- Don't try to get everything in one reg exp. I spent many many hours trying to make one that would get all emails. Even if you minimize the false negatives, you'll have a ton of false positives to address.
- Try doing multiple pass reg exp. That is, if you get made a reg exp that gets all sorts of dots (dot,dt,.,;), group the variables that have a dot together and do a second pass that converts them to your normal .
- there is a very real difference between + * ? and the various combinations of them. Know these when you are trying to remove false positives
- Break up your reg exp in parts (see post by Sergey Zhidkov)
- Normalize your inputs
=================================================
A couple of tips for writing complex regular expressions.
1) You can construct your regexp from smaller pieces. For example:
# An oversimplified and likely incorrect python example. :) tld_list = ['edu', 'com'] tld_re = r'(?:%s)' % '|'.join(tld_list) subdomain_re = r'(?:[a-z]+\.)+' name_re = r'[a-z]+' mail_re = r'(%s)@(%s%s)' % (name_re, subdomain_re, tld_re)Then you compile your resulting expression as usual.
2) Both Python and Java have the 'x' flag for regular expressions that allows you to add whitespace and comments. So you could have written the above as:
([a-z]+) # user name @( # host name, consisting of: (?:[a-z]+\.)+ # repeated subdomain labels (?:com|edu) # tld suffix )''', re.I + re.X)