For a recent
project I wrote some simple client-side form validation. Most
of it was trivial; but I also wanted to check for a syntactically
valid email address (I didn’t care whether it reached someone’s
inbox or not).
I performed an exhaustive Google search (no I didn’t) and found
one or two bits of code, but most of them annoyed me. I didn’t
want something that encoded a list of allowed top-level domains,
ferchrissakes, or some script kiddie idea of an email address that
disallowed subdomains. I wanted a regular expression that
referenced the appropriate RFC, that appeared to know what it was
talking about.
Of course I can find one now, now that I’ve written the
regular expression myself, but then I was dissatisfied. I
invoked the holy rite of Not Invented Here with a
soupçon of Never Knowingly Underengineered, and
settled down with a cup of tea, a copy of RFC 2822 and a
pile of shortbread.
The object of my desire was a regular expression that matched
the RFC 2822 ‘mailbox’ token, minus a few things. RFC 2822, not
the most gripping of reads, concerns itself with the format of
email messages for transmission over the net and so has to worry
itself ragged about line lengths and CRLF, and dealing with what
might charitably be called prior misunderstandings (those who
deployed software that didn’t conform to RFC822, the previous
version of this specification). I decided to wave a Dilbertian
hand at all that nonsense. I mean, does anyone actually put
comments inside their email address? (This is a valid mailbox,
according to RFC 2822: Pete(A wonderful \) chap)
– the bits
<pete(his account)@silly.test(his host)>
in parentheses are comments, which you can nest to an arbitrary
and pointless level.)
Here’s what I ended up with (with added line breaks for the web – remove before use). I’m reasonably confident it’s
correct; I used test-driven development techniques to derive it.
It’s licensed as cc-attrib mainly to reduce annoyance – GPL is
overkill for a regular expression, I feel. This licence allows
people to port it to their language of choice as long as they
credit me, and to incorporate it into their own code without any
additional licence burden. The copious comments are there to annoy
Roger.
function bValidMailbox(s) { // This function (but not any surrounding code) is copyright // (c) 2007 David Smith (dave a t sheepshank d o t org). // This work is licensed under a Creative Commons Attribution // 2.0 UK: England & Wales License. // http://creativecommons.org/licenses/by/2.0/uk/ // The regular expression below is based on RFC2822. It matches // the 'mailbox' token defined in that RFC, with the following // changes: no obsolete parts; no comments; no domain literals; // no spaces within or around the domain; no unquoted spaces in // the local-part; at least one dot in the domain; no CRLF // allowed. // It is believed to be accurate but YMMV. Use at your own risk. // Examples that PASS this test (one example per line): // jdoe@example.org // <boss@nil.test> // John Doe <jdoe@machine.example> // Who? <one@y.test> // "Joe Q. Public" <john.q.public@example.com> // Joe "Q." Public <john.q.public@example.com> // "Giant; \"Big\" Box" <sysservices@example.net> // Giant \'Big\' Box <sysservices@example.net> // "john q. doe"@machine.example // John "Q." Doe <"john q. doe"@machine.example> // Examples that FAIL this test (reason after the dash): // me - no domain // me@you - domains must have a dot // me@you. - that's not at the end // me@.you - or the beginning // me me@example.com - address spec not within <> // my.name <me@example.com> - no unquoted dots allowed there // me@example . com - no spaces allowed there // me@ example.com - or there // me @example.com - or there // me < me@example.com > - or there // me@[1.2.3.4] - domain literals not supported return /^(([\x20\x09]*[\x21\x23-\x27\x2a\x2b\x2d\x2f\x30-\x39\x3d\x3f \x41-\x5a\x5e-\x7e]+[\x20\x09]*|[\x20\x09]*\x22([^\x00\x0a\x0d\x22\x5c\ x80-\xff]|\x5c[\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c\x0e-\x7f])* [\x20\x09]*\x22[\x20\x09]*)*[\x20\x09]*\x3c([\x21\x23-\x27\x2a\x2b\x2d\ x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+(\x2e[\x21\x23-\x27\x2a\x2b\x2d \x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+)*|[\x20\x09]*\x22([^\x00\x0a\ x0d\x22\x5c\x80-\xff]|\x5c[\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c \x0e-\x7f])*[\x20\x09]*\x22[\x20\x09]*)\x40[\x21\x23-\x27\x2a\x2b\x2d\x 2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+(\x2e[\x21\x23-\x27\x2a\x2b\x2d\ x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+)+\x3e[\x20\x09]*|([\x21\x23-\x 27\x2a\x2b\x2d\x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+(\x2e[\x21\x23-\ x27\x2a\x2b\x2d\x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+)*|[\x20\x09]*\ x22([^\x00\x0a\x0d\x22\x5c\x80-\xff]|\x5c[\x01\x02\x03\x04\x05\x06\x07\ x08\x09\x0b\x0c\x0e-\x7f])*[\x20\x09]*\x22[\x20\x09]*)\x40[\x21\x23-\x2 7\x2a\x2b\x2d\x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+(\x2e[\x21\x23-\x 27\x2a\x2b\x2d\x2f\x30-\x39\x3d\x3f\x41-\x5a\x5e-\x7e]+)+)$/.test(s); }
Eh?
Why are you allowing for the name portions as well? Those are used in the headers and in the smtp protocol, but are not part of the actual address.
Also if you look in the xemacs source there is some code to extract the name vs address portions. It was written by our hero, jwz!
Re: Eh?
I allowed for the name portion because I wanted to use it, if it was there, in the email sent to the ecard recipient – PHP was happy with it, and it seemed friendlier that way. See also “never knowingly under-engineered”.
Re: Eh?
My point is what user when encountering a field asking for their email address types in:
Wiley Coyote
Re: Eh?
Oh, approximately nobody. Never knowingly etc. This is more about me scratching an itch than anything hugely beneficial.
Alternatively, I’m simply being liberal in what I accept :-)
Re: Eh?
In that case would you accept @@example.com – note that it is a perfectly valid address and actually works (when I ran qmail anyway).
It also works with postfix in a test right now although the mail agents present it as “@”@example.com but I used @@example.com when typing raw smtp.
Re: Eh?
@ is not a valid local-part according to RFC 2822, and doesn’t pass this regexp (I’ve checked). “@” is, and does.
I’m clearly not that liberal.