January 28th, 2012, 06:40 PM
Trying to achieve the following email-related regex
I'm using PCRE in PHP under the preg functions.
-at least one thing before the @
-exactly one @
-at least one thing after the @
-a period between above and below
-must be something, anything, after period
-check for multiple @s
Here are two expressions that I wrote:
Version 1: ^(.)+@(.)+\.(.)+$
Version 2: ^(.+)@(.+)\.(.+)$
Any problems with either of these expression corresponding to the above criteria? (and which one is preferred?)
I'm new to regex, so all in-depth advice would be appreciated.
If possible, I'd like to do the following as well, but I don't know how to do it in regex. Right now I just do it through other programming functions. Can regex do this? I am not sure it could be built into the expression above, though.
disallow: multiple @s, semicolons, back or forward slashes, commas, and single quotes.
January 29th, 2012, 01:45 AM
The parentheses in your expressions are only useful if you want to capture the content of the parentheses, and retrieve them via $match, $match etc (assuming $match is the third parameter in your preg_match).
If you do want to capture something, your second version is better, as the first only captures one character.
But if you only want to validate (no capture), do away with the parentheses.
Warning: both of your expressions allow multiple @ (the DOT can be an @).
However, you can easily disallow all the characters you specified. Here's the perhaps simplest way to match what you asked for:
You can be a whole lot more specific if you like, without or without lookarounds.
For lookarounds, have a look at how you can validate a password by using regex lookaheads. The technique is the same for an email.
Hope this helps, let me know if you have any questions.
Wishing you a fun weekend.
Last edited by ragax; January 29th, 2012 at 02:00 AM.
January 29th, 2012, 02:09 AM
And here's a less simple version of the code that has several benefits.
foreach($emails as $email) echo preg_match($regex,$email).'<br />';
First, there's an array of test email addresses so you can add addresses and check that the expression works for you.
Second, it only says "no @, no semicolons etc" explicitly ([^@;/\\\\,\']) once at the beginning, later just calling that expression twice with the pattern repeat syntax: (?1).
This allows you to tweak this condition in a single place, which makes the expression easier to maintain.
I also threw in a few performance tweaks.
Last edited by ragax; January 29th, 2012 at 02:21 AM.
Reason: performance tweaks
January 29th, 2012, 07:31 AM
ragax, it's a perfect example of repeating expressions. With them, the code is easier to maintain and just shorter. Nice regex!
January 29th, 2012, 01:39 PM
Wow, thanks for the help!
I am going to have to mull this over to get a better understanding of what is happening (I still can't look at a regex without looking things up!), but I just wanted to give you a quick thanks for the speedy help
January 29th, 2012, 05:57 PM
I tried your two regex's out in RegexBuddy (which I use to make sense of regex's along with google) and the first one came out okay but the second one gave some errors. However, when I tried out your code, and added a few test emails of my own, everything worked well. So regexbuddy could be wrong.
Unfortunately though, I could not understand the second regex you provided (although I was able to understand the first).
Here's the second one for reference:
My confusion begins at "(?1)" -- I'm not sure what this is doing. Same again with the second time the set of characters appears. (remember I am very inexperienced with regex!)
For the record, the error reported though is for the atomic group "(?>(?1)+?", the characters "(?" before the 1 (so in "(?1" are said to be "Erroneous characters - possibly incomplete regextoken or unescaped metacharacters"
-it also gives the same error for the ) in "\.)"
-again for the characters (? in the second "(?1)"
-it's not an error, but for every 1 in the regex, it says you are trying to match it literally (I don't know what the "(?1)" syntax means, but I am going to assume you are not trying to match 1 literally.
-same error for the ) in the second ")++"
-it also gives the error that "Quantifiers must be proceeded by a token that can be repeated." for the characters ++ in the second set of "++" (at the end of the regex)
I don't know if any of this makes sense to you or if regexbuddy just got it wrong, but I thought I would let you know so you could comment.
January 29th, 2012, 07:19 PM
January 29th, 2012, 11:33 PM
Hi ABA, yes, isn't it cool? Thought you'd like that. Planning to start using those more and more.
ABA is right about RB. I love RB, but there are quite a few cool features of PHP regex that don't work in it. By the way I also love ABA's own tool, ABA search and replace. It has a different focus---searching and replacing across multiple text files. Very powerful.
Did you sort it out by looking at the link ABA sent you?
It just means "repeat the regex in the parentheses of group 1". A great way to make your expressions more compact and maintainable.
As far as RB goes, that's really the only hiccup with this expression. When you remove the two (?1) patterns, RB calms down.
Let us know if you have other questions!
January 31st, 2012, 01:23 PM
Thanks ragax/abareplace. The link was helpful; I haven't used that feature before.
I have been looking at the regex for a few days now and it's helped improve other expressions I have had to write. One question I had though was about this sub-expression: "(?>(?1)+?\.)", in particular the +? part. In my original criteria, I needed to specify that there was something (except the characters listed) in between the "@" in an email and the "." before the TLD. ( so @(anything, but at least one thing).com ). I recognize that the + is operating on the repeating expression "(?!)" but could you explain what the immediately following ? is doing? I understand "?" usually makes things optional, but if I am interpreting this correctly, it is making the (?1)+ optional, which in turn is making the check for "([^@;/\\\\,\'])" optional that is supposed to take place after the @. I am assuming I am just misinterpreting the work of the ? in the atomic expression after the repeating expression. Could you clear up what that is doing here?
January 31st, 2012, 01:50 PM
A ? after a quantifier (such as + or *) is not an "optional" quantifier, but a "lazy" flag. It turns the quantifier lazy (quantifiers are greedy by default).
In this case, it's a bug on my part, you can remove it. It works, but it is not needed, and it slows down the match on the order of a millionth of a a second, so a purist would not want it.
The reason it is not needed is that the expression (?1) repeated by the + can in fact be greedy. There is no risk that its "greed" will make it roll over the period after it (\.), because there is no period in the character class contained in (?1). That pattern (?1) will never eat a period, so let it eat up anything it likes without impediment.
If you're interested in this topic, I encourage you to read this little piece of mine up on greedy and lazy quantifiers.
I'm really pleased you asked this question because (i) you're really getting into the nuts and bolts of the regex, which is awesome, and (ii) you put your finger on something to improve!
Wishing you a beautiful day.