Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old March 6th, 2012, 05:12 AM
NeilHillman.com NeilHillman.com is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2012
Posts: 1 NeilHillman.com User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 28 m 51 sec
Reputation Power: 0
Question Help to extract domain names

Hello, clever people who can do Regex!

I like to think of myself as an intelligent person, and a fairly good PHP scripter, but Regex makes my brain hurt! Can someone please help me write this rule before I start to reassess my career choices and retrain as a manual laborer...

I know there are many Regex rules already to isolate the various parts of a url, for example:

Code:
/^(http|https|ftp)://([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(d+)?/?/
- or -
/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/


What I need is something slightly different, I want to extract anything that fits the criteria for a domain name from a text string. So, as far as I can break it down, I need to identify:

Quote:
(not a hyphen or alpha-numeric character)
- followed by -
(2 or more alpha-numeric or hyphen character)
- followed by -
(a dot ".")
- followed by -
(2-7 alpha characters)
- optionally followed by -
(a dot "." and 2-3 more alpha characters)

It should be able to identify and extract any valid domains, which could be located inside text, quotes, tags, urls, anything, so the rule must include that the preceeding and following characters must be the start or end of line, or any non-alpha-numeric or hyphen character. For example:

Quote:
Contact me at: <a href="mailto:contact@me.co.uk?subject=xxx">blah</a> my favourite musium is "cymru.museum", ebay in australia is at:ebay.com.au! Some people use hacks like del.icio.us... and some domains are very ugly, like [y687-hy6-7yg54676-9076j--f798k-767658765.info]. Is this even possible!?

It should find:

me.co.uk
cymru.museum
ebay.com.au
icio.us
y687-hy6-7yg54676-9076j--f798k-767658765.info


Can anybody help me please!?

Many thanks in advance,

Neil

Reply With Quote
  #2  
Old March 6th, 2012, 03:10 PM
ragax's Avatar
ragax ragax is offline
Turn left at the third duck
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2011
Location: Nelson, NZ
Posts: 93 ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Day 24 m 37 sec
Reputation Power: 92
Hi Neil,
Quote:
Is this even possible!?

Given how carefully and precisely you have written your rule, that is actually a trivial regex problem. Without trying to understand how good your rule is, here's your rule into regex. I have used "comment mode" (aka whitespace mode) so you can see what is happening. The comments start after the # marks.
Code:
(?x)
[^-[:alnum:]] #(not a hyphen or alpha-numeric character)
( # start group 1 capture
[-[:alnum:]]{2,} # (2 or more alpha-numeric or hyphen character)
\. # (a dot ".")
[[:alpha:]]{2,7} # (2-7 alpha characters)
(?:\.[[:alpha:]]{2,3})? # optionally followed by (a dot "." and 2-3 more alpha characters)
) # end group 1 capture

And here is working php code using it:
Code:
Code:
<?php
$string='Contact me at: <a href="mailto:contact@me.co.uk?subject=xxx">blah</a> my favourite musium 
is "cymru.museum", ebay in australia is at:ebay.com.au! Some people use hacks like del.icio.us... 
and some domains are very ugly, like [y687-hy6-7yg54676-9076j--f798k-767658765.info]. Is this even possible!?';
$regex='~(?x)
[^-[:alnum:]] #(not a hyphen or alpha-numeric character)
( # start group 1 capture
[-[:alnum:]]{2,} # (2 or more alpha-numeric or hyphen character)
\. # (a dot ".")
[[:alpha:]]{2,7} # (2-7 alpha characters)
(?:\.[[:alpha:]]{2,3})?# optionally followed by (a dot "." and 2-3 more alpha characters)
) # end group 1 capture
~';
preg_match_all($regex, $string, $matches, PREG_PATTERN_ORDER);
$sz=count($matches[1]);
for ($i=0;$i<$sz;$i++) 
echo $matches[1][$i]."<br />";
?>

Output:
me.co.uk
cymru.museum
ebay.com.au
del.icio.us
y687-hy6-7yg54676-9076j--f798k-767658765.info

Seeing how beautifully you have written the rule, it seems to me that you are only a couple steps away from writing efficient regex.

Let me know if you have any questions!
__________________
Regex Tutorial | Latest RegexBuddy Demo

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Help to extract domain names

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap