#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    318
    Rep Power
    1

    Question Regex To Extract 2nd Level Domains From All TLDs ?


    Good Day Folks!

    1. Is the following regex ok to extract top level domains and 2nd level domains ?
    [^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$

    2. How to write php code to use that regex ?
    Any sample code welcome.
    Last edited by UniqueIdeaMan; May 26th, 2017 at 07:55 PM.
  2. #2
  3. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2003
    Location
    in da kitchen ...
    Posts
    12,869
    Rep Power
    6447
    1. re your regex, test it out on a online regex tester
    2. standard regex, most of the online regex testers should render the code required ...
    --Ax
    without exception, there is no rule ...
    Handmade Irish Jewellery
    Targeted Advertising Cookie Optout (TACO) extension for Firefox
    The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones


    09 F9 11 02
    9D 74 E3 5B
    D8 41 56 C5
    63 56 88 C0
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski
    Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...
    BIT COINS ANYONE
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    318
    Rep Power
    1
    Originally Posted by Axweildr
    1. re your regex, test it out on a online regex tester
    2. standard regex, most of the online regex testers should render the code required ...

    Thank you for your willingness to help.
    I'm a complete beginner in regex and so any suitable tutorial suggestions for complete beginners are welcome too!

    Anyway, as you know, different webpages would have different internal & external links all over their pages. No matter, what the link looks like, the domain should be extracted. Imagine, I'm running a web crawler, it would encounter unlimited links where some would have just domain and some subdomain and so on.
    Eg.

    301 Moved Permanently
    404 Error - Page Not Found


    Domain Name Registration and Web Hosting | Domain.com
    Domain Name Registration and Web Hosting | Domain.com


    Domain Name Registration and Web Hosting | Domain.com
    http://subdomian.domain.com


    domain.com/dir
    subdomian.domain.com/dir

    domain.com/dir/sub-dir
    subdomian.domain.com/dir/sub-dir


    Note: No matter how many subdomains or levels of domains (3rd level, 4th level, etc.) or dirs or sub-dirs (regardless of levels) the links contain, the 2nd level domain should be extracted along with it's tld.
    From our examples above, the script should extract "domain.com" from all the above mentioned links.
    I need an example of the php code too alongside the regex.
  6. #4
  7. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,110
    Rep Power
    4103
    There's not a good regex solution for what you want to do. Domains are more than something.ext. For example you have things like .co.uk or .ca to handle, there's also all the newer special-purpose domains like .chat, .pictures, .movie etc.

    You can extract something that kind of looks like a domain with your regex, but then to be able to distinguish it as an actual domain you'd have to do something like try to resolve it or compare it against a list of valid tLDs.
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    318
    Rep Power
    1
    Originally Posted by kicken
    There's not a good regex solution for what you want to do. Domains are more than something.ext. For example you have things like .co.uk or .ca to handle, there's also all the newer special-purpose domains like .chat, .pictures, .movie etc.

    You can extract something that kind of looks like a domain with your regex, but then to be able to distinguish it as an actual domain you'd have to do something like try to resolve it or compare it against a list of valid tLDs.
    Kicken,

    If you get the time then how-about building one and then contributing the code to this thread for present and future members to learn from it ?
    I'm very much interested to see how short you can make it.

    Thanks.

IMN logo majestic logo threadwatch logo seochat tools logo