#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    1
    Rep Power
    0

    Using regex to deconstruct a "tag"?


    I'm kind of a Regex newbie, so this may sound like a crazy question...

    Is it possible to use Regex to assist in deconstructing a "tag" (for lack of a better word). For example, if I have a "tag" in the form of AaBbbCcDddddEe where:
    Aa=State code (Ny, Fl, Tx, etc)
    Bbb=County code (3-char code representing a county within the state)
    Cc=City Code (2-char code representing the city within the county)
    Ddddd=House number (1 to 5-digit house number)
    Ee=Rating code (1 or 2 char rating code)

    The first letter of each code segment will be capitalized, for example:
    TxPotAm12044B (Texas, Potter county, Amarillo, house # 12044, rating B) or
    NmLeaHb457A2 (New Mexico, Lea county, Hobbs, house# 457, rating A2)

    Also, it is possible that the city, county and/or house number could be missing, leaving something like:
    Tx12044B or
    NmLeaA2

    Can Regex be used alone or in concert with code to do the deconstruction?

    Thanks for any help.

    (edited to clarify tag syntax)
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,115
    Rep Power
    9398
    Given your description I'm only mostly sure all possible strings can be "deconstructed" unambiguously.

    Code:
    /^(..)(\D\D\D)?(\D\D)?(\d*)(..?)$/
    Your mileage may vary by programming language.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    835
    Rep Power
    496
    Hmmm, given that some of the fields may be missing, I am afraid that what you proposed, Requinix, might fail. For example, (\D\D\D)? might match a county code, but it might also match the city code and the beginning of a rating code. Well, looking in details to all possibilities may eventually show that you'll get it right everytime, but I would not rely on that rather insecure way of doing it.

    I would take the pain of distringuishing between UC and LC letters to make sure that I am really matching really what I want.

    So I would try something like this (using Perl regex syntax, but it should apply to other regex packages, I would think):

    Code:
    /^([A-Z][a-z])([A-Z][a-z]{2})?([A-Z][a-z])?(\d{1,5)?(\w\w?)$/
        state     county           city          house #   rate
    I used standard Perl regex syntax, but if I were doing really that in Perl, I would probably rather try to define the individual sub-pattern components separately and put them together at the end, for example something like this:

    Code:
    my $st =  qr /[A-Z][a-z]/;
    my $county = qr /[A-Z][a-z]{2}/;
    ...
    and then build the final regex:

    Code:
    /^($st)($county)?... ... $/
    or possibly even:

    Code:
    my $uc = qr /[A-Z]/;
    my $lc = qr /[a-z]/;
    my $st = qr /$uc$lc/;
    my $county = qr/$uc$lc{2}/
    ...
    (Everything posted here is untested, I am just trying to give the gist of it, there may be errors or typos.)

    I do not know whether constructing a regex from smaller components in the way shown is possible in the language used by the OP, but if it is, I would recommend something like this.

    Comments on this post

    • requinix agrees : yep, that one's ambiguous without checking letter case

IMN logo majestic logo threadwatch logo seochat tools logo