#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    1
    Rep Power
    0

    How should I search the following text?


    Hi, I have a long list of text in the following pattern. I want to use POWERGREP to split it into multiple files. I tried ===\d of 2897 documents.*UNITED STATES OF AMERICA==== and the total number of hits is once.
    what regular expression should I apply? please help me. Thanks a lot!




    1 of 2897 DOCUMENTS

    UNITED STATES PUBLIC LAWS
    UNITED STATES OF AMERICA

    2 of 2897 DOCUMENTS
    UNITED STATES PUBLIC LAWS
    UNITED STATES OF AMERICA

    3 of 2897 DOCUMENTS
    UNITED STATES PUBLIC LAWS
    UNITED STATES OF AMERICA

    4 of 2897 DOCUMENTS
    UNITED STATES PUBLIC LAWS
    UNITED STATES OF AMERICA
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    I do not know Powergrep and I don't quite understand what you are trying to do, but I can tell you that this part: '.*' is probably what is wrong in your regex. In regex programming, this dot-asterisk pattern is "greedy" and says: any character as many times as possible (and still matching the end of the regex). So that if you have ".*UNITED STATES OF AMERICA", this pattern will match your full text until the last occurrence of "UNITED STATES OF AMERICA" in the document. Which is why you get only one match, you are basically matching the full document once.

    Just to give another example, suppose you have a line like this:
    Code:
    "start=1;end=3;start=4;end=6;start=7;end=9;"
    and use a regex like this:
    Code:
    start=\d.*\d
    • the first '\d' will match 1

    • the '.*" will match
      Code:
      ';end=3;start=4;end=6;start=7;end='
      and

    • the second \d will match 9,

    which is presumably not what was intended.

    The problem comes from the fact the '*' is a greedy quantifier (it tries to match as much as possible). In this case, you would want to use a non greedy quantifier. I do not know if this is implemented in Powergrep, but the non greedy form of '*' if usually '*?' in standard regex packages.

    So if I change my example regex above to:
    Code:
    start=\d.*?\d
    I would now have the following behavior:
    • the first '\d' will match 1

    • the '.*?" will match
      Code:
      ';end=
      and

    • the second \d will match 3,

    which is now presumably what was intended.

    So, try to change '...documents.*UNITED STATES ...' to '...documents.*?UNITED STATES ...', and see if you get something more in accordance whith what you are looking for.

IMN logo majestic logo threadwatch logo seochat tools logo