#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    1
    Rep Power
    0

    Help with a regex that reads the data between two strings with irregular no. of \n


    Hi,
    I'm fairly new to regex and I'm struggling to get a string that can scrape data from a web page.

    I need all the data between two fixed points - lets define them as "string". However the data in between could span across anything from 1 up to 6 lines. So for example it might look like this ( I've bolded string for visibility, it would not be bolded in the file ).

    string
    12312\n
    asd\n
    string fsdsg\n
    asgsdgfd sdfsd\n
    <saef 12n\
    af> sf \n
    123\n
    string\n
    123\n
    string gasfsfdsg\n

    In this case I would be looking to select

    12312\n
    asd\n

    fsdsg\n
    asgsdgfd sdfsd\n
    <saef 12n\
    af> sf \n
    123\n

    \n
    123\n


    I tried to set up a string using an if then like this

    string.*\n(?(?=string)|(.*\n){0,6})

    But that doesn't seem to work, I'm guessing because regex doesn't exit if the evaluation is true but just keeps going.

    Does anybody have any suggestions on how I could solve this problem please?

    Best regards Steve
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    835
    Rep Power
    496
    Because your problem involves a little bit more that pure regexes, it would be important to know in which language you are working. The solution mght be very different depending on the language used.

    For example in Perl, I could just do something as simple as this

    Perl Code:
    my $input = "string\n12312\nasd\nstring fsdsg\nasgsdgfd sdfsd\n<saef 12n\af> sf \n123\nstring\n123\nstring gasfsfdsg\n";
    @foo = split /string/, $input;

    Now, the @foo array contains four elements, the first one empty because string comes right at the beginning (easy to solve if this is nor desired) and then the three chunks you are looking for:

    Code:
    0  ''
    1  '
    12312
    asd
    '
    2  " fsdsg\cJasgsdgfd sdfsd\cJ<saef 12n\cGf> sf \cJ123\cJ"
    3  '
    123
    '
    4  ' gasfsfdsg
    '
    But this very easy solution depends on the language being used (but I am sure other languages have similar facilities).

IMN logo majestic logo threadwatch logo seochat tools logo