Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0

    Some Regex Help - stuck newbie!


    Hey guys,

    Thanks in advance to anyone who can help me here. I'm not that well versed with perl, but have managed to fashion and implement a few handy scripts with time. I've just hit a bit of a wall, with what I'm trying to do.

    I have two datasets. One with a list of search probes, and another with masses of text, in which these probes exist. I'm a biologist, and the data comprises DNA sequences. The data I want to use as a probe will be structured like this:

    >CCCCCCCCATGGATGGTTGATTGAGGTCTTGGAAAGATATGGTGATGACCAAAACAAATAC
    >GTTTTCCCCACATTTAAATTTGTCTATGAAT

    And the data I wish to search for these lines in, would be the same, but with (perhaps) a little more surrounding text in each line:

    >GAATGGATGAAGTGATGTCCCCCCCCCATGGATGGTTGATTGAGGTCTTGGAAAGATATGGTGATGACAAAACAAATACAGCACCTTATGTGCCTAATA GTCTAATAGGGAAAACAGATAAAT<
    >GAATGGATGAAGTGATGTCCCCCCCCCAATGGATGAAGTGATGTTTTCCCCACATTTAAATTTGTCTATGAATTTTCCCGGAACCTCTGAAAACTGTTTTAGTATTTCCTTGCATATGGCTAATTCAGATATAGAAAAGTGTACACGTACCTATATATGTGGGGAAATGTGGGGAAAAGAGGCGGAGAGTGGACGGA<

    The bit in bold is an example of a line that matches (in this case, it matches the second line of the two probe sequences provided). What I want to be able to do is to get the script to read off each individual line in the file of probes, and search through the second, larger data file for that. If it finds the probe, I basically want to print out as much as 30 extra letters either side of it, and to then spit that out for further processes, which I've gotten to work, hopefully. The underlined bit above is the amount that I want to try and extract (30 characters either side of the match).

    The bit of script I have thus far is:

    #! usr/bin/perl #-w

    open (A, "Probes.txt");

    while ( @seqs = <A> )
    {
    }

    open (B, "Sequences.txt");

    while (@seqs2 = <B> )
    {

    foreach $seqs2 (@seqs2)
    {
    if ( $seqs2 =~ /(\w*@seqs.*)/ )
    {
    my $read = $1;
    print $1, "\n";
    }
    }
    }


    It's probably a disaster, but the bit I need to get working better is the regex in the if loop.

    At the moment, it's picking out the whole file, and not just the line with the search hit present.

    Apologies for posting something that is probably extremely basic, but any help here would be much appreciated!

    Many thanks in advance,

    Adam
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    When you wxant to read the content of the file, you should do either of two things.

    You can read one line at a time into a scalar variable (a variable starting with the $ sigil) and do something with that line, with a loop syntax such as this:

    Code:
    while (my $line = <$FILE_IN>) {
         # do something with the line you just read and process each line thanks to the while loop
    }
    Or you can "slurp" the whole file into an array variable with something like this:
    Code:
    my @list_of_lines = <$FILE_IN>
    # now process each field in the array
    The first main problem of your code is that you mix up the two syntaxes.

    As an example:
    Code:
    while ( @seqs = <A> )
    {
    }
    Either you use a while loop or you use an array variable. In this case, it is probably going to work nonetheless, but this would be much simpler:
    Code:
    my @seqs = <A>;
    In the case of the second file, because you are doing other things within the loop, you get into mess.

    The next problem, far more serious, is that your algorithm is just wrong. You really need two nested loops, and there, let us hope that your files are not too big, because if they are both really big, it might take ages to execute. It could be done in two different ways. In both cases, you need to open the search probes file and store it into memory (probably into an array, as you have done).
    Then you open the second file and:
    - read the first line of that file and try to match it against each record of the search probe array, and then proceed with the second line, and so on.
    - or do it the other way around: try each line of the second file against the first line of the search probe array, then proceed with the second first line of the search probe array, and so on.

    If your search probe has m records and your data n records, both strategies will require m x n regex matches, which is why it might become long if both files are big. But, depending on your data, the strategies might not be equivalent in terms or performance. One of the problem with the second one is that you also need to store the second file into an array and this might be less efficient.

    Most of the time, the first strategy is better for some other various reasons. One of them is that, according to your description of the problem, once you have matched a data record against a probe, you probably don't need to try the other probes, because I understand that you are not going to find two probes in the same data record.

    So I would suggest something like this (untested code, just a basic architecture of a program):

    Perl Code:
    my @seqs = <A>; # loading the probes
    chomp @seqs; # removing newline characters from the lines of @seqs
    while (my $line = <B>) { #reading the data lines one at a time
         for my $probe (@seqs) {
              if ($line =~ /$probe/) {
                   # we've got a match, do something with it
                   last ; # we have a match, no need to try the other probes
              }
         }
    }


    There are a number of toher problematic things in your code/

    Code:
    #! usr/bin/perl #-w
    Why do you comment out the -w flag? Bad idea. Don't silence warnings, they give you important information about possible bugs.

    But don't use the -w flag either, the modern way (well modern, for the last 15 years at least) to do this in in Perl is to do this:

    Code:
    #! usr/bin/perl
    use warnings;
    use strict;
    The 'use strict' pragma is another thing that will help you detecting bug early. I am sure you will not like it too much at the beginning, because the progam that used to compile might just no longer compile. But it is immensely useful, because it detects at compile time errors that you might otherwise see only at run time and are likely to bite you much more severely then. And I can promise you that after having used it just for a few days, you will love it. One of the consequences of using the strict pragma is that you variables need to be declared the first time you use it, with the my function, as I have done in the code sample I gave you.

    As just one example of wher the strict pragma can help you, suppose that you use somewhere the $seqs variable. Somewhere else in your progrtam, you call it $seq. Just a stupid typo. You will not see it until you try to debug your program and figure out why your probgram is bugged and it might take you hours figuring you why $seq does not contain what you think. If you use the strict pragma, Perl will tell you straight at compile time that you made this mistake.

    A final point is that you are not using the best syntax to open your files. I just give an example of how it should be done according to commonly accepted best practices:

    Rather than

    Code:
    open (A, "Probes.txt");
    a much better syntax is:

    Perl Code:
    my $probes = "Probes.txt";
    open my $A, "<", $probes or die "Unable to open $probes $!\n";


    There are a number of important differences between the two syntaxes. But my post is already quite long, so I will leave it asides for the time being and would advise you to type "perldoc -f open" at your prompt for further details. Don't hesitate to ask if you want further details. Many people on this site, including myself) will be willing to give you more details.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Originally Posted by Laurent_R
    Many people on this site, including myself) will be willing to give you more details.
    Hi Laurent - thank you so much for providing me with such a helpful, detailed post! I knew it would be full of mistakes. I'd basically had something down, and just tinkered every little bit, covered it in print statements, and just tried to figure out what was going on, mucking the whole thing up even more, in the process!

    Thank you for your help! I tried to use a bit of the code you put in, but still can't get it to do what I'm after. It prints nothing. I basically did this (forgive me for using the 'open (A, 'probes.txt')' etc syntax here. I got a gazillion error messages when I tried my $probe = 'probes.txt'. Confused!

    I basically tried to capture the match, and print it. Still won't work. My test files are literally exactly what I posted in the original post. I'm keeping it simple, before trying this on the real data. Whilst there's quite a few lines in the real data, it's not too massive, and I really don't mind waiting a few hours if it's going to work. I literally can't think of another way around doing what I'm trying to do, so whatever works is great!

    Code:
    #! usr/bin/perl
    use warnings;
    use strict;
    
    open (A, "Flanks/ALL_HITS_TEST.txt");
    my @seqs = <A>; # loading the probes
    chomp @seqs; # removing newline characters from the lines of @seqs
    open (B, "Flanks/SampleGenome.txt");
    while (my $line = <B>) { #reading the data lines one at a time
         for my $probe (@seqs) {
              if ($line =~ /($probe)/) {
    		print $1; #Not sure if this is what I should be doing here...
                   # we've got a match, do something with it
                   last ; # we have a match, no need to try the other probes
              }
         }
    }
    Forgive my amateurishness! This is still unexplored territory for me. Thank you again for your help - it's much appreciated!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Just realised I changed the file names to what they actually are on my hard drive. Apologies! The ALL_HITS_TEST.txt is the probes file, and the SampleGenome.txt is the sequence file...
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Hi,

    I just made a quick try with the following code:

    Perl Code:
    #! usr/bin/perl
    use warnings;
    use strict;
     
    my @seqs = ("CCCCCCCCATGGATGGTTGATTGAGGTCTTGGAAAGATATGGTGATGACCAAAACAAATAC",
    "GTTTTCCCCACATTTAAATTTGTCTATGAAT");
     
    chomp @seqs; # removing newline characters from the lines of @seqs
     
    while (my $line = <DATA>) { #reading the data lines one at a time
         for my $probe (@seqs) {
              if ($line =~ /($probe)/) {
    		print $1, "\n"; #Not sure if this is what I should be doing here...
                   # we've got a match, do something with it
                   last ; # we have a match, no need to try the other probes
              }
         }
    }
    __DATA__
    GAATGGATGAAGTGATGTCCCCCCCCCATGGATGGTTGATTGAGGTCTTGGAAAGATATGGTGATGACAAAACAAATACAGCACCTTATGTGCCTAATAG TCTAATAGGGAAAACAGATAAAT
    GAATGGATGAAGTGATGTCCCCCCCCCAATGGATGAAGTGATGTTTTCCCCACATTTAAATTTGTCTATGAATTTTCCCGGAACCTCTGAAAACTGTTTT  AGTATTTCCTTGCATATGGCTAATTCAGATATAGAAAAGTGTACACGTACCTATATATGTGGGGAAATGTGGGGAAAAGAGGCGGAGAGTGGACGGA


    And this is what it is printing when I run it:

    Code:
     $ perl test_dn.pl
    GTTTTCCCCACATTTAAATTTGTCTATGAAT
    So the code per se seems to be working (except that you might want to print the full sequence, not just the probe, but that is up to you). Try to run the code above and see if it works. I am pretty sure it will, since it works for me. The problem probably has to do with the data.

    First, you posted your probes as:
    Code:
    >CCCCCCCCATGGATGGTTGATTGAGGTCTTGGAAAGATATGGTGATGACCAAAACAAATAC
     >GTTTTCCCCACATTTAAATTTGTCTATGAAT
    If you really have these '>' characters at the beginning of the string, then the match cannot occur.

    You might also have a problem if you run your program under Unix and prepared your files under Windows. Please post your two files as attachments, so that I can see what they look like. Or, else print your sequences after you have them in the array with delimiters before and after;

    Perl Code:
    print "[$_]" foreach @seqs;


    You might see some exctra characters after the [ or, more probably, before the ] (for example a carriage return). If this is the problem it is very easy to fix it: add the following line after the chomp:
    Perl Code:
    s/\r//g foreach @seqs;
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Originally Posted by Laurent_R
    You might also have a problem if you run your program under Unix and prepared your files under Windows. Please post your two files as attachments, so that I can see what they look like.
    This is awesome - thank you so much! That appears to work. So you're saying that the problem is with the data? These are two test files - the actual data files are basically arranged exactly like this, but are just much larger. I was keeping it simple, to make sure the script was doing the right thing.

    The code you sent me works perfectly, and I added in the bits to make it select some surrounding sequence, too.

    As for attaching the files, I apparently don't have the privileges to do so on this forum, so here's a link. Either way, they were prepared on my Mac - they won't have ever encountered a Windows machine.

    dropbox(dot)com/sh/78hpal9j9emul8k/QVuEP0hwQq

    Forgive me, but I was unsure what you meant at the end of your last post... I didn't see any extra characters...

    Also, if a match can't occur with '>' being present, do you reckon the best thing to do would be just to substitute these out? Or can you put another character such as '<' at the other end of each line? Without these, I'm guessing perl will still know that each one is a new line?

    Again, I am extremely grateful for your help - many thanks!

    Adam
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Originally Posted by AdamLeeGuitaris
    As for attaching the files, I apparently don't have the privileges to do so on this forum, so here's a link. Either way, they were prepared on my Mac - they won't have ever encountered a Windows machine.
    If you prepared your files on a Mac and run your program in a Unix-like environment, then you will probably encounter exactly the same type of problem on new line characters as the one I described for Windows (albeit not exactly the same). But if you run Perl under Mac, it should be OK. But given that you have this line "#! usr/bin/perl" at the beginning of your script I suspect you run it on U*ix. I'll try to look at your files later.

    Originally Posted by AdamLeeGuitaris
    Also, if a match can't occur with '>' being present, do you reckon the best thing to do would be just to substitute these out?
    Yes, that's what should be done. The line containing the search probe should have only the list of nucleotid letters (ACTG) and nothing else. Remember that a regex is really nothing more than a slightly avanced text matching.

    Originally Posted by AdamLeeGuitaris
    Or can you put another character such as '<' at the other end of each line?
    No, that won't work. Again, it is textual matching. If your proble looks like "<ACCTA...", the regex engine will scan you data line looking for a "<" immediately followed by a "A, followed by two "C", etc., and it will obviously fail to find a match. So the other solution (removing the leading "<" when loading the probes into the array) is a better solution.

    I'll look at your files and come back to you later today.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    I can't really see you file the way you have it, because it is really wrapped into HTML code.

    I don't have a Mac, but let me give an example of this end of line problem with a Windows file under Unix.

    I am opening a perl script file written with an editor under Windows and read the first line into the $c variable:

    Code:
     print $c
    #!/usr/bin/perl
    
    print ">$c<\n"
    >#!/usr/bin/perl
    <
    As you see, adding the angle brackets around my variable make it possible to see that there is some new line char between the word perl and the end of the variable.

    I now chomp it:
    Code:
     chomp $c
     print ">$c<\n"
    <#!/usr/bin/perl
    It is still not correct: I don't see the closing ">". This is because there is still a special carriage return (\r) character overriding the printing of >. You are actually likely to see something like this if you run a similar Perl program under Unix on a text file prepared on Mac.

    Let's try to remove this carriage return:

    Code:
     $c =~  s/\r//g;
    
     print ">$c<"
    >#!/usr/bin/perl<
    Now, it is correct, my line no longer these nasty invisible characters.

    - On Mac, line separators are the carriage return \r
    - On Unix, line separators are the new line (or line feed) \n
    - On Windows, line separators are a combination of \r and \n.

    When running a program under Unix, the Perl chomp function knows that it should remove \n at the end of the line. But that does not work properly if the file has been prepared under Mac or Windows. Of course, this kind of problem occurs only with cross platform plays, in this case using a Windows file under Unix.

    The solution is to explicitly remove these invisible characters. Since we also want to remove the leading '<' characters, try to replace this line in the program I gave you yesterday as follows:

    Perl Code:
    chomp @seqs; # removing newline characters from the lines of @seqs

    whith this:

    Perl Code:
    s/[^ACTG]//gi for @seqs;


    This should remove from every element of the @seqs array any character which is not A, C, T or G (capital letter or lower case).
    Last edited by Laurent_R; August 3rd, 2013 at 04:32 AM.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Dear Laurent,

    Firstly, please accept my apology for replying this late! We're just about to move house here, and things are all over the place. I'm very sorry!

    Secondly, thank you so much for that! I have replaced the line as you instructed. I have a feeling it will help, but am somehow encountering another issue. Please forgive the mess that is the following script! I've littered it with random print statements to test that each bit is working, to try and figure where the problem lies. I've gotten up to the very last for loop, where I've commented a random print statement that I can't get to work. Not exactly sure what's going on. Apologies for dragging this out, and thank you, once again, for your time and help - much appreciated!

    Code:
    my @seqs = "Flanks/ALL_HITS_TEST.txt";
                   #print "[$_]" foreach @seqs;
    #		print @seqs;
    
    
    
    open (A, "Flanks/ALL_HITS_TEST.txt");
    
    while ( @seqs = <A> )
    	{  
    	#print @seqs, "\n";
    }
    
    
    s/[^ACTG]//gi for @seqs; ; 
    
    
    
    open (B, "Flanks/SampleGenome.txt");
    
    while (my $line = <B>) { 
    	#print "hey", "\n";
    	#print $line, "\n";
    
         for my $probe (@seqs) {
    	#print $probe;
    	print "sup \n"; #THIS is the bit I can't get to happen, somehow.
              if ($line =~ /($probe)/) {
    		print $1, "\n";
    		print "hey";
                    last ;
              }
         }
    }
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    I should add that what I have been getting is a script that runs without errors, but gives no output to the 'print $1' command in the very last if loop... So the mess above was me trying to figure out where it was going wrong! I'm sure it's something obvious! Also forgive me for using the file handles again instead of my. It was giving me a load of problems, so I just went back to something I had success with previously. I'll try and tidy it all up later, once I see that it's working!
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Try two things to start with:

    Replace:
    Code:
    open (A, "Flanks/ALL_HITS_TEST.txt");
    by
    Perl Code:
    open (A, "<Flanks/ALL_HITS_TEST.txt") or die "could not open file ALL_HITS_TEST.txt $!";


    This will tell you if you encounter a problem opening the file. You should never open a file without checking whether it worked or failed (and usually aborting if it failed). Make the same check for the other file.

    Second:
    Code:
    while ( @seqs = <A> )
    	{  
    	#print @seqs, "\n";
    }
    I already told you earlier that this is wrong: if you slurp the file directly into an array, don't use a while loop. You need either of two things:
    Perl Code:
    @seqs = <A>
    s/[\r\n]//g foreach @seqs;

    or
    Perl Code:
    while ( my $line = <A> )
    	{  
            $line =~ s/[\r\n]//g;
    	push @seqs, $line;
    }


    This might solve your issue. Please let me know.
  22. #12
  23. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,254
    Rep Power
    1810
    I haven't looked into the logic of the script in any way, just wanted to set the record straight about one item:

    - On Mac, line separators are the carriage return \r
    - On Unix, line separators are the new line (or line feed) \n
    - On Windows, line separators are a combination of \r and \n.
    That is only true for "classic" versions of Mac OS, prior to OS X (pre-2000). OS X is unix, and uses the standard "\n" line endings. There is no need to do anything other than chomp when processing these files.

    Also, in a few places the shebang line was written as:

    Code:
    #! usr/bin/perl
    rather than
    Code:
    #!/usr/bin/perl
    as it should be. The built-in version of perl on Mac is in this common directory, and needs the absolute address if you are starting your script without prefixing it with 'perl' on the command line.
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Originally Posted by Laurent_R
    This might solve your issue. Please let me know.
    Thanks again, Laurent! Apologies for the while loop mistake again.

    I implemented what you suggested, and it seems to be alright now. The issue is I won't get anything if I try to print $1 at the end of the script. So it's all running without obvious errors, but not performing the function of pulling out the probe sequence from the SampleGenome.txt file. Here's the latest one:

    Code:
    use warnings;
    use strict;
    
    my @seqs = "Flanks/ALL_HITS_TEST.txt";
                   #print "[$_]" foreach @seqs;
    #		print @seqs;
    
    
    
    open (A, "<Flanks/ALL_HITS_TEST.txt") or die "Could not open file ALL_HITS_TEST.txt $!"; 
    
    
    while ( my $line = <A> )
        {  
            $line =~ s/[\r\n]//g;
        push @seqs, $line;
    #	print $line, "\n";
    } 
    
    
    open (B, "Flanks/SampleGenome.txt") or die "Could not open file SampleGenome.txt $!";
    
    while (my $line = <B>) { 
    	#print "hey", "\n";
    	#print $line, "\n";
    
         for my $probe (@seqs) {
    #	print $probe, "\n";
              if ($line =~ /($probe)/) {
    		print $1, "\n";
    		print "hey";
                    last ;
              }
         }
    }
    Sorry this is taking up your time, and thanks again for your help!
  26. #14
  27. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    8
    Rep Power
    0
    Thanks so much - learning! I'm still using the perl command on the command line, so I removed the shebang line altogether now. Thanks for the info!
  28. #15
  29. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Your code seems to be correct. It is most probably a data driven problem, but I don't know your data, so it is difficult to help further.

    The one thing, though, I am thinking about is that your probes file had lines starting with ">", and your code does not remove any such character. If you still have these '>' in your probes file, it can't work the way it is now. You need to change the relevant line as follows:

    Perl Code:
    $line =~ s/[\r\n>]//g;


    to also remove '>' from input (only needed on the probe file, not required on the data file).
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo