#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0

    Add Data to text file and align with sequence.


    I have a text file formatted in a specific manner. I need to add a new line based on the input data and align it with a specific segment. how should I go about doing this.

    Code:
    >sp|P04798|CP1A1_HUMAN Cytochrome P450 1A1 OS=Homo sapiens GN=CYP1A1 PE=1 SV=1
    MLFPISMSATEFLLASVIFCLVFWVIRASRPQVPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDTIRQALVRQGDDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDPASSTSCYLEEHVSKEAEVLISTLQELMAGPGHFNPYRYVVVSVTNVICAICFGRRYDHNHQELLSLVNLNNNFGEVVGSGNPADFIPILRYLPNPSLNAFKDLNEKFYSFMQKMVKEHYKTFEKGHIRDITDSLIEHCQEKQLDENANVQLSDEKIINIVLDLFGAGFDTVTTAISWSLMYLVMNPRVQRKIQEELDTVIGRSRRPRLSDRSHLPYMEAFILETFRHSSFVPFTIPHSTTRDTSLKGFYIPKGRCVFVNQWQINHDQKLWVNPSEFLPERFLTPDGAIDKVLSEKVIIFGMGKRKCIGETIARWEVFLFLAILLQRVEFSVPLGVKVDMTPIYGLTMKHACCEHFQMQLRS
                                                                        C         C        CC   CC         C   C  C   C   C            C                                  C  CC  C            C  CC  C  CC                            CC  C  C                                                C CCCCCCCCCCC  C                                                  C    CC  CC C                      C                                     CCCC  CHCCC  CC  CC                            CC   
    				CCCCCCECCCCCCCECCCECHHHHCCCHHHHHHHHHHHHCCEEEEEECCEEEEEECCHHHHHHHHCCCHHHCCECCCCHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCECCCCCCCEHHHHHHHHHHHHHHHHHHHHHCCCCCECHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHCCCHHHHCCCCCCCHHHCCHHHHHCCCHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHCCCCCCCECCECCCCCHHHHCHHHHHHHHHHHHHHHHHHHHHHHHHHCHHHHHHHHHHHHHHCCCCCCCCHHHHHHCHHHHHHHHHHHHHHCCCCECCCEECCCCEEECCEEECCCCEEEEEHHHHHHCCCCCCCCCCCCHHHHEECCCEECHHHHCCCCCCCCHHHCCCCHHHHHHHHHHHHHHHHHHCCEECCCCCCCCCCEECCCCCEECCCCCEEEC
    				9249848202717256920027506710020006106842300104008320000011800430026104101000801005202424002005313510410271024004400138178299101013203710430053056228762401006200300010001000293044928302500241231040000124000345338582750520330065125004401720685177942700000011307266240884371426202100010025100300000000000004359108402710583026934051702760100100000000200010001001024514066200176100000010000178206508502031032998203662174000121240342134002200000000000503021399391625241010000360440309
    References: PDB model from SuperCYP: http://bioinformatics.charite.de/supercyp/, RSA and SS from Polyview: http://polyview.cchmc.org/, Cavity sites from CASTp: http://sts-fw.bioengr.uic.edu/castp/calculation.php, Where available Heme binding sites from polyview, transmembrane sites from uniprot, Uniprot ID: http://www.uniprot.org/uniprot/P04798
    I would need to be able to find the segment i need, add the new line, then find the segment i need to align. I have the code to read the data from the input, but don't now how to go about inserting into the text file.

    hopefully the lines match up.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    This is not clear to me. Please tell us how we are supposed to identify the segment you need, as well as the segment you want to align. Please also show your expected result.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by Laurent_R
    This is not clear to me. Please tell us how we are supposed to identify the segment you need, as well as the segment you want to align. Please also show your expected result.
    I need to identify it based on given segment. so for example the input file matches

    Code:
    VPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDTIRQALVRQG DDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDPASSTSCYLEEHVSKEAEVLISTLQELMAGPGHFNPYRYVVVSVTNVICA ICFGRRYDHNHQELLSLVNLN
    I need to find that in the line of the file corresponding with the input, so in the original post it was |CP1A1_Human|. I would need to find that entry, add a new line below the sequence and insert the corresponding sequence in the line aligned with the matched segment.

    ORIGINAL:
    Code:
    >sp|P04798|CP1A1_HUMAN Cytochrome P450 1A1 OS=Homo sapiens GN=CYP1A1 PE=1 SV=1
    MLFPISMSATEFLLASVIFCLVFWVIRASRPQVPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDTIRQALVRQGDDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDPASSTSCYLEEHVSKEAEVLISTLQELMAGPGHFNPYRYVVVSVTNVICAICFGRRYDHNHQELLSLVNLNNNFGEVVGSGNPADFIPILRYLPNPSLNAFKDLNEKFYSFMQKMVKEHYKTFEKGHIRDITDSLIEHCQEKQLDENANVQLSDEKIINIVLDLFGAGFDTVTTAISWSLMYLVMNPRVQRKIQEELDTVIGRSRRPRLSDRSHLPYMEAFILETFRHSSFVPFTIPHSTTRDTSLKGFYIPKGRCVFVNQWQINHDQKLWVNPSEFLPERFLTPDGAIDKVLSEKVIIFGMGKRKCIGETIARWEVFLFLAILLQRVEFSVPLGVKVDMTPIYGLTMKHACCEHFQMQLRS
                                                                        C         C        CC   CC         C   C  C   C   C            C                                  C  CC  C            C  CC  C  CC                            CC  C  C                                                C CCCCCCCCCCC  C                                                  C    CC  CC C                      C                                     CCCC  CHCCC  CC  CC                            CC   
    				CCCCCCECCCCCCCECCCECHHHHCCCHHHHHHHHHHHHCCEEEEEECCEEEEEECCHHHHHHHHCCCHHHCCECCCCHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCECCCCCCCEHHHHHHHHHHHHHHHHHHHHHCCCCCECHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHCCCHHHHCCCCCCCHHHCCHHHHHCCCHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHCCCCCCCECCECCCCCHHHHCHHHHHHHHHHHHHHHHHHHHHHHHHHCHHHHHHHHHHHHHHCCCCCCCCHHHHHHCHHHHHHHHHHHHHHCCCCECCCEECCCCEEECCEEECCCCEEEEEHHHHHHCCCCCCCCCCCCHHHHEECCCEECHHHHCCCCCCCCHHHCCCCHHHHHHHHHHHHHHHHHHCCEECCCCCCCCCCEECCCCCEECCCCCEEEC
    				9249848202717256920027506710020006106842300104008320000011800430026104101000801005202424002005313510410271024004400138178299101013203710430053056228762401006200300010001000293044928302500241231040000124000345338582750520330065125004401720685177942700000011307266240884371426202100010025100300000000000004359108402710583026934051702760100100000000200010001001024514066200176100000010000178206508502031032998203662174000121240342134002200000000000503021399391625241010000360440309
    References: PDB model from SuperCYP: http://bioinformatics.charite.de/supercyp/, RSA and SS from Polyview: http://polyview.cchmc.org/, Cavity sites from CASTp: http://sts-fw.bioengr.uic.edu/castp/calculation.php, Where available Heme binding sites from polyview, transmembrane sites from uniprot, Uniprot ID: http://www.uniprot.org/uniprot/P04798
    New:

    Code:
    >sp|P04798|CP1A1_HUMAN Cytochrome P450 1A1 OS=Homo sapiens GN=CYP1A1 PE=1 SV=1
    MLFPISMSATEFLLASVIFCLVFWVIRASRPQVPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDTIRQALVRQGDDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDPASSTSCYLEEHVSKEAEVLISTLQELMAGPGHFNPYRYVVVSVTNVICAICFGRRYDHNHQELLSLVNLNNNFGEVVGSGNPADFIPILRYLPNPSLNAFKDLNEKFYSFMQKMVKEHYKTFEKGHIRDITDSLIEHCQEKQLDENANVQLSDEKIINIVLDLFGAGFDTVTTAISWSLMYLVMNPRVQRKIQEELDTVIGRSRRPRLSDRSHLPYMEAFILETFRHSSFVPFTIPHSTTRDTSLKGFYIPKGRCVFVNQWQINHDQKLWVNPSEFLPERFLTPDGAIDKVLSEKVIIFGMGKRKCIGETIARWEVFLFLAILLQRVEFSVPLGVKVDMTPIYGLTMKHACCEHFQMQLRS
                                                                        C         C        CC   CC         C   C  C   C   C            C                                  C  CC  C            C  CC  C  CC                            CC  C  C                                                C CCCCCCCCCCC  C                                                  C    CC  CC C                      C                                     CCCC  CHCCC  CC  CC                            CC   
    				CCHASDJSAKDJASDLKFJALKDSFJALKDSJFALKSDJFLASDFIWAEFADSKFHAKDSHFKSADJFJADSFAF
    				CCCCCCECCCCCCCECCCECHHHHCCCHHHHHHHHHHHHCCEEEEEECCEEEEEECCHHHHHHHHCCCHHHCCECCCCHHHHCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCECCCCCCCEHHHHHHHHHHHHHHHHHHHHHCCCCCECHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHCCCHHHHCCCCCCCHHHCCHHHHHCCCHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHCCCCCCCECCECCCCCHHHHCHHHHHHHHHHHHHHHHHHHHHHHHHHCHHHHHHHHHHHHHHCCCCCCCCHHHHHHCHHHHHHHHHHHHHHCCCCECCCEECCCCEEECCEEECCCCEEEEEHHHHHHCCCCCCCCCCCCHHHHEECCCEECHHHHCCCCCCCCHHHCCCCHHHHHHHHHHHHHHHHHHCCEECCCCCCCCCCEECCCCCEECCCCCEEEC
    				9249848202717256920027506710020006106842300104008320000011800430026104101000801005202424002005313510410271024004400138178299101013203710430053056228762401006200300010001000293044928302500241231040000124000345338582750520330065125004401720685177942700000011307266240884371426202100010025100300000000000004359108402710583026934051702760100100000000200010001001024514066200176100000010000178206508502031032998203662174000121240342134002200000000000503021399391625241010000360440309
    References: PDB model from SuperCYP: http://bioinformatics.charite.de/supercyp/, RSA and SS from Polyview: http://polyview.cchmc.org/, Cavity sites from CASTp: http://sts-fw.bioengr.uic.edu/castp/calculation.php, Where available Heme binding sites from polyview, transmembrane sites from uniprot, Uniprot ID: http://www.uniprot.org/uniprot/P04798
    for the new, I just typed some example gibberish(the forth line) aligned with the beginning of my sequence. I don't need to match the new characters, only insert the beginning of the string into a new line at the position of the matched segment, and align it with the segment.

    Hopefully that's clearer. basically, look at input file, match name of entry, insert a new line under the sequence(2nd line), then align the data in the new file with the matched segment of sequence from the original file.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    I still don't quite get it. Sorry, last time I had a biology course was in 1974, as far as I know, nobody used computers in this field at the time.

    But, assuming you want to insert a new line of rubbish (sorry, i could not resist) above your sequence and align the left margin of this new line with your sequence, I guess you have to read your file until the sequence, use a regular expression to capture the number of spaces at the beginning of that line, print that captured list of spaces followed by your new line, and then print the line in the file.

    A quick session under the Perl debugger to show you the idea:

    Code:
      DB<24> my $line = "                             foo bar"
    
      DB<25> $spaces = $1  if $line =~ /(^\s+)/
    
      DB<26> print ">>$spaces<<"
    >>                             <<
      DB<27> print $spaces, "BAZ", "\n", $line;
                                 BAZ
                                 foo bar
      DB<28>
    Line 24: I initialize the sequence " foo bar".
    Line 25: I capture the leading spaces before the sequence
    Line 26: printing the leading spaces between >> and << delimiters to visualize it clearly.
    Line 27: printing the line of spaces, my new sequence (BAZ), a carriage return and the existing line. As you can see BAZ is well aligned with foo bar.

    I think that's what you wanted. If not, well, sorry, I should probably not try to help people in the field of bioinformatics.
    Last edited by Laurent_R; July 29th, 2013 at 04:35 PM.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by Laurent_R
    I still don't quite get it. Sorry, last time I had a biology course was in 1974, as far as I know, nobody used computers in this field at the time.

    But, assuming you want to insert a new line of rubbish (sorry, i could not resist) above your sequence and align the left margin of this new line with your sequence, I guess you have to read your file until the sequence, use a regular expression to capture the number of spaces at the beginning of that line, print that captured list of spaces followed by your new line, and then print the line in the file.

    A quick session under the Perl debugger to show you the idea:

    Code:
      DB<24> my $line = "                             foo bar"
    
      DB<25> $spaces = $1  if $line =~ /(^\s+)/
    
      DB<26> print ">>$spaces<<"
    >>                             <<
      DB<27> print $spaces, "BAZ", "\n", $line;
                                 BAZ
                                 foo bar
      DB<28>
    Line 24: I initialize the sequence " foo bar".
    Line 25: I capture the leading spaces before the sequence
    Line 26: printing the leading spaces between >> and << delimiters to visualize it clearly.
    Line 27: printing the line of spaces, my new sequence (BAZ), a carriage return and the existing line. As you can see BAZ is well aligned with foo bar.

    I think that's what you wanted. If not, well, sorry, I should probably not try to help people in the field of bioinformatics.
    Thank you, its a start, but i need to align the rubbish with part of the other rubbish.

    so say I have sequence ABCEDFGHIJKLMNOP, and i need to align CCHDHDGF with part of the sequence FGHIJKLM, the code will have to add a new line, and then add the CCHDHDGF sequence like so

    Code:
    ABCDEFGHIJKLMNOP
         CCDHDGF
    I hope this makes it clearer. I can fiddle with your snippet. and I thank you profusely for your help.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    OK. May be something like this (untested):

    Perl Code:
    my $rubbish = "CCHDHDGF";
    my $line = "ABCDEFGHIJKLMNOP";
    my $offset = $1 if $line =~ /([A-Z]+?)FGHIJKLM/;
    my $offset =~ s/[A-Z]/ /g; # convert offset into spaces
    print $offset, $rubbish, "\n";
    Last edited by Laurent_R; July 30th, 2013 at 03:14 AM.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Actually I posted to quickly the above answer.

    There is a better solution. Using the index function will give you the position (offset) of the "CCHDHDGF" substring. You can then use this to build a string of spaces having the right length.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by Laurent_R
    Actually I posted to quickly the above answer.

    There is a better solution. Using the index function will give you the position (offset) of the "CCHDHDGF" substring. You can then use this to build a string of spaces having the right length.
    Ah, ok, i will look into that. your original snippet added a new line above a line, how do I add one below? unless i just go to the line under it and add one above.

    thank you for all the help.
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Just print the other line first.

    Please note that, in general, you cannot modify an existing file, you need to create a new one.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by Laurent_R
    Just print the other line first.

    Please note that, in general, you cannot modify an existing file, you need to create a new one.
    ah ok. I have most of it working now. however I have discovered a new problem outside of the scope of this question. I will consider this topic solved.


    Thank you everyone for the help!!

IMN logo majestic logo threadwatch logo seochat tools logo