#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0

    Question Help needed in pattern extraction from file


    Hi..

    I have a large input file with the following pattern (pattern starts from gene,below 2 consecutive patterns are shown):


    ..
    ..
    gene complement(1455..1673)
    /locus_tag="HMPREF9503_01801"
    CDS complement(1455..1673)
    /locus_tag="HMPREF9503_01801"
    /note="Psort location: Cytoplasmic, score: 7.50"
    /codon_start=1
    /transl_table=11
    /product="conserved hypothetical protein"
    /protein_id="EFT99317.1"
    /db_xref="GI:315155301"
    /translation="MKYTIAGILSVIIMFGAVIFVFFGKKPVEESVHTSTETTVKSST
    KISESSTVSSTATTESSTEITSVSSDES"
    gene complement(1694..2218)
    /locus_tag="HMPREF9503_01802"
    CDS complement(1694..2218)
    /locus_tag="HMPREF9503_01802"
    /note="Psort location: Cytoplasmic, score: 7.50"
    /codon_start=1
    /transl_table=11
    /product="conserved hypothetical protein"
    /protein_id="EFT99318.1"
    /db_xref="GI:315155302"
    /translation="MIMHQVYVDIIVDAIIEQYQTEENFYSAYQIQAADWQAWKEGQF
    GLDNEVMQKIKNLFTDYEWMLTQKILRQTILFPEKRNLAVSEYKRLKTTIAKKWLQSD
    LGVVELIPNNKQEQEIAAGYIDLKVTLAYGEWGFDDIITFRLPATIQRQLEGSKVELL
    DWVNENLMDTYVGE"

    ..




    From this, I want to extract GI,protein_id and translation sequences and paste it in another file (output.txt) in a specified format.. From the following input, the output file will be:


    >gi|315155301|EFT99317.1
    MKYTIAGILSVIIMFGAVIFVFFGKKPVEESVHTSTETTVKSSTKISESSTVSSTATTESSTEITSVSSDES
    >gi|315155302|EFT99318.1
    MIMHQVYVDIIVDAIIEQYQTEENFYSAYQIQAADWQAWKEGQFGLDNEVMQKIKNLFTDYEWMLTQKILRQTILFPEKRNLAVSEYKRLKTTIAKKWLQ SDLGVVELIPNNKQEQEIAAGYIDLKVTLAYGEWGFDDIITFRLPATIQRQLEGSKVELLDWVNENLMDTYVGE

    ...........
    .......

    Any idea? Thanks in advance..
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    157
    Rep Power
    2
    There are no doubt more efficient ways to do this but I usually just proceed by brute force. I assume the order of items is consistent:s
    Code:
    fid = open(<filename>)
    outputlst=[]
    for strline in fid:
        if strline.startswith("/protein_id"):
            pid=strline.split("=")[1]
        elif strline.startswith("/db_xref"):
            xrf=strline.split("=")[1]
        elif strlie.startswith("/translation"):
            genestr=strline.split("=")[1]
            outputlst.append(xrf+"|"+pid+"|"+genestr)
    assuming the "translation" is continuous.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0
    sorry but it produce no output ..

    Originally Posted by rrashkin
    There are no doubt more efficient ways to do this but I usually just proceed by brute force. I assume the order of items is consistent:s
    Code:
    fid = open(<filename>)
    outputlst=[]
    for strline in fid:
        if strline.startswith("/protein_id"):
            pid=strline.split("=")[1]
        elif strline.startswith("/db_xref"):
            xrf=strline.split("=")[1]
        elif strlie.startswith("/translation"):
            genestr=strline.split("=")[1]
            outputlst.append(xrf+"|"+pid+"|"+genestr)
    assuming the "translation" is continuous.
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    157
    Rep Power
    2
    did you substitute the actual file name for "<filename>"?
    Then you need to do something with outputlst.

IMN logo majestic logo threadwatch logo seochat tools logo