#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0

    File header extraction


    Hi..

    I have a multifasta file (input.fasta) with many sequences as follows:

    >or3|1
    TASSWKVKNSGENKKEQKNGGN*
    >or3|2
    MEKKKREVINQILFQDISD*
    >or2|3
    MKVTQLLMWHTARELDLQISALN*

    .........
    ..

    and in another file (list.txt), I lists the wanted header line as:

    or3|2
    or2|3

    the output file (output.txt) should contain the sequences of only those headers which are in list file. For the above list, the output file would be:

    >or3|2
    MEKKKREVINQILFQDISD*
    >or2|3
    MKVTQLLMWHTARELDLQISALN*


    Help would be much appreciated. Thank you..
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481
    Code:
    import sys
    
    with open('list.txt') as inf:
        keys = set([L.strip() for L in inf])   # discard extraneous spacing and new line
    
    print(keys)
    with open('input.fasta') as inf:
        try:
            while True:
                K = inf.readline()
                assert K[0] == '>'
                key = K[1:].strip() # discard extraneous spacing and new line
                V = inf.readline()
                if key in keys:
                    sys.stdout.write(K)
                    sys.stdout.write(V)
        except IndexError:
            pass

    Comments on this post

    • abhijit.bose agrees
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo