#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0

    Question Help needed in feature extraction from two input files


    Hi all,

    I have two input files. First file (file1.txt) contains entries in the following tab delimited format:

    gene1 or1|1234 or3|56 or4|793
    gene4 or2|347
    gene5 or3|23 or7|123456789

    .......
    ..
    The second file (file2.txt) contains some additional features along with the header line of the first file, such as:

    >or1|1234
    ATCGGATTCAGG
    >or2|347
    GAACCTATCGGGGGGGGAATTTA
    TATATTTTA
    >or3|56
    ATCGGAGATATAACCAATC
    >or3|23
    AAAATTAACAAGAGAATAGACAAAAAAA
    >or4|793
    ATCTCTCTCCTCTCTCTCTAAAAA
    >or7|123456789
    ACGTGTGTACCCCC

    ....
    ..
    From these two files, I have to extract entries by row wise header matching and rename the output file as the first column in file1. For example, in the above case, 3 output files will generate.

    the first output file would named as "gene1.txt" and it contains:

    >or1|1234
    ATCGGATTCAGG
    >or3|56
    ATCGGAGATATAACCAATC
    >or4|793
    ATCTCTCTCCTCTCTCTCTAAAAA
    the second output file would named as "gene4.txt" and it contains:

    >or2|347
    GAACCTATCGGGGGGGGAATTTATATATTTTA
    the third output file would named as "gene5.txt" and it contains:

    >or3|23
    AAAATTAACAAGAGAATAGACAAAAAAA
    >or7|123456789
    ACGTGTGTACCCCC
    Any help in solving the problem is highly appreciated. Thanks in advance.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    40
    Rep Power
    10
    I don't know if this is the best way of doing it, but this would be my method:

    1) Read the data from the file. If the files are very long, then perhaps loop this process and do it in small chunks.
    i.e.
    Code:
     file = open("file_test.txt", "r")
    file_data = file.read()
    2) Then split each line of the file.
    i.e.
    Code:
     line_splits = file_data.split('\n')
    3) Then for each element of line_splits, you will have to split again based on spaces. You know that the first element of every line is of the form: geneXX so you can easily turn that into a file name and open a new file called that. As for the rest of it, you will have to match that up with other file (probably using splits again).

    So basically if you add all that together you will probably have something along the lines of:
    Code:
    file = open("file_test.txt", "r")
    file_data = file.read()
    file.close()
    # Split the file based on new lines
    data_splits = file_data.split('\n')
    for i in range(len(data_splits)):
        # For each line, split again based on spaces
        line_splits = data_splits[i].split(' ')
        # Now you know the first element in line_splits is the new file name
        out_file = open(line_splits[0] + '.txt', 'w')
        for j in range(1, len(line_splits)):
            # Now match each element of this list with your other file and add them all
            # to out_file.
            
        out_file.close()
    That should give you a rough idea of 1 way of skinning this cat. Go do the rest.

IMN logo majestic logo threadwatch logo seochat tools logo