#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    2
    Rep Power
    0

    Issue script output/percentage


    Hey everybody

    Iam in need of advice, so I have two text file with a bunch of transcript name and their corresponding length, it looks like this:

    File1:
    A 256
    B 456
    File2:
    A 245
    B 435
    I want to compare the length of the transcript and see if the length in file 2 is at least 90% of the length in file for the corresponding transcript name ( I hope I am clear!)

    I wrote the following script but the output file only gives me one transcript instead of 100
    from collections import defaultdict
    import numpy as np

    Code:
    ercctranscript_size = {}
        for line in open('ERCC.txt'):
             columns = line.strip().split()
             transcript = columns[0]
             size = columns[1]
             ercctranscript_size[transcript] = int(size)
             size = ercctranscript_size[transcript]
    
        unknown_transcript= open('Not_sequenced_ERCC_transcript.txt', 'w')
        blast_file = open('blast.txt')
        out_file = open ('out.txt', 'w')
    
        blast_transcript = {}
        for line in blast_file:
             columns = line.strip().split()
             blasttranscript = columns[0].strip()
             blastsize = columns[1].strip()
             blast_transcript[blasttranscript] = int(blastsize)
             blastsize = blast_transcript[blasttranscript]
    
             if transcript not in blast_transcript:
                  unknown_transcript.write('{0}\n'.format(transcript))
             else:
                  if blastsize == size:
                       print >> out_file, transcript, True
                  else:
                       print >> out_file, transcript, False
    1) Does anyone see why I am not getting the entire list as an output with false or true?
    2) How do I set that blastsize must be > 0.9 x size

    Thanks a lot for your help
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Location
    Usually Japan when not on contract
    Posts
    240
    Rep Power
    12
    The following is an interesting-trick free version in Python 2.6 (higher Python versions are slightly cleaner).

    Embellishing this with shortcuts, error handling and argument substitution is strongly recommended. This should give you the basic idea of a working approach:
    python Code:
    from __future__ import division
     
    f = open('blast.txt', 'r')
    z = [q[:-1] for q in f.readlines()]
    blast = dict(q.split(' ') for q in z)
    for k in blast:
      blast[k] = int(blast[k])
    f.close()
     
    # Neat-o! Verbatim redundancy... perhaps this should be a function...
    f = open('out.txt', 'r')
    z = [q[:-1] for q in f.readlines()]
    out = dict(q.split(' ') for q in z)
    for k in out:
      out[k] = int(out[k])
    f.close()
     
    s = list(set(blast.keys()) & set(out.keys()))
    validation = {}
    for x in s:
      validation[x] = blast[x] / out[x] * 100
    for x in validation:
      print x, validation[x]

    When blast.txt is:
    Code:
    A 256
    B 456
    C 350
    D 700
    E 140
    and out.txt is
    Code:
    A 245
    B 435
    C 201
    D 1500
    F 200
    the output is
    Code:
    A 104.489795918
    C 174.129353234
    B 104.827586207
    D 46.6666666667
    Caveats are:
    1- I chopped the newlines off each line because I know they will be there. There are shortcuts for this that are platform agnostic, I just didn't use them.
    2- Any unexpected formatting will make it barf. Oh well, see recommendations above.
    3- There is a reporting language designed specifically to solve this kind of problem called awk. If you have problems like this often I recommend looking into awk, because its design lends itself naturally to these kinds of problems.
  4. #3
  5. Commie Mutant Traitor
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Jun 2004
    Location
    Alpharetta, GA
    Posts
    1,806
    Rep Power
    1570
    As a newcomer, you need to be made aware that the forum software does not by default retain indentation. This is especially important with Python code, as indentation is a significant part of the program itself. Therefore, you need to put code samples in between [code] tags, like so:

    [code]
    code goes here.
    [/code]

    You can do this automatically with the '#' button at the top of the editing window, or the highlight marker button right next to it.

    As a courtesy to you as a new member, I have re-posted your code with [highlight] tags; you can view and copy the plain source by clicking on the double chevron button at the top right of the viewing window.

    Python Code:
    from collections import defaultdict
    import numpy as np
     
        ercctranscript_size = {}
        for line in open('ERCC.txt'):
             columns = line.strip().split()
             transcript = columns[0]
             size = columns[1]
             ercctranscript_size[transcript] = int(size)
             size = ercctranscript_size[transcript]
     
        unknown_transcript = open('Not_sequenced_ERCC_transcript.txt', 'w')
        blast_file = open('blast.txt')
        out_file = open ('out.txt', 'w')
     
        blast_transcript = {}
        for line in blast_file:
             columns = line.strip().split()
             blasttranscript = columns[0].strip()
             blastsize = columns[1].strip()
             blast_transcript[blasttranscript] = int(blastsize)
             blastsize = blast_transcript[blasttranscript]
     
             if transcript not in blast_transcript:
                  unknown_transcript.write('{0}\n'.format(transcript))
             else:
                  if blastsize == size:
                       print >> out_file, transcript, True
                  else:
                       print >> out_file, transcript, False
    Rev First Speaker Schol-R-LEA;2 JAM LCF ELF KoR KCO BiWM TGIF
    #define KINSEY (rand() % 7) λ Scheme is the Red Pill
    Scheme in Short Understanding the C/C++ Preprocessor
    Taming Python A Highly Opinionated Review of Programming Languages for the Novice, v1.1

    FOR SALE: One ShapeSystem 2300 CMD, extensively modified for human use. Includes s/w for anthro, transgender, sex-appeal enhance, & Gillian Anderson and Jason D. Poit clone forms. Some wear. $4500 obo. tverres@et.ins.gov
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    2
    Rep Power
    0
    Thanks a lot for explaining me, I have edited my post to include the code correctly!

IMN logo majestic logo threadwatch logo seochat tools logo