#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    1
    Rep Power
    0

    How to quickly search over a large number of files?


    Hi all,

    I am a newbie to python.

    I have about 500 search queries, and about 52000 files in which I want to find all matches for each of the 500 queries.

    How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

    Can someone give me a suggestion as to how to minimize the search time?

    Thanks!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,856
    Rep Power
    481
    I can't think of any short cuts.

    loop over files:
    loop over queries: computation(file, query)
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,856
    Rep Power
    481
    Or if the files are small text files you could concatenate them into one large file. Might not work for databases or other binary files, depending on what you're looking for.
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Location
    Pennsylvania, USA
    Posts
    35
    Rep Power
    2
    Why not call an external program to do the search for you like grep using the python subprocess.popen() functionality?
    Code:
    import subprocess
    p = subprocess.Popen(['grep', yourGrepArgsHere], 
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    results, error_msgs = p.communicate()
    print results
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,856
    Rep Power
    481
    If it were my project I doubt python would be involved. I'd probably use gawk with bash. An egrep pattern file might be right for this one.
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo