#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    4
    Rep Power
    0

    Python script for chunking CSV data


    Hi all,

    I'm very new to Python and have been trying to develop a script on my raspberry pi that takes a CSV file containing potentially millions of sensor records, splits the files into chunks of rows and inserts them into a sqlite3 database. This is simply to improve the SQL performance by adding the data in transactions of say 10,000 rows at a time instead of writing and locking the file after each row.

    So far I have:
    Code:
    import csv, sqlite3, time
    
    def chunks(data, rows=10000):
        for i in range (0, len(data), rows):
                yield data[i:i+rows]
    
    if __name__ == "__main__":
    
        t = time.time()
    
    con = sqlite3.connect('test.db')
    cur = con.cursor()
    cur.execute("DROP TABLE IF EXISTS sensor;")
    cur.execute("CREATE TABLE sensor(key INT, reading REAL);")
    
    filename = 'dummy.csv'
    reader = csv.reader(open(filename,"r"))
    divdata = chunks(reader)
    
    for chunk in divdata:
        cur.execute('BEGIN TRANSACTION')
    
        for col1, col2 in chunk:
                cur.execute('INSERT INTO sensor (key, reading) VALUES (?, ?)', (col1, col2))
    
        con.execute('COMMIT')
    However, I get a typeerror on line 4 saying the csv.reader has no len(). I found that this was due to the csv reader being non-subscriptable and was looking to the methods (like http://stackoverflow.com/questions/4956984/how-do-you-split-a-csv-file-into-evenly-sized-chunks-in-python) to solve that problem. Unfortunately I was getting invalid syntax errors with chunk and was getting nowhere.

    My question really is, is there an elegant way to chunk CSV files (the number of rows are not known) into the memory for importing to SQLite without resorting to the likes of pandas/django etc for a bloated solution? I was thinking itertools.islice may be the most relevant option but am unsure how to implement.

    Any help appreciated!
    Last edited by Raabot; August 7th, 2013 at 10:15 AM. Reason: +Details
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    158
    Rep Power
    3
    Perhaps instead of using the CSV module (which I don't think you're really doing much with) just read the file into a list of lines (with .readlines()). If necessary you can .strip('\n') to get rid of the linefeeds and .split(',') make it a list of lists.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    492
    Rep Power
    33
    My question really is, is there an elegant way to chunk CSV files
    Yes, it is done all the time with log files. Do you have some delimiter that decides where the data breaks, i.e this row ends and the next row begins, or do you just want to read each record, as each record is a row, and then update the database with some of the columns from each record? Also please provide some sample data.
    Last edited by dwblas; August 7th, 2013 at 11:39 AM.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    4
    Rep Power
    0
    @rrashkin
    Doesn't .readlines() load the whole file into memory? Do I then use .split() to process the transaction in sqlite in a for loop?

    @dwblas

    There are ultimately two columns in the CSV file, delimited by ','. One column will be the key using a timestamp of when the measurement was recorded, and the other the value the sensor is recording. The dummy CSV file i'm using at the minute looks like:
    1,87.4855606
    2,33.74082303
    3,97.05385477
    4,8.699844458
    5,73.04870788
    6,51.97528321
    7,47.27492163
    8,79.17062076
    9,1.113944584
    10,9.896821158
    The first column just represents an auto-incrementing field which will be replaced by the timestamp in due course. Each row represents a record. I will be recording at 10+Hz into a CSV file and I want to take this data and place it in a DB at hourly intervals using a cronjob to run the script. The sheer number of records means i'll have 36,000+ records per hour which is why I'm aiming to split the records into chunks and make the database write operation much more efficient with transactions of about 10,000 records/transaction!
  8. #5
  9. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    158
    Rep Power
    3
    Originally Posted by Raabot
    @rrashkin
    Doesn't .readlines() load the whole file into memory? Do I then use .split() to process the transaction in sqlite in a for loop?
    Yes, as a list of lines. Then as you chunk you can split with list comprehension: [i.split(',') for i in biglist[j:j+42]]
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    492
    Rep Power
    33
    I don't use CSV because it is just as easy to read and split IMHO. The following should provide a starting place
    Code:
    test_data = """1,87.4855606
     2,33.74082303
     3,97.05385477
     4,8.699844458
     5,73.04870788
     6,51.97528321
     7,47.27492163
     8,79.17062076
     9,1.113944584
     10,9.896821158
     11,1.0158"""
    
    test_list = test_data.split("\n")
    output_list=[]
    ctr = 0
    
    ##with open("file_name", "r") as fp:
    ##    for rec in fp
    
    ## replace the above 2 lines with test data
    ## reading 5 records at a time for testing
    for rec in test_list:
        rec_as_list = rec.strip().split(",")
        ## test for bad data
        if 2 == len(rec_as_list):
            ## strip to get rid of any white space
            output_list.append([rec_as_list[0].strip(), rec_as_list[1].strip()])
            ctr += 1
            if 5 == ctr:
                ## print or write output_list to file
                print output_list, "\n"
                ctr = 0
                output_list=[]
        else:
            print "*****", rec
    
    ## final group
    print output_list
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Location
    Usually Japan when not on contract
    Posts
    240
    Rep Power
    12
    Just out of curiosity...

    Do you have control over the code that is writing the CSV file? Can you make it change the name of the CSV file its writing to every hour or so?

    Also... Are there any natural partitions to the data? As in, you compare daily, hourly, monthly, etc.

    And finally... how long are you going to be collecting this data? SQLite may not be an optimal solution for processing interesting answers over hundreds of millions of rows (but mere retrieval for external processing is probably OK).

    I'm leaning in the direction of more traditional log-style handling. Break the input file (easiest if you control the process writing it, but simply moving it works fine as well), periodically ".import" to the SQLite db. This happens to be much more straightforward from shell/cron than Python, imo, since the tools needed are already written (instead of needing to write a new one as above).
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    4
    Rep Power
    0
    @dwblas

    Thanks for the kick-start. I've mashed around your code no doubt horribly, but it now works on a 1,000,000 record dummy file using the code below. My main worry is that I'm not entirely sure whether SQL is actually chunking the data into rows of 10,000 and processing each chunk as a transaction. The very nature of pythons API means that transactions are automatically begun when performing SELECT/INSERT etc, so when I used the "BEGIN TRANSACTION" statement with "COMMIT" I got the following error:

    Code:
    OperationalError: cannot commit - no transaction is active
    I therefore used the normal insertion method as seen below presuming that a transaction was occuring for every 10,000 records. However, I ran a more basic script which doesn't chunk the data and it performed the operation in 338seconds compared to 392 seconds of the supposedly efficient transaction version below. The whole area of transactions/autocommit/isolation in sqlite with python seems muddled, and as an experienced user on stackoverflow described; broken.


    Code:
    import sqlite3, time
    
    con = sqlite3.connect('test.db')
    cur = con.cursor()
    cur.execute("DROP TABLE IF EXISTS sensor;")
    cur.execute("CREATE TABLE sensor(key INT, reading REAL);")
    
    output_list=[]
    ctr=0
    
    t=time.time()
    
    with open("dummybig.csv", "r") as fp:
        for rec in fp:
            rec_as_list = rec.strip().split(",")
        ## test for bad data
            if 2 == len(rec_as_list):
            ## strip to get rid of any white space
                output_list.append([rec_as_list[0].strip(), rec_as_list[1].strip()])
                ctr += 1
            if 10000 == ctr:
                ## print or write output_list to file
                for col1, col2 in output_list:
                    cur.execute('INSERT INTO sensor (key, reading) VALUES (?, ?)', (col1, col2))
                con.commit()
                ctr = 0
                output_list=[]
            
    ## final group
    for col1, col2 in output_list:
        cur.execute('INSERT INTO sensor (key, reading) VALUES (?, ?)', (col1, col2))
    con.commit()
    
    print "\n Time Taken: %.3f sec" % (time.time()-t)
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by zxq9
    Just out of curiosity...

    Do you have control over the code that is writing the CSV file? Can you make it change the name of the CSV file its writing to every hour or so?

    Are there any natural partitions to the data?

    And finally... how long are you going to be collecting this data?
    Yeah, the code that writes the CSV file can be controlled. It'll be something simple to pull data from a serial port on the pi from the CO2 sensor.

    No natural comparisons per-say. The reason for putting it into a DB in the first place is to simply make the format more universal without any technie having to worry too much about how the data is formatted.

    Collecting data for years, in remote conditions so the data can be easily accessed over an IP address without the need for someone to go and fetch it! Going to have to use a USB HD to avoid wear on the SD card.

    If I can't sort the whole transaction thing and see its benefits in real terms, i'll probably just use .import and make sure the CSV file is formatted perfectly so it runs smoothy. Want to try my current way first though!
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    492
    Rep Power
    33
    SQLite does have an executemany, which should be faster, but it is still a "lite" version so you would have to change to PostGRESQL or MariaDB to improve from here.
    Code:
    ## I think, but am not sure that you use a list of tuples, so
            if 2 == len(rec_as_list):
                ## strip to get rid of any white space
                output_list.append((rec_as_list[0].strip(), rec_as_list[1].strip()))  ## changed to tuple
                ctr += 1
    
            if 10000 == ctr:
                ## print or write output_list to file
                cur.executemany('INSERT INTO sensor (key, reading) VALUES (?, ?)', (output_list))
                con.commit() 
    ## etc for the final update
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Location
    Usually Japan when not on contract
    Posts
    240
    Rep Power
    12
    Originally Posted by Raabot
    Yeah, the code that writes the CSV file can be controlled. It'll be something simple to pull data from a serial port on the pi from the CO2 sensor.
    I would then recommend falling back to the traditional logging practice of writing to a timestamped file per [period]. In the case of multiple sites I'd add the site name to the file, something like "co_data-foobarnia-20130809.log". This will drastically simplify three things:
    • Maintenance of raw backup data; its already labeled and self-sorting.
    • Tracking of what data has already been loaded. The db can grow another table called "in_file" (or whatever) that tracks what raw data files have been successfully loaded.
    • The amount and nature of the code necessary to perform import, backup or whatever other operations you want.

    It also comes with the benefit of keeping your raw data around in a more useful form than one huge file (which you'd eventually probably ditch, leaving you with the hope that your SQLite db never, ever gets corrupted...). Don't discount those raw files. There is this whole language called AWK that can do amazing things with text input -- and though you may not have any comparisons in mind just now, this data will eventually be subject to analysis.
    If I can't sort the whole transaction thing and see its benefits in real terms, i'll probably just use .import and make sure the CSV file is formatted perfectly so it runs smoothy. Want to try my current way first though!
    Above I mentioned adding another table to the schema, one that keeps track of what raw data files have been loaded and which haven't. If an import of a file runs in the same transaction as an insert to the "in_file" table, then you're guaranteed whether you've accurately accounted for present data or not. If data from multiple sites may need to be merged in the future a reliance on an auto-incrementing column or a timestamp is insufficient, so you'll want to know the origin and completeness of your data collection.

    Anyway, sounds like fun. Hope the environment doesn't eat your systems too quickly, though!

IMN logo majestic logo threadwatch logo seochat tools logo