November 2nd, 2012, 08:53 PM
MySQL server setup
I'm relatively new to SQL and server setups.
I am trying to configure a mysql database for tracking data files from various research projects. I'm testing it with just one project and it currently contains references to 30 million files. The table contains 9 fields including datetime, path and filename and a few other fields about the equipment and site the datafile was collected at. Mostly varchar fields.
The table is indexed by datetime as virtually any query I would care to make on the database would include a timeframe of interest. It's relatively quick for single query but I'm trying to run an application to add new files to the database.
Due to the rate at which files get moved around on the data server I am scanning it in its entirety every time the application is run. The trouble is I need to compare each file found on the data server with the table in the database to see if it exists already. If it does, skip file, if not, insert new record.
The problem is, it has to query the database for each file it finds (about 30 million files). When it runs the queries sequentially it takes roughly 15 minutes to update the database. I figured if I split the queries up by year it would be faster. Run 10 queries at a time, one for each year. Well this slowed things a ton. After 45 minutes only 3 million files had been processed.
My question is, why. The table is innodb, each query is run in a separate connection to the database. The table is partitioned by year so in theory each query should only need to look at the particular year partition so there shouldn't be much contention between queries. I ran 'explain partitions' on the query and it does in fact only look at the partition containing the year of interest. I'm just not sure why mysql isn't scaling well with concurrent queries.
The mysql server is running on a 16 core server with over 400GB disk space (across 20+ drives in raid 6) and 48 GB memory.
The script is running on the data server with 16 cores, 20 disks in raid 6 + 64 GB memory. The network has a 40 MB/s transfer rate between servers.
The mysql server is the default mysql 5.1.61 setup. The only setting I've changed is setting innodb_buffer_size = 36G.
Anyone have any thoughts about why this is so slow and how I can speed things up?
November 3rd, 2012, 12:26 AM
Found out the bottleneck was in the threading code and not mysql. My python script was using the threading class originally. I switched to python's multiprocessing library and low and behold my program went from processing 2000 files/second to 17,000 files/second on 10 processes (nearly 10 times faster as expected).
If anyone has any configuration suggestions to add about server config I would be happy to hear it in case theres any other fine tuning that can be performed.