September 20th, 2013, 11:40 AM
Replace column data with info from external file
I have a table which is fairly large in columns and rows. I receive updates for certain columns from time to time. There is a row id that everything is based on.
external file = /User/folder1/folder2/file.csv
This file is comma delimited and has 3 columns
id, LastName, FirstName (sometimes it may have more than that and I would like to choose the column(s) to include in the update/replace)
table = mytable
This table has hundreds of columns and hundreds of thousands of rows
column(s) to update in this example = LastName, FirstName
Is there a way to load or use an external file and replace the column contents (LastName, FirstName) of mytable where the external files id = the mytable id?
September 20th, 2013, 12:11 PM
Insert all that data into a temporary table (normal or TEMPORARY, whatever) and then do a multi-table UPDATE query like
UPDATE original table AS o
JOIN temporary table AS t ON o.id = t.id
SET o.FirstName = t.FirstName, o.LastName = t.LastName
September 20th, 2013, 12:27 PM
requinix - thanks for the reply
So is it your recommendation that loading the file as a temp table and then performing an update using that table is preferable verses using the file externally?
Just asking because the data is so large it takes more than 10 hours to load the file due to its size. It might be quicker to reference the external file and do an update at the same time or is that option not available in mysql?
September 20th, 2013, 01:20 PM
If the file is so large it takes that long to import the data, it's also going to take a long time to process every row individually. An import and update is less work and you can get it started much sooner.
Are you concerned about concurrency problems?
September 20th, 2013, 02:03 PM
No, more interested in finding a way to speed up the process and thought that if I could eliminate downloading a temp file and then doing a replace that might speed things up. FYI: This process occurs to a remote location so everything is accomplished via ssh as well
Originally Posted by requinix
September 20th, 2013, 02:42 PM
Continuing with the assumption that there's nothing else that needs consideration,
(such as amount of time the database spends processing data, or the time it takes to actually perform the changes,)
The first thing to do to cut down on the time is to limit the amount of traffic sent between machines. It'd be nice if you could SSH directly to the database server and do all the work there, but if not
1. Download the file to the remote machine. No room for improvement here.
2. Import the file. The amount of data remains the same regardless of everything, so the only thing you can really do is reduce the overhead costs. Individual INSERT or UPDATE statements for each row could have a lot of overhead, so taking that to its logical conclusion the best option would be to INSERT/UPDATE everything at once (or as close to "at once" as you can). Since you can't UPDATE from a CSV directly to the database, move the data into the database as a separate step.
3. Do the UPDATE. The database server does all the work and there's no traffic exchanged.
At some point you're going to spend more time trying to find a way to make the process quicker than you would have saved having made it quicker. Cutting down the time from 10 hours to 5 hours is great, but if it takes you a day to figure out how then you haven't gained anything.
 Prepared statements may eliminate most of it.
Last edited by requinix; September 20th, 2013 at 02:45 PM.
September 21st, 2013, 12:34 PM
Loading to the server and then using the file from there shows itself to be a much faster way to do this.
Originally Posted by requinix
Why spend a day to save 5 hours? Thats how you learn. Then this becomes what to do the next time and improves whatever it is you do. Otherwise the wheel would never have been developed and we would not be where we are today doing whatever it is we do.