#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    4
    Rep Power
    0

    Replace column data with info from external file


    I have a table which is fairly large in columns and rows. I receive updates for certain columns from time to time. There is a row id that everything is based on.

    For discussion:
    external file = /User/folder1/folder2/file.csv
    This file is comma delimited and has 3 columns
    id, LastName, FirstName (sometimes it may have more than that and I would like to choose the column(s) to include in the update/replace)

    Existing table:
    table = mytable
    This table has hundreds of columns and hundreds of thousands of rows
    column(s) to update in this example = LastName, FirstName

    Question:
    Is there a way to load or use an external file and replace the column contents (LastName, FirstName) of mytable where the external files id = the mytable id?

    Thanks
  2. #2
  3. Come play with me!
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,749
    Rep Power
    9397
    Insert all that data into a temporary table (normal or TEMPORARY, whatever) and then do a multi-table UPDATE query like
    Code:
    UPDATE original table AS o
    JOIN temporary table AS t ON o.id = t.id
    SET o.FirstName = t.FirstName, o.LastName = t.LastName
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    4
    Rep Power
    0
    requinix - thanks for the reply

    So is it your recommendation that loading the file as a temp table and then performing an update using that table is preferable verses using the file externally?

    Just asking because the data is so large it takes more than 10 hours to load the file due to its size. It might be quicker to reference the external file and do an update at the same time or is that option not available in mysql?
  6. #4
  7. Come play with me!
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,749
    Rep Power
    9397
    If the file is so large it takes that long to import the data, it's also going to take a long time to process every row individually. An import and update is less work and you can get it started much sooner.

    Are you concerned about concurrency problems?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by requinix
    If the file is so large it takes that long to import the data, it's also going to take a long time to process every row individually. An import and update is less work and you can get it started much sooner.

    Are you concerned about concurrency problems?
    No, more interested in finding a way to speed up the process and thought that if I could eliminate downloading a temp file and then doing a replace that might speed things up. FYI: This process occurs to a remote location so everything is accomplished via ssh as well
  10. #6
  11. Come play with me!
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,749
    Rep Power
    9397

    Continuing with the assumption that there's nothing else that needs consideration,


    (such as amount of time the database spends processing data, or the time it takes to actually perform the changes,)

    The first thing to do to cut down on the time is to limit the amount of traffic sent between machines. It'd be nice if you could SSH directly to the database server and do all the work there, but if not
    1. Download the file to the remote machine. No room for improvement here.
    2. Import the file. The amount of data remains the same regardless of everything, so the only thing you can really do is reduce the overhead costs. Individual INSERT or UPDATE statements for each row could have a lot of overhead[1], so taking that to its logical conclusion the best option would be to INSERT/UPDATE everything at once (or as close to "at once" as you can). Since you can't UPDATE from a CSV directly to the database, move the data into the database as a separate step.
    3. Do the UPDATE. The database server does all the work and there's no traffic exchanged.

    At some point you're going to spend more time trying to find a way to make the process quicker than you would have saved having made it quicker. Cutting down the time from 10 hours to 5 hours is great, but if it takes you a day to figure out how then you haven't gained anything.

    [1] Prepared statements may eliminate most of it.
    Last edited by requinix; September 20th, 2013 at 02:45 PM.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by requinix
    (such as amount of time the database spends processing data, or the time it takes to actually perform the changes,)

    The first thing to do to cut down on the time is to limit the amount of traffic sent between machines. It'd be nice if you could SSH directly to the database server and do all the work there, but if not
    1. Download the file to the remote machine. No room for improvement here.
    2. Import the file. The amount of data remains the same regardless of everything, so the only thing you can really do is reduce the overhead costs. Individual INSERT or UPDATE statements for each row could have a lot of overhead[1], so taking that to its logical conclusion the best option would be to INSERT/UPDATE everything at once (or as close to "at once" as you can). Since you can't UPDATE from a CSV directly to the database, move the data into the database as a separate step.
    3. Do the UPDATE. The database server does all the work and there's no traffic exchanged.

    At some point you're going to spend more time trying to find a way to make the process quicker than you would have saved having made it quicker. Cutting down the time from 10 hours to 5 hours is great, but if it takes you a day to figure out how then you haven't gained anything.

    [1] Prepared statements may eliminate most of it.
    Loading to the server and then using the file from there shows itself to be a much faster way to do this.

    Why spend a day to save 5 hours? Thats how you learn. Then this becomes what to do the next time and improves whatever it is you do. Otherwise the wheel would never have been developed and we would not be where we are today doing whatever it is we do.

IMN logo majestic logo threadwatch logo seochat tools logo