#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0

    Making PostgreSQL Insert Performance Reasonable


    Hi. We're looking for a silver bullet or two that would make us take PostgreSQL more seriously.

    We're running two different etl products loading 30 million rows from sql server to PosgreSQL. We've tried different drivers, unlogged tables, different memory settings, removing indexes, using same server, using separate servers, high commit counts in some of the connectors etc etc but arent even coming close to the 1hr in which ssis can load 30 million rows from one sql server db to another. Our machine is a physical machine with tons of SAN and 36 gig of RAM.

    The best time we see so far loading PostgreSQL is 12 hours. The table's row is a mix of integer, date , money and boolean columns. I believe my peer is leaving the PK defined when he runs these tests. It is type serial but at the moment we are mapping an input sql server pk to this column.

    I can provide a lot more info but I suspect that with us being new there are one or two major settings we dont even know about. And we dont mind reading documentation but we'd like to justify that kind of thing by first seeing this product get into the ballpark.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,777
    Rep Power
    348
    How do you load the data?

    The fastest way to load data is using the COPY command (with the input file located on the server).

    Did you try "fsync = off"? This should turned back on, once you have loaded the data, otherwise you leave the database open for data corruption if your server is not shut down cleanly.

    Btw: "high commit counts" will usually slow down things in PostgreSQL. Try committing only once.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    Thx Shammat.

    Each of the products we're looking at is an etl product. One uses Talend as an oem. The other is ssis from Microsoft.

    The Talend based product came with its own PostgreSQL driver/connector but that is all we know about it and we dont know if it understands "bulk insert".

    In ssis, we've tried devart's and cozyroc's PostgreSQL connectors/components as what is called a "destination" component. And an off the shelf MS connector in the "source" component for our 30 million row sql server table.

    We draw an arrow between the two and the etl products know what to do from there. Some need a little help in data type conversions. For those cases we insert what are called "transform" components between the data source and data destination.

    If we use the copy command, can the source be a sql table? What formats can the source be in when using the copy command? It looks like csv and txt are the choices. I'm assuming we can issue that command from the PG ADMIN III front end.

    We'll look up and try the fsync thing. It looks like finding the instructions for issuing this command is going to be an adventure.

    I misspoke on the commit count. Its really a higher number of rows being committed at a time than the default. Its a bit unclear to us at this time if we are getting (or can get) what is called Bulk Insert behavior with these 3rd party connectors. That's always been something we just checked in sql server destination components.

    I should have mentioned we're running PG 9.2 on Windows.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,777
    Rep Power
    348
    Originally Posted by db042188
    We draw an arrow between the two and the etl products know what to do from there. Some need a little help in data type conversions. For those cases we insert what are called "transform" components between the data source and data destination.
    Well, apparently those tools aren't using the native "bulk load" operation which is the COPY command. I can load a ~1GB text file with ~35 million rows using the COPY command in about 5 Minutes on a virtualized server (so the IO performance is not really that good)

    If we use the copy command, can the source be a sql table? What formats can the source be in when using the copy command? It looks like csv and txt are the choices.
    COPY (to load data) can only use text files.

    I'm assuming we can issue that command from the PG ADMIN III front end.
    No idea. I don't use pgAdmin, but I would very much assume so. Especially as the file is read from the server by the PostgreSQL server process (unless COPY .. FROM stdin is used).

    More details in the manual:

    We'll look up and try the fsync thing. It looks like finding the instructions for issuing this command is going to be an adventure.
    Why? It's all documented in the manual: http://www.postgresql.org/docs/current/static/runtime-config-wal.html

    There is a dedicated chapter on bulk loading in the manual: http://www.postgresql.org/docs/current/interactive/populate.html

    You also want to go through this page: http://wiki.postgresql.org/wiki/Performance_Optimization
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    Thx Shammat. We're taking PostgreSQL much more seriously now. We created a text file from sql in 12 minutes and with a copy command in PG Admin III loaded it to PostgreSQL in 25 minutes. The table consumes 8 gig in data space. So in 37 minutes we accomplished what takes a similar sql to sql task 1 hour. Very impressive. And that is without the fsync item figured out.

    Now we will look at query performance.

    We're also curious how the copy command can be issued from a line (dos) command instead of a client tool. With that knowledge in our arsenal, we can easily have our etl jobs execute a "copy from" command rather than an inefficient 3rd party PostgreSQL adaptor component.

    Some PostgreSQL documentation is written for PosgreSQL folks like you and we want to spend only minimum $s right now investigating the potential of this product. Terms like "Manpage" found during our quick reads on fsync mean nothing to us and turn us off temporarily from going down certain research paths.

    So for a while the answer to your question about documentation is "because we are new, some of the documentation is over our heads and we will invest time in reading tons of documentation only after we see some potential in any given product".

    As you can imagine, we are comparing a number of products right now. So we need to be careful upfront about where we invest time. But we intend to catch on quickly with a little patience from forums like this one.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,777
    Rep Power
    348
    Originally Posted by db042188
    We're also curious how the copy command can be issued from a line (dos) command instead of a client tool.
    COPY is a regular Postgres SQL statement, so you can run it with the psql commandline client e.g. putting the statement into a SQL script and then call psql passing the scriptname - the same way Oracle's SQL*Plus works.

    Terms like "Manpage" found during our quick reads on fsync mean nothing to us and turn us off temporarily from going down certain research paths.
    You did read link to the manual (not "manpage") that I included in my previous answer?
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    Thx Shammat. We dont use Oracle here.

    We're probably going to find that we need to run something like this..c:\>psql.exe param1,parm2...etc. Where one of these params is a "sql statement". We'll look up the details.

    Before your post we probably hit most of the docs you listed. Somewhere in those docs use of the word "manpage" (or something like that) turned us off temporarily from pursuing the exact recipe for turning off fsync. We plan to come back to your docs if PG's query engine shows promise.

    I should add that we have already aquired a book on the subject. But we may need this or another forum to help us get started. We usually ask only a few strategic questions in forums before we become mostly self reliant. Your patience is appreciated.

IMN logo majestic logo threadwatch logo seochat tools logo