#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    90
    Rep Power
    10

    Talking MySQL to PostgreSQL, LARGE ROW COUNT


    Hello, i work for an organization that uses a lot of MySQL for articles and the like.

    right now we have over 700,000 rows in the article table. the previous architecht/programmer that I am in the process of replacing uses MySQL to eventualy point to YAML files to gather the rest of the information about the article, such as the pages and the images and OCR text.

    When i came on i could not fathom why he didnt store the page data in the database as well, and he mentioned a massive performance issue when there are over 20 million rows in a table, as there would be if we logged the pages as well, so now there is a lot of programming logic to alter/adjust/read these YAML files for the sake of speed.

    Personaly I am not horribly convinced at that. He made this decision 3 years ago when computer power and space was relatively expensive compared to today, and now we have the funding and the computer hardware to handle pretty much anything.

    Would PostgreSQL be signifigantly faster than the MySQL situation now?

    could i implement a page table as a proof of concept to demonstrate?

    also, a lot of what does deal with the MySQL database is written, would moving over to PostgreSQL be as easy as changing the DBI calls from the perl config files we use?

    I know this isnt very specific, and I will answer any questions that you have to better help me as well. I have been doing MySQL programming along side using Perl and PHP for 4 years now, and the thing i always liked about it is if i left the job, someone with a knowledge in any programming language could always start it back up again with the database structure left behind, a well normalized well organized structure that was clear and required no "tricks" programaticaly to get things done.

    this isnt the case and it makes me feel ill that it is in place and the company almost relies on it, and myself, as PERL programmers are more and more rare, especialy any worth a damn.

    Any help would be nice, I am really curious about PostgreSQL coming from only MySQL, besides, i want another DB to put on the resume

    Thanks a lot, I know im long winded on this. Thanks
  2. #2
  3. Modding: Oracle MsSQL Firebird
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jun 2001
    Location
    Outside US
    Posts
    8,527
    Rep Power
    539
    Originally Posted by animusdk
    ...
    When i came on i could not fathom why he didnt store the page data in the database as well, and he mentioned a massive performance issue when there are over 20 million rows in a table, as there would be if we logged the pages as well, so now there is a lot of programming logic to alter/adjust/read these YAML files for the sake of speed.

    Personaly I am not horribly convinced at that. ...
    The question has been asked many times on these forums and we all went against storage in the database.
  4. #3
  5. No Profile Picture
    Gödelian monster
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Jul 1999
    Location
    Central Florida, USA
    Posts
    2,307
    Rep Power
    61
    Originally Posted by pabloj
    The question has been asked many times on these forums and we all went against storage in the database.
    No we didn't. I have never been against this approach in principle It is simply a matter of application design requirements + system capabilities. I don't understand why so many programmers are offended at someone even considering storing data like this in a database.

    The real question to ask is "why are you considering storing this in the database?" Yes, there are tradeoffs: better logical control over your data, but definite performance implications. The question to ask is whether the tradeoff is worth it to you.

    Also, this is as much a question of database design. A properly-designed database will not see as many performance problems from this as a poorly-designed one. To start with, if you store documents in the database, you probably don't want this column to be added to your article index table. You should have an associated table with a foreign key to your article index table, and then a column with an appropriate data type to store your article document. This way, the database only accesses that data when needed to retrieve a specific document or group of documents, rather than every single time the article index tables are searched.

    The other thing you want to make sure of is that you never do *searches* on this table. If you want full-text indexing, extract the text from the document, and index it into another associated table with tsearch2, which is now in the 'contrib' section of the PostgreSQL distribution. Now there should be no reason to do large aggregate queries on your documents table. Keep it down to simple saves and retrieves and the performance of the rest of your database should not be significantly impacted.

    By default PostgreSQL char, varchar, and text columns can be up to 1 GB in size (no lame 8K limit, as was common in most DBMSs up until recently). Also, binary data can be encoded and stored in BYTEA columns of up to 1GB in size. Really, though I imagine your documents will be much smaller than this in size. If you are regularly storing 1 GB files in the database, then you will probably want to reconsider .

    If it turns out that this documents table is growing large and impacting main database performance, PostgreSQL 8+ has that wonderful feature known as tablespaces, which allow you to choose the physical data storage location of any table. Thus, you can dedicate an additional disk or even RAID array to your documents table, separate from all other database storage.

    As to how PostgreSQL would compare to MySQL in this situation: I imagine with the right design, either one could work OK, although PostgreSQL is much more proven on the heavy-database end of things. There are plenty of companies out there who have multi-Terabyte databases in PostgreSQL. See http://www.powerpostgresql.com/Downl...es_osc2005.pdf for a couple examples.
    The real n-tier system:

    FreeBSD -> PostgreSQL -> [any_language] -> Apache -> Mozilla/XUL

    Amazon wishlist -- rycamor (at) gmail.com

IMN logo majestic logo threadwatch logo seochat tools logo