Thread: Casting?

Page 4 of 4 First ... 234
  • Jump to page:
    #46
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    Normally I would agree, but in this case my design goal is to keep query response time at a minimum. I think I would rather pay the storage cost of the two extra characters (country code) on each row as opposed to the processing cost of some sort of range comparison join on lookup.
    Hmm.. well you'd have to benchmark it, but searching a few hundred ranges could easily be faster than searching 4 billion records. It depends largely on how you are going to search the countries.

    Even if I'm wrong, and I end up going with a separate table in the final production configuration, it is still an update I want to learn how to do...
    That would be simple; UPDATE iptable SET countryid=xx WHERE ip between range_lower_limit AND range_upper_limit;
    But it's the milions of changes that cause the "problem".
  2. #47
  3. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,800
    Rep Power
    349
    Originally Posted by bjquinniii
    Though there was some strange behavior. The first time I ran the first query, it took 23 seconds to complete. After that, all three queries dropped down to the 10 ms area. What would cause that?
    Caching.

    Before the first run no cache was filled. Neither Postgres' buffer cache nor the filesystem cache of the operating system.

    The second run was completely retrieved from the buffer (or file system) cache.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  4. #48
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    The second run was completely retrieved from the buffer (or file system) cache.

    And that is one of the reasons why you may want to normalize fields into separate tables; at some point a table and it's indexes will become too large to fit into memory, causing delays in fetching the data fom disk.

    Soo also: effective_cache_size in you configuration. This tells the planner how much of the database it can expect to find in filecaches.

    This is the point where you start using EXPLAIN to see what the database is doing, so you an avoid sequential scan's (which require a hit to all records, which is very slow for 4bln rows).

    The postgresql wiki also has queries for index maintenance, which allow you to see how much diskspace you indexes are using and how often they are being used.
  6. #49
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2

    caching...


    Originally Posted by shammat
    Caching.

    Before the first run no cache was filled. Neither Postgres' buffer cache nor the filesystem cache of the operating system.

    The second run was completely retrieved from the buffer (or file system) cache.
    Does that mean that currently the complete ip address index for the 4.3 billion rows is fitting in memory?
  8. #50
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2

    updates


    Originally Posted by Vinny42
    Hmm.. well you'd have to benchmark it, but searching a few hundred ranges could easily be faster than searching 4 billion records. It depends largely on how you are going to search the countries.

    That would be simple; UPDATE iptable SET countryid=xx WHERE ip between range_lower_limit AND range_upper_limit;
    But it's the milions of changes that cause the "problem".
    Thank-you...
    "Problem?" - no problem...
    Shouldn't be many changes. The by-country allocations stay pretty static as far as I know.

    I think my typical usage if going to be along the lines of "tell me everything we know about IP xxx.xxx.xxx.xxx
    There may also be some more batch oriented runs (perhaps map reduce jobs) that dive into the whois info which will eventually get stored in the hstore column.
  10. #51
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2

    indexes...


    Originally Posted by Vinny42
    And that is one of the reasons why you may want to normalize fields into separate tables; at some point a table and it's indexes will become too large to fit into memory, causing delays in fetching the data fom disk.

    Soo also: effective_cache_size in you configuration. This tells the planner how much of the database it can expect to find in filecaches.

    This is the point where you start using EXPLAIN to see what the database is doing, so you an avoid sequential scan's (which require a hit to all records, which is very slow for 4bln rows).

    The postgresql wiki also has queries for index maintenance, which allow you to see how much diskspace you indexes are using and how often they are being used.
    The 10 ms retrieval time tells me (I think) that at least the index is staying in memory. I can't imagine the table has any chance of staying in memory (my data folder is currently taking up 637 GB of space... and this is the only database I have... and so far, it is just the one table).

    Thanks for the pointers... definitely going to have to dive into index maintenance stuff shortly (probably right after I update all of the country codes).
  12. #52
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    I can't imagine the table has any chance of staying in memory (my data folder is currently taking up 637 GB of space... and this is the only database I have... and so far, it is just the one table).
    A single 637GB table can't, but a 2GB partition can. And since you are probably going to process the IP's sequentially, that partition can be kept (mostly) in memory along with the other tables that you need for the work you're doing.

    The larger your records are, the fewer you can keep in cache and the more diskactivity you wil get. Even with huge tables this is still something to keep in mind.

    Anyway, you have lots of things to try and work out, so happy hacking!
  14. #53
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2

    updating country codes


    Originally Posted by bjquinniii
    The 10 ms retrieval time tells me (I think) that at least the index is staying in memory. I can't imagine the table has any chance of staying in memory (my data folder is currently taking up 637 GB of space... and this is the only database I have... and so far, it is just the one table).

    Thanks for the pointers... definitely going to have to dive into index maintenance stuff shortly (probably right after I update all of the country codes).
    Just generated the update query script to take care of the country codes... there are 127,833 individual ranges... big fun
  16. #54
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2

    Update Script Finished!


    Originally Posted by bjquinniii
    Just generated the update query script to take care of the country codes... there are 127,833 individual ranges... big fun
    A bit over 65 hours.
    Data folder size now 1.21 TB.
  18. #55
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    65 hours, did you tune the DB yet?
  20. #56
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2
    Originally Posted by Vinny42
    65 hours, did you tune the DB yet?
    I will have to learn what tuning is before any of that happens.
    The 65 hours was for a bit over 100k range based updates, which gets me to the real starting point. Right now I have a row for every IP address (IPV4), their allocation status, which regional center manages them, and what country they have been assigned to (if any). Each row also has an hstore, where I will be able to push additional information (banners, domains, etc - though I am undecided about how I'm going to handle domain specific information - I may create a new table for those, or I may just have duplicate info for those cases where a single dns entry maps to more than one IP address... I'm fairly good at relational things, but don't have any experience yet with key-value stores).
  22. #57
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    I will have to learn what tuning is before any of that happens.
    Then spend a few hours learning that, because right now you are wasting a lot of your own time waiting for the database; an internal HD should be capable of at least 30MB/sec, wich means a simple insert like what you are doing should not take more than 12 hours because it hardly requires any other work than harddisk throughput.

    As for the HSTORE, I maintain that you should not use that for data that you expect to store, and that you should normalize.

    One reason is that currently, the only real drawback of HSTORE is that you cannot nest them; you cannot put an HSTORE under a key in an HSTORE, so they are not good for storing multiple of the same, like two domainnames, without resorting to the horrible "domain1",'domain2" poop.

    Another reason is that your application has the nasty habit that a lot of the data your fetch from the IP will belong to the hosting company that's hosting it, so one technical contact will belong to thousands of IP's and if that guy's name is "Verylongfirstnameindeed MacVerylonglastnameindeedinfactitsmuchlonger" then you are wasting more megabytes storing the same thing over and over, which means wider rows, bigger indexes and a serious risk that PgSQL will start TOASTing your rows; swapping them out from the main table to TOAST tables which are slower than non-toast storage. (this is part of the reason why I want you to learn about optimizing before you continue building a database so large that you will notice every delay to a factor of several million)
  24. #58
  25. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2
    Originally Posted by Vinny42
    Then spend a few hours learning that, because right now you are wasting a lot of your own time waiting for the database; an internal HD should be capable of at least 30MB/sec, wich means a simple insert like what you are doing should not take more than 12 hours because it hardly requires any other work than harddisk throughput.
    Just downloaded Smith's PostgreSQL 9.0 High Performance as a starting point... Any other recommendations (already looking through the Wiki)?

    Originally Posted by Vinny42
    As for the HSTORE, I maintain that you should not use that for data that you expect to store, and that you should normalize.
    One reason is that currently, the only real drawback of HSTORE is that you cannot nest them; you cannot put an HSTORE under a key in an HSTORE, so they are not good for storing multiple of the same, like two domainnames, without resorting to the horrible "domain1",'domain2" poop.
    You're probably right, though I am still wrestling with what the best structure will be. Columns named "domain1","domain2" was one thing I was definitely trying to avoid. I also wanted to avoid having a bunch of joined tables, but that may be unavoidable.
    Originally Posted by Vinny42
    Another reason is that your application has the nasty habit that a lot of the data your fetch from the IP will belong to the hosting company that's hosting it, so one technical contact will belong to thousands of IP's and if that guy's name is "Verylongfirstnameindeed MacVerylonglastnameindeedinfactitsmuchlonger" then you are wasting more megabytes storing the same thing over and over, which means wider rows, bigger indexes and a serious risk that PgSQL will start TOASTing your rows; swapping them out from the main table to TOAST tables which are slower than non-toast storage. (this is part of the reason why I want you to learn about optimizing before you continue building a database so large that you will notice every delay to a factor of several million)
    Again, yes, I'm probably going to have to suck it up and go with a relational model. At least that is something I know how to do.
  26. #59
  27. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    2
    Originally Posted by Vinny42
    Then spend a few hours learning that, because right now you are wasting a lot of your own time waiting for the database; an internal HD should be capable of at least 30MB/sec, wich means a simple insert like what you are doing should not take more than 12 hours because it hardly requires any other work than harddisk throughput.
    There may have been some other reasons for the excessive run time on those updates. Correct me if I'm wrong, but the range based updates I was doing should take longer than simple inserts. The file system is also busy moving some stuff around as I prepare to adjust my hardware configuration a bit.
  28. #60
  29. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    There may have been some other reasons for the excessive run time on those updates. Correct me if I'm wrong, but the range based updates I was doing should take longer than simple inserts.
    Maybe, but given the size of what you are working with, my guess is that most of the time taken is the database recylcing it's logs as it's trying to keep your checkpoints going while only using three 16MB chunks of logspace.
    You also haven't tuned the autovacuum etc so you may even be fighting that too.

    The high performance book is a very good start, try to read a few pages of that every day, there is no magic bullet to tuning your database, it's al about understanding what the database does and what you make it do. That book explains it very well.

    Again, yes, I'm probably going to have to suck it up and go with a relational model. At least that is something I know how to do.
    Indeed. If I was given this assignment, I would first setup the fully normalised version and if access speed requires it, I'd use a materialized view of some kind to create the "denormalized" tablus-giganticus that you are building now.

    But hey, there's only one way to know if something works: try it and see. :-)
Page 4 of 4 First ... 234
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo