Thread: Casting?

    #16
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    As much as I share your dislike for MySQL that statement isn't true - especially when it comes to partitioning which is (unfortunately) still a bit better in MySQL than in Postgres (although the mere existence of an error message that says "table is full" is kind of funny - if not embarrasing)
    I can give you a long list of emberrassing fails in MySQL. It will silently commit a transaction if you forget to rollback before opening the next transaction. It thinks 1/0=NULL. It's default Isolation level is repeatable-read... No amount of partitioningsupport is going to make it a reliable database.


    But, back on topic: HBASE of HSTORE, you're still going to have terabytes of data that you need to be able to access with some degree of speed. At some point you are going to want to do things like see how long ago an address has been checked and you cannot afford to plow through 4GB of data to do it, just reading the table from disk will take minutes and if you need to do that a few thousand times, you're in for a week of harddisk activity. Swapping excess data to an external store can be done using the foreign data wrappers, but that external store will still have to deal with terabytes of data so unless the data can be easily and quickly compressed (and you don't have to search the compressed data) you're not going to get much benefit from that.

    Oh, and all the information about IP's and countries is already available in IP-to-country databases that you can download for free these days, that may be a good start to get your database filled with some realworld stuff so you can do actual queries and see where you stand before making a decision.
  2. #17
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    hstore


    Originally Posted by shammat
    Adding a new (nullable) column to an existing table is basically a no-op and is extremely fast:
    Code:
    psql (9.3.0, server 9.2.2)
    Type "help" for help.
    
    postgres=# select count(*)
    postgres-# from foo;
    
      count
    ----------
     29999999
    (1 row)
    
    
    Time: 3466,883 ms
    postgres=# ALTER TABLE foo ADD COLUMN bar varchar(20);
    ALTER TABLE
    Time: 1,458 ms
    postgres=#
    Another alternative to the "dynamic attributes" problem is the hstore datatype which is essentially a key/value store inside a single row:
    Code:
    create table ip_attributes
    (
      address inet not null primary key,
      attributes hstore
    );
    
    insert into ip_attributes
    values
    ('10.1.1.1'),
    ('10.1.1.2'),
    ('10.1.1.9');
    
    select address, attributes -> 'ping' as ping, attributes -> 'nslookup' as nslookup
    from ip_attributes
    order by 1;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 |      |         
    10.1.1.2 |      |         
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('ping', 'OK')
    where address = '10.1.1.1'::inet;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 | OK   |         
    10.1.1.2 |      |         
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('nslookup', 'Zaphod')
    where address = '10.1.1.2'::inet;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 | OK   |         
    10.1.1.2 |      | Zaphod  
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('nslookup', 'Arthur')
    where address = '10.1.1.1'::inet;
    
    select address, attributes -> 'ping' as ping, attributes -> 'nslookup' as nslookup
    from ip_attributes
    where attributes ? 'nslookup'
    order by 1;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 |      | Arthur  
    10.1.1.2 |      | Zaphod
    You can add any attribute without the need to alter the table. The column can be indexed and a queries testing the keys or the values are extremely fast (e.g.: attributes ? 'nslookup' or where attributes @> hstore('ping','OK'))

    More details in the manual: http://www.postgresql.org/docs/current/static/hstore.html
    This looks promising. I will definitely have to play with HSTORE. The table will absolutely be sparsely populated. I just have not made the adjustment yet in my thinking.
    As an example of sparse, I took a particular country and mapped out their complete IPV4 footprint. Came in at just under 10 million addresses. Running a basic nmap reverse dns scan on those, it came down to about 1.5 million domains. Running nslookup on those brings back some additional IPs (that didn't turn up on the original nmap) and also brings back some aliases. So, from those 10 million rows, only 1.5 million would have domain name and eventually, whois info. Meanwhile, I will also do banner-port scanning (shodan style), but that is likely to be sparse as well.
    I am curious about how quickly attribute based queries will come back, and how to structure the indexes for the HSTORE data...

    As an aside, I have some Python experience, so am fairly well acquainted with the numerous Monty Python based naming conventions in that arena... noting your data examples, is HG2G typically used in Postgres-land?
  4. #18
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    Partitioning I understand (conceptually, haven't worked with it yet, but do understand the importance for this project), but what would inheritance do for me?
    Sorry, I should have explained: PostgreSQL uses inheritance to do partitioning, each partition inherits it's structure from the master table. Clients talk to the master table and don't see that it passes data back and forth to and from it's children.

    Disks are cheap so I am not concerned about space (except for the whole 'how to manage' bit). I'm currently doing my development work on a box with 32 GB ram and about 6 TB of hard drive space (current tablespace is using a couple hundred gigs on one 3 TB drive. I'll scale up ram and hard drives as necessary.
    Sounds good, but don't forget that you probably can't run this on a single disk, the throughput will be too low and reliability requires a RAID1+0 or RAID5 array (which octiplies (is that even a word?) the cost)

    But looking at your (and Shammat's) answers, it looks like I'll be heading down a pure Postgres road (the hstore data type looks like it will be a good fit for what I thought I was going to have to use HBASE for).
    HSTORE is exceedingly cool, but it is a key/value store designed for schemaless storage. It is best used for exactly that: data that you don't know the schema of, and that you don't have a need to enforce the structure of. You *can* enforce that some keys exist, but when you do that, you're basically wasting resources because a regular column is much faster at that.

    I'm curious about how this works out for you.
  6. #19
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1
    Originally Posted by Vinny42
    I can give you a long list of emberrassing fails in MySQL. It will silently commit a transaction if you forget to rollback before opening the next transaction. It thinks 1/0=NULL. It's default Isolation level is repeatable-read... No amount of partitioningsupport is going to make it a reliable database.


    But, back on topic: HBASE of HSTORE, you're still going to have terabytes of data that you need to be able to access with some degree of speed. At some point you are going to want to do things like see how long ago an address has been checked and you cannot afford to plow through 4GB of data to do it, just reading the table from disk will take minutes and if you need to do that a few thousand times, you're in for a week of harddisk activity. Swapping excess data to an external store can be done using the foreign data wrappers, but that external store will still have to deal with terabytes of data so unless the data can be easily and quickly compressed (and you don't have to search the compressed data) you're not going to get much benefit from that.

    Oh, and all the information about IP's and countries is already available in IP-to-country databases that you can download for free these days, that may be a good start to get your database filled with some realworld stuff so you can do actual queries and see where you stand before making a decision.
    Can you point me at some of these free "IP-to-country databases" you mention? I haven't found any containing full domain listings or the other types of information I am trying to get at. ERIPP lets you query, but the most recent data in there seems to be a couple of years old. Shodanhq has a ton of good stuff, but, at scale, it's pretty pricey to actually get (either in time or money - queries that return big piles are expensive and queries on individual IPs are slow, and a negative result is not definitive... it either means that nothing is there, or that he just hasn't gotten to that IP yet).
  8. #20
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    Well, I haven't used offline database in a while and I can't find really up-to-date stuff either, but the older ones should not be too bad:
    https://code.google.com/p/ip-countryside/

    Besides, your project should be able to update all this, right :-)
  10. #21
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    inheritance, disks, etc


    Originally Posted by Vinny42
    Sorry, I should have explained: PostgreSQL uses inheritance to do partitioning, each partition inherits it's structure from the master table. Clients talk to the master table and don't see that it passes data back and forth to and from it's children.
    Aaaaah, ok, thanks.


    Originally Posted by Vinny42
    Sounds good, but don't forget that you probably can't run this on a single disk, the throughput will be too low and reliability requires a RAID1+0 or RAID5 array (which octiplies (is that even a word?) the cost)
    Aye, just developing on a single disk for now. I also have a small NAS with about 4TB of RAID1+0 space, so have some appreciation for escallating costs.



    Originally Posted by Vinny42
    HSTORE is exceedingly cool, but it is a key/value store designed for schemaless storage. It is best used for exactly that: data that you don't know the schema of, and that you don't have a need to enforce the structure of. You *can* enforce that some keys exist, but when you do that, you're basically wasting resources because a regular column is much faster at that.
    How about "semi-schemaless" storage? My problem is that I am still on the learning curve side of things, so I do not know what is possible. I know that there are IP addresses. That each has an allocation status. That many have domain names, whois info, aliases, etc. I know that I can get at a bunch of data by running nmap and other scanning tools. But I don't know in advance how many cool attributes I will eventually learn how to gather... so an open ended key/value store seems like a good approach. If there is ever an end state, where all of the potential columns are known/defined, then perhaps that structure would be more performant. I don't know.

    Originally Posted by Vinny42
    I'm curious about how this works out for you.
    Me too.
  12. #22
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    [quote]
    How about "semi-schemaless" storage? My problem is that I am still on the learning curve side of things, so I do not know what is possible. I know that there are IP addresses. That each has an allocation status. That many have domain names, whois info, aliases, etc. I know that I can get at a bunch of data by running nmap and other scanning tools. But I don't know in advance how many cool attributes I will eventually learn how to gather... so an open ended key/value store seems like a good approach. If there is ever an end state, where all of the potential columns are known/defined, then perhaps that structure would be more performant. I don't know.
    [/quote[

    Hmm, normally I'm all for the "store everything, filter later" technique, and that willl certainly be wise for a test, but there is a *lot* of data out there. Be carefull that you don't end up with a dozen TB of data that you basically cannot query anymore because there is simply too much inundexed data.

    You should probably experiment with what data you can get, and be picky about what you keep.
  14. #23
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1
    Originally Posted by Vinny42
    How about "semi-schemaless" storage? My problem is that I am still on the learning curve side of things, so I do not know what is possible. I know that there are IP addresses. That each has an allocation status. That many have domain names, whois info, aliases, etc. I know that I can get at a bunch of data by running nmap and other scanning tools. But I don't know in advance how many cool attributes I will eventually learn how to gather... so an open ended key/value store seems like a good approach. If there is ever an end state, where all of the potential columns are known/defined, then perhaps that structure would be more performant. I don't know.
    Hmm, normally I'm all for the "store everything, filter later" technique, and that willl certainly be wise for a test, but there is a *lot* of data out there. Be carefull that you don't end up with a dozen TB of data that you basically cannot query anymore because there is simply too much inundexed data.

    You should probably experiment with what data you can get, and be picky about what you keep.
    The problem with being picky, is that I do not know ahead of time which data will be useful, so I really do need to store everything for now. I will probably take the approach of picking a few countries to "go deep" on, take lessons learned from those to perhaps refine what I do on others...
  16. #24
  17. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,784
    Rep Power
    348
    Originally Posted by bjquinniii
    I am curious about how quickly attribute based queries will come back
    That is nearly impossible to answer for your specific case, but my experience is that is really fast. But then I never tried it with billions of rows.

    Originally Posted by bjquinniii
    but and how to structure the indexes for the HSTORE data
    That's easy: you only need a single index on the hstore column. You can either use a GiST index or a GIN index. I'm not sure which one would be better, you will need to do some performance testing based on your data. GIN is usually faster but requires much more space. There is some work going on to improve the speed of GIN indexes and hstore, and it looks like this will make it into the next release (9.4, probably Q3/2014)

    http://obartunov.livejournal.com/171959.html
    http://obartunov.livejournal.com/175235.html

    Here is some little example executed on my pretty outdated desktop:
    Code:
    postgres=> select count(*) from ip_addresses;
      count
    ---------
     5000000
    (1 row)
    
    
    postgres=> \timing on
    Timing is on.
    postgres=> select count(*) from ip_addresses;
      count
    ---------
     5000000
    (1 row)
    
    
    Time: 1157,284 ms
    postgres=> select a, count(*)
    postgres-> from (
    postgres(>   select skeys(attributes) a
    postgres(>   from ip_addresses
    postgres(> ) t
    postgres-> group by a;
        a     |  count
    ----------+---------
     nslookup | 1000000
     whois    |  500000
     ping     |  500000
    (3 rows)
    
    
    Time: 7468,787 ms
    postgres=> select count(*) from ip_addresses where attributes @> hstore('whois', '101');
     count
    -------
         1
    (1 row)
    
    
    Time: 174,779 ms
    postgres=> select count(*) from ip_addresses where attributes @> hstore('ping', '100');
     count
    -------
         1
    (1 row)
    
    
    Time: 148,536 ms
    postgres=> select count(*) from ip_addresses where attributes @> hstore('nslookup', '100');
     count
    -------
         1
    (1 row)
    
    
    Time: 344,310 ms
    postgres=> select count(*) from ip_addresses where attributes ? 'whois';
     count
    --------
     500000
    (1 row)
    
    
    Time: 426,515 ms
    postgres=> select count(*) from ip_addresses where attributes ? 'nslookup';
      count
    ---------
     1000000
    (1 row)
    
    
    Time: 877,131 ms
    postgres=>
    Of course your mileage will vary greatly - you won't get around doing your own benchmarking.


    Originally Posted by bjquinniii
    noting your data examples, is HG2G typically used in Postgres-land?
    No, absolutely not. This is just my personal style.

    Originally Posted by Vinny42
    HSTORE is exceedingly cool, but it is a key/value store designed for schemaless storage. It is best used for exactly that: data that you don't know the schema of, and that you don't have a need to enforce the structure of. You *can* enforce that some keys exist, but when you do that, you're basically wasting resources because a regular column is much faster at that.
    I do agree that hstore is not always the solution and simply adding new columns as new attributes are identified/implemented is a very feasiable way.

    But hstore does have one advantage in this case though: you only need a single index to cover all potential queries. Whereas if you go with a column per attribute you would need one index per attribute (although they could be made "sparse" as well by not indexing the NULL values)
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  18. #25
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    The problem with being picky, is that I do not know ahead of time which data will be useful,
    True, but just storing everything doesn't guarantee that you can actually do something with the data you've collected, either. if you're just plonking strings into an HSTORE then you may not be able to process the content into some usefull format, so don't wait too long with analyzing your data and start getting picky soon.

    Anyway I think we agree that you should first get a test up and running and see what kind of data you are able to get.

    Comments on this post

    • shammat agrees : I too agree with that statement.
  20. #26
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    But hstore does have one advantage in this case though: you only need a single index to cover all potential queries.
    True, and that's also true for a regular EAV model (which is a horribly annoying model to use). But just as with an EAV model; having one index for everything usually means that you have an index that sort-of works for most queries. Given the number of records and the amount of data that is expected I think building a single index might not be an advantage after all.

    Also, index maintenance can be automated, especiallty with PostgreSQL's DDL triggers you can automatically create indexes on new columns in a table. But you don't need that because when you create specialized tables you already know what you are going to put in them and why, and therefor: how you need them to be indexed.
  22. #27
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    As a followup to the "MySQL has better partitioning support" I've looked up the limitations and.. well.. wow :-)

    "Server SQL mode. Tables employing user-defined partitioning do not preserve the SQL mode in effect at the time that they were created. As discussed in Section 5.1.7, “Server SQL Modes”, the results of many MySQL functions and operators may change according to the server SQL mode. Therefore, a change in the SQL mode at any time after the creation of partitioned tables may lead to major changes in the behavior of such tables, and could easily lead to corruption or loss of data."

    "Query cache not supported. "

    "Columns with spatial data types such as POINT or GEOMETRY cannot be used in partitioned tables."

    "A partitioning key must be either an integer column or an expression that resolves to an integer"

    And of course the kicker:

    "Foreign keys not supported for partitioned InnoDB tables. "


    http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html


    So yeah, MySQL is not an option for this application unless you want to end up in a very deep pool of purest poo.
  24. #28
  25. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    MySQL


    Originally Posted by Vinny42
    As a followup to the "MySQL has better partitioning support" I've looked up the limitations and.. well.. wow :-)

    "Server SQL mode. Tables employing user-defined partitioning do not preserve the SQL mode in effect at the time that they were created. As discussed in Section 5.1.7, “Server SQL Modes”, the results of many MySQL functions and operators may change according to the server SQL mode. Therefore, a change in the SQL mode at any time after the creation of partitioned tables may lead to major changes in the behavior of such tables, and could easily lead to corruption or loss of data."

    "Query cache not supported. "

    "Columns with spatial data types such as POINT or GEOMETRY cannot be used in partitioned tables."

    "A partitioning key must be either an integer column or an expression that resolves to an integer"

    And of course the kicker:

    "Foreign keys not supported for partitioned InnoDB tables. "


    http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html


    So yeah, MySQL is not an option for this application unless you want to end up in a very deep pool of purest poo.
    Even though most of my previous work has been of a smaller scale and MySQL (actually using the MariaDB fork) handled it nicely, this project quickly exceeded its capabilities. Though I have no doubt that it can handle more than what my skillset enabled, I think this would be a bit of a stretch for it. Right now I just need to play enough to find a reasonable PostgreSQL approach (doubt I'll hit on optimal any time soon, still in the beginning of the learning curve for this one... though this forum has been very helpful).
  26. #29
  27. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    Don't forget the most excellent mailinglists for postgresql, the core developers also reply to userquestions there, it might be a good idea to describe your project there too, they can tell you all the things you can do.
  28. #30
  29. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1
    Originally Posted by Vinny42
    Don't forget the most excellent mailinglists for postgresql, the core developers also reply to userquestions there, it might be a good idea to describe your project there too, they can tell you all the things you can do.
    where do I find/subscribe to those?

IMN logo majestic logo threadwatch logo seochat tools logo