Thread: Casting?

Page 1 of 4 123 ... Last
  • Jump to page:
    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    Casting?


    Trying to create a single table with the complete IPV4 address space. So far, I have five columns.

    oct1 smallint
    oct2 smallint
    oct3 smallint
    oct4 smallint
    ipv4 character varying(16)[]

    The rows were created using a series of cross joins from a single table with the values 0 - 255, so I have all ~4.3 billion rows. There are indexes on oct1 and and ipv4, but the ipv4 values have not been entered yet. I am trying to write an update statement that will concatenate oct1 - oct4 into ipv4.

    This query:
    select *,cast(oct1||'.'||oct2||'.'||oct3||'.'||oct4 as character varying(16)) from ipv4 where oct1=0 limit 100;

    works, and gives me the expected results (though ipv4 shows up as "ipv4 character varying(16)[] and the concatenation shows up as "varchar character varying(16)"

    When I try to write the update, I get a variety of errors that basically tell me that I do not understand casting in Postgresql.

    For example,
    update ipv4
    set ipv4=cast(oct1||'.'||oct2||'.'||oct3||'.'||oct4 as character varying(16)[])
    where oct1=0;
    reports:
    ERROR: array value must start with "{" or dimension information
    SQL state: 22P02

    while:
    update ipv4
    set ipv4=cast(oct1||'.'||oct2||'.'||oct3||'.'||oct4 as character varying)
    where oct1=0;
    tells me:
    ERROR: column "ipv4" is of type character varying[] but expression is of type character varying
    LINE 2: set ipv4=cast(oct1||'.'||oct2||'.'||oct3||'.'||oct4 as chara...
    ^
    HINT: You will need to rewrite or cast the expression.

    Can someone clue me in on what I am doing wrong? (fair amount of background in MS SQL Server and MySQL, but just starting out in PostgreSQL).

    (using the current 9.3 version, downloaded about a week ago).
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Any reason you are not using the inet datatype?

    http://www.postgresql.org/docs/current/static/datatype-net-types.html#DATATYPE-INET
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    ignorance (mine)


    Will try that, thank-you.
    When I create a new column of type inet, what is the difference between inet and inet[]?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by bjquinniii
    Will try that, thank-you.
    When I create a new column of type inet, what is the difference between inet and inet[]?
    inet[] is an array

    http://www.postgresql.org/docs/current/static/arrays.html
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    update limit?


    Looks like it is working now (I'll know better when the current update query finished with 16.7 million rows... really should have build some smaller tables to play with first).

    I added a column of type inet called ipv4inet and ran the following:

    update ipv4
    set ipv4inet=cast(oct1||'.'||oct2||'.'||oct3||'.'||oct4 as inet)
    where oct1=0;

    Using the where clause to drop the affected rows from 4.3 billion to 16.7 million. Will update the rest if it works.

    This leads me to another question. In MySQL, I could put a "limit" clause on the update query to test things out, but it doesn't look like I can use that here. Is there a way to get that functionality in PostgreSQL? Perhaps by using a limit clause in the "with_query"?
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    array type


    PostgreSQL allows arrays within a single cell of the table?
    Interesting...
    Thanks again...
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    update query


    fyi the update query worked.
    took a bit over three hours for the first 16 million rows, but looks great.
    appreciate the clue.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    Why do you have four octet columns?
    And for that matter; why are you inserting records for all possible IP adresses, all 4.2 billion of them? at four smallints plus one inet per address you will need at least 8bytesx4.2GB addresses = 32GB of diskspace. if you put all that in a single table you're going to spend most of your time waiting for your server to deal with the harddisk's throughput.

    Perhaps you should explain a bit more about what it is that you are trying to do...
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    What I'm trying to do...


    Vinny,

    The four oct fields serve three purposes.
    The first is that they were how I constructed the full rowcount.
    I started with a single table with values 0 - 255. I cross joined that table with itself and then had 256^2 rows. I cross joined the result with the first table and then had 256^3 rows. Same thing again gave me 256^4 rows, the full ipv4 space.

    Second purpose is to use the values in these fields to construct values for two more columns. The first column is the usual IP address, which is just a concatenation of the four oct fields (with decimals inserted between the octets). The second column will be the integer value of the address (256^3*oct1)+(256^2*oct2)+(256^1*oct3)+(256^0*oct4) - looks like I have to use Bigint for that field in PostgreSQL. An unsigned INT would hold it, but "unsigned" does not seem to be an option here.

    Third purpose is to evaluate how suitable PostgreSQL is for this particular project. I had started in MySQL but was running into size issues when I started trying to add indexes (even at 1/4 the full table size, when I tried adding the first index, MySQL would report that the table was "full").

    More columns will be added later. For example, there will be a column to hold the international allocation status for each IP. There will be another column for a ping status, and an nslookup, and a whois, and banners from various ports.

    Speed is not particularly important to me, just capacity/capability. Completed rows will eventually be pushed out to CSV and jammed into (probably) an HBASE store of some sort.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by bjquinniii
    The first is that they were how I constructed the full rowcount
    That is unnecessary, because that number never changes.

    Second purpose is to use the values in these fields to construct values for two more columns. The first column is the usual IP address, which is just a concatenation of the four oct fields (with decimals inserted between the octets). The second column will be the integer value of the address (256^3*oct1)+(256^2*oct2)+(256^1*oct3)+(256^0*oct4) - looks like I have to use Bigint for that field in PostgreSQL.
    Sorry I don't get it. Can you add some sample queries where you would need this? Are you aware that Postgres' inet data type supports an "ip range" as?

    Postgres also has pretty powerful range datatypes - I don't really understand your real problem but maybe they might be of some help as well:
    http://www.postgresql.org/docs/current/static/rangetypes.html

    More columns will be added later. For example, there will be a column to hold the international allocation status for each IP. There will be another column for a ping status, and an nslookup, and a whois, and banners from various ports.
    You still didn't describe the real problem you are trying to solve with this table. If you want to store attributes per IP address, I would probably only store those IP addresses where I know the additional attributes. The final CSV file could then be generated by "exploding" that table to all IP addresses using the generate_series() function. Something like this:

    Code:
    create table ip_attributes
    (
      address inet,
      ping_status varchar(10),
      whois       varchar(100)  
    );
    
    insert into ip_attributes
    values
    ('10.1.1.1', 'NOT OK', null),
    ('10.1.1.2', 'OK', 'Arthur Dent'),
    ('10.1.1.9', 'NOT OK', 'Zaphod Beeblebrox');
    
    with all_addresses (address) as 
    (
      select '10.1.1.0'::inet + i
      from generate_series(1,10) i
    ) 
    select aa.address, attr.ping_status, attr.whois
    from all_addresses aa
      left join ip_attributes attr on attr.address = aa.address
    ;
    will output the following:

    Code:
    address   | ping_status | whois            
    ----------+-------------+------------------
    10.1.1.1  | NOT OK      |                  
    10.1.1.2  | OK          | Arthur Dent      
    10.1.1.3  |             |                  
    10.1.1.4  |             |                  
    10.1.1.5  |             |                  
    10.1.1.6  |             |                  
    10.1.1.7  |             |                  
    10.1.1.8  |             |                  
    10.1.1.9  | NOT OK      | Zaphod Beeblebrox
    10.1.1.10 |             |
    By changing the parameter for the generate_series() call you can generate any number of ip addresses you like without actually storing them.
    Last edited by shammat; November 18th, 2013 at 06:04 AM.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    ranges, et al


    Originally Posted by shammat
    That is unnecessary, because that number never changes.
    Well, I guess I'm just more comfortable having a static list. Both to ensure that I have a complete framework and to avoid omission problems if something goes wrong with dynamic creation later (and because, at least in my prior experience, dynamically building the lists over and over again takes more processing power/time). That may not be the case here, I definitely was not familiar with the range support (or the inet type) available in PostgreSQL.

    Originally Posted by shammat
    Sorry I don't get it. Can you add some sample queries where you would need this? Are you aware that Postgres' inet data type supports an "ip range" as?
    There are a couple of different types of questions/queries which I will be asking. At the end state (which will never be complete - so it would be more accurate to say "at the current state of data which has been gathered") I would like to be able to query on a particular IP address and get back everything my data store knows about it (country allocation, ping, dns, whois, ports, banners, etc.). The other type of query/usage is to use the table as both the driver and storage container for data gathering. Sequence would go something like this, "select ipv4 from [table] where country code = 'XX';" Then use this recordset to drive a utility which performs some sort of action on each address (whether it be a ping, some type of nmap query, or whatever) and store the results back into the table. As I think through this problem, and begin to become more familiar with PostgreSQL, it may make more sense to simply use PostreSQL to store a table of the endpoints of the various country allocation blocks and use the generate series to come up with the driving list, store the results of individual runs, generate csv's from those, and then push the results to some flavor of HBASE.

    Originally Posted by shammat
    Postgres also has pretty powerful range datatypes - I don't really understand your real problem but maybe they might be of some help as well:
    http://www.postgresql.org/docs/current/static/rangetypes.html
    Was not aware of the range datatypes in Postgres, and you're right, those may help dramatically simplify things for me.

    Originally Posted by shammat
    You still didn't describe the real problem you are trying to solve with this table. If you want to store attributes per IP address, I would probably only store those IP addresses where I know the additional attributes. The final CSV file could then be generated by "exploding" that table to all IP addresses using the generate_series() function. Something like this:

    Code:
    create table ip_attributes
    (
      address inet,
      ping_status varchar(10),
      whois       varchar(100)  
    );
    
    insert into ip_attributes
    values
    ('10.1.1.1', 'NOT OK', null),
    ('10.1.1.2', 'OK', 'Arthur Dent'),
    ('10.1.1.9', 'NOT OK', 'Zaphod Beeblebrox');
    
    with all_addresses (address) as 
    (
      select '10.1.1.0'::inet + i
      from generate_series(1,10) i
    ) 
    select aa.address, attr.ping_status, attr.whois
    from all_addresses aa
      left join ip_attributes attr on attr.address = aa.address
    ;
    will output the following:

    Code:
    address   | ping_status | whois            
    ----------+-------------+------------------
    10.1.1.1  | NOT OK      |                  
    10.1.1.2  | OK          | Arthur Dent      
    10.1.1.3  |             |                  
    10.1.1.4  |             |                  
    10.1.1.5  |             |                  
    10.1.1.6  |             |                  
    10.1.1.7  |             |                  
    10.1.1.8  |             |                  
    10.1.1.9  | NOT OK      | Zaphod Beeblebrox
    10.1.1.10 |             |
    By changing the parameter for the generate_series() call you can generate any number of ip addresses you like without actually storing them.
    Again, this approach looks very promissing and I will have to experiment. One thing (of many) that I don't know yet is whether or not my final data store will gracefully handle incremental column updates (which would be necessary if I switch to this more dynamic approach), or if I would need to persist the complete data store in the table so that I could re-export the full information for rows which need updating... that probably doesn't make a lot of sense... let me try an example. Say that initially I have a utility which can only do the ping test. So I run it against the complete IPV4 footprint for country XX. Later, I create a utility which can do the nslookup and run it. At this point, I could push the new column into the final data store (if that type of operation is supported), or I may have to create a csv of all of the rows with both the original ping info and the new nslookup info and do a complete ingest/indexing. For a final goal, think ERIPP (http://eripp.com/) + Shodanhq (http://www.shodanhq.com/) + whatever else I can come up with...
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by bjquinniii
    Say that initially I have a utility which can only do the ping test. So I run it against the complete IPV4 footprint for country XX. Later, I create a utility which can do the nslookup and run it. At this point, I could push the new column into the final data store (if that type of operation is supported), or I may have to create a csv of all of the rows with both the original ping info and the new nslookup info and do a complete ingest/indexing. For a final goal, think ERIPP (http://eripp.com/) + Shodanhq (http://www.shodanhq.com/) + whatever else I can come up with...
    Adding a new (nullable) column to an existing table is basically a no-op and is extremely fast:
    Code:
    psql (9.3.0, server 9.2.2)
    Type "help" for help.
    
    postgres=# select count(*)
    postgres-# from foo;
    
      count
    ----------
     29999999
    (1 row)
    
    
    Time: 3466,883 ms
    postgres=# ALTER TABLE foo ADD COLUMN bar varchar(20);
    ALTER TABLE
    Time: 1,458 ms
    postgres=#
    Another alternative to the "dynamic attributes" problem is the hstore datatype which is essentially a key/value store inside a single row:
    Code:
    create table ip_attributes
    (
      address inet not null primary key,
      attributes hstore
    );
    
    insert into ip_attributes
    values
    ('10.1.1.1'),
    ('10.1.1.2'),
    ('10.1.1.9');
    
    select address, attributes -> 'ping' as ping, attributes -> 'nslookup' as nslookup
    from ip_attributes
    order by 1;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 |      |         
    10.1.1.2 |      |         
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('ping', 'OK')
    where address = '10.1.1.1'::inet;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 | OK   |         
    10.1.1.2 |      |         
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('nslookup', 'Zaphod')
    where address = '10.1.1.2'::inet;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 | OK   |         
    10.1.1.2 |      | Zaphod  
    10.1.1.9 |      |         
    
    update ip_attributes
      set attributes = hstore('nslookup', 'Arthur')
    where address = '10.1.1.1'::inet;
    
    select address, attributes -> 'ping' as ping, attributes -> 'nslookup' as nslookup
    from ip_attributes
    where attributes ? 'nslookup'
    order by 1;
    
    address  | ping | nslookup
    ---------+------+---------
    10.1.1.1 |      | Arthur  
    10.1.1.2 |      | Zaphod
    You can add any attribute without the need to alter the table. The column can be indexed and a queries testing the keys or the values are extremely fast (e.g.: attributes ? 'nslookup' or where attributes @> hstore('ping','OK'))

    More details in the manual: http://www.postgresql.org/docs/current/static/hstore.html
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  24. #13
  25. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    10
    Third purpose is to evaluate how suitable PostgreSQL is for this particular project. I had started in MySQL but was running into size issues when I started trying to add indexes (even at 1/4 the full table size, when I tried adding the first index, MySQL would report that the table was "full").
    MySQL as a toy database, not suited for serious applications.

    But that said; have you spent any time thinking about the size of what you are doing?

    Again, 4.2 Billion IP addreses, if you store just 400 bytes of data per address and you somehow manage to get just half of the adresses mapped, you are talking about 2.1*400=just under one Terabyte of data.

    it seems to me that your generating the actual addresses is the least of your worries. What you should be worried about is how to manage a 4-5TB database.

    PostgreSQL allows you to do partitioning, which you will absolutely have to do because otherwise your indexes will be Gigabytes in size and you simple will not have enough memory to deal with them (reading them from disk is too slow)

    Speed is not particularly important to me, just capacity/capability. Completed rows will eventually be pushed out to CSV and jammed into (probably) an HBASE store of some sort.
    Hstore doesn't solve any problems here, you need Terabytes of diskspace and many many gigabytes of RAM to real with the data.


    Look into inheritance and the partitioning of data in postgresql. That is where your real problem is.
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by Vinny42
    MySQL as a toy database, not suited for serious applications.
    As much as I share your dislike for MySQL that statement isn't true - especially when it comes to partitioning which is (unfortunately) still a bit better in MySQL than in Postgres (although the mere existence of an error message that says "table is full" is kind of funny - if not embarrasing)

    Hstore doesn't solve any problems here, you need Terabytes of diskspace and many many gigabytes of RAM to real with the data.
    bjquinniii was talking about HBASE not hstore.

    It was me who brought hstore into the game. I think(!) it can actually help in reducing the storage if the attributes are sparse, i.e. only a small percentage of the rows has a specific attribute and another small percentage has a value for a different attribute. But I'm not entirely sure on that. It heavily depends on the overhead of a single column in the row header - especially for rows that don't have values for a specific column.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  28. #15
  29. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1

    Partitioning and Inheritance


    Originally Posted by Vinny42
    MySQL as a toy database, not suited for serious applications.

    But that said; have you spent any time thinking about the size of what you are doing?

    Again, 4.2 Billion IP addreses, if you store just 400 bytes of data per address and you somehow manage to get just half of the adresses mapped, you are talking about 2.1*400=just under one Terabyte of data.

    it seems to me that your generating the actual addresses is the least of your worries. What you should be worried about is how to manage a 4-5TB database.

    PostgreSQL allows you to do partitioning, which you will absolutely have to do because otherwise your indexes will be Gigabytes in size and you simple will not have enough memory to deal with them (reading them from disk is too slow)



    Hstore doesn't solve any problems here, you need Terabytes of diskspace and many many gigabytes of RAM to real with the data.


    Look into inheritance and the partitioning of data in postgresql. That is where your real problem is.
    Partitioning I understand (conceptually, haven't worked with it yet, but do understand the importance for this project), but what would inheritance do for me?

    Disks are cheap, so I am not concerned about space (except for the whole 'how to manage' bit). I'm currently doing my development work on a box with 32 GB ram and about 6 TB of hard drive space (current tablespace is using a couple hundred gigs on one 3 TB drive. I'll scale up ram and hard drives as necessary.

    But looking at your (and Shammat's) answers, it looks like I'll be heading down a pure Postgres road (the hstore data type looks like it will be a good fit for what I thought I was going to have to use HBASE for).
Page 1 of 4 123 ... Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo