January 20th, 2003, 10:50 AM
limitations of a HD? (using numerous directories & files to split-up/sort databases)
I want to store unique IPs. I may potentially be storing up to 1,000,000 IPs at a time. Keep in mind that IPs are 4 bytes each.
One thought is to use a binary tree so the search for an existing IP is quick. However it is slow to load in 1,000,000 IPs from the disk each time.
Another thought is to use 256 different files (0.dat .. 255.dat) that stores only the IPs that end in the same name. This brings down 1,000,000 IPs in one file to an average of only 3,906 IPs in each file. Loading in only 4,000 IPs each time is much quicker than 1,000,000. (A binary tree optimization can still be used in this method too.) This uses the HD itself as a search function. However, is 256 files in one directory too much? Is this a slow search that I could somehow implement faster?
You can even take this one step further and use 256 directores and 256 files in each directory. (I am assuming that 65,536 files in one directory is too much). The directories correspond to the second to last number in the IP, and the files correspond to the last number in the IP. This *really* uses the HD as searching/sorting functions. The upside is this brings the average count of IPs in each file down from 1,000,000 to only 15. Once the file is found, the load time would be insignificant. (No binary tree would be needed either.) However, is this too much strain on a HD?
I would love to hear your input on this problem.
Last edited by Doucette; January 20th, 2003 at 10:57 AM.
January 20th, 2003, 12:47 PM
1,000,000 x 4 Bytes + some overhead = about 4MB.
what machine do you plan to run your program on that cannot load 4MB into RAM?
The HD is the slowest part of your system, so you donīt want to setup multiple files or directories as they produce overhead on accessing them (read FAT, read root dir, read dir, read file).
even if you didnīt have 4MB of free ram, you still want to use one file and seek in it.
For the limits: depends on your filesystem.
January 20th, 2003, 01:52 PM
The problem is not loading 4MB into RAM, the problem is that with 1,000,000 hits/day that is 12 hits/second on average. So the problem is loading 4MB into RAM 12 times every second. Also, 4 bytes/IP only stores the IP. I want to store referrer information too (URLs), so that will be considerably more information.
Sorry, I should have explained all of that, but I wanted to keep my post short.
Also, I agree that the HD is the slowest part of a system. No matter how I look at it I have to access the HD. (Unless the script runs completely in shared memory, which I am not sure how to code yet).
The question is basically: Which method of accessing the HD is the fastest?
Last edited by Doucette; January 20th, 2003 at 01:55 PM.
January 20th, 2003, 02:08 PM
you should choose the method that has the least hard disk access, i.e. load the whole file into the ram.
for a CGI program, you might want to use a server program that keeps the file in memory and the CGI scripts connect to this server for queries. like a database (anyway, why donīt you put it into a mysql DB and forget about all your problems?)
databases are already optimized for this special problem, so why not use one? Especially in a high-traffic environment like you described...
January 20th, 2003, 02:35 PM
I used MySQL and found it really slow. It was with Perl/DBI. I am not sure why it was so slow, but it was. All my CPU was being used up by the MySQL server, "mysqld", on my 400MHz CPU server.
This is why I have come to program in C for the code and for the database too.
Loading in the whole file is the least HD access? Are you sure about that? Loading in one file (out of 256) with 4,000 IPs or loading in just one file with 1,000,000 IPs has to be considerably different. How much time does it take to seek out the file on the HD? At what size of the file does splitting it up become the faster solution (if at all)?
(Loading the whole file each time would be faster if the file got cached. Loading hundreds (or thousands) of smaller files means you are never accessing the same file each time, so the file does not get cached. I'm not sure of how caching works on a Unix box though.)
Last edited by Doucette; January 20th, 2003 at 02:45 PM.
January 20th, 2003, 02:47 PM
The MySQL developers probably put more time on speed optimization than you have for the whole project.
Anyway, running a website for 12 hits/second on a 400Mhz is kinda adventure... if you really need this, learn assembler and how to optimize assembler for certain CPUs ;)
at the size where your OS has to start swapping...
yes. the way i told you it will only be loaded ONCE per server-restart!
- write a server program in C (using sockets or some other faster? IPC mechanism like SHM)
- load the file into RAM once
- make the CGI scripts query this server (just like a mysql database)
- compare the speed to a mysql database, i bet for MySQL...
but then - i said "use MySQL!", nothing else ...
Last edited by M.Hirsch; January 20th, 2003 at 02:51 PM.
January 20th, 2003, 02:58 PM
Ok I got you now. I obviously know that loading the file once into memory is the fastest. I was just confused with what you were saying. Now, which is the fastest if I ignore that approach and load nothing into memory permanently?
Also, I know that MySQL has put a lot of time into optimizations. But they have built a general purpose database and I can hardcode my database for my likings. Pure C with a hardcoded database can not be beat for speed, all other things equal.
Here's a few MySQL questions: Does MySQL stays loaded in memory? Does it keep the accessed databases in memory? If not, can it? Could this be why my Perl/DBI/MySQL script was so slow? What is "mysqld" specifically anyway? Is it the MySQL server?
As you said, the best solution is to write a C server (or use MySQL to do the same). But before I learn that, what is the best method?
Thanks a million for your input! :)
Last edited by Doucette; January 20th, 2003 at 03:01 PM.
January 20th, 2003, 03:07 PM
depends on how good you are ;) can you beat the MySQL team? (if you wanna try, read their "atomic operations" approach)
probably the PERL interpreter was so slow... for high performance you donīt want to use interpreted languages at all... (there is compilers for PERL, PHP and others on the īnet, but for performance you need a combo of C/ASM)
hard to tell...
- cluster size on the filesystem (how much data is loaded at once anyway depending on your FS, afaik eg. FAT=4k)
- cluster size on the hard disk (how much is loaded by the low-level driver at once)
- choosing the "fastest" filesystem for this special project
- probably use the approach where you split off the first byte. (just my personal guess, no mathematical background (yet)!)
for learning purposes, i would program all of them and profile the code afterwards for comparison...
January 21st, 2003, 10:24 AM
Is that the gist of it?
If MySQL keeps the databases in memory and I end up not being able to code my own databases likewise, then MySQL should beat me out hands down, no matter how well I code.
There is another issue as to why I do not want to use MySQL. I might market my script and it would be nice if there are no MySQL requirements. MySQL complicates the install too.
This probably is the case. But it was the "mysqld" process that was chewing up all the CPU, not the CGI script itself. What is "mysqld" excatly?
I agree. I should code all methods and test them out. Another consideration is that when the files are split up, they are PRE-sorted. This means that loading in 1,000,000 IPs requires a search on all 1,000,000 of those IPs to see if the current IP already exists or not. (That's not exactly true if you use binary search trees, but you get my point.) If I load only 4,000 IPs from one of the 256 files, 219.dat file for example, then I only have to search through 4,000 IPs, not 1,000,000, as I already have performed the necessary sorting calculations to get only the *.*.*.219 IPs.
So splitting the IPs into 256 different files decreases load time (smaller file), decreases search time (less IPs), and increases seek time (more files). The only downfall is the increased seek time, which I am beginning to think is insignificant. However, splitting up the IPs into 256 directores and 65,536 files might be a different story on the seek time.
Do you mean the method where I seperate the IP file into 256 seperate IP files based on the last byte of the IP? (I.E. IP 188.8.131.52 would be put in a 78.dat file along with all other *.*.*.78 IPs.)
Last edited by Doucette; January 21st, 2003 at 10:41 AM.
January 21st, 2003, 03:10 PM
there is another document on mysql.com somewhere describing the speed impact of atomic operations.
if you want to live without MySQL, you need to care about concurrent access and transactions/rollback yourself.
the "mysqld" is the actual MySQL server process. if it is eating up all CPU, you might need to optimize your queries. On the other hand, the database access is the only part of your program that really needs CPU power, so it is just natural that mysql seems to have 100% power.
There is different mysql server versions, iirc one is optimized for serving under high load (mysqld-max ?)
For your approach without a database, your speed will depend very much on a good filesystem.
There is filesystems that implement binary trees for the directory (iirc XFS?). They should imho work best with accessing 1000s of files simultaneously.
If you think about it (independent of the filesystem), it does not make a difference if you take one big file or 1000 small files. On the hard disk, data anyway is stored in one big "file" (ever tried to copy /dev/hda? ;) )
Either you rely on the OS having good and fast routines for accessing it (1000 small files) or you write your own (1 big file).
Did you think about getting a raid controller? (no software raid if you have high load already!) With raid0 or raid5, you can double (4x, 8x, ...) hard disk access speed. and they usually implement hardware cache which will again speed up things.
i suggested using the 256-file method as a middle-way. it relies partly on the OS but 256 files are not that much and you still can optimize access inside these files.
no. it depends on your hardware, OS and the filesystem being used.
on a hard drive you have several discs ("heads"). i.e. if you access a file on disc #1 and immediately afterwards a file on disc #2, youīll have better speed than two times accessing disc #1. - unless the second file is in the sector that lies directly after the first file, i.e. no seeking time...
probably one could make a lot of statistic calculations about this issue - sorry beyond my scope ;)
January 21st, 2003, 08:32 PM
Fact is that the HD is the slowest part, so you'll want to minimise read/write operations. Given enough memory, you might be able to rely on the OS caching all of it, but you might run into situations where some other process comes along and "claims" the cache.
The question is, will data be mostly read or written? In case of the former, you might actually consider the "let the OS cache it all" approach, and maybe just add a strip of RAM if need be.
In case of the latter, things get a little more complicated. If there's any chance of two or more processes (or threads, whatever) writing to the same space/file, you'll have to deal with concurrency.
If concurrency is not an issue, then you might consider using two processes and IPC. One that holds the database in-core, and periodically writes (part of) the file back to disk, and the other process telling the database process which entries to mutate.
As soon as multiple processes will be wanting to mutate the database, your best bet (IMHO) would be the basic client/server approach. Although in concept it's not that different from the two-process setup I mentioned, it takes much more careful programming to make sure that clients don't pile up, waiting to talk to the database process. This means choosing the right IPC system is paramount.
Or you could use PostgreSQL's approach, which uses a sort-of intermediary:
That way, whenever there's a higher load than usual, mutation requests from clients can pile up in a FIFO buffer in the server, allowing the client to move on. Obviously, this only works if the client isn't expecting a response from the back-end.
| client |<
---------- \ ---------- ------------
|---> | server | <---> | back-end |
---------- / ---------- ------------
| client |<
Anyway, just my €0.02 :)
"A poor programmer is he who blames his tools."
January 24th, 2003, 10:20 AM
I believe the origninal question was to sort/select from 1 million entries that are 4 bytes each. An ip address is nothing more than a 32-bit number (viewed in hex translated to decimal). Dimension your lookup table as 1 million 32-bit numbers and, if needed, a sub-field. You'll put this all in one file (reading 4MB is slow, but it'll probably only be on startup). As this is C, you can set/change a LOT of things, so try and set your file cache(buffer) to 4MB and use it as your "on-line" disk access. The opposite of putting it in memory is good, too, but that depends on the machine. Use you disk access scheme to access the other info that you need to retrieve.
If you MUST have the data all in one go, then just increase the file buffers and put your info in a fixed length, binary file:
unsigned long IP
unsigned char Octet
Forgive my pseudo-code, as it has been a while, but that's basically how we did it in the days of the dinosaurs. ;)
Best of luck!
January 24th, 2003, 11:56 AM
Another thing I should have mentioned right off the bat:
*** I want to market this script.***
This means various things: It has to be easy to install. This means it has to have as few restrictions and requirments as possible. It has to work on various Unix-based machines. I have to assume that not every machine is tweaked for optimal performance of my script. Etc.
Basically I have to assume I am running it on whatever the most 'average' server setup is, whatever that would be. Does that make sense?
January 24th, 2003, 02:29 PM
It does make sense. But i doubt this is possible. I donīt know of any C program that compiles on all favors of unix (besides the "helloworld.c").
Yes, apache et al do, but do you know how this works? 1000s of #ifdef and the "configure" program is used to resolve OS-, compiler- and distribution-dependent stuff as well as bugs in the implementations of certain functions on certain platforms...
imho you would have to code all of these features, choose the fastest for the OSs that is supported and if it isnīt, descent the slower algorithms until one fits. For speed optimization you cannot put this into normal "if ()"īs but you need to use #ifdef
I would not code it for the "average" system. Define minimum requirements. Maybe not mysql, but GB of RAM and type+MHz of the CPUs. Anyway you have to test your program on any OS that you claim it is working on...
from my experience: for any professional program, you always have to adapt your hardware and/or OS... eg. you canīt run Photoshop on a 386 nor on linux. Nor do premiere, novell, maya, eclipse, oracle, ... And i would not even recommend running apache/php on windows for a production environment. Believe me, i tried... :(
... just my 2ct ...
January 24th, 2003, 03:48 PM
Is this going to be a static list or will you make changes to it (if so, how frequently?). A static list allows you to make assumptions and pre-optimize retrieval.
Otherwise, you get into the realm of good ol' data-entry screens and the "fun" they can provide. If a database scheme is available, such as alluded to above (mySQL, et al), I say use it. Otherwise, you need to develop a very basic database on your own.
I was going to suggest buying a copy of Al Stevens' "C Database Development, 2nd Edition", but I see on Amazon that it's out-of-print (1991, jeez! Am I that old?). If you can somehow get a copy, it shows a very rudimentary Database system to include a nice intro to B(Binary)-trees (and compiles to less that 250k total!)
Another option is good ol' text. Unix is superb with its text processing routines, such as strtok and qsort. As quaint as it seems, it may be a good option. AND you can add entries using emacs, vi, etc.