January 22nd, 2013, 05:39 PM
Blacklist and files
I have a Project and in a part of this Project i have a Problem..
we have a blacklist so that has 4,000,000 line and in each line has 10 character of INTEGER. Like Below
And we have many file in every minute like below , So we must Compare These Files with Blacklist File And Remove Lines so there are in Blacklist. each of These Files Maybe have 300,000 line.
so after remove blacklist lines from this file we must have below file
i check many solution for solving this problem. like using findstr in windows and ...
but this solution is very slow and elapse long time ( 10 minutes for 1 file )
Please Help me to solving This problem. ( fastest way to doing these works. )
Sorry for poor english.
January 23rd, 2013, 12:25 AM
First make sure you're running a 64 bit version of PHP so that you can handle the numbers as integers rather than strings. Otherwise this will be very slow no matter what you do.
Assuming that your blacklist doesn't change or only changes rarely, sort it in numerical order in advance (before processing your files). You can perform a binary search on the blacklist then, which will only require about 32 comparisons per line in the input files. Make sure that you have enough RAM to store the whole black list in memory without swapping. If you don't, then again, this will be slow no matter what you do.
Also make sure that you have enough RAM to store the whole input file in memory twice without swapping.
Loop through the input file line by line and perform the binary lookup on the blacklist to determine whether the integer exists in it. If the integer is not in the blacklist, append the integer to a separate buffer that holds non-blacklisted items. At the end of the whole loop, write the separate buffer to your destination file.
It will probably still take a fair amount of time to run, but you can probably do it in under 10 minutes.