|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Stop making mediocre tutorials.The best tutorials are video! Camtasia Studio makes it easy to create engaging, buzz-building screen videos at any size, in any popular format. Download the free trial!
|
|
#1
|
|||
|
|||
|
Parsing a very large file in Unix
Hi all, I am trying to write a shell script which will parse a very large text file which has around 1.3 - 1.5 million records. I am parsing this large client detail text file based on the SSN's that I have stored in an another text file. While parsing this detail text file which has more than a million record the script is literally crawling and even after couple of days, the script did not finish parsing the input file.
What I am doing is I am actually looping the file which has SSN's (10,000 records) reading one SSN at a time and then greping the large text file to see if it has any records matching for the SSN that I checking, if so then I am writing to a new file. If any of you out there can give me any kind of suggestion which would expedite the process it will be very very helpfull, I am working on a deadline and I have compelete this task before this weekend. Thanks in Advance ![]() |
|
#2
|
|||
|
|||
|
Divide your small ssn file into several pieces of say 2000 lines each - the unix split command will do this for you. Read the man page.
Next, let grep have say, 10- 50 SSN's to work on at one time, this is an example so it reads thru the file looking for 10 SSN's instead of one. Code:
#/bin/ksh
# script: get_ssn
# $1 = SSN input file name
# ssnfilename is the monster file
arguments=""
let i=0
while read ssn
do
arguments=`echo "$arguments -e $ssn "`
let i=i+1
if [ i -eq 10 ]; then
grep $arguments ssnfilename
let i=0
arguments=""
fi
done < $1
exit
Fianlly, run the above code in separate processes something like this: Code:
Process #1: get_ssn ssninputfile1 > ssnoutputfile1 ...... get_ssn ssinputfile5 > ssnoutputfile5 You can create processes that run in background by adding a <space>& at the end of the command. Since grep opens files read only there will not be any file contention, but this whole deal will be I/O bound - the disk where the big file is will have hundreds of I/O requests per second. However most modern disk controllers have a data cache that is 10-20MB so part of the file will be cached in memory at all times. This mutltiple process trick will work better on a multi-processor system. You should see a really substantial gain. The numbers I chose, like 5 sub-processes, were a guess. You can adjust them. I think a command in ksh can be up to 4096 characters long, so the grep argument string could be longer than I made it. As a side note, the ssn files that make up the patterns could be converted into so-called 'pattern files' so you call grep: Code:
grep -f pattern_file ssnfilename |
|
#3
|
|||
|
|||
|
maybe not a big job, but
what's SSN ?? (sorry 4 ignorance) |
|
#4
|
|||
|
|||
|
SSN is a Social Security Nyumber - it's a unique personal identifier issued by the governement, used in businesses in the USA - especially anything to do with banking, credit, or taxes.
|
|
#5
|
|||
|
|||
|
shell is not the tool 4 this job.
i suppose you are not allowed to sort the BIGfile but sure can unique sort the ssnfile, this can be a little help to avoid duplicate loops. as jim said, ssn is unique, so find a tool STOPPING if matched, [ef]grep goes 10000 times through, even if still founds the unique string. if you use grep, prefer f(ast)grep. ----- maybe the best way is: put the BIGfile in a db (ge. mysql) construct a sql query out of the ssnfile, and exec it. ----- sed could also help, but it has limits: (not sure) 200 cmd at time BUT it can be stopped ![]() ----- a q+d way, i would: - numbering the BIGfile, [pre|ap]pending at every entry a number - sort both files, using ssn key && isoling the ssn in BIGfile - loock for common entries using: comm, diff, or other tool (no man pages available at the moment) - the [pre|ap]pended entry-number says where the entry in orig BIGfile is. ----- at last, if really performance is needed - write an own 'c' prog |
|
#6
|
|||
|
|||
|
Quote:
[QUOTE=babumovva] Simple. You will find below the fastest program/script to accomplish the task. I have used Perl to illustrate the solution, you can easily translate the same to C or anyother language you may need to use. To build a database of SSN numbers index in the memory do something like this in perl #!/usr/bin/perl while(<>) {if (m/((\d\d\d\)-(\d\d)-(\d\d\d\d))/) # whether there is anything that looks like ssn {$x=int("$2$3$4"); vec($ssndb,$x,1)=1; } } # use the ssn from master file as offset into a onebit array #contained in scalar variable $ssndb, and set the #corresponding bit # we are done with building the database # now read the input file containing ssns for which you want # to know whether the ssn is valid or do whatever you like as # follows open INP_SSN,"<my_input_ssn_file"; while(INP_SSN) {if (vec($ssndb,$inp_ssn,1)) {print "The ssn $inp_ssn is valid !\n"; } } #instead of printing if you wanted to do more serious #processing after it is found to be valid #we should have built another berkely db (bdb) #file containing the #start address and length of the master ssndb file #in the first loop #and the inp_ssn can be used to retrieve the offset,length #info from bdb and retrieving the original matching record #using this offset,length to do further processing Hope this solves your problem. If you need any further help send me a email or post back. babumovva |
![]() |
| Viewing: Dev Shed Forums > Operating Systems > UNIX Help > Parsing a very large file in Unix |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|
|