UNIX Help
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsOperating SystemsUNIX Help

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Stop making mediocre tutorials.The best tutorials are video! Camtasia Studio makes it easy to create engaging, buzz-building screen videos at any size, in any popular format. Download the free trial!
  #1  
Old August 5th, 2004, 12:00 PM
knookula knookula is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 1 knookula User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Parsing a very large file in Unix

Hi all, I am trying to write a shell script which will parse a very large text file which has around 1.3 - 1.5 million records. I am parsing this large client detail text file based on the SSN's that I have stored in an another text file. While parsing this detail text file which has more than a million record the script is literally crawling and even after couple of days, the script did not finish parsing the input file.
What I am doing is I am actually looping the file which has SSN's (10,000 records) reading one SSN at a time and then greping the large text file to see if it has any records matching for the SSN that I checking, if so then I am writing to a new file.
If any of you out there can give me any kind of suggestion which would expedite the process it will be very very helpfull, I am working on a deadline and I have compelete this task before this weekend.
Thanks in Advance

Reply With Quote
  #2  
Old August 5th, 2004, 01:59 PM
jim mcnamara jim mcnamara is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jun 2004
Posts: 1,299 jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 8 h 41 m 52 sec
Reputation Power: 47
Divide your small ssn file into several pieces of say 2000 lines each - the unix split command will do this for you. Read the man page.

Next, let grep have say, 10- 50 SSN's to work on at one time, this is an example so it reads thru the file looking for 10 SSN's instead of one.

Code:
#/bin/ksh
# script: get_ssn
# $1 = SSN input file name
# ssnfilename is the monster file
arguments=""
let i=0
while read ssn
do
    arguments=`echo "$arguments -e $ssn "`
    let i=i+1  
    if [ i -eq 10 ]; then 
       grep $arguments ssnfilename 
       let i=0
       arguments=""
    fi
done  < $1
exit


Fianlly, run the above code in separate processes something like this:
Code:
Process #1:
get_ssn  ssninputfile1 > ssnoutputfile1
......
get_ssn ssinputfile5 > ssnoutputfile5


You can create processes that run in background by adding a <space>& at the end of the command.

Since grep opens files read only there will not be any file contention, but this whole deal will be I/O bound - the disk where the big file is will have hundreds of I/O requests per second. However most modern disk controllers have a data cache that is 10-20MB so part of the file will be cached in memory at all times. This mutltiple process trick will work better on a multi-processor system.

You should see a really substantial gain. The numbers I chose, like 5 sub-processes, were a guess. You can adjust them. I think a command in ksh can be up to 4096 characters long, so the grep argument string could be longer than I made it.

As a side note, the ssn files that make up the patterns could be converted into so-called 'pattern files'
so you call grep:
Code:
grep -f pattern_file ssnfilename

Reply With Quote
  #3  
Old August 5th, 2004, 07:36 PM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,059 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 3 h 46 m 33 sec
Reputation Power: 8
maybe not a big job, but
what's SSN ??
(sorry 4 ignorance)

Reply With Quote
  #4  
Old August 6th, 2004, 09:54 AM
jim mcnamara jim mcnamara is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jun 2004
Posts: 1,299 jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 8 h 41 m 52 sec
Reputation Power: 47
SSN is a Social Security Nyumber - it's a unique personal identifier issued by the governement, used in businesses in the USA - especially anything to do with banking, credit, or taxes.

Reply With Quote
  #5  
Old August 9th, 2004, 07:09 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,059 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 3 h 46 m 33 sec
Reputation Power: 8
shell is not the tool 4 this job.

i suppose you are not allowed to sort the BIGfile
but sure can unique sort the ssnfile, this can be a
little help to avoid duplicate loops.

as jim said, ssn is unique, so find a tool STOPPING if matched,
[ef]grep goes 10000 times through, even if still founds the unique string.
if you use grep, prefer f(ast)grep.
-----
maybe the best way is: put the BIGfile in a db (ge. mysql)
construct a sql query out of the ssnfile, and exec it.
-----
sed could also help, but it has limits: (not sure) 200 cmd at time
BUT it can be stopped
-----
a q+d way, i would:

- numbering the BIGfile, [pre|ap]pending at every entry a number
- sort both files, using ssn key && isoling the ssn in BIGfile
- loock for common entries using: comm, diff, or other tool
(no man pages available at the moment)
- the [pre|ap]pended entry-number says where the entry in
orig BIGfile is.
-----
at last, if really performance is needed
- write an own 'c' prog

Reply With Quote
  #6  
Old August 25th, 2004, 07:53 PM
babumovva babumovva is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 1 babumovva User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Quote:
Originally Posted by knookula
Hi all, I am trying to write a shell script which will parse a very large text file which has around 1.3 - 1.5 million records. I am parsing this large client detail text file based on the SSN's that I have stored in an another text file. While parsing this detail text file which has more than a million record the script is literally crawling and even after couple of days, the script did not finish parsing the input file.
What I am doing is I am actually looping the file which has SSN's (10,000 records) reading one SSN at a time and then greping the large text file to see if it has any records matching for the SSN that I checking, if so then I am writing to a new file.
If any of you out there can give me any kind of suggestion which would expedite the process it will be very very helpfull, I am working on a deadline and I have compelete this task before this weekend.
Thanks in Advance

[QUOTE=babumovva]
Simple. You will find below the fastest program/script to accomplish the task. I have used Perl to illustrate the solution, you can easily translate the same to C or anyother language you may need to use.

To build a database of SSN numbers index in the memory do something like this in perl

#!/usr/bin/perl
while(<>)
{if (m/((\d\d\d\)-(\d\d)-(\d\d\d\d))/)
# whether there is anything that looks like ssn
{$x=int("$2$3$4");
vec($ssndb,$x,1)=1;
}
}
# use the ssn from master file as offset into a onebit array
#contained in scalar variable $ssndb, and set the
#corresponding bit

# we are done with building the database
# now read the input file containing ssns for which you want
# to know whether the ssn is valid or do whatever you like as
# follows

open INP_SSN,"<my_input_ssn_file";
while(INP_SSN)
{if (vec($ssndb,$inp_ssn,1))
{print "The ssn $inp_ssn is valid !\n";
}
}
#instead of printing if you wanted to do more serious #processing after it is found to be valid
#we should have built another berkely db (bdb)
#file containing the
#start address and length of the master ssndb file
#in the first loop
#and the inp_ssn can be used to retrieve the offset,length
#info from bdb and retrieving the original matching record
#using this offset,length to do further processing

Hope this solves your problem.


If you need any further help send me a email or post back.

babumovva

Reply With Quote
Reply

Viewing: Dev Shed ForumsOperating SystemsUNIX Help > Parsing a very large file in Unix


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

 Free IT White Papers!
 
Accelerating Trading Partner Performance
One in five. That's how many partner transactions have at least one error. That is an amazing statistic, particularly given the extraordinary leaps in innovation across the global supply chain during the past two decades. Download this white paper to learn more.

 
Competing on Analytics
This Tech Analysis is designed to help identify characteristics shared by analytics competitors, and includes information about 32 organizations that have made a commitment to quantitative, fact-based analysis.

 
Cost Effective Scaling with Virtualization and Coyote Point Systems
An overview of the industry trend toward virtualization, how server consolidation has increased the importance of application uptime and the steps being taken to integrate load balancing technology with virtualized servers.

 
Five Checkpoints to Implementing IP Telephony
Implementation planning for IP PBX software and IP telephony has become vital as businesses replace discontinued legacy PBX phone systems. This informative whitepaper outlines five "checkpoints" for any implementation plan that will help make IP communications a successful proposition.

 
Hosted Email Security: Staying Ahead of New Threats
In the last two years, email has become a fierce battleground between the nefarious forces of spam and malware, and the heroes of messaging protection. The spam volumes increased alarmingly every month, bringing clever new forms of phishing and virus propagation attacks.

 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway