|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now! |
|
#1
|
|||
|
|||
|
Hi,
I'm working with a really large (over 7 million records ) tab delimited file with many columns. I'm fairly new to UNIX, but it seems like the best choice for splitting up my file.My file contains information on people and where they live. I need to split it into smaller files based on where people live. For example the file contains: Code:
ID FirstName LastName Town 1 John Smith Chicago 2 Mary Jones New York 3 Judy Johnson Chicago 4 Mark George Los Angeles 5 Susan Hatfield New York This file would then be split into 3 files (one for each Town). I've looked into using csplit since it works with large files, but I can't see how I would indicate that I want to divide on only that particular column. Any suggestions? Thanks in advance! Courtenay |
|
#2
|
|||
|
|||
|
put in sedcmd:
Code:
#n /.*New York/w ny /.*Chicago/w ch /.*Los Angeles/w la then enter: sed -f sedcmd inputfile |
|
#3
|
|||
|
|||
|
Thanks!
Thanks a ton - it worked perfectly!
Cheers, Courtenay |
|
#4
|
|||
|
|||
|
OOPS - Just realized that I run into problems when the name of a town appears in more than one column in a person's record. (Which happens to be the case with many of the records in my file)
Is there a way to specify which column I want to have searched for splitting purposes? Thanks! Courtenay |
|
#5
|
|||
|
|||
|
a) decide what format your file has
b) what is the field separator my sed works on: aa bbb new york and of aa new york new york new york new york assumig the fieldsep is a space and you look 4 the 3. field /\(.*\) \(.*\) \([Nn]ew [Yy]ork\).*/w ny will work. |
|
#6
|
|||
|
|||
|
Thanks again!
Huge help! I got it to work
. Although I found I had to add a search for an extra colmun after the search for the name of the town so I didn't end up repeating the search for columns after the third one. I think this requires that I can't have the column I'm searching and then spliting on be the last column in the row, but otherwise it seems to work.Here is what I did (This creates a file of all records from a tab delimited file where the 4th column has the word "New York"). /\(.*\) \(.*\) \(.*\) \([Nn]ew [Yy]ork\) \(.*\)*/w ny Thanks a ton! Courtenay |
|
#7
|
|||
|
|||
|
long wait?
Hi again,
I've just tried running the code on a file with close to a million records (254MB). It works correctly, but it is taking hours to create just 2 split files. In fact - after an hour the 2 new files reached only about 3MB a piece. After that, it appeared no more records were being added to either file. Is that normal? Here is what I'm doing: I search the 49th column in a tab delimited file for 01 and 02 to create separate files for each record. In the sedcmd.txt file I have the following: Code:
/\(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(01\) \(.*\)*/w 01 /\(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(.*\) \(02\) \(.*\)*/w 02 To run this in the command prompt I type: Code:
sed -n -f sedcmd.txt large_file.txt Could there be something I need to change in memory settings? I'm using cygwin which allocates 384 MB of memory (program+data). Does this need to be changed? If so, what exactly do I need to type in the bash shell? I'm very hesitant about changing the registry. Or am I missing something in my code? Thank you so much for all of your help! Courtenay |
|
#8
|
|||
|
|||
|
don't change settings.
i never used sed on a 256mb file ![]() post 10 lines of your big-file, maybe there is a better, more efficient regexpr |
|
#9
|
|||
|
|||
|
Sample of big file
Hi,
Here is a sample of the first 10 records. Unfortunately the columns are not lining up properly in the display below. Hopefully it will still make sense. I'm trying to split on the 49th column labled "td". This is actually one of the smaller files I'm working with. Others are 2GB+. Thanks, Courtenay Code:
mpid hhid vid title fname mname lname suffix stnum stpredir stname stsuf stpstdir staptt stapt stoe city state zip5 zip4 acod staddress maddress mcity mstate mzip5 mzip4 mdp mcrrt mlotc mlots macod pdate pcode ncode ndate gender group race ethnic pob affno absent bdate dob rdate sales code district td fd ad status town zone s2004 g2003 p2003 s2003 g2002 p2002 s2002 g2001 p2001 s2001 g2000 p2000 pp2000 s2000 g1999 p1999 s1999 g1998 p1998 s1998 g1997 p1997 g1996 p1996 g1995 p1995 g1994 p1994 p1992 g2003abs p2003abs g2002abs p2002abs g2000abs p2000abs email code2 code3 fn_check prec_02 dupgrp duperror matched ge11_00 ge11_02 ge11_03 ge11_00a ge11_02a ge11_03a sp6_04 phone unverphone ID 61257299 1 1 Mr. John Anbadges E Otisfield AZ 53592 S Otisfield AZ 53592 M Z U / / 10/4/2002 13 1 48 22 St. George KNO U U 2 John N 11 61257300 1 11 Ms. Mabel Anbadges E Otisfield AZ 53592 S Otisfield AZ 53592 F X U / / 10/4/2002 13 1 48 22 St. George KNO U U 2 Mabel N 12 61300010 2 1 Ms. Nicole Astbury E Blue Hill AZ 53592 S Blue Hill AZ 53592 F Z U / / / / 9 2 37 28 Blue Hill HAN R R1 Nicole N 13 61259304 3 1 Ms. Marge Balacock E Castine AZ 53592 S Castine AZ 53592 F Z U / / 10/23/2002 9 2 37 31 Castine HAN R R1 Marge N 14 52428301 4 1 Ms. Anna E. Balowitz E Tenants Harbor AZ 53592 S PO Box 182 Tenants Harbor AZ 53592 182 82 B002 1 A F Z U 19730810 8/10/1973 / / 13 1 48 22 A St. George KNO Y U U 1 Anna N 15 61300011 5 1 Mr. Nelson Henry Beaudoin E St. George AZ 53592 S St. George AZ 53592 M Z U / / / / 13 1 48 22 St. George KNO U U 1 Nelson N 16 61300012 6 1 Mr. Andrew D. Benner E St George AZ 53592 S St George Rd St George AZ 53592 M X U / / / / 13 1 48 22 St. George KNO D D 1 Andrew N 17 52428401 7 1 Mr. David F. Bennett E Tenants Harbor AZ 53592 S PO Box 463 Tenants Harbor AZ 53592 463 63 B005 1 A M X 2 19420606 6/6/1942 / / 13 1 48 22 A St. George KNO Y U U 1 David N 18 61300013 8 1 Ms. Janet Boulter E Gray AZ 53592 S Gray AZ 53592 F X U / / / / 5 1 11 Gray CUM R R1 Janet N 19 61300014 9 1 Ms. Judith Bragar E Blue Hill AZ 53592 S Blue Hill AZ 53592 F X U / / / / 9 2 37 28 Blue Hill HAN U U 1 Judith N 20 |
|
#10
|
|||
|
|||
|
fds = file descriptors
the sed is correct, you cannot hope more, 2 probls a) the data amount b) sed open and close the fds every time it write to it this costs time as exemple, in a shell Code:
echo >output echo >>output echo >>output and Code:
( echo echo echo ) >ouptut is really not the same, in the 1.case you open and close 3 times the 'output' file, in 2. once! this does not matter working on small files, but in your case it costs a lot of I/O operations. i would try at least a perl (or better c program) opening once fds and closing at the end. i don't know how many fds perl can handle, c has no limits (using integers). if the file is formatted, i mean the 49col begins always at pos X you don't need to split the string, this also will speed up the prog. i can help you writing a c prog. |
|
#11
|
|||
|
|||
|
thanks again
Thanks! Your explanation on why it's taking so long makes a lot of sense.
I know c++, so I would be more comfortable with c than with perl. If you are willing to help me I would really appreciate it! Thanks, Courtenay |
|
#12
|
|||
|
|||
|
i played a little with your probl, here an example
it takes 1,3 mins for 1 MIO records, records are 80 char && i a looking for the 4. tab. feel free to adapt it for your requirements. PS: my sys is an old sparc (>6years) and solaris7 Code:
#include <stdio.h>
#include <fcntl.h>
/*
NOTA: this is a 30mins toy, by guggach@yahoo.com,
not really a programm, just an example.
- 'regexp' AND 'fname' HAVE to be UNIQUE
- this version DOES NOT supports spaces in 'regexp'
- spaces in 'fname' are also a very bad practice
- i use a lot of pointers, i love them and they are the power of 'c'
because a lot of people does not understand and miss-uses them,
'c++' has NO pointers :(
- i also like columning, it makes soooo easy to see errors :)
- to compile: (g)cc -o execname thisfile.c
- to run: execname inputfile
*/
/*
assumptions:
2*MAXBUFF is large enough to hold 1 record
and MAXBUFF is large enough to hold 1 regexpr
in a productive-prog i would check both
*/
#define MAXBUFF 256
#define ONPOS 4
#define TRAILER '\t' /* this is a tab char */
typedef struct { char *regexp, *fname; int fd; } MYREG;
MYREG myreg[] =
{{ "abc" , "/home/user/abc" , 0 }
,{ "1234" , "/tmp/abc" , 0 }
,{ "ytr" , "/var/tmp/abc" , 0 }
,{ "qwe" , "/tmp/qwe" , 0 }
,{ "zzzz" , "zzzz" , 0 }
,{ (char *)NULL , (char *)NULL , 0 }};
int main(int argc, char **argv)
{
int openonce(char *, int);
FILE *input;
MYREG *mr;
char buff[2*MAXBUFF];
char regexp[MAXBUFF];
char *bpt;
char *npt;
register int tabs;
if(!*++argv) /* i love that, it's for 'c' programmer only */
exit(printf("give me a file-name to read\n"));
if(!(input = fopen(*argv,"r")))
exit(printf("cannot read %s\n",*argv));
while(fgets(buff,sizeof(buff),input)){
/* set a pointer on 'buff' && walk until ONPOS is reached */
for(tabs = 0, bpt = buff; *bpt && tabs != ONPOS; ++bpt)
if(*bpt == TRAILER) ++tabs;
if(tabs != ONPOS) continue; /* not reached, scan next input */
while(isspace(*bpt)) ++bpt; /* purge possible leading spaces */
/* set a pointer on 'regexp' and isole it */
for(npt = regexp; *bpt && !isspace(*bpt); ++bpt) *npt++ = *bpt;
if(npt == regexp) continue; /* nothing appropriate found */
*npt = 0; /* terminate 'regexp' */
/* scan for 'regexp' the list of regexpression */
for(mr = myreg; mr->regexp && strcmp(regexp,mr->regexp); ++mr);
if(!mr->regexp) continue; /* nothing appropriate found */
/* if not still open, try to open the file 'fname' */
if(!(mr->fd = openonce(mr->fname,mr->fd))) continue;
/* OK, put this record in that file */
write(mr->fd,buff,strlen(buff));
}
/* housekeeping,
not really necessary in this code but a good practice */
fclose(input);
for(mr = myreg; mr->regexp; ++mr) if(mr->fd) close(mr->fd);
exit(0);
}
int openonce(char *name, int fd)
{
/* ATTENTION: no check about filename !!! just try opening */
if(fd) return(fd);
if(0 >(fd = open(name,O_WRONLY|O_CREAT,0644))) return(0);
return(fd);
}
|
|
#13
|
|||
|
|||
|
Wow! THANKS!!! I'll keep you posted on how it goes as I adapt it to my files.
I can't thank you enough for helping me out! Cheers, Courtenay |
![]() |
| Viewing: Dev Shed Forums > Operating Systems > UNIX Help > split large file by groups in a column |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|