UNIX Help
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsOperating SystemsUNIX Help

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now!
  #1  
Old August 30th, 2004, 09:33 PM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
Red face split large file by groups in a column

Hi,

I'm working with a really large (over 7 million records ) tab delimited file with many columns. I'm fairly new to UNIX, but it seems like the best choice for splitting up my file.

My file contains information on people and where they live. I need to split it into smaller files based on where people live.

For example the file contains:
Code:
ID FirstName LastName Town
1  John        Smith      Chicago
2   Mary        Jones      New York
3   Judy        Johnson   Chicago
4   Mark        George    Los Angeles
5   Susan      Hatfield   New York


This file would then be split into 3 files (one for each Town).

I've looked into using csplit since it works with large files, but I can't see how I would indicate that I want to divide on only that particular column.

Any suggestions?

Thanks in advance!

Courtenay

Reply With Quote
  #2  
Old August 31st, 2004, 05:41 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
put in sedcmd:

Code:
#n
/.*New York/w ny
/.*Chicago/w ch
/.*Los Angeles/w la


then enter: sed -f sedcmd inputfile

Reply With Quote
  #3  
Old August 31st, 2004, 11:32 AM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
Thanks!

Thanks a ton - it worked perfectly!

Cheers,
Courtenay

Reply With Quote
  #4  
Old August 31st, 2004, 11:42 AM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
OOPS - Just realized that I run into problems when the name of a town appears in more than one column in a person's record. (Which happens to be the case with many of the records in my file)

Is there a way to specify which column I want to have searched for splitting purposes?

Thanks!
Courtenay

Reply With Quote
  #5  
Old August 31st, 2004, 12:35 PM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
a) decide what format your file has
b) what is the field separator

my sed works on:

aa bbb new york

and of

aa new york new york new york new york

assumig the fieldsep is a space and you look 4 the 3. field

/\(.*\) \(.*\) \([Nn]ew [Yy]ork\).*/w ny

will work.

Reply With Quote
  #6  
Old September 1st, 2004, 11:06 AM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
Thanks again!

Huge help! I got it to work . Although I found I had to add a search for an extra colmun after the search for the name of the town so I didn't end up repeating the search for columns after the third one. I think this requires that I can't have the column I'm searching and then spliting on be the last column in the row, but otherwise it seems to work.

Here is what I did (This creates a file of all records from a tab delimited file where the 4th column has the word "New York").
/\(.*\) \(.*\) \(.*\) \([Nn]ew [Yy]ork\) \(.*\)*/w ny

Thanks a ton!
Courtenay

Reply With Quote
  #7  
Old September 1st, 2004, 05:02 PM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
long wait?

Hi again,

I've just tried running the code on a file with close to a million records (254MB). It works correctly, but it is taking hours to create just 2 split files. In fact - after an hour the 2 new files reached only about 3MB a piece. After that, it appeared no more records were being added to either file.

Is that normal?

Here is what I'm doing:

I search the 49th column in a tab delimited file for
01 and 02 to create separate files for each record.

In the sedcmd.txt file I have the following:

Code:
/\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(01\)	\(.*\)*/w 01
/\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(.*\)	\(02\)	\(.*\)*/w 02


To run this in the command prompt I type:

Code:
sed -n -f sedcmd.txt large_file.txt


Could there be something I need to change in memory settings? I'm using cygwin which allocates 384 MB of memory (program+data). Does this need to be changed? If so, what exactly do I need to type in the bash shell? I'm very hesitant about changing the registry.

Or am I missing something in my code?

Thank you so much for all of your help!

Courtenay

Reply With Quote
  #8  
Old September 2nd, 2004, 04:17 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
don't change settings.
i never used sed on a 256mb file
post 10 lines of your big-file, maybe there is a better,
more efficient regexpr

Reply With Quote
  #9  
Old September 2nd, 2004, 11:08 AM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
Sample of big file

Hi,

Here is a sample of the first 10 records. Unfortunately the columns are not lining up properly in the display below. Hopefully it will still make sense. I'm trying to split on the 49th column labled "td". This is actually one of the smaller files I'm working with. Others are 2GB+.

Thanks,
Courtenay

Code:
mpid	hhid	vid	title	fname	mname	lname	suffix	stnum	stpredir	stname	stsuf	stpstdir	staptt	stapt	stoe	city	state	zip5	zip4	acod	staddress	maddress	mcity	mstate	mzip5	mzip4	mdp	mcrrt	mlotc	mlots	macod	pdate	pcode	ncode	ndate	gender	group	race	ethnic	pob	affno	absent	bdate	dob	rdate	sales code	district	td	fd	ad	status	town	zone	s2004	g2003	p2003	s2003	g2002	p2002	s2002	g2001	p2001	s2001	g2000	p2000	pp2000	s2000	g1999	p1999	s1999	g1998	p1998	s1998	g1997	p1997	g1996	p1996	g1995	p1995	g1994	p1994	p1992	g2003abs	p2003abs	g2002abs	p2002abs	g2000abs	p2000abs	email	code2	code3	fn_check	prec_02	dupgrp	duperror	matched	ge11_00	ge11_02	ge11_03	ge11_00a	ge11_02a	ge11_03a	sp6_04	phone	unverphone	ID
61257299	1	1	Mr.	John		Anbadges									E	Otisfield	AZ	53592		S			Otisfield	AZ	53592											M	Z		U					/  /	10/4/2002	13		1	48	22		St. George	KNO																																					U	 U 2	John				N										11
61257300	1	11	Ms.	Mabel		Anbadges									E	Otisfield	AZ	53592		S			Otisfield	AZ	53592											F	X		U					/  /	10/4/2002	13		1	48	22		St. George	KNO																																					U	 U 2	Mabel				N										12
61300010	2	1	Ms.	Nicole		Astbury									E	Blue Hill	AZ	53592		S			Blue Hill	AZ	53592											F	Z		U					/  /	/  /	9		2	37	28		Blue Hill	HAN																																					R	  R1	Nicole				N										13
61259304	3	1	Ms.	Marge		Balacock									E	Castine	AZ	53592		S			Castine	AZ	53592											F	Z		U					/  /	10/23/2002	9		2	37	31		Castine	HAN																																					R	  R1	Marge				N										14
52428301	4	1	Ms.	Anna	E.	Balowitz									E	Tenants Harbor	AZ	53592		S		PO Box 182	Tenants Harbor	AZ	53592	182	82	B002	1	A						F	Z		U				19730810	8/10/1973	/  /	13		1	48	22	A	St. George	KNO											Y																										U	 U 1	Anna				N										15
61300011	5	1	Mr.	Nelson	Henry	Beaudoin									E	St. George	AZ	53592		S			St. George	AZ	53592											M	Z		U					/  /	/  /	13		1	48	22		St. George	KNO																																					U	 U 1	Nelson				N										16
61300012	6	1	Mr.	Andrew	D.	Benner									E	St George	AZ	53592		S		St George Rd	St George	AZ	53592											M	X		U					/  /	/  /	13		1	48	22		St. George	KNO																																					D	D  1	Andrew				N										17
52428401	7	1	Mr.	David	F.	Bennett									E	Tenants Harbor	AZ	53592		S		PO Box 463	Tenants Harbor	AZ	53592	463	63	B005	1	A						M	X		2				19420606	6/6/1942	/  /	13		1	48	22	A	St. George	KNO											Y																										U	 U 1	David				N										18
61300013	8	1	Ms.	Janet		Boulter									E	Gray	AZ	53592		S			Gray	AZ	53592											F	X		U					/  /	/  /	5		1		11		Gray	CUM																																					R	  R1	Janet				N										19
61300014	9	1	Ms.	Judith		Bragar									E	Blue Hill	AZ	53592		S			Blue Hill	AZ	53592											F	X		U					/  /	/  /	9		2	37	28		Blue Hill	HAN																																					U	 U 1	Judith				N										20

Reply With Quote
  #10  
Old September 3rd, 2004, 04:25 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
fds = file descriptors

the sed is correct, you cannot hope more, 2 probls
a) the data amount
b) sed open and close the fds every time it write to it
this costs time
as exemple, in a shell
Code:
echo >output
echo >>output
echo >>output

and
Code:
(
echo
echo
echo
) >ouptut

is really not the same, in the 1.case you open and close 3 times
the 'output' file, in 2. once!
this does not matter working on small files, but in your case
it costs a lot of I/O operations.

i would try at least a perl (or better c program)
opening once fds and closing at the end. i don't know how many
fds perl can handle, c has no limits (using integers).
if the file is formatted, i mean the 49col begins always at pos X
you don't need to split the string, this also will speed up the prog.

i can help you writing a c prog.

Reply With Quote
  #11  
Old September 3rd, 2004, 02:36 PM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
thanks again

Thanks! Your explanation on why it's taking so long makes a lot of sense.

I know c++, so I would be more comfortable with c than with perl. If you are willing to help me I would really appreciate it!

Thanks,
Courtenay

Reply With Quote
  #12  
Old September 4th, 2004, 06:16 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
i played a little with your probl, here an example
it takes 1,3 mins for 1 MIO records, records are 80 char &&
i a looking for the 4. tab.
feel free to adapt it for your requirements.
PS: my sys is an old sparc (>6years) and solaris7
Code:
#include <stdio.h>
#include <fcntl.h>

/* 
   NOTA: this is a 30mins toy, by guggach@yahoo.com,
   not really a programm, just an example.
   - 'regexp' AND 'fname' HAVE to be UNIQUE
   - this version DOES NOT supports spaces in 'regexp'
   - spaces in 'fname' are also a very bad practice
   - i use a lot of pointers, i love them and they are the power of 'c' 
   because a lot of people does not understand and miss-uses them,
   'c++' has NO pointers :(
   - i also like columning, it makes soooo easy to see errors :)

   - to compile: (g)cc -o execname thisfile.c
   - to run: execname inputfile
*/

/*
   assumptions:
     2*MAXBUFF is large enough to hold 1 record
   and MAXBUFF is large enough to hold 1 regexpr
   in a productive-prog i would check both
*/

#define MAXBUFF 256
#define ONPOS 4
#define TRAILER '\t'  /* this is a tab char */

typedef struct { char *regexp, *fname; int fd; } MYREG;

MYREG myreg[] = 
{{  "abc"        , "/home/user/abc"   , 0 }
,{  "1234"       , "/tmp/abc"         , 0 }
,{  "ytr"        , "/var/tmp/abc"     , 0 }
,{  "qwe"        , "/tmp/qwe"         , 0 }
,{  "zzzz"       , "zzzz"             , 0 }
,{  (char *)NULL , (char *)NULL       , 0 }};

int main(int argc, char **argv)
{
   int openonce(char *, int);
   FILE *input;
   MYREG *mr;
   char buff[2*MAXBUFF];
   char regexp[MAXBUFF];
   char *bpt;
   char *npt;
   register int tabs;

   if(!*++argv) /* i love that, it's for 'c' programmer only */
      exit(printf("give me a file-name to read\n"));
   if(!(input = fopen(*argv,"r")))
      exit(printf("cannot read %s\n",*argv));

   while(fgets(buff,sizeof(buff),input)){
      /* set a pointer on 'buff' && walk until ONPOS is reached */
      for(tabs = 0, bpt = buff; *bpt && tabs != ONPOS; ++bpt)
         if(*bpt == TRAILER) ++tabs;
      if(tabs != ONPOS) continue; /* not reached, scan next input */
      while(isspace(*bpt)) ++bpt; /* purge possible leading spaces */
      /* set a pointer on 'regexp' and isole it */
      for(npt = regexp; *bpt && !isspace(*bpt); ++bpt) *npt++ = *bpt;
      if(npt == regexp) continue; /* nothing appropriate found */
      *npt = 0; /* terminate 'regexp' */
      /* scan for 'regexp' the list of regexpression */
      for(mr = myreg; mr->regexp && strcmp(regexp,mr->regexp); ++mr);
      if(!mr->regexp) continue; /* nothing appropriate found */
      /* if not still open, try to open the file 'fname' */
      if(!(mr->fd = openonce(mr->fname,mr->fd))) continue;
      /* OK, put this record in that file */
      write(mr->fd,buff,strlen(buff));
   }
   /* housekeeping,
      not really necessary in this code but a good practice */
   fclose(input);
   for(mr = myreg; mr->regexp; ++mr) if(mr->fd) close(mr->fd);
   exit(0);
}
int openonce(char *name, int fd)
{
   /* ATTENTION: no check about filename !!! just try opening */
   if(fd) return(fd);
   if(0 >(fd = open(name,O_WRONLY|O_CREAT,0644))) return(0);
   return(fd);
}

Reply With Quote
  #13  
Old September 4th, 2004, 05:16 PM
Courtenayt Courtenayt is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2004
Posts: 48 Courtenayt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 18 m 31 sec
Reputation Power: 5
Wow! THANKS!!! I'll keep you posted on how it goes as I adapt it to my files.
I can't thank you enough for helping me out!

Cheers,
Courtenay

Reply With Quote
Reply

Viewing: Dev Shed ForumsOperating SystemsUNIX Help > split large file by groups in a column


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump