Perl Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPerl Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now!
  #1  
Old April 16th, 2008, 01:12 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
Tab delimited parse

Hi...encountering a difficult file format.

Please see attached file: tsv_parse.txt (open with excel to see format)

tsv_result.txt is what I would like the data to look like.

Any direction or help would be appreciated.

So far I have this code:

Code:
while (<$file>) {

	$lines++;
    $len = length($_);
    my @fields = split(/\t/);

	if ($len > 2) {

print @fields;

...
 
Attached Files
File Type: txt tsv_parse.txt (168 Bytes, 32 views)
File Type: txt tsv_result.txt (334 Bytes, 27 views)

Reply With Quote
  #2  
Old April 16th, 2008, 04:01 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
my other approach was to use excel vba however, the data in the one cell is separated by ALT+ENTER. I also tried importing to Access and what I saw was a question mark sign closed by a square symbol to represent the delimiter.

Reply With Quote
  #3  
Old April 16th, 2008, 04:10 PM
keath's Avatar
keath keath is offline
!~ /m$/
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: May 2004
Location: Leawood, Kansas
Posts: 2,652 keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level)keath User rank is Brigadier General (60000 - 70000 Reputation Level) 
Time spent in forums: 1 Week 5 Days 1 h 24 m 27 sec
Reputation Power: 633
I can't look at your files from here, but from experience I think Excel does a terrible job of creating tab delimited files. It tends to add unwanted quotes around the place anyway.

I prefer to export as CSV when working with Excel, and parse it with Text::CSV or Text::CSV_XS. It's slightly more work, but better results.
Comments on this post
ishnid agrees: Yep. IIRC, Text::CSV_XS may handle newlines within fields better. Can't quite remember.

Reply With Quote
  #4  
Old April 16th, 2008, 04:38 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
Substitute the | with ALT+Enter symbol
Code:
 
For Instance if I have:
id name zip
1 bob 98789
2 tom 23434
3 abe 345 | 356 | 467 (these 3 zip's are in ONE cell)

I want that to look like this:
id name zip
1 bob 98789
2 tom 23434
3 abe 345 
3 abe 356 
3 abe 467


I need each entry to go down to a new row that also copies the rest of the row to the new row.

Here is the exact problem I am having but I didn't understand it either and rather do this with perl:
http://www.mrexcel.com/archive/Data/951.html

Reply With Quote
  #5  
Old April 16th, 2008, 06:30 PM
OmegaZero OmegaZero is online now
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2007
Posts: 126 OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level)OmegaZero User rank is First Lieutenant (10000 - 20000 Reputation Level) 
Time spent in forums: 6 Days 20 h 30 m 24 sec
Reputation Power: 183
I took a look at your files. Take the following as a reason to use CSV...

I think this will solve the problem of getting the data into memory. The files use have have CRLFs ending the lines but just LFs in the multi-line field.
Code:
# Add this before the loop
binmode($file);
$/ = "\cM\cJ";


There's some stray quotes in around the multi-line fields so you probably want to clean that up too.
Code:
s/^"(.*)"$/$1/s for @fields


For the second part you can just split the record on "\cJ" and print out the entire line substituting in each piece of the split record with something like this:
Code:
print @fields[0,1], $_, @fields[3,4,5,6,7,8] for split /\cJ/, $fields[2];


But the problem gets more interesting if you don't know which field is the multi-line field or if you can have multiple multi-line fields. I think the easiest way of solving it would be looping over the columns creating an array of arrays as you go: you flatten any multi-lines in column 1 building a new array, then you flatten column two resulting in a new array and so on.

The real answer is that (A) Excel should not be used as a database and (B) you need to normalize that database. ; )
__________________
sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);

Last edited by OmegaZero : April 16th, 2008 at 06:34 PM.

Reply With Quote
  #6  
Old April 17th, 2008, 01:29 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
OmegaZero, that worked very nicely =)

Now how would I dump that in a database?

I know I should be normalizing... I will down the road but for the time being I need this working.

not sure if this is the best way for the sql statement:

Code:

my $insert = "INSERT INTO table A (1, 2, 3, 4, 5, 6, 7, 8)  VALUES (@field)";

	 my $sth = $dbh->prepare($insert);
	 $isth->execute;
	
	 $isth->finish();
	 $dbh->disconnect();



Quote:
Originally Posted by OmegaZero
I took a look at your files. Take the following as a reason to use CSV...

I think this will solve the problem of getting the data into memory. The files use have have CRLFs ending the lines but just LFs in the multi-line field.
Code:
# Add this before the loop
binmode($file);
$/ = "\cM\cJ";


There's some stray quotes in around the multi-line fields so you probably want to clean that up too.
Code:
s/^"(.*)"$/$1/s for @fields


For the second part you can just split the record on "\cJ" and print out the entire line substituting in each piece of the split record with something like this:
Code:
print @fields[0,1], $_, @fields[3,4,5,6,7,8] for split /\cJ/, $fields[2];


But the problem gets more interesting if you don't know which field is the multi-line field or if you can have multiple multi-line fields. I think the easiest way of solving it would be looping over the columns creating an array of arrays as you go: you flatten any multi-lines in column 1 building a new array, then you flatten column two resulting in a new array and so on.

The real answer is that (A) Excel should not be used as a database and (B) you need to normalize that database. ; )

Reply With Quote
  #7  
Old April 17th, 2008, 01:45 PM
Clueless Newbie Clueless Newbie is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 440 Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 3 Days 1 h 25 m 10 sec
Reputation Power: 93
In general ...

Code:
#!\user\bin\perl -w
while (my @a_Field=GetTSV(@a_Field)) {
  Print('',@a_Field);
   };

  sub Print{
    my $string=shift;
    my @field=@_;
    if (@field == 0) { # all done
       substr($string,-1,1)="\n";
       print "$string";
        } else { # more to come
       for (split(/\n/,shift(@field))) {
          Print("$string$_\t",@field);
           };
        };
     };

  sub GetTSV{
    my($s_Line,$s_Text);
    while ($s_Line=<DATA>) {
      #print "'$s_Line'\n";
      my(@a_Field);
      $s_Text.=$s_Line;
      while ($s_Text =~ m{([^"\t]*)([\t\n])|"((?:[^"]|"")*)"([\t\n])}gs) {
        unless (substr($`,-1,1) eq '"') {
          if (defined $1) {
            push(@a_Field,$1);
            if ($2 eq "\n") {
              return @a_Field;
               };
             } else {
            my($s_Field);
            ($s_Field=$3)=~s/""/"/g;
            push(@a_Field,$s_Field);
            if ($4 eq "\n") {
              return @a_Field;
               };
             };
           };
         };
       };
    return ();
     };

__DATA__
data1	data2	"data3a
data3b
data3c
data3d"	data4	"data5a
data5b"	data6	"data7a
data7b"	data8	data9	data10


which gives
Code:
data1   data2   data3a  data4   data5a  data6   data7a  data8   data9   data10
data1   data2   data3a  data4   data5a  data6   data7b  data8   data9   data10
data1   data2   data3a  data4   data5b  data6   data7a  data8   data9   data10
data1   data2   data3a  data4   data5b  data6   data7b  data8   data9   data10
data1   data2   data3b  data4   data5a  data6   data7a  data8   data9   data10
data1   data2   data3b  data4   data5a  data6   data7b  data8   data9   data10
data1   data2   data3b  data4   data5b  data6   data7a  data8   data9   data10
data1   data2   data3b  data4   data5b  data6   data7b  data8   data9   data10
data1   data2   data3c  data4   data5a  data6   data7a  data8   data9   data10
data1   data2   data3c  data4   data5a  data6   data7b  data8   data9   data10
data1   data2   data3c  data4   data5b  data6   data7a  data8   data9   data10
data1   data2   data3c  data4   data5b  data6   data7b  data8   data9   data10
data1   data2   data3d  data4   data5a  data6   data7a  data8   data9   data10
data1   data2   data3d  data4   data5a  data6   data7b  data8   data9   data10
data1   data2   data3d  data4   data5b  data6   data7a  data8   data9   data10
data1   data2   data3d  data4   data5b  data6   data7b  data8   data9   data10
Comments on this post
ewish agrees: Nice!
__________________
"Plagiarize the code of others, I say!"

That's about the worst you can do. It's the fast way to make a very bad coder out of you. It's important to understand what you are doing. Just asking what you should do, and blindly copying that if you would have gotten an unambigious answer is very, very bad.

-- Abigail

Last edited by Clueless Newbie : April 17th, 2008 at 02:03 PM.

Reply With Quote
  #8  
Old April 17th, 2008, 05:33 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
amazing...any books out there to teach me all this stuff?

Reply With Quote
  #9  
Old April 21st, 2008, 05:29 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
I am still having some difficulty insert this into a database.
Any hints appreciated.

Code:
while (<$file>) {

	$lines++;
    $len = length($_);
    my @fields = split(/\t/); 

    binmode($file);
    $/ = "\cM\cJ";

   	s/^"(.*)"$/$1/s for @fields;
        
	if ($len > 2) {
		
	print @fields[0,1], $_, @fields[3,4,5,6,7,8,9,10,11,12,13] for split /\cJ/, $fields[2];
	
	my $insert = "INSERT INTO table () values ()";

		my $sth = $dbh->prepare($insert);
 	 	$sth->execute;
 	
 	 	$sth->finish();
 	 	$dbh->disconnect();

	}# closes if $len

}# closes while
Attached Files
File Type: txt tsv_parse3.txt (148 Bytes, 22 views)

Last edited by electron_89 : April 21st, 2008 at 06:04 PM.

Reply With Quote
  #10  
Old April 22nd, 2008, 02:03 PM
Clueless Newbie Clueless Newbie is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 440 Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 3 Days 1 h 25 m 10 sec
Reputation Power: 93
Code:
#!\user\bin\perl -w
use strict;
use warnings;
use DBI;

my $Temp_dbh_o=DBI->connect...
#$Temp_dbh_o->do("create table example (p0,p1,p2,p3,p4,p5,p6,p7,p8,p9)");
my $Temp_sth_o=$Temp_dbh_o->prepare("insert into example values(?,?,?,?,?,?,?,?,?,?)");
my @a_Field;
while (@a_Field=GetTSV(@a_Field)) {
  Insert([],@a_Field);
   };
$Temp_dbh_o->commit;
$Temp_dbh_o->disconnect();
exit;

  sub Insert{
    my $value_ra=shift;
    my @field=@_;
    if (@field == 0) { # all done
       $Temp_sth_o->execute(@$value_ra);
        } else { # more to come
       for (split(/\n/,shift(@field))) {
          Insert([@$value_ra,$_],@field);
           };
        };
     };

  sub GetTSV{
    my($s_Line,$s_Text);
    while ($s_Line=<DATA>) {
      #print "'$s_Line'\n";
      my(@a_Field);
      $s_Text.=$s_Line;
      while ($s_Text =~ m{([^"\t]*)([\t\n])|"((?:[^"]|"")*)"([\t\n])}gs) {
        unless (substr($`,-1,1) eq '"') {
          if (defined $1) {
            push(@a_Field,$1);
            if ($2 eq "\n") {
              return @a_Field;
               };
             } else {
            my($s_Field);
            ($s_Field=$3)=~s/""/"/g;
            push(@a_Field,$s_Field);
            if ($4 eq "\n") {
              return @a_Field;
               };
             };
           };
         };
       };
    return ();
     };

__DATA__
data1	data2	"data3a
data3b
data3c
data3d"	data4	"data5a
data5b"	data6	"data7a
data7b"	data8	data9	data10


HTH,
clueless

Reply With Quote
  #11  
Old April 23rd, 2008, 03:20 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
Thanks but I can't get that working on a form.
What happens is user browses for the tsv file, clicks submit.
It works with OmegaZero's example.

Thanks
Quote:
Originally Posted by Clueless Newbie
Code:
#!\user\bin\perl -w
use strict;
use warnings;
use DBI;

my $Temp_dbh_o=DBI->connect...
#$Temp_dbh_o->do("create table example (p0,p1,p2,p3,p4,p5,p6,p7,p8,p9)");
my $Temp_sth_o=$Temp_dbh_o->prepare("insert into example values(?,?,?,?,?,?,?,?,?,?)");
my @a_Field;
while (@a_Field=GetTSV(@a_Field)) {
  Insert([],@a_Field);
   };
$Temp_dbh_o->commit;
$Temp_dbh_o->disconnect();
exit;

  sub Insert{
    my $value_ra=shift;
    my @field=@_;
    if (@field == 0) { # all done
       $Temp_sth_o->execute(@$value_ra);
        } else { # more to come
       for (split(/\n/,shift(@field))) {
          Insert([@$value_ra,$_],@field);
           };
        };
     };

  sub GetTSV{
    my($s_Line,$s_Text);
    while ($s_Line=<DATA>) {
      #print "'$s_Line'\n";
      my(@a_Field);
      $s_Text.=$s_Line;
      while ($s_Text =~ m{([^"\t]*)([\t\n])|"((?:[^"]|"")*)"([\t\n])}gs) {
        unless (substr($`,-1,1) eq '"') {
          if (defined $1) {
            push(@a_Field,$1);
            if ($2 eq "\n") {
              return @a_Field;
               };
             } else {
            my($s_Field);
            ($s_Field=$3)=~s/""/"/g;
            push(@a_Field,$s_Field);
            if ($4 eq "\n") {
              return @a_Field;
               };
             };
           };
         };
       };
    return ();
     };

__DATA__
data1	data2	"data3a
data3b
data3c
data3d"	data4	"data5a
data5b"	data6	"data7a
data7b"	data8	data9	data10


HTH,
clueless

Reply With Quote
  #12  
Old April 23rd, 2008, 06:24 PM
Clueless Newbie Clueless Newbie is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 440 Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level)Clueless Newbie User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 3 Days 1 h 25 m 10 sec
Reputation Power: 93
Post the code.

Reply With Quote
  #13  
Old April 23rd, 2008, 07:32 PM
electron_89 electron_89 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 91 electron_89 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 h 24 m 5 sec
Reputation Power: 2
This is the partial working version I have. Your code is very nice but hopefully it can be incorporated into this and will import into my database. As you probably can tell, my insert statement won't work.

thanks

Code:
print "Content-type: text/html\n\n";

print <<HTML;
<head><title>parser</title>
</head>
<body>
<table>
  <tr>
    <td>
	<form action="parse4.cgi" enctype="multipart/form-data" method="post">
	<table>
	  <tr>
	    <td><input type="file" name="filename" size="40"></td>
	  </tr>
	  <tr>  
	    <td><input type="submit" value="Import"></td>
	  </tr>
	</table>
	</form>
    </td>
  </tr>
</table>
</body>
HTML

if (my $file = param('filename')) {

$db="";
$host="";
$port="";
$userid="";
$passwd="";
$connectionInfo="DBI:mysql:database=$db;$host:$port";

my $dbh = DBI->connect($connectionInfo,$userid,$passwd)
						or die("Cannot connect: $DBI::errstr\n");
						
while (<$file>) {

    $lines++;
    $len = length($_);
    my @fields = split(/\t/); 

    binmode($file);
    $/ = "\cM\cJ";

   	s/^"(.*)"$/$1/s for @fields;
        
	if ($len > 2) {
		
	print @fields[0,1], $_, @fields[3,4,5,6,7,8,9,10,11,12,13] for split /\cJ/, $fields[2];
	
	my $insert = "INSERT INTO table (column1, column2, column3, column4, column5, column6, column7, column8, column9, column10,
										column11, column12, column13, column14)
								values (?,?,?,?,?,?,?,?,?,?,?,?,?,?)";

		my $sth = $dbh->prepare($insert);
 	 	$sth->execute;
 	
 	 	$sth->finish();
 	 	$dbh->disconnect();
	}
}
}


Quote:
Originally Posted by Clueless Newbie
Post the code.

Reply With Quote
  #14  
Old April 24th, 2008, 01:01 AM
keath's Avatar
keath keath is offline
!~ /m$/
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: May 2004
Location: Leawood, Kansas