Perl Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPerl Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old October 6th, 2012, 06:13 AM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
Encoding iso 8859 issues within a dataset of more than 10000 lines

with the Mechanize i get a dataset with the following set:

see a datachunk:

Loosdorftown Ledochowskastra�e 4 3382 Loosdorftown Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4


linux-wyee:/home/martin/perl #
the script below gives back result like this one;
Loosdorf
Ledochowskastraße
3382 Loostown
Telefonnummer: 0002754 6257
FAX-Nummer: 0002754 6257-4

Well - we have following options here:

to print to a file instead of printing at the screen, we just have to change:

say $text;

to:

print $OUT_FILE $text;

Some explanations: where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably dedicated functions or methods for printing CSV lines to a file (Well to be frank i don't use this module and don't know it, although I should probably change this because I am using CSV files from time to time . Well i try to describe more in details what we want to have: Which output file to look like. Well i want the comma to separate the fields of the addresses, or the records?


if we take this for example: katholisch.at

we have the following dataset:


well i want to have seperated each datset into these bits - in other words: if i have a dataset that delimiters and seperates the lines that are given like that

Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "ß" so we have to take care for the iso 8859 encoding dont we!?


Well i love if you can give some hints and helping hands. That would be very very supportive. Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.


see more results:
Marias Neustift Neustifttown 28 4443 Marias Neussstift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.inmarianeustift@dioezese-linz.at
Marias Puchheim Gmundnertown Stra�e 1b 4800 Attnanger-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.inmariapuchheim@dioezese-linz.at
Marias Scharten Schartenstown 1 4612 Schartensbook Telefonnummer: 007272/5210
Marias Schmolln Maria Schmollntown 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.inmariaschmolln@dioezese-linz.at
Mattighofen R�merstra�e 12 5230 Mattighofentown Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.inmattighofen@dioezese-linz.at
Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchentown Telefonnummer: 007724/2262



well you see - we ve have a encoding iso 8859 issue here.

what can we do!? At the end of the day - i have to get all in a CVS formate



btw: a friend also suggested me using Text::CSV which will load up Text::CSV_XS or,

Well at the moment all the results will only print the data to stdout (console) im sure that i can modify it... :-)

i just installed the Text::CSV_XS

took it from here: http://search.cpan.org/~hmbrand/Text-CSV_XS-0.91/CSV_XS.pm


love to hear from you

greetings

your metabo

Reply With Quote
  #2  
Old October 6th, 2012, 06:54 AM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
i can try the Text::CSV module too....


The Text::CSV module provides functions for both parsing and producing CSV data. However, we'll focus on the parsing functionality here. The following code sample opens the prospects.csv file and parses each line in turn, printing out all the fields it finds.

PHP Code:
#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { if ($csv->parse($_)) { my @columns = $csv->fields(); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV; 



Running the code produces the following output:

Quote:
Name Address Floors Donated last year Contact Charlotte French Cakes 1179 Glenhuntly Rd 1 Y John Glenhuntly Pharmacy 1181 Glenhuntly Rd 1 Y Paul **** Wicks Magnetic Pain Relief 1183-1185 Glenhuntly Rd 1 Y George Gilmour's Shoes 1187 Glenhuntly Rd 1 Y Ringo


And by replacing the line:
PHP Code:
print "@columns\n"



with:

PHP Code:
print "Name: $columns[0]\n\tContact: $columns[4]\n"



we can get more particular about which fields we want to output. And while we're at it, let's skip past the first line of our csv file, since it's only a list of column names.

PHP Code:
#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { next if ($. == 1); if ($csv->parse($_)) { my @columns = $csv->fields(); print "Name: $columns[0]\n\tContact: $columns[4]\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV; 



Running this code will give us the following output:

Quote:
Name: Charlotte French Cakes Contact: John Name: Glenhuntly Pharmacy Contact: Paul Name: **** Wicks Magnetic Pain Relief Contact: George<br> Name: Gilmour's Shoes Contact: Ringo





well i can get some analogies what do you think!

Reply With Quote
  #3  
Old October 6th, 2012, 12:28 PM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
well i can do the job with split - the function split too

Reply With Quote
  #4  
Old October 6th, 2012, 08:15 PM
keath's Avatar
keath keath is offline
!~ /m$/
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: May 2004
Location: Reno, NV
Posts: 4,099 keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level) 
Time spent in forums: 2 Weeks 4 Days 8 h 20 m 20 sec
Reputation Power: 1809
Quote:
Originally Posted by metabo
well i can do the job with split - the function split too


How?

Need to define the problem. You can get HTML with Mechanize. You can write CSV with Text::CSV. The remaining issue you haven't discussed is parsing the HTML. You need to get the part you want from the web page. The CSV module doesn't recognize HTML, and can't parse it.

In one of your other posts, you had a small example of getting text from the HTML document, but it was using the wrong delimiters for the job. I think you'll see if you look at the HTML source that your data is mostly separated by <br> tokens.

Reply With Quote
  #5  
Old October 6th, 2012, 10:29 PM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
here more description

hellllo Dear Keath

many many thanks for the reply. Glad to hear from you..

Quote:
Originally Posted by keath
How?

Need to define the problem. You can get HTML with Mechanize. You can write CSV with Text::CSV. The remaining issue you haven't discussed is parsing the HTML. You need to get the part you want from the web page. The CSV module doesn't recognize HTML, and can't parse it.

In one of your other posts, you had a small example of getting text from the HTML document, but it was using the wrong delimiters for the job. I think you'll see if you look at the HTML source that your data is mostly separated by <br> tokens.




Here the dataset with its specific design and charcakters - and structure;

http://katholisch.at/content/site/pfarrfinder/address/4604.html

seen with firebug:

PHP Code:
<div class="address detail addressDetail">
<
div class="topIcons"><div class="iconsDetail noprint">
</
div></div>

<
strong>Schalchen</strong><br><br>
Hummelbachstraße 7<br>5231 Schalchen<br><br>
Telefonnummer07742/2513<br>FAX-Nummer07742/2513<br>E-Mail: <a href="mailto:pfarre.schalchen@dioezese-linz.at">pfarre.schalchen@dioezese-linz.at</a><br


and without firebug:

Schalchen

Hummelbachstraße 7
5231 Schalchen

Telefonnummer: 07742/2513
FAX-Nummer: 07742/2513
E-Mail: pfarre.schalchen@dioezese-linz.at


here the dataset with its structure - as it can be found in allmost all the results;

Well Keath - the dataset is the result of a Mechanize-job.


Here the dataset is - how i t should look like; in other words: the dataset with its specific design and charcakters - and structure;


Name ( of the institution) -

Street and housenumber
Postalcode and town: ( 4 digit-postalcode - obligatore; townname can consist out of several words)
Telefonnumber: (the telephone-number is following after a ":" - the telephone - number can be very long)
Fax Nummer: (the fax-number consists out of a large string...)
E-Mail: note - not every data-line has got the E-Mail-Adressdataset.


Here again the conclusion: Fax and E-Mail are not in every line. See below for more infos:


here an example.

PHP Code:
 Pichlsbier beim Wels Pfarrhoferplatz 1 4632 Pichlbier bei Wels Telefonnummer0333337444247/6444457770 0655550076/85557765291 FAX-Nummer055567247/5556777-4 E-Mailpichldasbierchen.wels@linzertorte.at
Pierbach Dorfstra
&#65533;e 1 4282 Pierbach Telefonnummer: 07267/8205 0676/8776529
Pinsdorf Moargasse 2 4812 Pinsdorf Telefonnummer07612/63952 0676/87765293 E-Mailpichelsteiner@linzertorte.at
Pischelsdorf Pischelsdorf 2 5233 Pischelsdorf am Engelbach Telefonnummer
07742/7207 0676/87765294 E-Mailpischelsdorf@linze 



Post Scriptum:
see here the code - that does call Mechanize:

PHP Code:
#!/usr/bin/perl   
  ## This is how i would go about doing what i understand about what your trying todo 
  ## EXAMPLE only 
  
  
use 5.014
  use 
strict
  use 
warnings
  
  use 
WWW::Mechanize
  use 
HTML::TokeParser
  use 
Data::Dumper
  
  
my $target_url 'http://katholisch.at/content/site/pfarrfinder/address/'#base url 
  
my $page 4000#page start number 
  
my $format '.html'#ending format 
  
my $max_page_num 4100#2300 max page number 
  
  
  #loop threw the pages 
  
for (0..$max_page_num){ 
      
#get mech 
      
my $mech WWW::Mechanize->new(); 
      
#set agent 
      
$mech->agent_alias('Windows Mozilla'); 
      
      
#this combines to make the url 
      
my $url $target_url "$page"$format"
      
      
#get the page 
      
$mech->get($url); 
      
      
#get the page 
      
my $page_content $mech->content(); 
      
      
#filter the html    
      
my $html HTML::TokeParser->new(\$page_content); 
      
      
#search and match 
      
while(my $tag $html->get_tag('strong')){ 
      
      
my $text $html->get_trimmed_text('script'); 
      
      
say $text
      } 
      
      
      
      
$page++; 
      
  } 
  
  
  
1



well Keath -

do you think that Mechanize can do the complete job !?
note in the above mentioned code i have the usage of tokeParser

PHP Code:
use HTML::TokeParser


the script - shown above gives back only a result - this one;

see a demo-site:
http://katholisch.at/content/site/pfarrfinder/address/4000.html

but note: it does not give back the results in this way that is shown ä
below - but it gives back in a way that can be seen as line by line!


i therefore need the dataset - like the following...


Loosdorf
Ledochowskastraße 4
3382 Loosdorf
Telefonnummer: 02754 6257
FAX-Nummer: 02754 6257-4


is this doable - perhaps with a split function!? Or - even better: the mecha does all the job....


look forward to hear from you

Last edited by metabo : October 6th, 2012 at 10:53 PM.

Reply With Quote
  #6  
Old October 7th, 2012, 04:26 AM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
hello dear Keath, - well you gave very good hints. Guess that the following things make it clear.

The tokeparser does the magic - we do not need more stufff - do we!?

i found the following descrition here:
http://search.cpan.org/~gaas/HTML-Parser-3.69/lib/HTML/TokeParser.pm

If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

If parsing from an UTF-8 encoded string buffer decode it first:

utf8::decode($document);
my $p = HTML::TokeParser->new( \$document );
# ...

$p->get_token

This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The first element of the array will be a string denoting the type of this token: "S" for start tag, "E" for end tag, "T" for text, "C" for comment, "D" for declaration, and "PI" for process instructions. The rest of the token array depend on the type like this:

["S", $tag, $attr, $attrseq, $text]
["E", $tag, $text]
["T", $text, $is_data]
["C", $text]
["D", $text]
["PI", $token0, $text]

where $attr is a hash reference, $attrseq is an array reference and the rest are plain scalars. The "Argspec" in HTML::Parser explains the details.
$p->unget_token( @tokens )

If you find you have read too many tokens you can push them back, so that they are returned the next time $p->get_token is called.
$p->get_tag
$p->get_tag( @tags )

This method returns the next start or end tag (skipping any other tokens), or undef if there are no more tags in the document. If one or more arguments are given, then we skip tokens until one of the specified tag types is found. For example:

$p->get_tag("font", "/font");

will find the next start or end tag for a font-element.

The tag information is returned as an array reference in the same form as for $p->get_token above, but the type code (first element) is missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]

The tagname of end tags are prefixed with "/", i.e. end tag is returned like this:

["/$tag", $text]

$p->get_text
$p->get_text( @endtags )

This method returns all text found at the current position. It will return a zero length string if the next token is not text. Any entities will be converted to their corresponding character. If one or more arguments are given, then we return all text occurring before the first of the specified tags found. For example:


Well - dear Keath i think i have to define the things like it was described here


Quote:
$p->get_token

This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The first element of the array will be a string denoting the type of this token: "S" for start tag, "E" for end tag, "T" for text, "C" for comment, "D" for declaration, and "PI" for process instructions. The rest of the token array depend on the type like this:

["S", $tag, $attr, $attrseq, $text]
["E", $tag, $text]
["T", $text, $is_data]
["C", $text]
["D", $text]
["PI", $token0, $text]



look forward to hear from you

greetings
meta

Reply With Quote
  #7  
Old October 7th, 2012, 10:21 AM
keath's Avatar
keath keath is offline
!~ /m$/
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: May 2004
Location: Reno, NV
Posts: 4,099 keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level)keath User rank is General 12nd Grade (Above 100000 Reputation Level) 
Time spent in forums: 2 Weeks 4 Days 8 h 20 m 20 sec
Reputation Power: 1809
There is no need for you to quote the documentation back to us. We know how the modules work.

This is the code you are using:
Code:
#search and match  
      while(my $tag = $html->get_tag('strong')){  
      my $text = $html->get_trimmed_text('script');  


Here's the HTML you are trying to parse:

Code:
<div id="contentBox">
	<div class="address detail addressDetail">
		<div class="topIcons">
			<div class="iconsDetail noprint"></div>
		</div><strong>Dom- und Metropolitanpfarre St. Stephan</strong><br>
		<br>
		Stephansplatz 3<br>
		1010 Wien<br>
		<br>
		Telefonnummer: 515 52-3530<br>
		FAX-Nummer: 515 52-3720<br>
		E-Mail: <a href="mailto:dompfarre-st.stephan@edw.or.at">dompfarre-st.stephan@edw.or.at</a><br>
		Web: <a href="http://www.st.stephan.at" target="_blank">http://www.st.stephan.at</a><br>
		<script type="text/javascript">

		window.addEvent('domready', function() {
		
		// blah blah blah
		
		</script><br>
		<br>
		<div id="google_map"></div>
		<div class="bottomIcons">
			<div class="iconsDetail noprint"></div>
		</div>
	</div>
</div>


You looked for a strong tag. Once one is found, everything from there until the javascript is found is returned as a string with no HTML tags in it. If another strong tag is found later in the document, that one is used instead.

What is the best, most reliable way to parse this into accurate fields.

In the first place, I would probably instruct the parser to look for the 'contentBox' div, because that tag is unique. You could check to make sure you are inside a div with an 'address' class afterwards, but the extra step is probably not necessary.

The only thing identifying the 'name' field, is the use of strong tags. So you should only get the text inside those tags for that field. I'll say it again: don't start getting text at the first start tag and then go to the 'script' tag. Stop parsing when you see the closing 'strong' tag, and put that text into a 'name' variable.

The address is the hardest field because it has no identifier, and it is spread over several lines. I recommend you get all these fields one at a time grabbing text up to any 'br' tag. Use a hash to collect data. If the text contains a colon, split on that and assign the data to the key (field name) prior to the colon. If no colon is present, append the text to an 'address' field.

Stop when 'script' is reached.

Last edited by keath : October 7th, 2012 at 04:47 PM. Reason: typo

Reply With Quote
  #8  
Old October 7th, 2012, 02:01 PM
metabo metabo is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2004
Posts: 234 metabo Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 2 Days 14 h 32 m 50 sec
Reputation Power: 0
hello dear keath

many many thanks for this in-depth-tutorial.

i will do as adviced.

greetings

metabo

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPerl Programming > Encoding iso 8859 issues within a dataset of more than 10000 lines

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap