#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0

    Encoding iso 8859 issues within a dataset of more than 10000 lines


    with the Mechanize i get a dataset with the following set:

    see a datachunk:

    Loosdorftown Ledochowskastra�e 4 3382 Loosdorftown Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4


    linux-wyee:/home/martin/perl #
    the script below gives back result like this one;
    Loosdorf
    Ledochowskastraße
    3382 Loostown
    Telefonnummer: 0002754 6257
    FAX-Nummer: 0002754 6257-4

    Well - we have following options here:

    to print to a file instead of printing at the screen, we just have to change:

    say $text;

    to:

    print $OUT_FILE $text;

    Some explanations: where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

    This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably dedicated functions or methods for printing CSV lines to a file (Well to be frank i don't use this module and don't know it, although I should probably change this because I am using CSV files from time to time . Well i try to describe more in details what we want to have: Which output file to look like. Well i want the comma to separate the fields of the addresses, or the records?


    if we take this for example: katholisch.at

    we have the following dataset:


    well i want to have seperated each datset into these bits - in other words: if i have a dataset that delimiters and seperates the lines that are given like that

    Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

    i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "ß" so we have to take care for the iso 8859 encoding dont we!?


    Well i love if you can give some hints and helping hands. That would be very very supportive. Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.


    see more results:
    Marias Neustift Neustifttown 28 4443 Marias Neussstift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.inmarianeustift@dioezese-linz.at
    Marias Puchheim Gmundnertown Stra�e 1b 4800 Attnanger-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.inmariapuchheim@dioezese-linz.at
    Marias Scharten Schartenstown 1 4612 Schartensbook Telefonnummer: 007272/5210
    Marias Schmolln Maria Schmollntown 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.inmariaschmolln@dioezese-linz.at
    Mattighofen R�merstra�e 12 5230 Mattighofentown Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.inmattighofen@dioezese-linz.at
    Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchentown Telefonnummer: 007724/2262



    well you see - we ve have a encoding iso 8859 issue here.

    what can we do!? At the end of the day - i have to get all in a CVS formate



    btw: a friend also suggested me using Text::CSV which will load up Text::CSV_XS or,

    Well at the moment all the results will only print the data to stdout (console) im sure that i can modify it... :-)

    i just installed the Text::CSV_XS

    took it from here: http://search.cpan.org/~hmbrand/Text-CSV_XS-0.91/CSV_XS.pm


    love to hear from you

    greetings

    your metabo
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0
    i can try the Text::CSV module too....


    The Text::CSV module provides functions for both parsing and producing CSV data. However, we'll focus on the parsing functionality here. The following code sample opens the prospects.csv file and parses each line in turn, printing out all the fields it finds.

    PHP Code:
    #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { if ($csv->parse($_)) { my @columns = $csv->fields(); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV; 

    Running the code produces the following output:

    Name Address Floors Donated last year Contact Charlotte French Cakes 1179 Glenhuntly Rd 1 Y John Glenhuntly Pharmacy 1181 Glenhuntly Rd 1 Y Paul **** Wicks Magnetic Pain Relief 1183-1185 Glenhuntly Rd 1 Y George Gilmour's Shoes 1187 Glenhuntly Rd 1 Y Ringo
    And by replacing the line:
    PHP Code:
    print "@columns\n"

    with:

    PHP Code:
    print "Name: $columns[0]\n\tContact: $columns[4]\n"

    we can get more particular about which fields we want to output. And while we're at it, let's skip past the first line of our csv file, since it's only a list of column names.

    PHP Code:
    #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { next if ($. == 1); if ($csv->parse($_)) { my @columns = $csv->fields(); print "Name: $columns[0]\n\tContact: $columns[4]\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV; 

    Running this code will give us the following output:

    Name: Charlotte French Cakes Contact: John Name: Glenhuntly Pharmacy Contact: Paul Name: **** Wicks Magnetic Pain Relief Contact: George<br> Name: Gilmour's Shoes Contact: Ringo



    well i can get some analogies what do you think!
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0
    well i can do the job with split - the function split too
  6. #4
  7. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,252
    Rep Power
    1810
    Originally Posted by metabo
    well i can do the job with split - the function split too
    How?

    Need to define the problem. You can get HTML with Mechanize. You can write CSV with Text::CSV. The remaining issue you haven't discussed is parsing the HTML. You need to get the part you want from the web page. The CSV module doesn't recognize HTML, and can't parse it.

    In one of your other posts, you had a small example of getting text from the HTML document, but it was using the wrong delimiters for the job. I think you'll see if you look at the HTML source that your data is mostly separated by <br> tokens.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0

    here more description


    hellllo Dear Keath

    many many thanks for the reply. Glad to hear from you..

    Originally Posted by keath
    How?

    Need to define the problem. You can get HTML with Mechanize. You can write CSV with Text::CSV. The remaining issue you haven't discussed is parsing the HTML. You need to get the part you want from the web page. The CSV module doesn't recognize HTML, and can't parse it.

    In one of your other posts, you had a small example of getting text from the HTML document, but it was using the wrong delimiters for the job. I think you'll see if you look at the HTML source that your data is mostly separated by <br> tokens.


    Here the dataset with its specific design and charcakters - and structure;

    http://katholisch.at/content/site/pfarrfinder/address/4604.html

    seen with firebug:

    PHP Code:
    <div class="address detail addressDetail">
    <
    div class="topIcons"><div class="iconsDetail noprint">
    </
    div></div>

    <
    strong>Schalchen</strong><br><br>
    Hummelbachstraße 7<br>5231 Schalchen<br><br>
    Telefonnummer07742/2513<br>FAX-Nummer07742/2513<br>E-Mail: <a href="mailto:pfarre.schalchen@dioezese-linz.at">pfarre.schalchen@dioezese-linz.at</a><br
    and without firebug:

    Schalchen

    Hummelbachstraße 7
    5231 Schalchen

    Telefonnummer: 07742/2513
    FAX-Nummer: 07742/2513
    E-Mail: pfarre.schalchen@dioezese-linz.at


    here the dataset with its structure - as it can be found in allmost all the results;

    Well Keath - the dataset is the result of a Mechanize-job.


    Here the dataset is - how i t should look like; in other words: the dataset with its specific design and charcakters - and structure;


    Name ( of the institution) -

    Street and housenumber
    Postalcode and town: ( 4 digit-postalcode - obligatore; townname can consist out of several words)
    Telefonnumber: (the telephone-number is following after a ":" - the telephone - number can be very long)
    Fax Nummer: (the fax-number consists out of a large string...)
    E-Mail: note - not every data-line has got the E-Mail-Adressdataset.


    Here again the conclusion: Fax and E-Mail are not in every line. See below for more infos:


    here an example.

    PHP Code:
    Pichlsbier beim Wels Pfarrhoferplatz 1 4632 Pichlbier bei Wels Telefonnummer0333337444247/6444457770 0655550076/85557765291 FAX-Nummer055567247/5556777-4 E-Mailpichldasbierchen.wels@linzertorte.at
    Pierbach Dorfstra
    &#65533;e 1 4282 Pierbach Telefonnummer: 07267/8205 0676/8776529
    Pinsdorf Moargasse 2 4812 Pinsdorf Telefonnummer07612/63952 0676/87765293 E-Mailpichelsteiner@linzertorte.at
    Pischelsdorf Pischelsdorf 2 5233 Pischelsdorf am Engelbach Telefonnummer
    07742/7207 0676/87765294 E-Mailpischelsdorf@linze 


    Post Scriptum:
    see here the code - that does call Mechanize:

    PHP Code:
    #!/usr/bin/perl   
      ## This is how i would go about doing what i understand about what your trying todo 
      ## EXAMPLE only 
      
      
    use 5.014
      use 
    strict
      use 
    warnings
      
      use 
    WWW::Mechanize
      use 
    HTML::TokeParser
      use 
    Data::Dumper
      
      
    my $target_url 'http://katholisch.at/content/site/pfarrfinder/address/'#base url 
      
    my $page 4000#page start number 
      
    my $format '.html'#ending format 
      
    my $max_page_num 4100#2300 max page number 
      
      
      #loop threw the pages 
      
    for (0..$max_page_num){ 
          
    #get mech 
          
    my $mech WWW::Mechanize->new(); 
          
    #set agent 
          
    $mech->agent_alias('Windows Mozilla'); 
          
          
    #this combines to make the url 
          
    my $url $target_url "$page"$format"
          
          
    #get the page 
          
    $mech->get($url); 
          
          
    #get the page 
          
    my $page_content $mech->content(); 
          
          
    #filter the html    
          
    my $html HTML::TokeParser->new(\$page_content); 
          
          
    #search and match 
          
    while(my $tag $html->get_tag('strong')){ 
          
          
    my $text $html->get_trimmed_text('script'); 
          
          
    say $text
          } 
          
          
          
          
    $page++; 
          
      } 
      
      
      
    1

    well Keath -

    do you think that Mechanize can do the complete job !?
    note in the above mentioned code i have the usage of tokeParser

    PHP Code:
    use HTML::TokeParser
    the script - shown above gives back only a result - this one;

    see a demo-site:
    http://katholisch.at/content/site/pfarrfinder/address/4000.html

    but note: it does not give back the results in this way that is shown ä
    below - but it gives back in a way that can be seen as line by line!


    i therefore need the dataset - like the following...


    Loosdorf
    Ledochowskastraße 4
    3382 Loosdorf
    Telefonnummer: 02754 6257
    FAX-Nummer: 02754 6257-4


    is this doable - perhaps with a split function!? Or - even better: the mecha does all the job....


    look forward to hear from you
    Last edited by metabo; October 6th, 2012 at 10:53 PM.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0
    hello dear Keath, - well you gave very good hints. Guess that the following things make it clear.

    The tokeparser does the magic - we do not need more stufff - do we!?

    i found the following descrition here:
    http://search.cpan.org/~gaas/HTML-Parser-3.69/lib/HTML/TokeParser.pm

    If a $filename is passed to the constructor the file will be opened in raw mode and the parsing result will only be valid if its content is Latin-1 or pure ASCII.

    If parsing from an UTF-8 encoded string buffer decode it first:

    utf8::decode($document);
    my $p = HTML::TokeParser->new( \$document );
    # ...

    $p->get_token

    This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The first element of the array will be a string denoting the type of this token: "S" for start tag, "E" for end tag, "T" for text, "C" for comment, "D" for declaration, and "PI" for process instructions. The rest of the token array depend on the type like this:

    ["S", $tag, $attr, $attrseq, $text]
    ["E", $tag, $text]
    ["T", $text, $is_data]
    ["C", $text]
    ["D", $text]
    ["PI", $token0, $text]

    where $attr is a hash reference, $attrseq is an array reference and the rest are plain scalars. The "Argspec" in HTML::Parser explains the details.
    $p->unget_token( @tokens )

    If you find you have read too many tokens you can push them back, so that they are returned the next time $p->get_token is called.
    $p->get_tag
    $p->get_tag( @tags )

    This method returns the next start or end tag (skipping any other tokens), or undef if there are no more tags in the document. If one or more arguments are given, then we skip tokens until one of the specified tag types is found. For example:

    $p->get_tag("font", "/font");

    will find the next start or end tag for a font-element.

    The tag information is returned as an array reference in the same form as for $p->get_token above, but the type code (first element) is missing. A start tag will be returned like this:

    [$tag, $attr, $attrseq, $text]

    The tagname of end tags are prefixed with "/", i.e. end tag is returned like this:

    ["/$tag", $text]

    $p->get_text
    $p->get_text( @endtags )

    This method returns all text found at the current position. It will return a zero length string if the next token is not text. Any entities will be converted to their corresponding character. If one or more arguments are given, then we return all text occurring before the first of the specified tags found. For example:


    Well - dear Keath i think i have to define the things like it was described here


    $p->get_token

    This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The first element of the array will be a string denoting the type of this token: "S" for start tag, "E" for end tag, "T" for text, "C" for comment, "D" for declaration, and "PI" for process instructions. The rest of the token array depend on the type like this:

    ["S", $tag, $attr, $attrseq, $text]
    ["E", $tag, $text]
    ["T", $text, $is_data]
    ["C", $text]
    ["D", $text]
    ["PI", $token0, $text]
    look forward to hear from you

    greetings
    meta
  12. #7
  13. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,252
    Rep Power
    1810
    There is no need for you to quote the documentation back to us. We know how the modules work.

    This is the code you are using:
    Code:
    #search and match  
          while(my $tag = $html->get_tag('strong')){  
          my $text = $html->get_trimmed_text('script');
    Here's the HTML you are trying to parse:

    Code:
    <div id="contentBox">
    	<div class="address detail addressDetail">
    		<div class="topIcons">
    			<div class="iconsDetail noprint"></div>
    		</div><strong>Dom- und Metropolitanpfarre St. Stephan</strong><br>
    		<br>
    		Stephansplatz 3<br>
    		1010 Wien<br>
    		<br>
    		Telefonnummer: 515 52-3530<br>
    		FAX-Nummer: 515 52-3720<br>
    		E-Mail: <a href="mailto:dompfarre-st.stephan@edw.or.at">dompfarre-st.stephan@edw.or.at</a><br>
    		Web: <a href="http://www.st.stephan.at" target="_blank">http://www.st.stephan.at</a><br>
    		<script type="text/javascript">
    
    		window.addEvent('domready', function() {
    		
    		// blah blah blah
    		
    		</script><br>
    		<br>
    		<div id="google_map"></div>
    		<div class="bottomIcons">
    			<div class="iconsDetail noprint"></div>
    		</div>
    	</div>
    </div>
    You looked for a strong tag. Once one is found, everything from there until the javascript is found is returned as a string with no HTML tags in it. If another strong tag is found later in the document, that one is used instead.

    What is the best, most reliable way to parse this into accurate fields.

    In the first place, I would probably instruct the parser to look for the 'contentBox' div, because that tag is unique. You could check to make sure you are inside a div with an 'address' class afterwards, but the extra step is probably not necessary.

    The only thing identifying the 'name' field, is the use of strong tags. So you should only get the text inside those tags for that field. I'll say it again: don't start getting text at the first start tag and then go to the 'script' tag. Stop parsing when you see the closing 'strong' tag, and put that text into a 'name' variable.

    The address is the hardest field because it has no identifier, and it is spread over several lines. I recommend you get all these fields one at a time grabbing text up to any 'br' tag. Use a hash to collect data. If the text contains a colon, split on that and assign the data to the key (field name) prior to the colon. If no colon is present, append the text to an 'address' field.

    Stop when 'script' is reached.
    Last edited by keath; October 7th, 2012 at 04:47 PM. Reason: typo
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    305
    Rep Power
    0
    hello dear keath

    many many thanks for this in-depth-tutorial.

    i will do as adviced.

    greetings

    metabo

IMN logo majestic logo threadwatch logo seochat tools logo