#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0

    [solved] How to install HTML-TreeBuilder-LibXML on OpenSuse v. 11.3


    Hello dear Community,

    this issue is solved - i finally was able to install the package! see below!

    Many thanks to Axweildr and Keath for this great walkthrough - it is a great lesson with a nice - steep learning courve. And you are great guides!
    I continue to go to the next step - discussing a Perl-task - But this can be discussed in a new thread... I will open this in the next few hours.


    original thread: - here it is:

    this is a easy question for all the experts of you.

    how to install HTML-TreeBuilder-LibXML on OpenSuse v. 11.3?


    I am new to Linux and also pretty new to Perl

    Any and all help will be appreciated

    Metabo
    Last edited by metabo; September 26th, 2010 at 01:35 PM. Reason: solved - thx to the great help!
  2. #2
  3. 'fie' on me, allege-dly
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2003
    Location
    in da kitchen ...
    Posts
    12,889
    Rep Power
    6444
    at the console
    perl -MCPAN -e shell
    on first run you'll need to configure it, just follow defaults, get local repositories
    cpan> install HTML::TreeBuilder::LibXML

    and you should be good, or your package manager may allow you to install Perl modules as well

    Here's a few options http://linuxpoison.blogspot.com/2009/07/how-to-install-perl-modules.html
    --Ax
    without exception, there is no rule ...
    Handmade Irish Jewellery
    Targeted Advertising Cookie Optout (TACO) extension for Firefox
    The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones


    09 F9 11 02
    9D 74 E3 5B
    D8 41 56 C5
    63 56 88 C0
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski
    Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...
    BIT COINS ANYONE
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0
    Hi Axweildr,

    many thanks - you are very quick!!

    Originally Posted by Axweildr
    at the console
    perl -MCPAN -e shell
    on first run you'll need to configure it, just follow defaults, get local repositories
    cpan> install HTML::TreeBuilder::LibXML
    just running the above mentioned command on console...

    if you prefer the automatic configuration, answer 'yes' below.
    Would you like me to configure as much as possible automatically? [yes]
    i said yes - now i am waiting - there happens something great. I comeback and report all my findings...


    untill - later.

    Metabo


    update: there seem to be serious troubles - i think that the system did not install treebuilder... !? Look at the last lines in the console...

    CPAN.pm: Going to build M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz

    Warning: Prerequisite 'XML::XPathEngine => 0.12' for 'MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz' failed when processing 'MIROD/XML-XPathEngine-0.12.tar.gz' with 'make => NO'. Continuing, but chances to succeed are limited.
    Can't exec "make": No such file or directory at /usr/lib/perl5/5.12.1/CPAN/Distribution.pm line 2026.
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    make -- NOT OK
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-LibXML-0.12-dCBR48

    CPAN.pm: Going to build T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz

    Warning: Prerequisite 'HTML::TreeBuilder::XPath => 0.11' for 'TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz' failed when processing 'MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz' with 'make => NO'. Continuing, but chances to succeed are limited.
    Can't exec "make": No such file or directory at /usr/lib/perl5/5.12.1/CPAN/Distribution.pm line 2026.
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    make -- NOT OK
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible
    Failed during this command:
    MIROD/XML-XPathEngine-0.12.tar.gz : make NO
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz: make NO
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz : make NO
    Last edited by metabo; September 25th, 2010 at 10:06 AM.
  6. #4
  7. 'fie' on me, allege-dly
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2003
    Location
    in da kitchen ...
    Posts
    12,889
    Rep Power
    6444
    pay attention to the repositories ...
    --Ax
    without exception, there is no rule ...
    Handmade Irish Jewellery
    Targeted Advertising Cookie Optout (TACO) extension for Firefox
    The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones


    09 F9 11 02
    9D 74 E3 5B
    D8 41 56 C5
    63 56 88 C0
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski
    Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...
    BIT COINS ANYONE
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0
    hi Axweildr,

    thx - but how!?

    Originally Posted by Axweildr
    pay attention to the repositories ...
    hmm - have i done something wrong - perhaps i have do re-do it!

    I try it again! And i come back and report all my findings again.

    untill later

    Metabo


    update: here i have some strange findings

    cpan[2]> install HTML::TreeBuilder::LibXML
    Running install for module 'HTML::TreeBuilder::LibXML'
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-LibXML-0.12-dCBR48
    Could not make: Unknown error
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible

    cpan[3]>
    Last edited by metabo; September 25th, 2010 at 11:17 AM.
  10. #6
  11. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,261
    Rep Power
    1810
    Seems to me that once a make fails in CPAN, it gets stubborn about trying to make again.

    I'm concerned about the earlier errors though, since it said you didn't have the prerequesites installed.

    Try telling cpan to initialize again. Login to the shell, and send the command:
    o conf init
    You can accept defaults for most things, but when you get to the prerequisites line, choose "follow"

    Life with cpan
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0

    hello you both - many thanks for the help so far. See the results:


    Hello Keath hello Axweildr, many thanks for your help so far


    btw: i learned alot while following your hints! Incredible support here. I like this forum for its very very supportive folks here! Again Thanks to you both! metabo aka martin


    Many thanks for the hints. I had some bad responses while trying to install HTML::TreeBuilder::LibXML from CPAN.

    I ve tried all - also the ideas that you Keath postet. See the results below. After all trials failed i did the following - in order to get some things done on my openSUSE 11.3. With The YAST (the install-manager) i tried to get the packages from the opensSUSE respository: with

    http://download.opensuse.org/repositories/devel:/languages:/perl/openSUSE_11.3/
    That went well - but i guess that i do not have HTML::TreeBuilder::LibXML (from CPAN)

    I tried to verify the situation on my perl: I went to the console -with the following command:
    zypper in perl-HTML-Tree
    The console gave me - perl-HTML-Tree is allready installed


    question: is HTML::TreeBuilder::LibXML allready included on my machine!?

    Well that is what i got so far. I wanted to have HTML::TreeBuilder::LibXML to do some HTML-Parsing jobs. But i am not sure if i have it installed now.

    can i work with perl-HTML-Tree?


    @Keath -see the results of the CPAN-Approach below:

    I look forward to hear from you

    regards metabo



    Originally Posted by keath
    Seems to me that once a make fails in CPAN, it gets stubborn about trying to make again.

    I'm concerned about the earlier errors though, since it said you didn't have the prerequesites installed.

    Try telling cpan to initialize again. Login to the shell, and send the command:

    You can accept defaults for most things, but when you get to the prerequisites line, choose "follow"

    Life with cpan
    @ Keath - see here - the results that i got while triying to get your commands done on the console.


    cpan[2]> o conf init

    CPAN is the world-wide archive of perl resources. It consists of about
    300 sites that all replicate the same contents around the globe. Many
    countries have at least one CPAN site already. The resources found on
    CPAN are easily accessible with the CPAN.pm module. If you want to use
    CPAN.pm, lots of things have to be configured. Fortunately, most of
    them can be determined automatically. If you prefer the automatic
    configuration, answer 'yes' below.

    If you prefer to enter a dialog instead, you can answer 'no' to this
    question and I'll let you configure in small steps one thing after the
    other. (Note: you can revisit this dialog anytime later by typing 'o
    conf init' at the cpan prompt.)

    Would you like me to configure as much as possible automatically? [yes] yes


    ALERT: 'make' is an essential tool for building perl Modules. Please make sure you have 'make' (or some equivalent) working.

    Your 'urllist' is already configured. Type 'o conf init urllist' to change it.

    Autoconfiguration complete.

    commit: wrote '/usr/lib/perl5/5.12.1/CPAN/Config.pm'

    cpan[3]> install HTML::TreeBuilder::LibXML
    Going to read '/root/.cpan/Metadata'
    Database was generated on Fri, 24 Sep 2010 22:28:50 GMT
    Running install for module 'HTML::TreeBuilder::LibXML'
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Checksum for /root/.cpan/sources/authors/id/T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz ok
    Scanning cache /root/.cpan/build for sizes
    ............................................................................DONE

    CPAN.pm: Going to build T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz

    Cannot determine perl version info from lib/HTML/TreeBuilder/LibXML.pm
    Checking if your kit is complete...
    Looks good
    Warning: prerequisite HTML::TreeBuilder::XPath 0.11 not found.
    Writing Makefile for HTML::TreeBuilder::LibXML
    ---- Unsatisfied dependencies detected during ----
    ---- TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz ----
    HTML::TreeBuilder::XPath [requires]
    Running make test
    Delayed until after prerequisites
    Running make install
    Delayed until after prerequisites
    Running install for module 'HTML::TreeBuilder::XPath'
    Running make for M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    Checksum for /root/.cpan/sources/authors/id/M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz ok

    CPAN.pm: Going to build M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz

    Checking if your kit is complete...
    Looks good
    Warning: prerequisite XML::XPathEngine 0.12 not found.
    Writing Makefile for HTML::TreeBuilder::XPath
    ---- Unsatisfied dependencies detected during ----
    ---- MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz ----
    XML::XPathEngine [requires]
    Running make test
    Delayed until after prerequisites
    Running make install
    Delayed until after prerequisites
    Running install for module 'XML::XPathEngine'
    Running make for M/MI/MIROD/XML-XPathEngine-0.12.tar.gz
    Checksum for /root/.cpan/sources/authors/id/M/MI/MIROD/XML-XPathEngine-0.12.tar.gz ok

    CPAN.pm: Going to build M/MI/MIROD/XML-XPathEngine-0.12.tar.gz

    Checking if your kit is complete...
    Looks good
    Writing Makefile for XML::XPathEngine
    Can't exec "make": No such file or directory at /usr/lib/perl5/5.12.1/CPAN/Distribution.pm line 2026.
    MIROD/XML-XPathEngine-0.12.tar.gz
    make -- NOT OK
    'YAML' not installed, will not store persistent state
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible
    Running make for M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-XPath-0.11-a5HoR1

    CPAN.pm: Going to build M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz

    Warning: Prerequisite 'XML::XPathEngine => 0.12' for 'MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz' failed when processing 'MIROD/XML-XPathEngine-0.12.tar.gz' with 'make => NO'. Continuing, but chances to succeed are limited.
    Can't exec "make": No such file or directory at /usr/lib/perl5/5.12.1/CPAN/Distribution.pm line 2026.
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    make -- NOT OK
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-LibXML-0.12-KNwSoq

    CPAN.pm: Going to build T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz

    Warning: Prerequisite 'HTML::TreeBuilder::XPath => 0.11' for 'TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz' failed when processing 'MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz' with 'make => NO'. Continuing, but chances to succeed are limited.
    Can't exec "make": No such file or directory at /usr/lib/perl5/5.12.1/CPAN/Distribution.pm line 2026.
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    make -- NOT OK
    Running make test
    Can't test without successful make
    Running make install
    Make had returned bad status, install seems impossible
    Failed during this command:
    MIROD/XML-XPathEngine-0.12.tar.gz : make NO
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz : make NO
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz: make NO
    @ Keath and Axweildr - i guess that i have errors with MAKE -
    but at the moment the question is:

    I wanted to have HTML::TreeBuilder::LibXML to do some HTML-Parsing jobs. But i am not sure if i have it installed now.

    can i work with perl-HTML-Tree?

    look forward to hear from you ... Metabo
  14. #8
  15. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,261
    Rep Power
    1810
    You are still failing to build the prerequisites. You can do that manually, or you can check the cpan configuration again. At the cpan prompt, type 'o conf' and look at how it is set up. This is the option you are looking for:
    prerequisites_policy [follow]
    prerequisites_policy is the option, and 'follow' is the choice I made.

    Using that will allow cpan to automatically download and build whatever it needs to install. The link I provided above gives instructions on changing just that one option, if you don't want to do the whole init process.

    As to whether or not you can use HTML-Tree in place of tHTML::TreeBuilder::LibXML, I don't know. I don't know what you are trying to do, and I've never used either module.

    Comments on this post

    • Axweildr agrees
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0
    Hello Keath - good evening. [and of course hello to the whole commuity]

    Keath, many thx for writing - very helpful! This is obviousliy a in-depth-introduction in Perl as well as in Linux. Again thanks to you and Axweildr for this great walkthrough - it is a great lesson with a nice - steep learning courve. And you are great guides...- i like this forum here - for its supportive user!!

    Originally Posted by keath
    As to whether or not you can use HTML-Tree in place of tHTML::TreeBuilder::LibXML, I don't know. I don't know what you are trying to do, and I've never used either module.
    I tried to follow your instructions - and then i remembered that [allready yesterday] i had some issues with MAKE. See the protocoll-notes and the response further above.

    so i installed Make:
    make - GNU make - The GNU make command with extensive documentation
    Afterwards i tried to run the command: install HTML::TreeBuilder::LibXML

    This time it worked - without any break - ! See the text below:
    As for what i want to use HTML::TreeBuilder::LibXML : I want to parse a set of files - HTML files in order to get a chunk of contained information out of it.

    Well i think that i start a new thread for this task - since this one has lead to a good and happy end...


    But wait: i want to describe what i want to do:

    I have a large number of HTML-files in a folder. I want to read and extract a certain chunk of text of each HTML file to a new.txt. I'm only interested in the content having the text, and i want to create only one single .txt file. I thought of making use of some perl modules, that support this task! Below there is a piece of code i have until now: - not tested yet.

    Here is an example of the HTML-file; one of more than 14 thousand - all look the same! Note: there is some HTML-stuff around this - that is not wanted - and can be stripped out:

    PHP Code:
    <h2>[bHit 7 out of 120517[/b]</h2>
    <
    img src="http://myweb.org/images/wappen/ni.gif" class="wappen_pos" width="45" height="53" alt="country" title="countryname" />
    <
    div style="width: 40em;">
    <
    div style="display: inline;"><div class="logo_homepage"><class="img_inl" href="http://myWeb.org/222237520031111"></a></div>
    <
    div class="fm_linkeSpalte"><h2>name 1</h2>
    <
    span class="schulart_text">typeone (for example) </span>
    <
    class="einzel_text">AdressParis3ne Boulevard Saint Lo
    <br />
       
    Telefon:048 334555664  Fax048 334555667
       
    <br />
       
    MyWeb-Nummer:  222237520031111   <br />
       
    Webmaster:  <a href="mailto: webmaster@demosite.fr" class="p1">master</a><br /></p>                  </div>
            <
    div>
            <
    class="ta_left einzel_text">
                    </
    p></div>
      <
    class="ta_left einzel_text">[b]Listed since:[/b20.08.2002</p>
    </
    div
    Note 1: i have written the important parts wherein the chunk of text - all kind of "adress-data" is, that i am interested in.
    Note 2:there is some HTML-stuff around this - that is not wanted - and can be stripped out!


    Conclusion: what do i need:

    1. i need to write all into one big texfile:
    2. the HTML-tags have to be stripped
    3. all results have to be stored in one single file - with some cleanup.

    How can we do this!?

    Well - here is the piece of code i have untill now: This will extract the text and uses HTML::TokeParser::Simple which is a wrapper around HTML::Parser. There are some white space to the HTML for clarity.

    PHP Code:
    #! /usr/bin/perl

    use strict;
    use 
    warnings;

    use 
    HTML::TokeParser::Simple;
    my $p HTML::TokeParser::Simple->new(*DATA)
      or die 
    qq{cant parse html: $!\n};

    my @text;
    while (
    my $t $p->get_token){
      
    next unless $t->is_text;
      
    my $txt $t->as_is;
      if (
    $txt =~ /Hit/ .. $txt =~ /Listed since/){
        for (
    $txt){
          
    s/^\s+//;
          
    s/\s+$//;
        
    }
        
    next unless $txt;
        
    push @text$txt;
      }
    }

    print 
    qq{$_\n} for @text
    Intended results: i want to get a result that looks like this:

    Hit 7 out of 120517
    name 1
    type: one (for example)
    Adress: Paris, 3ne Boulevard Saint Lo
    Telefon:048 + 334555664 , Fax: 048 + 334555667
    MyWeb-Nummer: 222237520031111
    Webmaster:
    master
    Listed since: 20.08.2002
    But this can be discussed in a new thread... I will open this in the next few hours.

    And now back to the topic of this thread. ... Again thanks to
    Axweildr and Keath for this great walkthrough - it is a great lesson with a nice - steep learning courve. And you are great guides.

    Here the results of the installation-task of HTML::TreeBuilder::LibXML

    cpan[7]> install HTML::TreeBuilder::LibXML
    Going to read '/root/.cpan/Metadata'
    Database was generated on Fri, 24 Sep 2010 22:28:50 GMT
    Fetching with LWP:
    ftp://cpan.cpantesters.org/CPAN/authors/01mailrc.txt.gz
    Going to read '/root/.cpan/sources/authors/01mailrc.txt.gz'
    ............................................................................DONE
    Fetching with LWP:
    ftp://cpan.cpantesters.org/CPAN/modules/02packages.details.txt.gz
    Going to read '/root/.cpan/sources/modules/02packages.details.txt.gz'
    Database was generated on Sun, 26 Sep 2010 17:28:39 GMT
    ............................................................................DONE
    Fetching with LWP:
    ftp://cpan.cpantesters.org/CPAN/modules/03modlist.data.gz
    Going to read '/root/.cpan/sources/modules/03modlist.data.gz'
    ............................................................................DONE
    Going to write /root/.cpan/Metadata
    Running install for module 'HTML::TreeBuilder::LibXML'
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Checksum for /root/.cpan/sources/authors/id/T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz ok
    Scanning cache /root/.cpan/build for sizes
    ............................................................................DONE

    CPAN.pm: Going to build T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz

    Cannot determine perl version info from lib/HTML/TreeBuilder/LibXML.pm
    Checking if your kit is complete...
    Looks good
    Warning: prerequisite HTML::TreeBuilder::XPath 0.11 not found.
    Writing Makefile for HTML::TreeBuilder::LibXML
    ---- Unsatisfied dependencies detected during ----
    ---- TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz ----
    HTML::TreeBuilder::XPath [requires]
    Running make test
    Delayed until after prerequisites
    Running make install
    Delayed until after prerequisites
    Running install for module 'HTML::TreeBuilder::XPath'
    Running make for M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    Checksum for /root/.cpan/sources/authors/id/M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz ok

    CPAN.pm: Going to build M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz

    Checking if your kit is complete...
    Looks good
    Warning: prerequisite XML::XPathEngine 0.12 not found.
    Writing Makefile for HTML::TreeBuilder::XPath
    ---- Unsatisfied dependencies detected during ----
    ---- MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz ----
    XML::XPathEngine [requires]
    Running make test
    Delayed until after prerequisites
    Running make install
    Delayed until after prerequisites
    Running install for module 'XML::XPathEngine'
    Running make for M/MI/MIROD/XML-XPathEngine-0.12.tar.gz
    Checksum for /root/.cpan/sources/authors/id/M/MI/MIROD/XML-XPathEngine-0.12.tar.gz ok

    CPAN.pm: Going to build M/MI/MIROD/XML-XPathEngine-0.12.tar.gz

    Checking if your kit is complete...
    Looks good
    Writing Makefile for XML::XPathEngine
    cp lib/XML/XPathEngine/Literal.pm blib/lib/XML/XPathEngine/Literal.pm
    cp lib/XML/XPathEngine/Number.pm blib/lib/XML/XPathEngine/Number.pm
    cp lib/XML/XPathEngine.pm blib/lib/XML/XPathEngine.pm
    cp lib/XML/XPathEngine/Expr.pm blib/lib/XML/XPathEngine/Expr.pm
    cp lib/XML/XPathEngine/NodeSet.pm blib/lib/XML/XPathEngine/NodeSet.pm
    cp lib/XML/XPathEngine/Variable.pm blib/lib/XML/XPathEngine/Variable.pm
    cp lib/XML/XPathEngine/Root.pm blib/lib/XML/XPathEngine/Root.pm
    cp lib/XML/XPathEngine/Function.pm blib/lib/XML/XPathEngine/Function.pm
    cp lib/XML/XPathEngine/Step.pm blib/lib/XML/XPathEngine/Step.pm
    cp lib/XML/XPathEngine/Boolean.pm blib/lib/XML/XPathEngine/Boolean.pm
    cp lib/XML/XPathEngine/LocationPath.pm blib/lib/XML/XPathEngine/LocationPath.pm
    Manifying blib/man3/XML::XPathEngine::Literal.3pm
    Manifying blib/man3/XML::XPathEngine::Number.3pm
    Manifying blib/man3/XML::XPathEngine::NodeSet.3pm
    Manifying blib/man3/XML::XPathEngine.3pm
    Manifying blib/man3/XML::XPathEngine::Boolean.3pm
    MIROD/XML-XPathEngine-0.12.tar.gz
    make -- OK
    Running make test
    PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
    t/00-load.t ....... 1/1 # Testing XML::XPathEngine 0.12, Perl 5.012001, /usr/bin/perl
    t/00-load.t ....... ok
    t/01_basic.t ...... ok
    t/pod-coverage.t .. skipped: Test::Pod::Coverage 1.04 required for testing POD coverage
    t/pod.t ........... skipped: Test::Pod 1.14 required for testing POD
    All tests successful.
    Files=4, Tests=34, 1 wallclock secs ( 0.10 usr 0.03 sys + 0.53 cusr 0.05 csys = 0.71 CPU)
    Result: PASS
    MIROD/XML-XPathEngine-0.12.tar.gz
    make test -- OK
    Running make install
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Expr.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Literal.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Root.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/NodeSet.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/LocationPath.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Boolean.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Function.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Number.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Variable.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/XML/XPathEngine/Step.pm
    Installing /usr/share/man/man3/XML::XPathEngine::Literal.3pm
    Installing /usr/share/man/man3/XML::XPathEngine.3pm
    Installing /usr/share/man/man3/XML::XPathEngine::Boolean.3pm
    Installing /usr/share/man/man3/XML::XPathEngine::Number.3pm
    Installing /usr/share/man/man3/XML::XPathEngine::NodeSet.3pm
    Appending installation info to /usr/lib/perl5/5.12.1/i586-linux-thread-multi/perllocal.pod
    MIROD/XML-XPathEngine-0.12.tar.gz
    make install -- OK
    Running make for M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-XPath-0.11-LlHD2d

    CPAN.pm: Going to build M/MI/MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz

    cp lib/HTML/TreeBuilder/XPath.pm blib/lib/HTML/TreeBuilder/XPath.pm
    Manifying blib/man3/HTML::TreeBuilder::XPath.3pm
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    make -- OK
    Running make test
    PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
    t/HTML-TreeBuilder-XPath.t .. ok
    t/pod.t ..................... skipping, Test::Pod required
    t/pod.t ..................... ok
    t/pod_coverage.t ............ Test::Pod::Coverage 1.00 required for testing POD coverage at t/pod_coverage.t line 6.
    t/pod_coverage.t ............ ok
    t/test_following.t .......... ok
    t/test_preceding.t .......... ok
    All tests successful.
    Files=5, Tests=81, 2 wallclock secs ( 0.14 usr 0.02 sys + 1.10 cusr 0.10 csys = 1.36 CPU)
    Result: PASS
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    make test -- OK
    Running make install
    Installing /usr/lib/perl5/site_perl/5.12.1/HTML/TreeBuilder/XPath.pm
    Installing /usr/share/man/man3/HTML::TreeBuilder::XPath.3pm
    Appending installation info to /usr/lib/perl5/5.12.1/i586-linux-thread-multi/perllocal.pod
    MIROD/HTML-TreeBuilder-XPath-0.11.tar.gz
    make install -- OK
    Running make for T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/HTML-TreeBuilder-LibXML-0.12-KeQ5T7

    CPAN.pm: Going to build T/TO/TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz

    cp lib/HTML/TreeBuilder/LibXML/Node.pm blib/lib/HTML/TreeBuilder/LibXML/Node.pm
    cp lib/HTML/TreeBuilder/LibXML.pm blib/lib/HTML/TreeBuilder/LibXML.pm
    Manifying blib/man3/HTML::TreeBuilder::LibXML::Node.3pm
    Manifying blib/man3/HTML::TreeBuilder::LibXML.3pm
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    make -- OK
    Running make test
    PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'inc', 'blib/lib', 'blib/arch')" t/*.t
    t/00_compile.t .............. 1/1 # soft dependencies
    # HTML::TreeBuilder::XPath: 0.11
    t/00_compile.t .............. ok
    t/01_simple.t ............... # HTML::TreeBuilder::XPath
    t/01_simple.t ............... 1/62 # HTML::TreeBuilder::LibXML
    t/01_simple.t ............... ok
    t/02_web_scraper.t .......... skipped: this test requires Web::Scraper
    t/03_destructor.t ........... ok
    t/04_new_methods.t .......... ok
    t/05_empty.t ................ 1/15 HTML parser error : Document is empty

    ^
    t/05_empty.t ................ 8/15 HTML parser error : Document is empty

    ^
    HTML parser error : Document is empty

    ^
    t/05_empty.t ................ ok
    t/HTML-TreeBuilder-XPath.t .. ok
    All tests successful.

    Test Summary Report
    -------------------
    t/HTML-TreeBuilder-XPath.t (Wstat: 0 Tests: 19 Failed: 0)
    TODO passed: 10-13
    Files=7, Tests=101, 5 wallclock secs ( 0.17 usr 0.03 sys + 1.93 cusr 0.24 csys = 2.37 CPU)
    Result: PASS
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    make test -- OK
    Running make install
    Installing /usr/lib/perl5/site_perl/5.12.1/HTML/TreeBuilder/LibXML.pm
    Installing /usr/lib/perl5/site_perl/5.12.1/HTML/TreeBuilder/LibXML/Node.pm
    Installing /usr/share/man/man3/HTML::TreeBuilder::LibXML.3pm
    Installing /usr/share/man/man3/HTML::TreeBuilder::LibXML::Node.3pm
    Appending installation info to /usr/lib/perl5/5.12.1/i586-linux-thread-multi/perllocal.pod
    TOKUHIROM/HTML-TreeBuilder-LibXML-0.12.tar.gz
    make install -- OK
    and later this day i start a new thread... ;-)
    Last edited by metabo; September 26th, 2010 at 01:29 PM.
  18. #10
  19. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,261
    Rep Power
    1810
    When it comes to parsing HTML, context is everything.

    In a case such as the one you provided, I would want to know what comes before the <h2> tag. There may be many h2 tags, or only this one. The heading may appear in a <div> of a unique ID, which would be even better.

    I recommend you be specific, and provide as much of the file as possible for best assistance.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0
    Hello Keath,

    many thanks for the quick reply. You are very quick.


    Originally Posted by keath
    When it comes to parsing HTML, context is everything.
    I recommend you be specific, and provide as much of the file as possible for best assistance.
    Okay Keath – you convinced me! I guess that it is pretty important to provide as much information. The task: see the following webpage (a bit old and in some aspects outdated page - created by some administrative chapters of the German school-authority.

    see the page: click here on this hyperlink [but it can take some seconds since it is a very big set of info] again - here the link in plain text: http://schulweb.de/de/schulsuche/liste.html?trefferzahlauswahl=alle&x=19&y=8&kategorie=&region=de&auswahl_1=0&auswahl_2=0&auswahl_3=0 &suchtext=

    Here you have an overview on German Schools:

    Treffer 1 - 10517 von 10517 (which means there are 10517 listed on this page): One school each line you can see the following information
    Line by line you have listed schools, such as the following:

    • 1. Stiftung Louisenlund, 24357 Güby
      2. Bayerische Landesanstalt für Weinbau und Gartenbau, 97205 Veitshöchheim
      3. Katharina-Fischer-Schule Sonderpädagogisches Förderzentrum Erding, 85435 Erding
      4. 02 Grundschule Reinickendorf (Am Schäfersee), 13407 Berlin
      and so on...
    and so on...

    if you click on the i – which stands for Information – you get an page with which provides more information on the school:


    name: the school xy
    shool-type: one (for example)
    Adress: Paris, 3ne Boulevard Saint Lo
    Telefon:048 + 334555664,
    Fax: 048 + 334555667
    Web-Nummer: 222237520031111
    Webmaster: Admin@school-Paris.fr
    Listed since: 20.08.2002

    Well i have all the HTML-pages loaded down to my computer. They are stored on my Linux-machine. Now i want to read out the information-pages. Each page should be opened and parsed – in order to get the text-chunk. I want to store the text in a local mysql-database – or at least in a open-Office-spreadsheet. But the first task is to get the Adress-Text out of the 10517 HTML-Text-files. I think this is a Perl-Job. With perl this job can be done easily.

    So – in order to provide you with as much information as possible: I would say: take a look at one result-information-page:

    click here - to this page: it shows you one of the results

    Bayerische Landesanstalt für Weinbau und Gartenbau
    Schulart: Berufsbildungseinrichtung
    Adresse: 97205 Veitshöchheim, An der Steige 15
    Telefon: 0931/ 9801-0, Fax: 0931/ 9801-100
    SchulWeb-Nummer: 9720500
    Email: poststelle@lwg.bayern.de
    Webmaster: Michael Gengenbach

    Description: Die beiden Schulen sind Teil der Bayerischen Landesanstalt für Weinbau und Gartenbau (LWG) mit ihren Forschungs- und Informationseinrichtungen. Neu in Veitshöchheim ist die Internet-Fachklasse in der Fachrichtung Garten- und Landschaftsbau. Sie wendet sich an Interessenten, die bereits mit beiden Beinen im Berufsleben stehen. Dank kurzer Präsenzphasen in den Wintermonaten und ergänzendem Online-Unterricht am heimischen Bildschirm ist eine Qualifikation zum Meister neben der Berufstätigkeit möglich.
    Im SchulWeb seit: 15.04.1997 ( end of cit.! )

    Very very Interesting: on this page you see little house – a symbol that refers to the weblink. When moving the mouse over this symbol you see the link: For this (above mentioned) site it is http://schulweb.de/9720500

    Well this would be another great perl-task. The adresses have to get somewhat „translated“ - so that we have the real adress in the resultpage. Can we call this function that Perl has to do a „Link-Extractor-Function“!? Anyway – to have the links
    either in untranslated or (even much much better) in translated form is wanted and needed.

    For the general parser-job: I guess that you have a closer look at one of the pages – and then you see the full information. I thought that it would be the best to take a kind of a parser, that supports us to strip tags from a lot of HTML-tags, and leaves only a small text behind and after experimenting with a few CPAN modules I found out that HTML-Parser could do the job. Note - every page of the 10517 pages looks the same! So if we can parse one page - then we can parse every page. ;-)

    But Keath – i am pretty sure that you have much much more ideas how this job can be done. I know that you are a expert in doing Perl tasks.

    So i really look forward to get to know your ideas.

    Many thanks in advance!

    Best regards
    metabo

    Note: I am working as a teacher i want to provide my colleagues, the pupils, the parents and last but not least - everyone who is interested in these data with more up to date information. I think that this old db needs an overhaul. You see that the site is not maintained very well.
    So i start to grab all the data and parse it – in order to get a new set of full information -...
    By the way: With the above mentioned example – it is the forth of more than 10 000 you see a very exciting and interesting detail: This school was listet in the year 1997 (see the last line of the cit: Im SchulWeb seit: 15.04.1997). 13 years - well this is not very very new. Within these 10517 pages there are many results which needs to be updated. This job i want to do. But first of all i have to get the information into one single document.

    Well i hope, that the parser-job can be done with Perl ;-)

    Again - i look forward to get to know your ideas...
    Last edited by metabo; September 26th, 2010 at 06:20 PM.
  22. #12
  23. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,261
    Rep Power
    1810
    This parses the detail page for a school:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use HTML::TokeParser;
    
    my $file = 'school.html';
    my $p = HTML::TokeParser->new($file) or die "Can't open: $!";
    
    my %school;
    while (my $tag = $p->get_tag('div', '/html')) {
    	# first move to the right div that contains the information
    	last if $tag->[0] eq '/html';
    	next unless exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'inhalt_large';
    	
    	$p->get_tag('h1');
    	$school{'location'} = $p->get_text('/h1');
    	
    	while (my $tag = $p->get_tag('div')) {
    		last if exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'fusszeile';
    		
    		# get the school name from the heading
    		next unless exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'fm_linkeSpalte';
    		$p->get_tag('h2');
    		$school{'name'} = $p->get_text('/h2');
    		
    		# verify format for school type
    		$tag = $p->get_tag('span');
    		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'schulart_text') {
    			warn "unexpected format: parsing stopped";
    			last;
    		}
    		$school{'type'} = $p->get_text('/span');
    		
    		# verify format for address
    		$tag = $p->get_tag('p');
    		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'einzel_text') {
    			warn "unexpected format: parsing stopped";
    			last;
    		}
    		$school{'address'} = clean_address($p->get_text('/p'));
    		
    		# find the description
    		$tag = $p->get_tag('p');
    		$school{'description'} = $p->get_text('/p');
    	}
    }
    
    print qq/$school{'name'}\n/;
    print qq/$school{'location'}\n/;
    print qq/$school{'type'}\n/;
    
    foreach (@{$school{'address'}}) {
    	print "$_\n";
    }
    
    print qq/\nDescription: $school{'description'}\n/;
    
    sub clean_address {
    	my $text = shift;
    	my @lines = split "\n", $text;
    	foreach (@lines) {
    		s/^\s+//;
    		s/\s+$//;
    	}
    	return \@lines;
    }
    That's as much as I can work on it today.
  24. #13
  25. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    317
    Rep Power
    0

    many many thanks!!


    Hi Keath

    many many thanks for the quick reply. Incredible - i am amazed! I will try out this later tonight...

    Again many thanks for all your help! This is very very supportive!

    You rock
    Best regards
    metabo!

    PS - i come back and report all the findings...

    Originally Posted by keath
    This parses the detail page for a school:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use HTML::TokeParser;
    
    my $file = 'school.html';
    my $p = HTML::TokeParser->new($file) or die "Can't open: $!";
    
    my %school;
    while (my $tag = $p->get_tag('div', '/html')) {
    	# first move to the right div that contains the information
    	last if $tag->[0] eq '/html';
    	next unless exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'inhalt_large';
    	
    	$p->get_tag('h1');
    	$school{'location'} = $p->get_text('/h1');
    	
    	while (my $tag = $p->get_tag('div')) {
    		last if exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'fusszeile';
    		
    		# get the school name from the heading
    		next unless exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'fm_linkeSpalte';
    		$p->get_tag('h2');
    		$school{'name'} = $p->get_text('/h2');
    		
    		# verify format for school type
    		$tag = $p->get_tag('span');
    		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'schulart_text') {
    			warn "unexpected format: parsing stopped";
    			last;
    		}
    		$school{'type'} = $p->get_text('/span');
    		
    		# verify format for address
    		$tag = $p->get_tag('p');
    		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'einzel_text') {
    			warn "unexpected format: parsing stopped";
    			last;
    		}
    		$school{'address'} = clean_address($p->get_text('/p'));
    		
    		# find the description
    		$tag = $p->get_tag('p');
    		$school{'description'} = $p->get_text('/p');
    	}
    }
    
    print qq/$school{'name'}\n/;
    print qq/$school{'location'}\n/;
    print qq/$school{'type'}\n/;
    
    foreach (@{$school{'address'}}) {
    	print "$_\n";
    }
    
    print qq/\nDescription: $school{'description'}\n/;
    
    sub clean_address {
    	my $text = shift;
    	my @lines = split "\n", $text;
    	foreach (@lines) {
    		s/^\s+//;
    		s/\s+$//;
    	}
    	return \@lines;
    }
    That's as much as I can work on it today.

IMN logo majestic logo threadwatch logo seochat tools logo