#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2016
    Posts
    5
    Rep Power
    0

    Question Regex help - parsing Website


    Hello,

    I have following sample of a website I would like to parse with "preg_match" in PHP

    Code:
    </table></div><div class="span-11 prepend-1" style="float:right"><table class="info"><tr><th colspan="2" class="round-top">HEIZEN</th></tr>  <tr class="even">
        <td class="key">AUSSENTEMPERATUR</td>
        <td class="value">22,6 °C</td>
      </tr>
      <tr class="odd">
        <td class="key">ISTWERT HK1</td>
        <td class="value">25,2 °C</td>
      </tr>
      <tr class="even">
        <td class="key">SOLLWERT HK1</td>
        <td class="value">28,0 °C</td>
      </tr>
      <tr class="odd">
        <td class="key">ISTWERT HK2</td>
        <td class="value">25,5 °C</td>
      </tr>
      <tr class="even">
        <td class="key">SOLLWERT HK2</td>
        <td class="value">30,1 °C</td>
      </tr>
      <tr class="odd">
        <td class="key">VORLAUFTEMPERATUR</td>
        <td class="value">24,8 °C</td>
      </tr>
      <tr class="even">
        <td class="key">RÜCKLAUFTEMPERATUR</td>
        <td class="value">25,4 °C</td>
      </tr>
      <tr class="odd">
        <td class="key">DRUCK HEIZKREIS</td>
        <td class="value">1,3 bar</td>
      </tr>
      <tr class="even">
        <td class="key round-leftbottom">VOLUMENSTROM</td>
        <td class="value round-rightbottom">0,0 l/min</td>
      </tr>
    </table>
    I would like to use the Header for this section "Heizen" and then the key Parameter to finally get the values

    At the end highlighted is a special case - so it is not always <td class="value">

    Would be nice if somebody could post a regex statement and maybe explain it for a noob

    Greet`s Erich
  2. #2
  3. Headless Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,977
    Rep Power
    9647
    Don't use regular expressions for parsing HTML. It sucks. They're really bad for it.

    Do use DOMDocument, and possibly DOMXPath for more complicated stuff. Check the user comments for examples, or ask and I'll move this thread to the PHP forum and we can talk about your code.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2016
    Posts
    5
    Rep Power
    0
    Uii never heard from DOM stuff before.

    With regex I have success - but I didn`t write my Statement to confuse you - because it doesn`t work for all situations with "preg_match"
    When I use an online regex Parser - it works, but not with preg_match,...

    So maybe you see the misstake

    /HEIZEN.*?RÜCKLAUFTEMPERATUR.*?([0-9,?\???.]+)/s

    Thanks Erich

    P.S. when you give me an example with DOM - of course I will try to get used to it and implement it.
    But my main Focus is in getting the values, doesn`t matter how
  6. #4
  7. Headless Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,977
    Rep Power
    9647
    See, here's the problem: do you this properly you need to

    1. Find the <table> that says "HEIZEN",
    2. Find all the <tr>s that contain key/value pairs
    3. Extract the key and value from each pair

    Doing that with a regular expression for HTML is hard.

    But with DOM it's easy:
    PHP Code:
    $dom = new DOMDocument();
    // $dom->load(file) or $dom->loadHTML(string)

    $xpath = new DOMXPath($dom);

    // find all the <tr>s
    $trs $xpath->query("//table[tr/th='HEIZEN']/tr[@class]"); // checking each <tr> for a class seems sufficient, and
    //$trs = $xpath->query("//table[tr/th='HEIZEN']/tr[td[contains(@class,'key')][td[contains(@class,'value')]]"); is kinda long
    foreach ($trs as $tr) {
        
    // grab the key and value. using | because it's unlikely that the "key" contains it
        
    $kvp $xpath->evaluate("concat(td[contains(@class,'key')], '|', td[contains(@class,'value')])"$tr);
        list(
    $key$value) = explode("|"$kvp2);
        
    // ...

    Now I wouldn't expect someone new to DOM to just come up with that, but since I'm writing the code and I know what I'm doing then I'll do the thing.
    Take a couple minutes to just read through that. You don't have to fully understand what query() and evaluate() are doing, but if you look at the string then it shouldn't be too hard to make an educated guess about it.

    I'm quite good with regular expressions. That code above took minutes to write and test. Trying to do the same thing with regular expressions could take me all day because they just don't have the tools needed to understand HTML.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2016
    Posts
    5
    Rep Power
    0

    Thumbs up


    @requinix

    you are genius

    I had to install "php-xml" on my CentOS.
    And one modification was neccessary.

    Instead of $dom->loadHTML(string)
    I had to use $dom->loadHTMLFile(URL)

    Now I get following result

    Code:
    AUSSENTEMPERATUR => 20,0 °C
    ISTWERT HK1 => 24,4 °C
    SOLLWERT HK1 => 21,5 °C
    ISTWERT HK2 => 24,8 °C
    SOLLWERT HK2 => 29,3 °C
    VORLAUFTEMPERATUR => 24,0 °C
    RÜCKLAUFTEMPERATUR => 24,9 °C
    DRUCK HEIZKREIS => 1,3 bar
    VOLUMENSTROM => 0,0 l/min
    Is there an easy way only to get the values without Units?

    Thanks

    Erich
  10. #6
  11. Headless Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,977
    Rep Power
    9647
    It looks like they're formatted as "number unit" - that is, a space between them. You can repeat the explode() thing to split the $value apart by the space:
    PHP Code:
    list($num$unit) = explode(" "$value2); 
    Or, if PHP is running with the right locale (one that recognizes commas as the fractional separator) you could simply cast the string to a float. PHP will stop at the space and ignore the rest.
    PHP Code:
    $num = (float)$value
    If you only get the whole part (20, 24, 21, etc) and not the whole value (20.0, 24.4, 21.5) then the locale is wrong and you should look into that.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2016
    Posts
    5
    Rep Power
    0
    nice - your 1st Methode is working perfect

    I didn`t want to Change the local Settings because I don`t know the Impact to other Tools,...

    Thx Erich
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2016
    Posts
    5
    Rep Power
    0
    update:
    When the query is like

    Code:
    $trs = $xpath->query("//table[tr/th='KÜHLEN']/tr[@class]");
    So it contains a "Ü" - it Fails
    Do you know what I have to do?

    Erich

IMN logo majestic logo threadwatch logo seochat tools logo