#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Location
    aussie
    Posts
    8
    Rep Power
    0

    Exclamation *Newbie needs help...


    Hi all! I'm new to VB.Net, and pretty much a beginner in programming... I'm trying to write a HTML parser/search algorithm... But it doesn't seem to be working... please guide me... thx....

    ok.. so basically, the idea is really simple... the function recieves a search string, and also recieves the related streamreader (loaded from a website or anywhere...), and it scans through the text until it finds a beginning of an anchor ("<a"). then it useses a queue structured loop to find an ("href=") heading, and similarly reads the hyperlink into a temp variable. after this, it checks if the anchored text is a match to the search string given, and if it is, the function will return the hyperlink.

    the problem now is that the program just scans through the html stream without any if statements being trigured.... like it doesn't respond to the beginnings of ("<") and stuff.....

    pleaseeee helppp..... am I using a very stupid method??? here is the code:
    ============================================
    Imports System.IO

    Public Class SearchClass

    Public Sub SearchClass()

    End Sub

    '***Method 1.0 - search function***
    Public Function searchFor(ByVal searchString As String, ByVal SR As StreamReader)
    Dim resultLink As New ArrayList()
    Dim count As Integer

    While (SR.Peek() > -1)

    'Section 1.2 - searching for [<a] anchor statement in html
    If SR.Read().Equals("<") Then

    '===test code==='
    System.Windows.Forms.MessageBox.Show("1.2 begin found ""<""")
    '===end test code==='

    If SR.Peek().Equals("a") Then

    SR.Read()
    Dim tempQueue As New Queue()
    Dim hrefString As String = "href="""
    Dim hrefstring2 As String = "href='"
    Dim hrefCompareString As String
    Dim tempLink As String

    'Section 1.3 - enqueuing Chars to search for [href="] statement in html
    While Not SR.Peek().Equals("<")

    If tempQueue.Count() < hrefString.Length Then
    tempQueue.Enqueue(SR.Read())
    Else
    tempQueue.Dequeue()
    tempQueue.Enqueue(SR.Read())
    End If

    If tempQueue.Count() = hrefString.Length Then
    Dim myCollection As IEnumerable = tempQueue
    Dim myEnumerator As System.Collections.IEnumerator = myCollection.GetEnumerator()
    While myEnumerator.MoveNext()
    hrefCompareString = String.Concat(hrefCompareString, myEnumerator.Current())
    '===begin TEST code===
    System.Windows.Forms.MessageBox.Show(vbCrLf & hrefCompareString)
    '===end TEST code===
    End While
    End If

    'Section 1.4 - enqueuing hyperlink to tempLink (ONLY when [href="] is found)
    If hrefCompareString.Equals(hrefString) Or hrefCompareString.Equals(hrefstring2) Then
    Dim tempQueue2 As New Queue()
    While Not SR.Peek().Equals("""") Or SR.Peek().Equals("'")
    tempQueue2.Enqueue(SR.Read())
    End While
    Dim myCollection2 As IEnumerable = tempQueue2
    Dim myEnumerator2 As System.Collections.IEnumerator = myCollection2.GetEnumerator()
    While myEnumerator2.MoveNext()
    tempLink = String.Concat(tempLink, myEnumerator2.Current())
    End While

    'Section 1.5 - Look for searchString match untill end of anchor ([</a>])
    Dim tempString As String
    Dim tempQueue3 As New Queue()

    While Not SR.Peek.Equals(">")
    SR.Read()
    End While

    While Not SR.Peek().Equals(">")
    If tempQueue3.Count() < searchString.Length Then
    tempQueue3.Enqueue(SR.Read())
    Else
    tempQueue3.Dequeue()
    tempQueue3.Enqueue(SR.Read())
    End If

    If tempQueue.Count() = searchString.Length Then
    Dim myCollection3 As IEnumerable = tempQueue3
    Dim myEnumerator3 As System.Collections.IEnumerator = myCollection3.GetEnumerator()
    While myEnumerator3.MoveNext()
    tempString = String.Concat(tempString, myEnumerator3.Current())
    End While

    If tempString.Equals(searchString) Then
    resultLink.Add(tempLink)
    End If
    End If
    End While
    'end of section 1.5
    End If
    'end of section 1.4 (storing templink)
    End While
    'end of section 1.3 (searching for [href="] head within anchor
    End If
    End If
    'end of section 1.2 (searching for anchor)
    count += 1
    End While
    'end of SR stream

    System.Windows.Forms.MessageBox.Show("Search ended with " & count)

    Return resultLink
    End Function
    '***End of search function 1.0***
    End Class
    ==============================================
  2. #2
  3. Superhero
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Location
    OH
    Posts
    27
    Rep Power
    0
    I don't know much ASP, but I did this recently using PHP and regular expression matching - posting it in case it helps.

    Basically, it uses the curl library to store the html in a variable, searches the variable for a specific string ($anchor) then stores all hyperlinks after the anchor that match the regular expression in $matchstr.

    A typical $matchstr and $anchor would be:
    $siteurl = 'http://msn.espn.go.com/';
    $matchstr = '/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i';
    $anchor = '/Top News Headlines(.*?)/';

    Which would pull the hyperlinks from the Headline News section of ESPN's website

    PHP Code:

      
    // Connect to web site and store HTML in text string $s
      
    $c curl_init($siteurl);
      
    curl_setopt($cCURLOPT_RETURNTRANSFER1);
      
    $s curl_exec($c);
      
    curl_close($c);

      
    $a = array();
      
    $b = array();
      
      
    //  If the anchor is valid, split the HTML, storing everything after the anchor in text string $b
      
    if (preg_match($anchor,$s))
        
    $b preg_split($anchor,$s);
      
    //  If not, the regular expression for the anchor doesn't match any text on the site - generate error
      
    else
        return 
    'Error - Anchor ['.$sitename.']';
      
      
    //  Pull out urls as indicated by the regular expression in $matchstr ($b[1] contains the text after the anchor)
      
    if (preg_match_all($matchstr,$b[1],$matches,PREG_SET_ORDER))
      {
        foreach(
    $matches as $match
        {
          
          
    // Put url and text into an array
          
    array_push($a,array($match[1],$match[2]));
        }
      } 
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Location
    aussie
    Posts
    8
    Rep Power
    0
    ??? I don't really understand PHP.. eheheh.. ^_^"
  6. #4
  7. Superhero
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Location
    OH
    Posts
    27
    Rep Power
    0
    If ASP has some pattern matching functions like preg_match series in PHP, I think that and regular expressions would be the way to go.

    Maybe take a look at http://www.php.net/preg_match to get an idea what they do and look for similar functions in ASP?

    Unfortunately, I don't know enough ASP to do the conversion for you, but the logic would remain the same:
    1. Find the anchor using a regular expression match:
    PHP Code:
      if (preg_match($anchor,$s))
        
    $b preg_split($anchor,$s); 
    2. Find the hyperlink using another regular expression match that flags the url and the text of the hyperlink - (.*?) is the placeholder for a value you want to save from the match:
    PHP Code:
    $matchstr '/<a\s+.*?href=[\"'\s]?(.*?)>(.*?)<\/a>/i'; 

     if (preg_match_all($matchstr,$b[1],$matches,PREG_SET_ORDER)) 
    3. Store them for use later:
    PHP Code:
        foreach($matches as $match
        {
          
    // Put url and text into an array
          
    array_push($a,array($match[1],$match[2]));
        } 

IMN logo majestic logo threadwatch logo seochat tools logo