|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Hi all! I'm new to VB.Net, and pretty much a beginner in programming... I'm trying to write a HTML parser/search algorithm... But it doesn't seem to be working... please guide me... thx....
ok.. so basically, the idea is really simple... the function recieves a search string, and also recieves the related streamreader (loaded from a website or anywhere...), and it scans through the text until it finds a beginning of an anchor ("<a"). then it useses a queue structured loop to find an ("href=") heading, and similarly reads the hyperlink into a temp variable. after this, it checks if the anchored text is a match to the search string given, and if it is, the function will return the hyperlink. the problem now is that the program just scans through the html stream without any if statements being trigured.... like it doesn't respond to the beginnings of ("<") and stuff..... pleaseeee helppp..... am I using a very stupid method??? here is the code: ============================================ Imports System.IO Public Class SearchClass Public Sub SearchClass() End Sub '***Method 1.0 - search function*** Public Function searchFor(ByVal searchString As String, ByVal SR As StreamReader) Dim resultLink As New ArrayList() Dim count As Integer While (SR.Peek() > -1) 'Section 1.2 - searching for [<a] anchor statement in html If SR.Read().Equals("<") Then '===test code===' System.Windows.Forms.MessageBox.Show("1.2 begin found ""<""") '===end test code===' If SR.Peek().Equals("a") Then SR.Read() Dim tempQueue As New Queue() Dim hrefString As String = "href=""" Dim hrefstring2 As String = "href='" Dim hrefCompareString As String Dim tempLink As String 'Section 1.3 - enqueuing Chars to search for [href="] statement in html While Not SR.Peek().Equals("<") If tempQueue.Count() < hrefString.Length Then tempQueue.Enqueue(SR.Read()) Else tempQueue.Dequeue() tempQueue.Enqueue(SR.Read()) End If If tempQueue.Count() = hrefString.Length Then Dim myCollection As IEnumerable = tempQueue Dim myEnumerator As System.Collections.IEnumerator = myCollection.GetEnumerator() While myEnumerator.MoveNext() hrefCompareString = String.Concat(hrefCompareString, myEnumerator.Current()) '===begin TEST code=== System.Windows.Forms.MessageBox.Show(vbCrLf & hrefCompareString) '===end TEST code=== End While End If 'Section 1.4 - enqueuing hyperlink to tempLink (ONLY when [href="] is found) If hrefCompareString.Equals(hrefString) Or hrefCompareString.Equals(hrefstring2) Then Dim tempQueue2 As New Queue() While Not SR.Peek().Equals("""") Or SR.Peek().Equals("'") tempQueue2.Enqueue(SR.Read()) End While Dim myCollection2 As IEnumerable = tempQueue2 Dim myEnumerator2 As System.Collections.IEnumerator = myCollection2.GetEnumerator() While myEnumerator2.MoveNext() tempLink = String.Concat(tempLink, myEnumerator2.Current()) End While 'Section 1.5 - Look for searchString match untill end of anchor ([</a>]) Dim tempString As String Dim tempQueue3 As New Queue() While Not SR.Peek.Equals(">") SR.Read() End While While Not SR.Peek().Equals(">") If tempQueue3.Count() < searchString.Length Then tempQueue3.Enqueue(SR.Read()) Else tempQueue3.Dequeue() tempQueue3.Enqueue(SR.Read()) End If If tempQueue.Count() = searchString.Length Then Dim myCollection3 As IEnumerable = tempQueue3 Dim myEnumerator3 As System.Collections.IEnumerator = myCollection3.GetEnumerator() While myEnumerator3.MoveNext() tempString = String.Concat(tempString, myEnumerator3.Current()) End While If tempString.Equals(searchString) Then resultLink.Add(tempLink) End If End If End While 'end of section 1.5 End If 'end of section 1.4 (storing templink) End While 'end of section 1.3 (searching for [href="] head within anchor End If End If 'end of section 1.2 (searching for anchor) count += 1 End While 'end of SR stream System.Windows.Forms.MessageBox.Show("Search ended with " & count) Return resultLink End Function '***End of search function 1.0*** End Class ============================================== |
|
#2
|
||||
|
||||
|
I don't know much ASP, but I did this recently using PHP and regular expression matching - posting it in case it helps.
Basically, it uses the curl library to store the html in a variable, searches the variable for a specific string ($anchor) then stores all hyperlinks after the anchor that match the regular expression in $matchstr. A typical $matchstr and $anchor would be: $siteurl = 'http://msn.espn.go.com/'; $matchstr = '/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i'; $anchor = '/Top News Headlines(.*?)/'; Which would pull the hyperlinks from the Headline News section of ESPN's website PHP Code:
|
|
#3
|
|||
|
|||
|
??? I don't really understand PHP.. eheheh.. ^_^"
|
|
#4
|
||||
|
||||
|
If ASP has some pattern matching functions like preg_match series in PHP, I think that and regular expressions would be the way to go.
Maybe take a look at http://www.php.net/preg_match to get an idea what they do and look for similar functions in ASP? Unfortunately, I don't know enough ASP to do the conversion for you, but the logic would remain the same: 1. Find the anchor using a regular expression match: PHP Code:
2. Find the hyperlink using another regular expression match that flags the url and the text of the hyperlink - (.*?) is the placeholder for a value you want to save from the match: PHP Code:
3. Store them for use later: PHP Code:
|
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > .Net Development > *Newbie needs help... |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|