.Net Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - More.Net Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old September 6th, 2003, 11:39 AM
theronkid theronkid is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2003
Location: aussie
Posts: 8 theronkid User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 19 m 56 sec
Reputation Power: 0
Exclamation *Newbie needs help...

Hi all! I'm new to VB.Net, and pretty much a beginner in programming... I'm trying to write a HTML parser/search algorithm... But it doesn't seem to be working... please guide me... thx....

ok.. so basically, the idea is really simple... the function recieves a search string, and also recieves the related streamreader (loaded from a website or anywhere...), and it scans through the text until it finds a beginning of an anchor ("<a"). then it useses a queue structured loop to find an ("href=") heading, and similarly reads the hyperlink into a temp variable. after this, it checks if the anchored text is a match to the search string given, and if it is, the function will return the hyperlink.

the problem now is that the program just scans through the html stream without any if statements being trigured.... like it doesn't respond to the beginnings of ("<") and stuff.....

pleaseeee helppp..... am I using a very stupid method??? here is the code:
============================================
Imports System.IO

Public Class SearchClass

Public Sub SearchClass()

End Sub

'***Method 1.0 - search function***
Public Function searchFor(ByVal searchString As String, ByVal SR As StreamReader)
Dim resultLink As New ArrayList()
Dim count As Integer

While (SR.Peek() > -1)

'Section 1.2 - searching for [<a] anchor statement in html
If SR.Read().Equals("<") Then

'===test code==='
System.Windows.Forms.MessageBox.Show("1.2 begin found ""<""")
'===end test code==='

If SR.Peek().Equals("a") Then

SR.Read()
Dim tempQueue As New Queue()
Dim hrefString As String = "href="""
Dim hrefstring2 As String = "href='"
Dim hrefCompareString As String
Dim tempLink As String

'Section 1.3 - enqueuing Chars to search for [href="] statement in html
While Not SR.Peek().Equals("<")

If tempQueue.Count() < hrefString.Length Then
tempQueue.Enqueue(SR.Read())
Else
tempQueue.Dequeue()
tempQueue.Enqueue(SR.Read())
End If

If tempQueue.Count() = hrefString.Length Then
Dim myCollection As IEnumerable = tempQueue
Dim myEnumerator As System.Collections.IEnumerator = myCollection.GetEnumerator()
While myEnumerator.MoveNext()
hrefCompareString = String.Concat(hrefCompareString, myEnumerator.Current())
'===begin TEST code===
System.Windows.Forms.MessageBox.Show(vbCrLf & hrefCompareString)
'===end TEST code===
End While
End If

'Section 1.4 - enqueuing hyperlink to tempLink (ONLY when [href="] is found)
If hrefCompareString.Equals(hrefString) Or hrefCompareString.Equals(hrefstring2) Then
Dim tempQueue2 As New Queue()
While Not SR.Peek().Equals("""") Or SR.Peek().Equals("'")
tempQueue2.Enqueue(SR.Read())
End While
Dim myCollection2 As IEnumerable = tempQueue2
Dim myEnumerator2 As System.Collections.IEnumerator = myCollection2.GetEnumerator()
While myEnumerator2.MoveNext()
tempLink = String.Concat(tempLink, myEnumerator2.Current())
End While

'Section 1.5 - Look for searchString match untill end of anchor ([</a>])
Dim tempString As String
Dim tempQueue3 As New Queue()

While Not SR.Peek.Equals(">")
SR.Read()
End While

While Not SR.Peek().Equals(">")
If tempQueue3.Count() < searchString.Length Then
tempQueue3.Enqueue(SR.Read())
Else
tempQueue3.Dequeue()
tempQueue3.Enqueue(SR.Read())
End If

If tempQueue.Count() = searchString.Length Then
Dim myCollection3 As IEnumerable = tempQueue3
Dim myEnumerator3 As System.Collections.IEnumerator = myCollection3.GetEnumerator()
While myEnumerator3.MoveNext()
tempString = String.Concat(tempString, myEnumerator3.Current())
End While

If tempString.Equals(searchString) Then
resultLink.Add(tempLink)
End If
End If
End While
'end of section 1.5
End If
'end of section 1.4 (storing templink)
End While
'end of section 1.3 (searching for [href="] head within anchor
End If
End If
'end of section 1.2 (searching for anchor)
count += 1
End While
'end of SR stream

System.Windows.Forms.MessageBox.Show("Search ended with " & count)

Return resultLink
End Function
'***End of search function 1.0***
End Class
==============================================

Reply With Quote
  #2  
Old September 6th, 2003, 10:22 PM
milenko1054's Avatar
milenko1054 milenko1054 is offline
Superhero
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2003
Location: OH
Posts: 27 milenko1054 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 11 h 24 m 57 sec
Reputation Power: 0
I don't know much ASP, but I did this recently using PHP and regular expression matching - posting it in case it helps.

Basically, it uses the curl library to store the html in a variable, searches the variable for a specific string ($anchor) then stores all hyperlinks after the anchor that match the regular expression in $matchstr.

A typical $matchstr and $anchor would be:
$siteurl = 'http://msn.espn.go.com/';
$matchstr = '/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i';
$anchor = '/Top News Headlines(.*?)/';

Which would pull the hyperlinks from the Headline News section of ESPN's website

PHP Code:
// Connect to web site and store HTML in text string $s
  
$c curl_init($siteurl);
  
curl_setopt($cCURLOPT_RETURNTRANSFER1);
  
$s curl_exec($c);
  
curl_close($c);

  
$a = array();
  
$b = array();
  
  
//  If the anchor is valid, split the HTML, storing everything after the anchor in text string $b
  
if (preg_match($anchor,$s))
    
$b preg_split($anchor,$s);
  
//  If not, the regular expression for the anchor doesn't match any text on the site - generate error
  
else
    return 
'Error - Anchor ['.$sitename.']';
  
  
//  Pull out urls as indicated by the regular expression in $matchstr ($b[1] contains the text after the anchor)
  
if (preg_match_all($matchstr,$b[1],$matches,PREG_SET_ORDER))
  {
    foreach(
$matches as $match
    {
      
      
// Put url and text into an array
      
array_push($a,array($match[1],$match[2]));
    }
  } 

Reply With Quote
  #3  
Old September 7th, 2003, 12:26 AM
theronkid theronkid is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2003
Location: aussie
Posts: 8 theronkid User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 19 m 56 sec
Reputation Power: 0
??? I don't really understand PHP.. eheheh.. ^_^"

Reply With Quote
  #4  
Old September 7th, 2003, 06:10 AM
milenko1054's Avatar
milenko1054 milenko1054 is offline
Superhero
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2003
Location: OH
Posts: 27 milenko1054 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 11 h 24 m 57 sec
Reputation Power: 0
If ASP has some pattern matching functions like preg_match series in PHP, I think that and regular expressions would be the way to go.

Maybe take a look at http://www.php.net/preg_match to get an idea what they do and look for similar functions in ASP?

Unfortunately, I don't know enough ASP to do the conversion for you, but the logic would remain the same:
1. Find the anchor using a regular expression match:
PHP Code:
if (preg_match($anchor,$s))
    
$b preg_split($anchor,$s); 

2. Find the hyperlink using another regular expression match that flags the url and the text of the hyperlink - (.*?) is the placeholder for a value you want to save from the match:
PHP Code:
 $matchstr '/<a\s+.*?href=[\"'s]?(.*?)>(.*?)</a>/i'

 if (preg_match_all($matchstr,$b[1],$matches,PREG_SET_ORDER)) 

3. Store them for use later:
PHP Code:
foreach($matches as $match
    {
      
// Put url and text into an array
      
array_push($a,array($match[1],$match[2]));
    } 

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - More.Net Development > *Newbie needs help...


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway
Stay green...Green IT