Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old March 1st, 2004, 08:47 PM
7imz 7imz is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 10 7imz User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
trying to create a webcrawler

i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated

here's the fragment of code that i'm mainly relying on

import re
import urllib

pattern = '<a href="(.+?)">'
links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())

for l in links:
print l

Reply With Quote
  #2  
Old March 1st, 2004, 10:38 PM
rebbit rebbit is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jan 2004
Posts: 84 rebbit User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 7 m
Reputation Power: 5
Quote:
Originally Posted by 7imz
i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated

here's the fragment of code that i'm mainly relying on

import re
import urllib

pattern = '<a href="(.+?)">'
links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())

for l in links:
print l


just checking the extension of each link would probably work. if it doesn't have an extension (a link like python.org/search/) or the extension is in a list of valid extensions (which you would need to specify) then keep it as a valid link.

Reply With Quote
  #3  
Old March 2nd, 2004, 05:14 AM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,532 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 17 h 38 m 15 sec
Reputation Power: 63
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
It's probably Better (and easier) to check against a list of common unwanted file types. This way you dont exclude a possibly valid page. You can then delete unwanted entried from the list using the del statment. Then obviously you're gonna need to store your results

Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > trying to create a webcrawler


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway