#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Posts
    156
    Rep Power
    17

    Extracting content from Wikipedia


    I have the following simple method used to extract the html source of a wikipedia article:
    Code:
    public String getArticle(String searchTerm) {
    	
    		try {
    		
    		    URL url = new URL("http://en.wikipedia.org/wiki/Special:Search?search="+searchTerm+"&go=Go");
    			HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    			connection.setRequestMethod("GET");
    			connection.setDoOutput(true);
    			BufferedReader in = new BufferedReader(
    					new InputStreamReader(connection.getInputStream()));
    			String line;
    			while ((line = in.readLine()) != null) {
    				System.out.println(line);
    			}
    			in.close();
    
    		} catch (MalformedURLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		} catch (IOException e1) {
    			// TODO Auto-generated catch block
    			e1.printStackTrace();
    		}
    
    		return null;
    	}
    When I attempt to run it, I get an IOException with a HTTP status of 403 (HTTP FORBIDDEN). Am I using the wrong protocol or is there a more fundamental reason why I can't extract Wiki articles?
  2. #2
  3. AYBABTU
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Here or There
    Posts
    1,256
    Rep Power
    379
    Works fine for me.
    A common mistake people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools.
    Douglas Adams
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2005
    Posts
    191
    Rep Power
    85
    Have a look at this thread

    Basically you need to set the user-agent property to convince Wikipedia that your program is really a web browser...

    ... although perhaps given wsa1971's reply (which was virtually simultaneous to mine), this might not be the problem in your case... can you retrieve other web sites with your code?
    Last edited by occam999; January 29th, 2007 at 06:08 AM.
  6. #4
  7. AYBABTU
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Here or There
    Posts
    1,256
    Rep Power
    379
    I have tested the code unaltered, so yes it should work. I did get a 400 response when I used "All your base" as a search term. Turned out that the space between the terms was the problem and not the code. Was fixed by using "All%20your%20base".
    A common mistake people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools.
    Douglas Adams
  8. #5
  9. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2005
    Location
    Internet
    Posts
    7,625
    Rep Power
    6087
    @OP: Also, I'm not sure if wikipedia would be too happy to have someone crawling their stuff without permission. Look into that.
    Chat Server Project & Tutorial | WiFi-remote-control sailboat (building) | Joke Thread
    “Rational thinkers deplore the excesses of democracy; it abuses the individual and elevates the mob. The death of Socrates was its finest fruit.”
    Use XXX in a comment to flag something that is bogus but works. Use FIXME to flag something that is bogus and broken. Use TODO to leave yourself reminders. Calling a program finished before all these points are checked off is lazy.
    -Partial Credit: Sun

    If I ask you to redescribe your problem, it's because when you describe issues in detail, you often get a *click* and you suddenly know the solutions.
    Ches Koblents
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Posts
    156
    Rep Power
    17
    Adding an additional property to fool Wikipedia into thinking the request was sent from mozilla works for me!

    I'd be interested to know why the above code without this additional property works for some people and not others? I'm running OS X and executing the code from eclipse if that makes a difference?


    The whole permission seeking for mass web-crawling on a large scale I've yet to fully consider, but for my purposes I'm still very much in inception!
  12. #7
  13. No Profile Picture
    rebel with a cause
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    May 2004
    Location
    The Batsh!t Crazy State.
    Posts
    5,814
    Rep Power
    3465
    I'd like to expand upon previous concerns. If your actions are in fact a violation of Wikipedia's TOS then this thread is a violation of Devshed's TOS. Provide documentation that your actions are allowed or I'll feel obligated to recommend that moderators close this thread and remove your source code.
    Dear God. What is it like in your funny little brains? It must be so boring.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Posts
    156
    Rep Power
    17
    I'm using the above code on Wikipedia purely for personal use and therefore in no way am I violating Wikipedia's TOS.

    If and when I am in a position to seriously consider the posibility of large scale screen scrapping, I will of course apply for my material to be covered by the GFDL. For those insomniacs, here's the relevant excerpt:

    If you want to use Wikipedia materials in your own books/articles/web sites or other publications, you can do so, but you have to follow the GFDL. If you are simply duplicating the Wikipedia article, you must follow section two of the GFDL on verbatim copying, as discussed at Wikipedia:Verbatim copying.
    If you create a derivative version by changing or adding content, this entails the following:
    your materials in turn have to be licensed under GFDL,
    you must acknowledge the authorship of the article (section 4B), and
    you must provide access to the "transparent copy" of the material (section 4J). (The "transparent copy" of a Wikipedia article is any of a number of formats available from us, including the wiki text, the html web pages, xml feed, etc.)
    You may be able to partially fulfill the latter two obligations by providing a conspicuous direct link back to the Wikipedia article hosted on this website. You also need to provide access to a transparent copy of the new text. However, please note that the Wikimedia Foundation makes no guarantee to retain authorship information and a transparent copy of articles. Therefore, you are encouraged to provide this authorship information and a transparent copy with your derived works.

    Comments on this post

    • wsa1971 agrees : Was going to post something similar.
  16. #9
  17. No Profile Picture
    rebel with a cause
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    May 2004
    Location
    The Batsh!t Crazy State.
    Posts
    5,814
    Rep Power
    3465
    Okey. That's sufficient for me.
    Dear God. What is it like in your funny little brains? It must be so boring.

IMN logo majestic logo threadwatch logo seochat tools logo