#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0

    Exclamation Scrape embedded PDF from website


    Hi, I am writing a program where I need to scrape a website which returns a PDF document in HTML <embed> tag with type=application/pdf.

    I need to save that PDF to local directory.

    What is the best approach? I have written custom web browsers in VB and have extensive experience in HTTP headers and stuff.

    I am not sure how I would pull the file as its probably not a downloadable file stream (??). Its something that calls Acrobat to play (right ?)

    The target platform is Windows XP+

    To keep the application simple and light-weight, I was thinking of a console application or Win32 application (yes i know, but i like the Win 32 API calls :D )to run in background. I don't have extensive experience to work with Windows Service and the console app works for me.

    thanks in advance :)
    Last edited by speedbooster; January 3rd, 2013 at 01:41 PM.
  2. #2
  3. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    The embed will tell the source of the file. Just download that.

    The Acrobat thing is something the browser does. The server says the type of file is application/pdf, the browser looks up in a list and finds that the type is mapped to a plugin, and it fires up the plugin.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    How is that different from a simple file stream?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    I mean, why didn't the web developers just provide a file stream, by linking to the PDF directly? so the MIME headers would be PDF.

    Instead, they give me this embed tag?

    Oh, wait, so the source of the embed tag is the file stream I'm talking about?
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    Originally Posted by requinix
    The embed will tell the source of the file. Just download that.

    The Acrobat thing is something the browser does. The server says the type of file is application/pdf, the browser looks up in a list and finds that the type is mapped to a plugin, and it fires up the plugin.
    but if i put the SOURCE url in web browser and view its code, it still gives me the same embedded tag HTML code.

    perhaps the website is causing this.... to prevent download?
  10. #6
  11. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,255
    Rep Power
    2222
    This might be a stupid question, but why can't you simply use the reader's save feature? When my browser runs Adobe Acrobat to display a PDF, one of the options it gives me is to save the file locally (it's the icon that looks like a 3.5" floppy diskette).
  12. #7
  13. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    The embed links to something. If that isn't the PDF itself then something in it will. Or in whatever that links to. Somewhere there is a reference to the actual PDF content.
    Find that.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    @dwise:
    i want to automate it, hence the term 'scrape'.

    @requinix:
    actually the source is a .PDF file with some GET attributes. But it returns the embed HTML code with the 'same' source again....

    may be the server is faking the extension and there s a PHP-like script handling .PDF extensions?
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    ok, so i checked. let us call the 'source' URL: http://www.server.com/file.pdf&var=value

    now when i visit this using IE or firefox, it shows PDF.
    but if i put this URL in a download manager, it returns an HTMLed version of a JSP page which says i don't have cooked enabled in my web browser.

    What is happening is this:
    I am behind an HTTP proxy, and the admins have logged in the whole network somehow to 'server.com' (cookies or IP recognition, i don't know, probably cookies cauz DDownload manager failed).

    Now, how do I know which cookies the admins are setting for the requests generated from the network computers? and most importantly HOW are they setting them? Why did the cookies not get set with download manager?
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    sorry again, the server.com states it uses IP recognition.

    but i guess i need to enable cookies in my 'custom' web browser?
  20. #11
  21. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    Sounds like they're trying pretty hard to prevent automated bots and scrapers from pulling content from the site. Can you ask the site owners for a way to get what you need? Maybe there's an RSS feed or list you can grab URLs from.
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    ok, so i checked the headers in firefox (via plugin) and you're right the 'source' is actually the PDF stream. The reason it was showing <embed> code is because I was viewing the page in Firebug, which showed the <embed> tag.

    Now, I got the stream. But there is something else I want. Is it possible to use Internet explorer to fetch the PDF file for me?

    Because the firewall won't let me open a connection, even to port 80. So, I thought I could use IE to fetch the PDF?? I used web browser controls in VB. But can it be done in Win32 C++ ? possible ?? how?
  24. #13
  25. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Posts
    227
    Rep Power
    0
    ok im using IWebBrowser2 to do the job. hope i can doewnload PDF this way.

    The problem now is, i cannot kill/ close the window of IE.

    Code:
    HWND lala;
    			if SUCCEEDED(pWBApp->get_HWND((SHANDLE_PTR*) &lala)) {
    				//MessageBox(hWnd, L"got handle to IE window", L"notice", MB_OK);
    			DestroyWindow(lala);
    			}
    			DestroyWindow(lala);
    pWBApp->Quit() kills the 'tab' not the whole window.

    Any ideas?

IMN logo majestic logo threadwatch logo seochat tools logo