#1
  1. Rut row Raggy!
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2001
    Location
    Tornado Alley
    Posts
    560
    Rep Power
    31

    Question Obtaining html files and saving them as text in C or C++


    I got Petzold's book today from Amazon.com, and while it's a great book, it spends little time on the internet (even though it spends hundreds of pages on bitmaps), which I thought was odd since the book was printed in '98. Anyway, the program I want to develop needs to open HTML files from a website, edit them (saving only the parts needed), and save them as hidden .dat files. Or I could just read the HTML files. How would I accomplish this in C or C++? Would I use the normal file I/O functions and which is better (saving the file or reading it)?
    Matt
  2. #2
  3. No Profile Picture
    .
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Posts
    296
    Rep Power
    12
    the code linked to in the 'a very small http server in c' thread might be helpfull. i think fopen will help.
    Last edited by balance; February 18th, 2003 at 09:19 PM.
  4. #3
  5. Rut row Raggy!
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2001
    Location
    Tornado Alley
    Posts
    560
    Rep Power
    31
    That's for Unix, I'm making an app for Win32.
    Matt
  6. #4
  7. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,155
    Rep Power
    2222
    Originally posted by marron79
    That's for Unix, I'm making an app for Win32.
    Winsock also supports the standard sockets API for the most part. Read "Transitioning from UNIX to Windows Socket Programming" by Paul O'Steen at http://cs.baylor.edu/~donahoo/practi...owsSockets.pdf for instructions on converting a UNIX sockets program to a Win32 Winsock console application. You can even do multithreading in both UNIX and Win32, though the function names are a bit different. About the only thing you can't do in Win32 is process forking.

    But it sounded more like you were talking about parsing HTML. If that's the case, I do believe that there's an MFC view class for HTML.
  8. #5
  9. Rut row Raggy!
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2001
    Location
    Tornado Alley
    Posts
    560
    Rep Power
    31
    Originally posted by dwise1_aol
    But it sounded more like you were talking about parsing HTML. If that's the case, I do believe that there's an MFC view class for HTML.
    Yes, I've seen the MFC thing for HTML views, but I'm using Win32. Is there one for Win32?
    Matt
  10. #6
  11. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,155
    Rep Power
    2222
    Originally posted by marron79
    Yes, I've seen the MFC thing for HTML views, but I'm using Win32. Is there one for Win32?
    The source for the implementation of the CHtmlView class is in the VIEWHTML.CPP file in the C:\Program Files\Microsoft Visual Studio\VC98\MFC\SRC directory (actual path may vary). I haven't looked at it closely yet, but it no doubt depends on the rest of MFC to work. Still, you might find some information or ideas you could use.

    Searching on Google for -- Win32 HTML parser C++ -- produced some likely hits. One of them linked me to Odin Consulting's OPP (Open Plus Plus) page at http://www.odin-consulting.com/OPP/ . Their OPP library contains an HTML parser that:
    "can interpret HTML as a human reader can, understanding tables, fonts and so on. It can also "fix" broken HTML. This is a proof of concept implementation, with better and more compliant versions to come."
    The library comes in a "tar ball" that WinZIP can easily handle.

    Also, an observation about Petzold. He's been writing pretty much the same book since Windows v2 and possibly even before. When a new version of Windows would come out, he'd update the book to cover the new features. I even have a copy that was rewritten for OS/2's Presentation Manager (very similar to the Windows SDK). Since C was the language and the SDK (software development kit, AKA "Windows API") were the only way when the first book was written, that is the approach his books still offer -- at least up to Windows 95, which is the most recent of his books that I have. That is also why he doesn't cover the Internet and network programming -- in Win16, sockets programming was somewhat cumbersome and required a lot of message processing. Still, it's a good book that does contain a lot of good information about Windows API programming.
  12. #7
  13. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,625
    Rep Power
    4247
    As for getting a HTTP page in Windows, you can also use the WinINET functions http://msdn.microsoft.com/library/de..._functions.asp

    In particular, you would be looking at InternetOpenUrl() and InternetReadFile()/InternetReadFileEx().

    These function requires that IE be installed (which it is for practically any windoze installation).
  14. #8
  15. Rut row Raggy!
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2001
    Location
    Tornado Alley
    Posts
    560
    Rep Power
    31
    Originally posted by Scorpions4ever
    As for getting a HTTP page in Windows, you can also use the WinINET functions http://msdn.microsoft.com/library/de..._functions.asp

    In particular, you would be looking at InternetOpenUrl() and InternetReadFile()/InternetReadFileEx().

    These function requires that IE be installed (which it is for practically any windoze installation).
    Sounds like what I'm looking for! Thanks for your replies.:)
    Matt

IMN logo majestic logo threadwatch logo seochat tools logo