February 18th, 2003, 05:15 PM
Obtaining html files and saving them as text in C or C++
I got Petzold's book today from Amazon.com, and while it's a great book, it spends little time on the internet (even though it spends hundreds of pages on bitmaps), which I thought was odd since the book was printed in '98. Anyway, the program I want to develop needs to open HTML files from a website, edit them (saving only the parts needed), and save them as hidden .dat files. Or I could just read the HTML files. How would I accomplish this in C or C++? Would I use the normal file I/O functions and which is better (saving the file or reading it)?
February 18th, 2003, 09:10 PM
the code linked to in the 'a very small http server in c' thread might be helpfull. i think fopen will help.
Last edited by balance; February 18th, 2003 at 09:19 PM.
February 18th, 2003, 10:35 PM
That's for Unix, I'm making an app for Win32.
February 18th, 2003, 11:51 PM
Winsock also supports the standard sockets API for the most part. Read "Transitioning from UNIX to Windows Socket Programming" by Paul O'Steen at http://cs.baylor.edu/~donahoo/practi...owsSockets.pdf for instructions on converting a UNIX sockets program to a Win32 Winsock console application. You can even do multithreading in both UNIX and Win32, though the function names are a bit different. About the only thing you can't do in Win32 is process forking.
But it sounded more like you were talking about parsing HTML. If that's the case, I do believe that there's an MFC view class for HTML.
February 19th, 2003, 02:57 AM
Yes, I've seen the MFC thing for HTML views, but I'm using Win32. Is there one for Win32?
February 19th, 2003, 10:03 AM
The source for the implementation of the CHtmlView class is in the VIEWHTML.CPP file in the C:\Program Files\Microsoft Visual Studio\VC98\MFC\SRC directory (actual path may vary). I haven't looked at it closely yet, but it no doubt depends on the rest of MFC to work. Still, you might find some information or ideas you could use.
Searching on Google for -- Win32 HTML parser C++ -- produced some likely hits. One of them linked me to Odin Consulting's OPP (Open Plus Plus) page at http://www.odin-consulting.com/OPP/ . Their OPP library contains an HTML parser that:
"can interpret HTML as a human reader can, understanding tables, fonts and so on. It can also "fix" broken HTML. This is a proof of concept implementation, with better and more compliant versions to come."
The library comes in a "tar ball" that WinZIP can easily handle.
Also, an observation about Petzold. He's been writing pretty much the same book since Windows v2 and possibly even before. When a new version of Windows would come out, he'd update the book to cover the new features. I even have a copy that was rewritten for OS/2's Presentation Manager (very similar to the Windows SDK). Since C was the language and the SDK (software development kit, AKA "Windows API") were the only way when the first book was written, that is the approach his books still offer -- at least up to Windows 95, which is the most recent of his books that I have. That is also why he doesn't cover the Internet and network programming -- in Win16, sockets programming was somewhat cumbersome and required a lot of message processing. Still, it's a good book that does contain a lot of good information about Windows API programming.
February 19th, 2003, 01:29 PM
As for getting a HTTP page in Windows, you can also use the WinINET functions http://msdn.microsoft.com/library/de..._functions.asp
In particular, you would be looking at InternetOpenUrl() and InternetReadFile()/InternetReadFileEx().
These function requires that IE be installed (which it is for practically any windoze installation).
February 19th, 2003, 03:05 PM
Sounds like what I'm looking for! Thanks for your replies.