
January 3rd, 2004, 09:30 PM
|
|
Contributing User
|
|
Join Date: Dec 2003
Location: Perth, Western Australia
Posts: 30
Time spent in forums: < 1 sec
Reputation Power: 5
|
|
|
You can open an HTML file as plain text (since thats all it is anyway) but finding 'words' will be a nightmare since 'words' can be represented in a few ways, and can be disguised in may more. (eg the many ways that Spam HTML does it.. esp for rude words & key words that anti-spam programs look for...)
You can InStr() for the word "rude" but the letters in the word can be expressed using &#xxx; or the word can be split up using comments like this:
ru<!--comment-->de
this will display in a browser as 'rude' since the comment is ignored by browsers. Also words can be split by putting a carriage-return in the middle of it since browser ignore carriage-returns
You will need to take into account every single way of word masking. You can preprocess the text first by strippoing out comments and carriage-returns, converting &#xxx; to their character values, and the other ways that Spammers come up with
eg inserting 1x1 pixel image in the middle of a word, converting the text to an image (no way to get around that except to remove all images!)
The list goes on and on...
Also, you shouldn't work with HTML files 'line-by-line' since end-of-lines are not required in HTML. A whole web page with images and text can be written on one (very long) line.
|