|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Seperating text content from html using perl
Can you tell me how to extract the text portion from the html page?
Actually i have got an html page like following: <HTML> <HEAD> <TITLE></TITLE> </HEAD> <BODY> <TABLE WIDTH=450><TR><TD> <FONT SIZE=5 FACE="verdana"> Header 1xxxxxx:</FONT><P> <FONT SIZE=3 FACE="verdana"><BR> <I> Dear Mr Badger Who wrote that? Let's consider the possibilities. <BR><BR>These files are kept safe from downloading from online directories before modifying/uploading to server.<BR>i.e. safe live copies of the core, changeable, data.<BR><BR>In the event of a freeze, re-uploading should revert to the original and workable state.<BR><BR>April 25 2005<BR><BR>03.05.2001<BR><BR>These files taken from various sources-see notes.<BR>Should be ready to rock. <BR>All safety copies ready.<BR>(only .txt, and new updated QAGRID.DAT + QALOCTAB.DAT are untried)<BR><BR>Mr,B,Hagerty,Hawthorns,5 Granborough Road,North Marston,Bucks,MK18 3PN,On the back of a lorry that pulled out in front of me on the M25</I><BR><BR> <FONT SIZE=5 FACE="verdana"> Header 2:</FONT><P> Application Procedure:<BR><BR>When filing an application form, the student should specify the department or doctoral program subcommittee under which he or she wishes to study. In any given term, a student may apply for study under only one department or subcommittee. </TD></TR></TABLE> </BODY> </HTML> I would like to seperate the header and content from the html like Header: Header 1xxxxxx Content: Dear Mr Badger Who wrote that? Let's consider the possibilities. These files are kept safe from downloading from online directories before modifying/uploading to server.i.e. safe live copies of the core, changeable, data.In the event of a freeze, re-uploading should revert to the original and workable state.April 25 2005 03.05.2001 These files taken from various sources-see notes.Should be ready to rock. All safety copies ready.(only .txt, and new updated QAGRID.DAT + QALOCTAB.DAT are untried)Mr,B,Hagerty,Hawthorns,5 Granborough Road,North Marston,Bucks,MK18 3PN,On the back of a lorry that pulled out in front of me on the M25 Can you tell me how would i seperate the content from html using perl? Thanks in advance |
|
#2
|
|||
|
|||
|
You could look into the HTML library. HTML:
arse is used for, that's right, parsing HTML. I've never actually used them so I couldn't very well direct you on that. This would be your best bet, especially if you are dealing with files of variable formatting and length.If you are dealing with only this file, or a situation where the headers are in the same general format as far as number of lines used, then you could just count lines and write regular expressions to get the info you need and remove the tags you don't want. Doing this wouldn't be any good if you are dealing with a number of disimilar files, so I wouldn't suggest it unless the script will be used specifically for only ONE or ONE TYPE of file.
__________________
- dsb - ![]() Perl Guy |
|
#3
|
|||
|
|||
|
Thanks dsb for your reply,
The html files i'm using is in different format.what i need to do is just extract the text portions,header,content and emails seperatly from the html.actually i'm try to create an online editing tool for users. Thanks again,any more help would be really appreciated. |
|
#4
|
|||
|
|||
|
Use GREP to find the esential tags and strip out the rest:
/<title>(.+)<\/title>/i; # $1 contains Title s/<[^>]+>//g; # All other tags are stripped out The first match leaves the title in $1, the second strips out everything within < & > brackets. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Perl Programming > Seperating text content from html using perl |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|