Perl Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPerl Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old July 1st, 2001, 02:11 PM
Haris Haris is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2001
Posts: 2 Haris User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Seperating text content from html using perl

Can you tell me how to extract the text portion from the html page?

Actually i have got an html page like following:


<HTML>
<HEAD>
<TITLE></TITLE>
</HEAD>
<BODY>
<TABLE WIDTH=450><TR><TD>
<FONT SIZE=5 FACE="verdana">
Header 1xxxxxx:</FONT><P>
<FONT SIZE=3 FACE="verdana"><BR>
<I>
Dear Mr Badger Who wrote that? Let's consider the possibilities.
<BR><BR>These files are kept safe from downloading from online
directories before modifying/uploading to server.<BR>i.e. safe
live copies of the core, changeable, data.<BR><BR>In the event
of a freeze, re-uploading should revert to the original and workable
state.<BR><BR>April 25 2005<BR><BR>03.05.2001<BR><BR>These files
taken from various sources-see notes.<BR>Should be ready to rock.
<BR>All safety copies ready.<BR>(only .txt, and new updated QAGRID.DAT
+ QALOCTAB.DAT are untried)<BR><BR>Mr,B,Hagerty,Hawthorns,5 Granborough
Road,North Marston,Bucks,MK18 3PN,On the back of a lorry that pulled out
in front of me on the M25</I><BR><BR>
<FONT SIZE=5 FACE="verdana">
Header 2:</FONT><P>
Application Procedure:<BR><BR>When filing an application form,
the student should specify the department or doctoral program
subcommittee under which he or she wishes to study. In any given
term, a student may apply for study under only one department
or subcommittee.
</TD></TR></TABLE>
</BODY>
</HTML>



I would like to seperate the header and content from the html like


Header:

Header 1xxxxxx

Content:

Dear Mr Badger Who wrote that? Let's consider the possibilities.
These files are kept safe from downloading from online
directories before modifying/uploading to server.i.e. safe
live copies of the core, changeable, data.In the event
of a freeze, re-uploading should revert to the original and workable
state.April 25 2005 03.05.2001 These files
taken from various sources-see notes.Should be ready to rock.
All safety copies ready.(only .txt, and new updated QAGRID.DAT
+ QALOCTAB.DAT are untried)Mr,B,Hagerty,Hawthorns,5 Granborough
Road,North Marston,Bucks,MK18 3PN,On the back of a lorry that pulled out
in front of me on the M25


Can you tell me how would i seperate the content from html using perl?

Thanks in advance

Reply With Quote
  #2  
Old July 1st, 2001, 04:17 PM
dsb dsb is offline
PerlGuy
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jan 2001
Posts: 714 dsb User rank is Sergeant Major (2000 - 5000 Reputation Level)dsb User rank is Sergeant Major (2000 - 5000 Reputation Level)dsb User rank is Sergeant Major (2000 - 5000 Reputation Level)dsb User rank is Sergeant Major (2000 - 5000 Reputation Level)dsb User rank is Sergeant Major (2000 - 5000 Reputation Level)dsb User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 2 Days 15 h 44 m 20 sec
Reputation Power: 36
Send a message via AIM to dsb
You could look into the HTML library. HTML:arse is used for, that's right, parsing HTML. I've never actually used them so I couldn't very well direct you on that. This would be your best bet, especially if you are dealing with files of variable formatting and length.

If you are dealing with only this file, or a situation where the headers are in the same general format as far as number of lines used, then you could just count lines and write regular expressions to get the info you need and remove the tags you don't want. Doing this wouldn't be any good if you are dealing with a number of disimilar files, so I wouldn't suggest it unless the script will be used specifically for only ONE or ONE TYPE of file.
__________________
- dsb -
Perl Guy

Reply With Quote
  #3  
Old July 2nd, 2001, 01:00 AM
Haris Haris is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2001
Posts: 2 Haris User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Thanks dsb for your reply,
The html files i'm using is in different format.what i need to do is just extract the text portions,header,content and emails seperatly from the html.actually i'm try to create an online editing tool for users.

Thanks again,any more help would be really appreciated.

Reply With Quote
  #4  
Old July 2nd, 2001, 12:19 PM
Pressly Pressly is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2001
Posts: 48 Pressly User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 8
Wink A Load of GREP

Use GREP to find the esential tags and strip out the rest:

/<title>(.+)<\/title>/i; # $1 contains Title
s/<[^>]+>//g; # All other tags are stripped out

The first match leaves the title in $1, the second strips out everything within < & > brackets.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPerl Programming > Seperating text content from html using perl


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway
Stay green...Green IT