XML Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreXML Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
SlickEdit: Code in over 40 languages across 7 platforms. SlickEdit’s unmatched power, speed, and flexibility allows even the most accomplished developers to write better code faster. Download a free trial today!
  #1  
Old August 29th, 2001, 10:36 PM
orionvisuals orionvisuals is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Oct 2000
Posts: 3 orionvisuals User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Send a message via ICQ to orionvisuals
HTML content extractor for XML conversion

My employer (Museum Victoria) is beginning the process of upgrading its tens of thousands of web pages from HTML to XML. The benefits of this are numerous, but I assume everyone here knows them.

The problem we currently face is how to get the data (content) out of the existing pages and leave us with essentially text content that preserves only the basic formatting (heading levels, font emphasising etc). I have found a number of HTML strippers, and they are great at taking out the HTML tags alone, but they don't remove any non-body text (eg navigation text etc), and they don't preserve any of the body-text formatting.

HTML to XML converters usually just convert the page to XHTML which is not what we want. Other XML extractors don't work on more than one page, require substantial scripting, or wont run on a windows platfor, or I don't know how to automate them to work on our many thousands of pages.

Additionally, because of the many different designs of the pages on the collection of museum sites, a great deal of time needs to be done to create filters for HTML strippers of XML converters which will work on each site. Idealy a utility which is intelligent enough to do most of the work and leave me to do the fine-tuning would be perfect!

After many hours of searching the web, I'm starting to run out of ideas. Can anyone here help me with this challenge? I'm certain this is going to become a very widespread problem in the next year or so as content providers migrate to XML.

Your help is greatly appreciated!

Regards,
Neil

Reply With Quote
  #2  
Old September 12th, 2001, 05:41 AM
cloud9 cloud9 is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2001
Posts: 1 cloud9 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
NoteTab Pro

Hi Neil

My personal favorite - I started to use the tool in 1997 - is NoteTab Pro. You can find it at URL It is a very powerful text/HTML-editor including an easy-to-learn so called "Clip Language". You can write your macro commands in that language to prepare large numbers of files (the maximum file size of ONE file is 2Gigabytes! - enough capacity I assume )


Do not expect NoteTab Pro to do miracles.

There are no easy solutions to some of your requirements, though:

a)
Quote:
, but they don't remove any non-body text (eg navigation text etc),
The critical point here is: How to distinguish Navigation-Elements (mostly strings enclosed within <a href=...>-Tags) from "normal" Links within the "body"-text.
Write a macro to strip (search-replace) those text-elements first.

b)
Quote:
and they don't preserve any of the body-text formatting
Write a macro to convert the elements to preserve to HTML-entities.

Do the following further steps:
c) Strip HTML from your files

d) Reconvert HTML-entities to HTML-tags and attributes.

e) Convert Files to XML or XHTML with the NoteTab Pro built in conversion filter

Quote:
Idealy a utility which is intelligent enough to do most of the work and leave me to do the fine-tuning would be perfect!

The Clip Language - in my opinion - meets your requirements.

cheers, tom

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreXML Programming > HTML content extractor for XML conversion


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 3 hosted by Hostway