#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2012
    Location
    Terlingua, TX
    Posts
    16
    Rep Power
    0

    Question Parse a subset of an XML file with regex


    What follows are extracts of an XML document contained in an EPUB file. These lines always fall within the <metadata> and </metadata> markers. Generally these are only four out of 10-15 lines to be found there.
    It's easy enough to parse the file, line by line, and use the information I can find there. (( Not particularly since the information on the lines can vary pretty drastically.)) It would be easier if I could use a regex to pull only the information between the > and the </ (for example, "Dave Barry" in the first example. Can anyone offer a suggested regex, please. No I don't know what I'm doing here and have no idea where to begin to tell a regex to ignore "all this" and give me back "this"

    Most of these are converted ebooks - mobi to epub, for example. My intent is to be able to standardize the names of the books in my library.

    Programatically I can do everything except comfortably derive the series and series number. Calibre generally give the series name (calibre:series) and the book number in the series (calibre:series_index) - the problem I'm having trying to read this as an XML file is there isn't any real order for the information in the file.

    Examples of the XML extract ...
    From Calibre:
    <dc:creator opf:file-as="Unknown" opf:role="aut">Dave Barry</dc:creator>
    <dc:title>Dave Barry’s Greatest Hits</dc:title>
    File as: Barry, Dave - Dave Barry's Greatest Hits

    <meta name="calibre:series_index" content="22"/>
    <dc:creator opf:file-as="Hamilton, Laurell K." opf:role="aut">Laurell K. Hamilton</dc:creator>
    <meta name="calibre:series" content="Anita Blake"/>
    <dc:title>Anita Blake - 22 - Affliction</dc:title>
    File as: Hamilton, Laurell K - Anita Blake 22 - Affliction
    even given the "right information" I still have to parse out some things to be able to file it correctly, that's no big deal. For example, this title already contains the series and series number.

    <dc:creator opf:file-as="Byrde, Ann-Katrin" opf:role="aut">Ann-Katrin Byrde</dc:creator>
    <dc:title>A Baby for the Firefighter</dc:title>
    <meta name="calibre:series" content="Oceanport Omegas"/>
    <meta name="calibre:series_index" content="2"/>
    File as: Byrde, Ann-Katrin - Oceanport Omegas 02 - A Baby for the Firefighter

    <dc:title>A Body on the Beach</dc:title>
    <dc:creator opf:role="aut" opf:file-as="Baker, Blythe">Blythe Baker</dc:creator>
    <meta name="calibre:series" content="Sunrise Island Mysteries"/>
    <meta name="calibre:series_index" content="1.0"/>

    <dc:creator>Unknown</dc:creator>
    <dc:title> </dc:title>
    Ignore the "unknown" I'll be looking for the author and title in the file name. Hopefully something will be there - otherwise it's filed as "Unknown"

    From Epubor:
    <dc:creator opf:file-as="Diana Xarissa" opf:role="aut">Diana Xarissa</dc:creator>
    <dc:title>Aunt Bessie Knows (An Isle of Man Cozy Mystery Book 11)</dc:title>
  2. #2
  3. Backwards Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,904
    Rep Power
    9646
    XML is specifically and deliberately designed to be very easily machine-readable. Every programming language out there has some sort of XML library, either built-in or easily installed.

    Can you try that approach instead? Regular expressions can do this, it's just it takes lots of work to make sure they're behaving correctly for all possible inputs.

IMN logo majestic logo threadwatch logo seochat tools logo