#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    76
    Rep Power
    2

    Making Text Files Look Neater


    So, in addition to coding, I'm also something of a poet.
    I've had a few changes over the years to how I wrote the headers in my poetry files, as illustrated below:

    Torn To Bits
    By *****
    Originally published June 21, 2011

    Where's My Moment?
    By *****
    Published To Facebook November 13, 2010

    Whisper Goodbye
    By *****
    Written August 4, 2011

    Bathed In Blood And Glory
    By *****
    Inspired by Psalms 32
    Written July 2, 2011 [This is the only date I actually want]
    Originally published to facebook July 3, 2011

    So, I had an idea about possibly writing a program that would take these, and turn it into the following uniform format:

    [Poem Name]
    [By My Name]
    Written on [Date]
    [Series Names]
    [Any additional notes that would originally have been mixed in the wrong space, like the 'Inspired by Psalms 32' found above]

    (Series names are based on the folder they're found in. The examples above, coincidentally, didn't seem to have them on there, although some headers do.)
    And I wanted to create a text file that would go in each sub category in the following manner:

    Name Date Written Series Name

    Organized not by name (Which the folder ought to do by default) but by Date Written.

    So, here's the questions I have to ask:

    How do I search those random headers and find the first line that contains a date and pull that date, regardless of what format it might be in? I don't think I have purely numeric dates, but I don't know if that's the case. I know there are one or two instances where the date only includes a year. If I could catch cases that didn't follow the normal format and look at them separately to make judgement calls, that would also work.

    How do I make a list with uniform tabs that would fit to any sized poem names? My longest poem name is like 20+ characters long, and my shortest is like 8, so I don't know how to tab them to uniform length.
    Last edited by Mr909; August 12th, 2013 at 01:11 PM.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,900
    Rep Power
    481
    I usually use flex and bison to parse grammars.
    You might interpret this as "Use a finite state machine".
    If regular expressions don't work. Which is a FSM. (Apologies for the non-sentences. I shouldn't post while consuming wine.) The dates look like a problem. (As I think you said. I haven't read your post today.) It may be easiest in your case to find parts that do not look like a date. And then we'd have the problem of converting the dates to a standard form. It's a translation. Google translation works with Bayes law (Bayes networds), with a fantastic amount of data providing the probabilities.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    76
    Rep Power
    2
    I want to do this... within Python.
    It's not that I have problems with other frameworks, it's just that I kind of want to stick with this one.
    Any way to approach this sort of thing with that caveat?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2010
    Posts
    153
    Rep Power
    5
    You need to get very familiar with the "re" module, and regular expressions. Brace yourself.

    Humans are very good at looking at a wide variety of information and inferring relationships of various kinds from subtleties. Computers need very tight rules, so if you want a computer to translate this stuff you need to tell it the rules.

    In my experience doing similar kinds of things (e.g, turning free-form spreadsheet data into clean, uniform RDBMS databases), you basically have to survey the data and develop a set of rules that matches the widest possible set of records, then check the output and fix the edge cases.

    There may be a "magic module" somewhere in a github repo that translates date strings to datetime objects, but I'm not specifically aware of one. Most likely you're just going to have to develop a regex.

IMN logo majestic logo threadwatch logo seochat tools logo