#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Posts
    154
    Rep Power
    14

    Problem formulating a regular expression


    I'm having trouble writing a regular expression to basically group text encapsulated on an xml tag. I'm trying to create a list which basically has as element 0 (1st element), all the text before the first '<s>' (sentence) tag and then for each proceeding element up to the penultimate element, the text inside each '<S>...text...</S>' is stored. The last element then contains all the data occuring after the last '</S>' tag e.g. '</DIV></BODY></REFERENCES><SURNAME>joebloggs</SURNAME>...etc.'

    When I use the following regular expression:

    pat = '</?S[^A-Z]\s?I?D?=?\'?S?-?\d?\d?\d?\'?\s?T?Y?P?E?=?\'?\w*\'?>'

    the tag "<S ID='S-10' TYPE='ITEM'>" is found, but the closing sentence tags "</S>" aren't.

    If I change the expression to:

    pat = '</?S\s?I?D?=?\'?S?-?\d?\d?\d?\'?\s?T?Y?P?E?=?\'?\w*\'?>'

    both the opening "<S ID=...>" and closing "</S>" of sentence tags are matched, but also the tag "<SURNAME>...name...</SURNAME>" is matched.

    I need to write a single regular expression which will exactly match both:

    1. Opening of sentence tags e.g. <S ID='S-10' TYPE='ITEM'>

    2. Closing of sentence tags i.e. </S>

    but not match any other tags such as <SURNAME>


    I can see why other tags beginning with an S i.e. <S*> are matched, but I'm not sure how to write an expression capable of matching both the above conditions, but not any other tags.

    Any ideas appreciated???
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2003
    Posts
    624
    Rep Power
    34
    Complete the popular saying:

    "I'm not parsing XML with an XML parser because..."

  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Posts
    154
    Rep Power
    14
    One could answer posts in many tones, but misplaced sarcasm was not the answer I was looking for. I'm using reg. expressions because that's what I've got to work with to fully utilise a natural language processing library which extracts all words occuring between a certain pattern match!!

    I have in the meantime, hacked a temporary solution together which solves the conditions of my original post, but I'd be interested to hear the thoughts of the wider community to see if a more elegent expression can be constructed.
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    I don't have much time to explain this right now but you should be looking at “Back references” (\\1 and ?P<name> in Python. $1 in Perl.) This will allow you to match part of the expression that already matched and use it later in the regex .

    Foe example, this means that you could write a regex that will match the closing tag that corresponds to it's opening tag. Something like this,

    '<(\w.+?)\s.+?>.*?</\\1>'

    This is described in part in the regular expression how-to which you can find at

    http://www.amk.ca/python/howto/regex/

    Since your working with pretty complex regex you might also want to take a look at Kodos; a regular expressions debugger for Python .

    http://kodos.sourceforge.net/

    Take care,

    Mark.
    programming language development: www.netytan.com Hula

  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2003
    Posts
    624
    Rep Power
    34
    Originally Posted by markb_1984
    One could answer posts in many tones, but misplaced sarcasm was not the answer I was looking for.
    OK.

    Your pattern is a bit redundant from what I can see - there is no need to specify a ? after each character when matching a straight string. That there must be a T followed by a Y is enough to stop there being two T's matched in a row.

    This seems to match from my limited testing, but I might be missing some intended functionality in your original post;

    Code:
    >>> text = """
    ... <SOME_TEXT>text</SOME_TEXT>
    ... <S ID='S-10' TYPE='ITEM'>This is text segment one, over</S>
    ... <SURNAME>Bloggs</SURNAME>
    ... <SURNAME NAME='Fred'>Bloggs</SURNAME>
    ... <XML>nothing</XML>
    ... <S ID='S-11' TYPE='OTHER'>This is text segment two, over</S>"""
    >>> 
    >>> pattern = "<S ID=\'(.*?)\' TYPE=\'(.*?)\'>(.*?)</S>"
    >>> 
    >>> results = re.findall(pattern, text)
    >>>
    >>> results[0]
    ('S-10', 'ITEM', 'This is text segment one, over')
    >>> results[1]
    ('S-11', 'OTHER', 'This is text segment two, over')
    >>>
    Your expression validates the content of the ID='...' attribute by matching the pattern "S-<three digits>", but your example shows S-10 which has two digits. I don't know which is the one you need, but to match between one and three digits, you could try something like:

    Code:
    <S ID=\'(S-\d{1,3})\' TYPE=\'(.*?)\'>(.*?)</S>
    To keep the regex simple, I have ignored beginning and ending characters, as these are handled quite easily with standard string functions, such as

    Code:
    text2[:text2.find("<S ")]
    to get all the text up to the first S tag, and

    Code:
    text2[text2.rfind("</S>"):]
    to get all from the last </S> tag. You could put those in your list separately. (Though, I might read from your next comment that this isn't an option, I'm not sure).

    That might be more what you were looking for (at least, I hope it's accurate to the question you posted ).

    I'm using reg. expressions because that's what I've got to work with to fully utilise a natural language processing library which extracts all words occuring between a certain pattern match!!
    I don't fully understand your point. You mean you have to pass the regular expression to the library?

    I ask because from your original post, it seems that you get XML, you extract content from it, and format it into a list - and Python comes with an XML parsing library (or two) as well as the regular expression library.

    Code:
    import xml.dom.minidom
    from xml.dom.minidom import Node, Attr
    
    doc = xml.dom.minidom.parseString(text) 
    # or use .parse(filename)
    
    for sentence in doc.getElementsByTagName("S"):
        print sentence.childNodes[0].nodeValue
        print sentence.attributes["ID"].nodeValue
        print sentence.attributes["TYPE"].nodeValue
    This would quite easily allow you to handle "for each proceeding element up to the penultimate element, the text inside each '<S>...text...</S>' is stored." in a way that's easier to read than the regular expression, and is using the right tool for the job described.


    Aside from that concern, natural language isn't really regular as far as I see it, so what does the library do with the expressions? (Just curiosity here, I've never looked at any natural language work at all).
    Last edited by sfb; December 15th, 2004 at 06:06 PM.
  10. #6
  11. Reinvent the Circle
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2004
    Location
    Jonesville, MI
    Posts
    218
    Rep Power
    16
    I fully agree with sfb, in that modules are very good things, under most circumstances. You can see the code sfb presents is very short and consice for the job.

    Nevertheless, to answer this problem:

    Originally Posted by markb_1984
    pat = '</?S[^A-Z]\s?I?D?=?\'?S?-?\d?\d?\d?\'?\s?T?Y?P?E?=?\'?\w*\'?>'

    . . .

    I need to write a single regular expression which will exactly match both:
    1. Opening of sentence tags e.g. <S ID='S-10' TYPE='ITEM'>
    2. Closing of sentence tags i.e. </S>
    The original problem is not having a ? after the [^A-Z]. But the more important problem is not making use of grouping by parentheses.

    This...
    Code:
    pat = '</?[sS]\s+(?:ID=([\'\"])S-\d{2,3}\1\s+TYPE=([\'\"])\w*\2\s+)?>'
    ...is cleaner. Notice the \s+ instead of \s? -- you always want to thing fault-tolerance. I believe it is allowed in correct xml syntax to have more than one whitespace between parts of the tag. Maybe it is also correct to have a lowercase <S>? Thus [sS]. Notice also the ([\'\"]) and \1 & \2 -- This will allow for single- and double-quotes while at the same time ensuring that the beginning and end quotes are the same.

    If you wanted more flexibility in your regex, you might consider changing its format to something like this:
    Code:
    pat = '</?[sS]\s+(?:[a-zA-Z]+=[\'\"]\w*[\'\"]\s+)*>'
    This would allow any number (0-n) of properties, (lower or uppercase), with their values (in quotes). If you changed [sS] to [a-zA-Z]+, this regex would match any xml tag.

    Just an excercise in regex's, just as much or more for other readers as for the original party.
    -Yanno

    "If it will have to be done more than once, don't do it. Make something that does it for you."

    "If you want to get out of the box, you must not think outside the box, you must think about the box. In fact, think about destroying the box."

IMN logo majestic logo threadwatch logo seochat tools logo