#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0

    My regex works on sandbox but not on Python


    Hi,

    I need your help to solve the problem with either my regex or the Python code.

    My regex is supposed to select all the sentences that start with a dash (-) and may end either with a dash (-) or full-stop (.) or comma (,) or question mark (?). It does works on the sandbox, highlighting all the strings I need. My regex is:

    \-(.*?)-|-(.*?).*

    But once I use it in the code for parsing the text, the code looks like this:

    import re

    filename="C:/Users/Gabriele/My Documents/python/XXX"
    fileobj=open(filename, "r")
    ptext=fileobj.read()
    fileobj.close()
    pat=r'\-(.*?)-|-(.*?).*'
    m=re.findall(pat, ptext)

    print m

    Python returns the following mess:

    [('\x00 \x00D\x00i\x00l\x00z\x00e\x00,\x00 \x00', ''), ('', ''), ('', ''), ('\x00 \x00E\x00i\x00n\x00u\x00,\x00 \x00e\x00i\x00n\x00u\x00,\x00 \x00', ''), ('\x00 \x00A\x00a\x01 \x00\r\x01i\x00a\x00.\x00 \x00P\x00r\x00i\x00p\x00i\x00l\x00s\x00i\x00u\x00 \x00j\x00\x05\x01,\x00 \x00k\x00a\x00i\x00 \x00t\x00i\x00k\x00 \x00v\x00a\x00n\x00d\x00u\x00o\x00 \x00s\x00u\x00a\x01i\x00l\x00s\x00,\x00 \x00', ''), ('', '')]

    What do I do wrong?

    Btw, the text I parse is in Lithuanian.

    Many many thanks!

    Gabriele
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,843
    Rep Power
    480
    Code:
    pat=r'-[^-!.?]*[-!.?]'
    m=re.findall(pat, ptext)
    for sentence in m:
        print(sentence)
    My pattern differs from your pattern, but that's not really the point.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Code:
    pat=r'-[^-!.?]*[-!.?]'
    m=re.findall(pat, ptext)
    for sentence in m:
        print(sentence)
    My pattern differs from your pattern, but that's not really the point.
    Hi,
    Thanks for this but even this code returns me something that looks as the below:

    ('\x00 \x00D\x00i\x00l\x00z\x00e\x00,\x00 \x00', '')
    ('', '')
    ('', '')
    ('\x00 \x00E\x00i\x00n\x00u\x00,\x00 \x00e\x00i\x00n\x00u\x00,\x00 \x00', '')
    ('\x00 \x00A\x00a\x01 \x00\r\x01i\x00a\x00.\x00 \x00P\x00r\x00i\x00p\x00i\x00l\x00s\x00i\x00u\x00 \x00j\x00\x05\x01,\x00 \x00k\x00a\x00i\x00 \x00t\x00i\x00k\x00 \x00v\x00a\x00n\x00d\x00u\x00o\x00 \x00s\x00u\x00a\x01i\x00l\x00s\x00,\x00 \x00', '')
    ('', '')

    Could it be for the language which contains many diacritics? I saved the txt in unicode...

    More help is appreciated!
    Thanks,
    Gabriele
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Is it possible to have a sample file just to see the format of the searched statements (if it is not confidential, of course)?

    What do you use as delimiter of statements? space, new line, tabs?

    Regards,
    Dariyoosh
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    Hi Dariyoosh,

    Thanks for reply.
    The sample looks as follows:

    - Dilze, - šaukė ji be jokios intonacijos, pabrėžtinumo ar skubos, tarsi nesitikėdama atsakymo. - Dilze!
    Dilzė atsakė ir liovės barškinusi rykais, stovinčiais ant krosnies, bet dar nespėio pereit per virtuvę, kai ponia Kompson pašaukė dar kartą, o kol ji perėjo per valgomąjį ir kyštelėjo galvą į tą pilką lango šviesą, - dar vieną kartą.
    - Einu, einu, - atsakė Dilzė. - Aš čia. Pripilsiu ją, kai tik vanduo sušils, - pasikaišė sijoną ir ėmė kopti laiptais, visai užstodama tą pilką šviesą. - Padėkit ją antžemės ir grįžkite į lovą.

    If I understand your question re delimiters, the dash (-) always delimits the beginning of the string I want to capture, and the ending could be (,.?!) followed by another dash (-) or whitespace.
    I highlighted the strings in the sample that illustrate how they may look and vary.

    I hope this might be solved. I'm a literary scholar so the code writing is a deep deep forest for me

    Thank you in advance!
    Gabriele
  10. #6
  11. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Ok, looking at your sample, as b49P23TIvg said, findall() can do the job. yet there is something that I think has to be clear before knowing how to read the file (I mean read it as whole string by readlines() or line by line inside a loop by read())

    Can the following happen in your file,? for example

    Code:
    " - Beginning of the searched string ....... \n
        The rest of the searched string ................. ? -
    That is, can we have a matched string which is written on several lines? because if it is the case, this can have an impact on how tokens are detected based on the regular expression and more importantly based on how the file is read.

    My question may seem trivial by as I didn't understand the content (language) of your text I thought that might be important to point out.

    Regards,
    Dariyoosh
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    Originally Posted by dariyoosh
    Ok, looking at your sample, as b49P23TIvg said, findall() can do the job. yet there is something that I think has to be clear before knowing how to read the file (I mean read it as whole string by readlines() or line by line inside a loop by read())

    Can the following happen in your file,? for example

    Code:
    " - Beginning of the searched string ....... \n
        The rest of the searched string ................. ? -
    That is, can we have a matched string which is written on several lines? because if it is the case, this can have an impact on how tokens are detected based on the regular expression and more importantly based on how the file is read.

    My question may seem trivial by as I didn't understand the content (language) of your text I thought that might be important to point out.

    Regards,
    Dariyoosh
    Hi,
    I used findall() in the first place and b49P23TIvg suggested using the 'for' loop. But both the codes returned exactly the same thing that makes no sense to me.

    Re your sample above, my text sample is not formatted to show newline \n. As it is normal for the sentences in the coherent text, the sentences carry over to a new line if they are long enough. But I'm not sure if this is what you are asking.....

    Thank you.
    Gabriele
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2007
    Location
    Joensuu, Finland
    Posts
    431
    Rep Power
    67
    Must be encoding-related because I got it to work nicely in IDLE using either pattern. b49P23TIvg’s seems to work better. Python 2 again?

    Code:
    >>> p = re.compile(r'-[^-!.?]*[-!.?]')
    >>> p.findall(s)
    ['- Dilze, -', '- Dilze!', '- dar vieną kartą.', '- Einu, einu, -', '- Aš čia.', '- pasikaišė sijoną ir ėmė kopti laiptais, visai užstodama tą pilką šviesą.', '- Padėkit ją antžemės ir grįžkite į lovą.']
    >>> q = re.compile(r'\-(.*?)-|-(.*?).*')
    >>> q.findall(s)
    [(' Dilze, ', ''), ('', ''), ('', ''), (' Einu, einu, ', ''), (' Aš čia. Pripilsiu ją, kai tik vanduo sušils, ', ''), ('', '')]

    Comments on this post

    • b49P23TIvg agrees : Yes, I suspect there's a different version of python running in "the sandbox".
    My armada: openSUSE 13.1 (home desktop, home laptop), Crunchbang Linux 11 (mini laptop, work laptop), Android 4.2.1 (tablet)
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    Originally Posted by SuperOscar
    Must be encoding-related because I got it to work nicely in IDLE using either pattern. b49P23TIvg’s seems to work better. Python 2 again?

    Code:
    >>> p = re.compile(r'-[^-!.?]*[-!.?]')
    >>> p.findall(s)
    ['- Dilze, -', '- Dilze!', '- dar vieną kartą.', '- Einu, einu, -', '- Aš čia.', '- pasikaišė sijoną ir ėmė kopti laiptais, visai užstodama tą pilką šviesą.', '- Padėkit ją antžemės ir grįžkite į lovą.']
    >>> q = re.compile(r'\-(.*?)-|-(.*?).*')
    >>> q.findall(s)
    [(' Dilze, ', ''), ('', ''), ('', ''), (' Einu, einu, ', ''), (' Aš čia. Pripilsiu ją, kai tik vanduo sušils, ', ''), ('', '')]
    Hi, SuperOscar,

    yes, it's Python 2. Is it that bad?

    how would the whole code look like? Where do you define (s)? Sorry, must be silly question to you but I'm from linguistics and just learning a bit of programming...

    Thanks,
    Gabriele
  18. #10
  19. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Well, maybe I didn't understand the problem correctly. But I still believe that if a searched string is too long and therefore it is written in several (at least two) lines, then you will have a problem and you cannot retrieve the list of the matched tokens simply by calling findall(). Here is how you defined your criteria in order to detect the specific statements in your file.
    Originally Posted by gabrielemucho
    ... the dash (-) always delimits the beginning of the string I want to capture, and the ending could be (,.?!) followed by another dash (-) or whitespace ...
    Consider the following example as input file:
    Code:
    ZZZZZZZZZZZZZ-search1_1st_part_
    search1_2nd_part!- ZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    -search_2. ZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZ-search_3, ZZZZZZ-search_4? Z
    According to your specification, I'm supposed to have a list including everything which is not "Z". Here is my test script:
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And the output is:
    Code:
    -search1_1st_part_
    search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?
    As you can see "search1" is considered as two separate tokens because of the new line character in the file, which may not be how you wish to arrange your statements based on the regular expression. So I think the solution is to remove new line characters.
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        
        # So here you get rid of new line characters
        searchedText = searchedText.replace("\n", "")
        
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And now it gives you the desired result:
    Code:
    -search1_1st_part_search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?

    Regards,
    Dariyoosh
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    [QUOTE=dariyoosh]Well, maybe I didn't understand the problem correctly. But I still believe that if a searched string is too long and therefore it is written in several (at least two) lines, then you will have a problem and you cannot retrieve the list of the matched tokens simply by calling findall(). Here is how you defined your criteria in order to detect the specific statements in your file.

    Consider the following example as input file:
    Code:
    ZZZZZZZZZZZZZ-search1_1st_part_
    search1_2nd_part!- ZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    -search_2. ZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZ-search_3, ZZZZZZ-search_4? Z
    According to your specification, I'm supposed to have a list including everything which is not "Z". Here is my test script:
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And the output is:
    Code:
    -search1_1st_part_
    search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?
    As you can see "search1" is considered as two separate tokens because of the new line character in the file, which may not be how you wish to arrange your statements based on the regular expression. So I think the solution is to remove new line characters.
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        
        # So here you get rid of new line characters
        searchedText = searchedText.replace("\n", "")
        
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And now it gives you the desired result:
    Code:
    -search1_1st_part_search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?

    Regards,
    Dariyoosh[/QUOTE

    Hi all,

    thanks for numerous efforts to solve the problem, but...

    I'm really thick in these matters so I copy faithfully your code, and it returns literally nothing

    does it matter that I use Python 2.5.5?

    G
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    [QUOTE=gabrielemucho]
    Originally Posted by dariyoosh
    Well, maybe I didn't understand the problem correctly. But I still believe that if a searched string is too long and therefore it is written in several (at least two) lines, then you will have a problem and you cannot retrieve the list of the matched tokens simply by calling findall(). Here is how you defined your criteria in order to detect the specific statements in your file.

    Consider the following example as input file:
    Code:
    ZZZZZZZZZZZZZ-search1_1st_part_
    search1_2nd_part!- ZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    -search_2. ZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    ZZZZZZZZZ-search_3, ZZZZZZ-search_4? Z
    According to your specification, I'm supposed to have a list including everything which is not "Z". Here is my test script:
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And the output is:
    Code:
    -search1_1st_part_
    search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?
    As you can see "search1" is considered as two separate tokens because of the new line character in the file, which may not be how you wish to arrange your statements based on the regular expression. So I think the solution is to remove new line characters.
    Code:
    import re
    
    def main():
        inputFile = open("testdata2.txt", "r")
        stringList = inputFile.readlines()
        searchedText = "".join(stringList)
        
        # So here you get rid of new line characters
        searchedText = searchedText.replace("\n", "")
        
        prog = re.compile("-[^ ]*[.?!,][- ]")
        tokens = prog.findall(searchedText)
        for token in tokens:
            print(token)
        inputFile.close();
               
    main()
    And now it gives you the desired result:
    Code:
    -search1_1st_part_search1_2nd_part!-
    -search_2.
    -search_3,
    -search_4?

    Regards,
    Dariyoosh[/QUOTE

    Hi all,

    thanks for numerous efforts to solve the problem, but...

    I'm really thick in these matters so I copy faithfully your code, and it returns literally nothing

    does it matter that I use Python 2.5.5?

    G
    OK, Dariyoosh, for your code it says 'tokens' are not defined.

    G
  24. #13
  25. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    [QUOTE=gabrielemucho]
    Originally Posted by gabrielemucho

    OK, Dariyoosh, for your code it says 'tokens' are not defined.

    G
    I tested (I used Python 3 for this program) in windows command line (I mean a python session in windows CMD) the whole program before posting it here (and I've just done the same test again) and the result was what I wrote in my precedent comment, so maybe you made a mistake while copying the code.

    Regards,
    Dariyoosh
  26. #14
  27. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    15
    Rep Power
    0
    [QUOTE=dariyoosh][QUOTE=gabrielemucho]

    I tested (I used Python 3 for this program) in windows command line (I mean a python session in windows CMD) the whole program before posting it here (and I've just done the same test again) and the result was what I wrote in my precedent comment, so maybe you made a mistake while copying the code.

    Regards,
    Dariyoosh[/QUOTE

    I copied the code correctly, and it does not work. But I think I've pinnned down the problem. The simple code as this should be working more or less:

    import re

    filename = "C:\Users\XX\LITH.txt"
    fileobj = open(filename, "r")
    ptext=fileobj.read()
    fileobj.close()
    pat = re.compile(r'-[^-!.?]*[-!.?]')
    m = re.findall(pat, ptext)

    print (m)

    The problem seems to be the txt and the language encoding. My sample was saved in Unicode, in which case the parsing result was nonsense.

    I re-saved the sample text in UTF-8 and the result is this:

    ['- Dilze, -', '- Dilze!', '- dar vien\xc4\x85 kart\xc4\x85.', '- Einu, einu, -', '- A\xc5\xa1 \xc4\x8dia.', '- pasikai\xc5\xa1\xc4\x97 sijon\xc4\x85 ir \xc4\x97m\xc4\x97 kopti laiptais, visai u\xc5\xbestodama t\xc4\x85 pilk\xc4\x85 \xc5\xa1vies\xc4\x85.', '- Pad\xc4\x97kit j\xc4\x85 ant\xc5\xbeem\xc4\x97s ir gr\xc4\xaf\xc5\xbekite \xc4\xaf lov\xc4\x85

    The digits stand for the letters with diacritics such as ž, š, ū etc.

    So the code should be working but the language (of the text) does not. What can be done? UTF-8 or Unicode usually works for Lithuanian...

    Thank you,
    Gabriele
  28. #15
  29. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Originally Posted by gabrielemucho
    . . . I copied the code correctly, and it does not work . . . The problem seems to be the txt and the language encoding. My sample was saved in Unicode, in which case the parsing result was nonsense . . . So the code should be working but the language (of the text) does not. What can be done? UTF-8 or Unicode usually works for Lithuanian . . .
    Instead of printing the tokens directly in the console, you can write them into a file, try the following:
    Code:
    import re
    
    def main():
        with open("testdata2.txt", "r") as inputFile, open("output.txt", "w") as outputFile:
            stringList = inputFile.readlines()
            searchedText = "".join(stringList)
            searchedText = searchedText.replace("\n", "")
            prog = re.compile("-[^ ]*[.?!,][- ]")
            tokens = prog.findall(searchedText)
            for token in tokens:
                outputFile.write(token + "\n")
               
    main()
    in this example, the file "output.txt" includes the result.

    Regards,
    Dariyoosh

IMN logo majestic logo spyfu logo threadwatch logo seochat tools logo