#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Location
    USA
    Posts
    12
    Rep Power
    0

    Ideas on how to approach translating *.xlsx to *.txt


    I'm using Python2.7.3 to write a program. Basically, I have a *.txt file which is a hand history record of a No Limit Hold'em Heads Up Sit and Go tournament (2 players, 1 table). That *.txt file is from another poker client that I do not use myself, but I have it because it is a common hand history format and is easily pasted into a hand history converter, which outputs BB code that I can then paste into forums and ask questions about the hand and get advice. That *.txt file is my template, so to speak.

    When I get my hand histories, however, they are in the *.xlsx format of an Excel2007 workbook. This *.xlsx format is NOT an industry norm because I've never in my life seen any other poker client use the *.xlsx format for poker hand histories (usually always *.txt or XML).

    Anyway, I have 50,000 *.xlsx hand histories and I need to get them all translated, or whatever the correct term is, into the *.txt format, for which I have about 60 templates to use as a guide to the proper formatting. You see, they must all be formatted correctly before the hand history converter will convert them to BB code for me.

    So, after searching around the internet for a few weeks and not being able to find any relatively easier ways to translate those 50K *.xlsx files to *.txt, I've decided to write my own program for that sole purpose. The program is going to be designed to specifically do just that one thing because I'm not a coding guru (it's been a long time) but I have some knowledge of a few different languages. I chose to learn Python because it's supposed to support both procedural and OOP styles and is not as strongly typed as Java, etc. I've been brushing up to get comfortable with coding again, and messing around with some ideas on how to actually go about writing such a program (the steps involved). That's where my memory is vague, so I wanted to ask some questions from experienced coders who are also up to date on this sort of thing.

    Does the version of Python that I'm using (2.7.3) have standard libraries that are capable enough to read *.xlsx files, or should I go ahead and get the openpyxl library? And, let's say that I've gotten to the point where I can open up a *.xlsx file and read it with the program. When I read that *.xlsx file into memory, will the size that it takes up be somewhat comparable to the file's size on my hard drive? I'm not sure of how it exactly works yet, but can I read in the *.xlsx file and just have all of its data stay in memory, so that I can tinker with it a bit until I get the formatted results into a *.txt file? Or, is it like a fleeting type of datastream, where once I read it in, I'll have to perform some quick functions on the data before the stream is gone? I'm wondering also, with something like the openpyxl library, can I read in rows of data at a time, and maybe even look ahead a few rows and, maybe only read parts of those later rows depending upon which data is contained in them?

    Before I became more familiar with Python, and also poker hand histories, I figured that it might be as relatively simple as having the program convert the *.xlsx files to *.csv files, and them reformat those into *.txt files. But, I'm not so sure that method will be the best way now that learned a bit more about it all. Those are about all of the things that I'm wondering about before I get going. I'm almost ready to start the coding part and instead of just trial and error, I figured that I'd better get some experienced advice on the matter to save me a lot of wasted time about some things I may or may not try. I have some images that I can link to later if anyone is really interested in what the source document and the final document both look like, but I'm just wondering generally about the questions above, because I have several ideas in mind on how to approach the problem but I don't know the specifics of the Python libraries enough to be able to narrow down my options and start off in the right direction.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Location
    /dev/null
    Posts
    163
    Rep Power
    19
    My apologies. I didn't read the entire post. Just read parts of it and though I'm no expert, below are my suggestions:

    Originally Posted by WhoDidWhat
    I chose to learn Python because it's supposed to support both procedural and OOP styles and is not as strongly typed as Java, etc.
    Python is a great language. For this task (reading an excel file), I would prefer perl over python. Primarily because, I've never used python to read excel files before. And secondly, I feel perl is more suited for this kind of data munching job (less coding too, since there's more than one way to do it in perl )

    Originally Posted by WhoDidWhat
    Does the version of Python that I'm using (2.7.3) have standard libraries that are capable enough to read *.xlsx files, or should I go ahead and get the openpyxl library?
    I don't think there are any standard modules for this. You may have to depend on 3rd party libraries like openpyxl.

    Originally Posted by WhoDidWhat
    I'm not sure of how it exactly works yet, but can I read in the *.xlsx file and just have all of its data stay in memory
    Not necessarily. Filehandling methods are pretty efficient. They read chunks of data at once, so that you have a good compromise on speed and memory utilisation.

    Originally Posted by WhoDidWhat
    so that I can tinker with it a bit until I get the formatted results into a *.txt file? Or, is it like a fleeting type of datastream, where once I read it in, I'll have to perform some quick functions on the data before the stream is gone?
    If it's plain conversion of xlsx to txt, you might not have a lot to tinker, do you? Usually, it's good to read data, perform some action, store the result and then move ahead with the next set of data, rather than jumping around seeking data from fag end of file to the beginning.

    Originally Posted by WhoDidWhat
    I'm wondering also, with something like the openpyxl library, can I read in rows of data at a time, and maybe even look ahead a few rows and, maybe only read parts of those later rows depending upon which data is contained in them?
    Again, if it's plain conversion of xlsx to txt, you might not need all such features. Depends on how you want it sorted out. Read a bit more about the library and you'll figure out it's features.

    Let us know your findings and what you could brew
    Last edited by noobie1000; October 16th, 2013 at 05:29 AM.
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,995
    Rep Power
    481
    Isn't xlsx an xml format? Use expat or other xml library.
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Location
    USA
    Posts
    12
    Rep Power
    0

    RE: Ideas on how to approach translating *.xlsx to *.txt


    I was going to post a few image links to the source and final output file for a better idea of what I'm working with, but I'm not allowed to post hyperlinks or URL's yet. Having just installed Python about 2 days ago, I was dreading the thought of having to try and brute force my way through learning another OOP language. So far though, it's been easy. I've only been coding small stuff so far, to get used to the new syntax and Object Model; OOP stumped me the last time around. With Python, it feels almost as easy to work with as Qbasic but with most of the power and without the strictness of Java. But I'm not that far along yet, so time will tell if I'm up to the task or not. Mostly, I think it'll be a large amount of relatively smaller tasks. So, if I can maybe get a handle on correctly organizing the smaller tasks into objects, to keep the code more compact and easier for me to think about as a whole, then I might actually get to the finish line.

    The "Translator" prog that I'll be messing with will just have the sole task (to keep it KISS for me) of translating this EDIT1 hand history (*.xlsx) into that EDIT2 hand history format (*.txt). Unless a person is used to looking at hand history files, the above two images may be confusing to look at without some sort of context. Basically, they are two different representations of this hand EDIT4 , which is how my other hand histories will appear on poker forums once they are also translated into the *.txt format.
    Since my forum status doesn't allow the posting of URL's yet - "EDIT1", "EDIT2", and "EDIT4" just above correspond to the 1st, 2nd and 4th links (4th post down) in the thread that I made in twoplustwo's "programming" sub-forum entitled "Coding a hand history translator - getting started - seeking experienced advice ???"
    I don't expect anyone to go through the trouble but the really curious can find them by:
    > Open a new web browser tab or window.
    > Type in = forumserverDOTtwoplustwoDOTcom.
    > Scroll down that page until you see the "Programming" sub-forum link in the light-grayish area; on the left side of the page in the 14th bolded category down, called "Other Topics" (you may have to "expand" the forum menu to see it this way).
    > After entering the "Programming" sub-forum, my thread is about the 12th one down, called "Coding a hand history translator - getting started - seeking experienced advice ???".
    > Once in that thread, the links are in the 4th post from the top.

    "Python is a great language. For this task (reading an excel file)..." - Yeah, it has been easier than I expected to get into programming again. If I get stuck too much, later, I might have a look at Perl. Before the OP, I wasn't sure if the CSV module would cut it or not. After the OP, though, I was still too anxious for sleep so I went ahead and got the OpenPyXL library. I'm just going through some basic "by experience" tutorials with it for the time being, to get used to it and also give me some ideas on how I might specifically go about tackling the problem. It is pretty handy to have from what I've seen in the docs and what I've experienced so far.

    "If it's plain conversion of xlsx to txt, you might" - At first, I was thinking that it would be just a matter of: *.xlsx to *.txt conversion; stripping out irrelevant data from the resulting converted *.txt; and then inserting some explicit data (or data that remains constant throughout all hand histories) into the correct places in the converted *.txt. I'm still keeping this approach as a possible option. Maybe I'm just not able to see how I can express this in code yet. It still seems like a logical approach to me but the fact that all of the *.xlsx hand histories are dynamic stumps me as of right now [ie- # of betting rounds, length of hands, etc]. I'm browsing the originals now, and noting which data is both consistent and also inconsistent throughout all files. Any "rules" that I can find, I note them for later use in the code if they lend themselves to it.
    "Usually, it's good to read data, perform some action, store the result and then move ahead with the next set of data, rather than jumping around seeking data" - Whenever I put my thinking cap on, I keep returning to this line of thought. The *.txt format, which will be the final output of the program, I have been going through that and sort of making my own documentation on it, 'cause I've not ever seen a formal one anywhere. I don't know what the correct term for it would be. I call it an "abstract" of what it contains. All of the data in a final *.txt file is just a bunch of fields of info, each separated by one or more spaces. Then when I break that view down further, each field holds its data in 1 to 3 ways: either explicit data that is constant throughout all hands; variable data that is liable to change it's value, be present, or not be present throughout all hands; or a combination of explicit and variable data. Using the "usual" process that you refer to above of "read data, perform an action, store the result", maybe I should try something like making a first pass at reading an *.xlsx file (the largest is only 14Kb) and writing all of the "explicit" data in it to the beginnings of a final *.txt file. Then make one or several more passes at reading the *.xlsx file and performing algorithms on certain cells to get at the "variable" data or the extent to which the "variable" data occurs (how many rows or cells within a row are variable, etc). I don't know... it's kind of "you have to be there" to see what I'm talking about because I do a poor job of explaining it in writing.

    "Isn't xlsx an xml format?" - I never knew that. I'm going to look more into the expat or other xml libraries. Which reminds me, I still want to do "save as" on one of those original *.xlsx files to several other formats to see if any of those may be easier for me to think about and work with. But unless they are dramatically easier to work with over the original format, I'll probably stick with it so that I'm less prone to make mistakes that end up in the final *.txt. If it ends up that the final hands have the same data integrity as the originals, then I can also import them into a poker database (HUD) and analyze my play that way too. But the major benefit for me will just be able to review them quickly in a replayer and note such things as which betting actions particular opponents make with certain starting hands, etc. I'm actually surprised that there's not a piece of software out there yet (open source or commercial) that does what I'm trying to do, given that American players only have a handful of options on where to play online poker.

    When I first decided to give this a try and start coding again, I asked about it in the "programming" sub-forum of a poker forum. And, there was some mention of ANTLR, parsers, building grammars, trees, and things that I wasn't really all that familiar with. As of now, if I'm understanding correctly, about all I'll absolutely need to get the job done is Python, Openpyxl (or other related library), the source files (*.xlsx), and the destination files (*.txt). Is that pretty reasonable to assume on my part, or should I be prepared to definitely have to learn other technologies (languages) also???
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Location
    USA
    Posts
    12
    Rep Power
    0

    Bump - related to the above post


    Code:
    1)  OriginalFile = open("C:\\Current\\MY_PYTHON\\BOL_HH_Translator\\TestFile.txt","r")
    
    2)  CopyFile = open("C:\\Current\\MY_PYTHON\\BOL_HH_Translator\\TestFile2.txt","w")
    
    3)  OriginalFileContents = OriginalFile.read()
    
    4)  OriginalFile.read()
    
    5)  print(OriginalFileContents)
    
    6)  CopyFile.write(OriginalFileContents)    # This line was originally "CopyFile.write(OriginalFile.read())"
    
    7)  OriginalFile.close()
    
    8)  CopyFile.close()
    I'm wondering a few things about the above 8 lines of Python code. First off, excuse anything in the following text that doesn't quite make sense or "add up" exactly. The reason that what I'm writing may not be %100 accurate is because the code above went through a few changes: mostly adding a line or two and changing the sequence of a line or two. I have about 20 lines of comments in my copy of the code that I use to follow along. But, now that I finally get a moment to post my questions, I've lost track a bit of the order in which I made certain changes, etc. I am just noodling around, trying stuff out, and trying to get more familiar with OOP concepts before I dive into bigger tasks. The above code does what I expect it to now, so I'm posting to mainly get a handle on any misconceptions that I might've had before about some OOP terminology/concepts and why I was getting an "expected a character buffer object" error before I fixed it.

    A Python program containing the above code will: open a file named TestFile.txt and create a new file named TestFile2.txt; it will also read the contents of TestFile.txt, print those contents to the screen in the Python Shell, and write those contents to the new file that it creates named TestFile2.txt. Originally, the code looked like:

    Code:
    1)  OriginalFile = open("C:\\Current\\MY_PYTHON\\BOL_HH_Translator\\TestFile.txt","r")
    
    2)  CopyFile = open("C:\\Current\\MY_PYTHON\\BOL_HH_Translator\\TestFile2.txt","w")
    
    4)  OriginalFile.read()
    
    6)  CopyFile.write(OriginalFile.read())
    
    7)  OriginalFile.close()
    
    8)  CopyFile.close()
    But, it was throwing a TypeError: expected a character buffer object, for line 6. Was this maybe because after OriginalFile was read, it was automatically closing and then when I tried to write its' contents to CopyFile, the contents weren't available? So I needed to maybe: create a character buffer object to keep read() results in memory; then read OriginalFile; then put the read contents into the character buffer object so that when OriginalFile closes, I'll still have its' contents available to write to another file ?

    Also, I'm not sure - The second after "OriginalFile.read()" executes, do the file's contents stay in memory until I explicitly use close(), or are my last two close() statements redundant?

    Next to lastly, to get me thinking more about OOP concepts, I'm wondering about the group of code at the top of the post. Of the 5 fundamental OOP concepts of: Inheritance, Polymorphism, Abstraction, Cohesion & Coupling, and Encapsulation... Which of those OOP concepts are demonstrated in the top most code?

    And lastly, I ordered the 5 OOP concepts above the way that I did just because it's easiest for me to recall them by remembering the acronymn "IPACE" and then recalling what each letter stands for. But, let's say for someone who's hardly had any OOP theory, what order should the concepts be in; as far as their ability to be understood, noticed in code, and implemented when planning a medium sized program? I'd think that it would be something like: Inheritance, Encapsulation, Abstraction, Cohesion & Coupling, and then Polymorphism, but I'm not sure.
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,995
    Rep Power
    481
    The xlsx format seems to be a zip archive of xml files.

    $ file ~/Documents/a.xlsx
    /home/me/Documents/a.xlsx: Zip archive data, at least v2.0 to extract


    I suggest that you look at your data.

    unzip one of the .xlsx archives. Examine the resulting .xml files with a text editor (this is trivial using emacs). Using LibreCalc I made an xlsx file from a spreadsheet having a bit of pi and some text in other cells. The file xl/worksheets/sheet1.xml of the zip archive contains ...<v>3.14159</v>... and I can't begin to find the text. I suppose it went to unicode hell.

    You might be able to make sense of your data. You might need a third party library.


    Three hours elapsed. I'm now drunk. The lady in the downstairs apartment locked herself out. I let her in and she fed me alcohol and I no longer know anything about this thread. Given that I might have already written something useful, I press submit.
    [code]Code tags[/code] are essential for python code and Makefiles!
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Location
    USA
    Posts
    12
    Rep Power
    0

    wonder wtf is up with this Listbox...


    Hey b49P23TIvg, was your lady friend able to get back into her apartment or did she happen to maybe stay the night at your place?

    From several other online sources, I have read something similar to what you mentioned, about the xlsx format being based on XML.

    I put some problem code below, under the "----..." separator. But, before I posted this whole thing, I read your post above and figured that I'd have a look at what kind of XML my *.xlsx files produced when doing a "save as" to an *.xml format. I wasn't able to save to an xml format, though - I get an error window saying, "Cannot save XML data because the workbook does not contain any XML mappings...." I clicked on "help" for more info, but it just directed me to Help&Support and I didn't feel up to wading through all of that stuff at this hour (maybe later).

    Having already unzipped the archive (WinRAR), the *.xlsx files that are in it are Excel2007 workbooks. I tried to open one of the files in Notepad but all I was able to get out of it was some type of machine code, similar to:
    "+-dybaZ;,^i:kK„&:%$ l!Zaa9%?%Rl%&LuZO cȾ  PK   ! \..."
    I might try to do something later, like you mentioned, where I'll see if the *.xlsx archive lets me unzip it to *.xml format. Maybe that'll produce hand history files that're able to be more seamlessly translated into the text format.


    -----------------------------------------------------------------


    I'm still learning the nuances of the Python language, so the answer is probably obvious but I am stumped as to why I cannot populate the simple Listbox in the code below with each filename in the current working directory? When I type:

    for fname in Contents:
    print fname

    ... in at the >>> prompt and [enter], the screen displays a list of the current working directory's contents, similar to:

    husng_SMDMF111_118833298-1180837149 (1).xlsx
    husng_SMDMF111_118833298-1180837149 (10).xlsx
    husng_SMDMF111_118833298-1180837149 (11).xlsx ...

    But I can't seem to figure out why I'm not able to get the same output in the HhtListbox, in the code below? A listbox appears when I run the prog, but it's empty. Maybe it's just time to quit looking at the screen for a bit and try again later today. If anyone sees where I'm messing up at, let me know.


    Code:
    import sys
    import os
    from Tkinter import *    # I've read that it's better to do: import Tkinter as tk, but the DOCs have "from Tkinter import *".
    
    Hht = Tk()
    Hht.geometry("999x650+30+20")
    Hht.title("BOL Hand History Translator")
    S1 = "    Translate your BOL hand histories from their native *.xlsx format into a more traditional *.txt format.  "
    S2 = "The *.txt HH's are then compatible with various poker databases, forums, and hand replayers."+"\n"+"\n"
    S3 = "    CHECK AND MAKE SURE THAT:  1) The current working directory is the correct folder containing all of the BOL hand "
    S4 = "history files that you wish to translate into *.txt form, AND  2) Also that the contents of this folder contain the correct 
    
    *.xlsx "
    S5 = "files that you wish to translate."+"\n"+"\n"
    S6 = "    NOTE - this program will ONLY attempt to translate any *.xlsx BOL hand histories that "
    S7 = "are present in the current working directory.  However, as a precaution, you should remove any unrelated files or subfolders "
    S8 = "from this directory.  At least until you're finished using this program and have removed it from any sensitive directory.  "
    S9 = "Now, run this thing and then go review some hands.  You might learn something........................ENJOY!"
    HhtMessage = Message(Hht, text=S1+S2+S3+S4+S5+S6+S7+S8+S9, relief=RAISED, aspect=300, padx=20, pady=20)
    HhtMessage.pack()
    
    HhtLabel1 = Label(Hht, text="The current working directory, from which this program instance is running, is:", relief=RAISED, padx=5, 
    
    pady=5)
    HhtLabel1.pack()
    PathName = os.getcwd()
    HhtLabel2 = Label(Hht, text=PathName, relief=RAISED, padx=5, pady=5, fg="blue")
    HhtLabel2.pack()
    HhtLabel3 = Label(Hht, text="The contents of this directory are:", relief=RAISED, padx=5, pady=5)
    HhtLabel3.pack()
    
    Contents = os.listdir(PathName)
    
    HhtListbox = Listbox(Hht, selectmode=SINGLE, fg="blue", width=100, height=12)    # 999 : 794 should be the correct aspect ratio of 
    
    widths but the Message2 window width doesn't change much when I alter the size...
    
    i = 0            #### WTF?  From the prompt, I can print each fname in Contents fine, but I can't populate this Listbox ??? ####
    for fname in Contents:
        HhtListbox.insert(i, fname)
        i = i + 1
    
    #for fname in Contents:
    #    HhtListbox.insert(0, fname)
    
    #for fname in Contents:
    #    HhtListbox.activate(fname)
    
    #for fname in range(len(Contents)):
    #    HhtListbox.insert(fname, Contents.fname)
    
        ####  search:  os.listdir(), Listbox, for...
    
    HhtListbox.pack()
    HhtScroll = Scrollbar(HhtListbox)
    HhtScroll.pack(side=RIGHT, fill=Y)
    HhtScroll.config(command=HhtListbox.yview)
    HhtListbox.config(yscrollcommand=HhtScroll.set)
    
    Hht.mainloop()

    A pic of what the above code produces can be seen at:

    h t t p : / / i DOT i m g u r DOT c o m / O t B 6 7 5 K DOT p n g

    From the top down, on a window, it's basically just a: Message, Label, Label, Label, and a scrollable Listbox.

    However, I've been stumped for awhile as to why the HhtListbox is not populating with the filenames of the current working directory? Once this little glitch is solved, then I'll concentrate on trying to think up some code that translates the hands from *.xlsx files into a *.txt hand history format. That's when the real head scratching is bound to start.
  14. #8
  15. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,995
    Rep Power
    481
    Hey b49P23TIvg, was your lady friend able to get back into her apartment or did she happen to maybe stay the night at your place?
    She fed me latkes and whiskey with diet coke. I refused pork chops. She told me how grateful she was that I had gotten her seed for her bird feeder and how sad she was about her dog which she found dead on the floor a few days earlier. Then her boyfriend and her daughter's boyfriend both showed up. Have seen just once since then, in passing on the walk.
    [code]Code tags[/code] are essential for python code and Makefiles!
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Location
    USA
    Posts
    12
    Rep Power
    0
    "whiskey with diet coke..."

    I like that part. At first, I was like "uuggghh! diet coke!", but then I figured "Oh well, the whiskey will kill the taste." I probably wouldn't have been able to refuse the pork chops, though. I bet she'll remember your hospitality for a while and you'll likely at least get some more free booze out of it.

    SCRATCH that post that's two above this one ("h t t p : / / i DOT i m g u r DOT c o m / O t B 6 7 5 K DOT p n g").

    I finally found the answer, which wasn't even difficult after I seen it. I was mixing up what I vaguely remembered about how Java and HTML layout stuff, and I also finally found some comprehensive resources online that explain in more detail about how pack() lays out widgets in Python. So, I'm done with that small GUI after I add a button. I'll clean up the appearance of it more after I get the main functionality of the program working. I may have to tie on a serious drunk, though, before I start that. Afterwards, I'll have to abstain for awhile so as not to cloud my concentration.

IMN logo majestic logo threadwatch logo seochat tools logo