#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    8
    Rep Power
    0

    Trying to make a MOBI dictionary


    I'm currently trying to take some HTML files decompiled from a CHM of a dictionary to turn it into a dictionary MOBI for my Kindle. I have the main part down of changing the HTML tags to the Mobipocket tags, but I've been trying to use regular expressions to turn certain strings into part italics and part regular text. Specifically, I'm trying to turn the source language in entries of example into italics, while leaving the target (English) alone.

    I can tell I'm part-way to getting it right, but I'm just too new at regex to get it perfect. I've been trying to figure it out on my own for a long time, but I'm currently stuck.

    Here's an example of what I want to change, highlighted in blue

    Code:
    <br />	<i>du/ad.</i>
    		to tire, exhaust 
    	 <br />
    		to exhaust; 
    		<br />
    			hainbeste kilometro ibilita, gorputza ~a gelditu zitzaion
    					having covered so many kilometres, he became physically exhausted;
    		<br />
    			haurgintzaren nekeak ez du ~
    					the pangs of labour haven't exhausted her
    <br />					
    			<i>(irud.)</i>to overburden; 
    			<br />
    				abailtzen zaituen pisu hori arintzeko
    	in order to lighten the weight that is overburdening you<br >
    	
    		<i>( arbola )</i>
    				to overburden; 
    		<br />
    			sagarrondoa ~a zegoen
    					the apple tree was overburdened with fruit 
    		</idx:entry><hr />
    I want the first line (Basque) to be italicized, while keeping the second line (English) the same.

    I came up with an undoubtedly contrived regex of
    Code:
    <br />\n\t*((.*).*$)\n\t*((.*).*$)\n
    which works for most entries, but also includes false positives, like the part containing "to exhaust". I know I could do the changes in multiple steps, but I am still not skilled enough to do that.

    Also, for some reason, if I debug
    Code:
    <br />\n\t*((.*).*$)\n\t*((.*).*$)\n<br />
    with RegexBuddy on the line starting with "hainbeste kilometro", I get an almost 8 million long line of console output... it keeps recursing for a reason unknown to me.

    So, I'd be very grateful if someone experienced in regular expressions could simplify this for me!
  2. #2
  3. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Hi sjheiss!

    Run this exact code. I think it does what you want.

    PHP Code:
    <?php 
    $s 
    '<br />    <i>du/ad.</i>
            to tire, exhaust 
         <br />
            to exhaust; 
            <br />
                hainbeste kilometro ibilita, gorputza ~a gelditu zitzaion
                        having covered so many kilometres, he became physically exhausted;
            <br />
                haurgintzaren nekeak ez du ~
                        the pangs of labour havent exhausted her
    <br />                    
                <i>(irud.)</i>to overburden; 
                <br />
                    abailtzen zaituen pisu hori arintzeko
        in order to lighten the weight that is overburdening you<br >
        
            <i>( arbola )</i>
                    to overburden; 
            <br />
                sagarrondoa ~a zegoen
                        the apple tree was overburdened with fruit 
            </idx:entry><hr />

    '
    ;
    $pattern='%(?x)
    <br[ ]/>\s*\r\n  # opening br up to the first line break
    (?!.*<br[ ]/)\s*(.*?)\r\n # first line that does not have a line break
    (?!.*<br[ ]/).*?\r\n # second line that does not have a line break
    %'
    ;
    preg_match_all($pattern$s$matches,PREG_OFFSET_CAPTURE PREG_PATTERN_ORDER);
    $sz=count($matches[1]);
    for (
    $i=0;$i<$sz;$i++) 
    echo 
    "<i>".$matches[1][$i][0]."</i><br />";
    ?>
    Output:
    hainbeste kilometro ibilita, gorputza ~a gelditu zitzaion
    haurgintzaren nekeak ez du ~
    abailtzen zaituen pisu hori arintzeko
    sagarrondoa ~a zegoen


    Just so you know, in the string, I changed "haven't" to "havent" (obviously I could have escaped the quote). That shouldn't be a problem for your input but needed that for in order to use your exact text.

    If you have questions I will explain later (9pm here in NZ).
    But briefly: the regex pattern looks for a <br /> then two lines without a <br />
    It works, but perhaps by luck as your input is quite dirty. For instance there is a <br> at the end of one of your English lines, but quite luckily it is not a <br />!

    Please let me know how this works for you.

    Wishing you a beautiful day
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    8
    Rep Power
    0
    I'd love to thank you, but I'm not sure how I should run that script.

    I have Windows 7 (although I could easily get Ubuntu if I need to), and the only good regex programs I have are RegexBuddy and EditPad Pro (excluding Notepad++, since it's very limited in it's scope). As far as I know neither of those programs support running scripts. What would you recommend I use for your script? Thanks.

    And sorry about the unclean code, I'm still in the process of converting the HTML to work with Mobipocket.
  6. #4
  7. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Hi, more detailed reply later, but in the meantime just dump this in the regexbuddy expression box (as is):

    Code:
    (?x)
    <br[ ]/>\s*\r\n  # opening br up to the first line break
    (?!.*<br[ ]/)\s*(.*?)\r\n # first line that does not have a line break
    (?!.*<br[ ]/).*?\r\n # second line that does not have a line break
    It's the exact pattern from the code I sent you. Then paste your test string in RB.
    On the test tab, select List All group matches in Columns.
    Scroll all the way to the right.
    Tada... All the Basque is selected.
  8. #5
  9. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    And the quick reply to how to run the php regex code on Windows 7: Install xampp 1.7.7.
    Easy install, it will be up in no time. Make sure to NOT choose the "install services" option.
    Then take the code I sent you, save it in "test.php", move it to c:\xampp\htdocs\
    Then open your browser and go to localhost/test.php.

    Wishing you a fun day, please report on how the expression I sent you works in regexbuddy, signing off for the evening but will check on the thread tomorrow.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    8
    Rep Power
    0
    Thanks a ton! It worked perfectly for those examples, and should work just fine for the rest of the document.

    I have a new question though, also relating to this document. I'm trying to use a regular expression to find and replace a line on the model
    Code:
    \t\t ="font-size:x-small; vertical-align: top; style="bold"">"[0-1][0-9]".\r\n
    so that I can replace it with
    Code:
    \t\t <br />
    \t\t "[0-1][0-9]".
    Note that the blue is not part of the text, rather a symbol of what's in the text. The number is anywhere from 4 to 18, but there wouldn't be any false positives.

    I made what I thought would be a good regex, but I get the error "Invalid class typecast" in RegexBuddy. Here's my attempt:
    Code:
    (\t\t="font-size:x-small; vertical-align: top; style="bold"">)(1?[0-9])[.]\r\n
    First, do you know why I would be getting that error? And second, what would be a correct regex for this?

    Many thanks.
  12. #7
  13. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Thanks for your message, really glad it works!

    Moving on to your next question.
    I pasted your regex in RB. It does not give me any kind of error. Are you in PCRE mode at the very top?

    Happy to help you with the expression, but I need to see an actual sample of input and output. It's just too abstract for my overloaded little brain to try to understand abstract descriptions of what needs to be done, I need to see it.

    Wishing you a fun day,
    Talk soon.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    8
    Rep Power
    0
    No, I had it in JGsoft. I switched it and got it to work.

    I have made quite a lot of progress, but it seems now that it would take days to do the next step with regex alone, so I might have to do it with a programming\scripting language, unless it is possible with regex (but I highly doubt it). I started over with the formatting so it would look much cleaner, and the next step is converting the entries into the format I showed above.

    Since the entries vary in length and content, there would be many steps involved if I did everything with regex, but I think it would be much easier if I could use a programming language for recursion and other things.

    If I sent you the HTML file and a few examples of what I want to finished product to look like, could you tell me if what I want to do is even viable? The file is only about 31mb.

    I love doing this stuff and have worked on it almost all day for a few days this week, but I still do not know nearly as much as you seem to!

    I love programming, and once wanted to get into it, but it was supplanted by my interest in learning languages (which is what this project is to help me with ). Although now, I may just start getting into it again, since it's so much fun.

    Have a nice night/day!

    P.S. In the meantime you could take a look at the source code for entries on the site I got the entries from, if you'd like. Search "Morris Hiztegia" (no double quotes of course) on Google and just click the first link. You can search "A,a" in the left box and click "Bilatu" for the simplest entry possible, "ababor" for a bit more complex one, and "egin" for probably the most complex entry in the dictionary.

    As you can see, they follow the same general format, although some are obviously longer than the others. Thus, I think I need to use a programming\scripting language for entries of various composition and length.
  16. #9
  17. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Hi sjheiss,
    Glad it works when you switched the mode in RB!
    There are a lot of settings on that screen that can affect your match. For instance, "Whole file" vs. "line by line".


    Sorry, I am out of time, won't be able to look at your detailed code.

    Wishing you a fun new year's eve.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    8
    Rep Power
    0
    OK. I think I'll be able to do it with regex alone.

    Is it possible to take a text like this:

    Code:
    </span><span style="font-size: 8pt; font-family:Arial, Helvetica, sans-serif;font-weight:bold;color:#000000; text-align: left; vertical-align: top">
    	~reko aldetik
    </span><span style="font-size: 8pt; font-family:Arial, Helvetica, sans-serif;color:#000000; text-align: left; vertical-align: top">
    	from portside;
    </span><span style="font-size: 8pt; font-family:Arial, Helvetica, sans-serif;font-weight:bold;color:#000000; text-align: left; vertical-align: top">
    	~rean
    </span><span style="font-size: 8pt; font-family:Arial, Helvetica, sans-serif;color:#000000; text-align: left; vertical-align: top">
    	on portside
    </span>
    and using regular expressions, extract different parts in one step? I've made a regex that can match the first 4 lines in a set, and the second 4 lines in a set. What I want to do is extract the words on the even-numbered lines, without having to do it in many steps. There is a pattern that alternates every 4 lines in my example, and I wish to use this to my advantage to avoid having to use many steps of regular expressions.

    If I am correct (which I actually hope I'm not) this is not possible with regular expressions, as it's too limited in power. Instead, I think I'd either have to find and replace each set of 4 lines manually, or perhaps use a scripting language like PHP.

    So, does anyone know if there's a way to do this so I can be lazy and not have to learn a new programming language?

    Thanks in advance!


    Here is the regex I mentioned that I made:

    Code:
    \t*</span>(\t*<span.*font-weight:bold;color:#000000;.*>
    \t*[A-Za-z0-9 .,;:~]*
    \t*</span><span.*>
    \t*[A-Za-z0-9 .,;:~]*
    \t*</span>)+
    It's not perfect, but it works just fine for what I want it to do.

    Also, is there any shorthand for the symbols/special characters, so I don't have to type them all within the square brackets?
  20. #11
  21. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    is there any shorthand for the symbols/special characters, so I don't have to type them all within the square brackets?
    In your example, the shortcut to [0-9] is \d
    The shortcut to [A-Za-z0-9] is [[:alnum:]],
    which you can combine with your other characters:
    Code:
    [[:alnum:] .,;:~]
    If you allowed underscores, you could use \w as a shortcut to [_A-Za-z0-9], yielding an even shorter:
    Code:
    [\w .,;:~]
    About your other question (matching even lines).
    The short answer is Yes, you can match text in even lines, but I'm not sure about the exact details of what you are doing. I know you are working in RB, and here is an example you can adapt. The numbers on all even lines are captured in group 1.

    Paste this in the Subject box:
    Code:
    Hi 11
    Ho 22
    Hi 33
    Ho 44
    Hi 55
    Ho 66
    Paste this in the Pattern box:
    Code:
    [^\r]*\r\n[^\d]*(\d+)(?:\r|$)
    Make sure you are in MATCH mode (at the top left) for now, not REPLACE.
    In the List All Menu, select List All Matches of Group 1, then Update Automatically.
    In the next menu, select Whole File not Line by Line.

    The output:
    Code:
    22
    44
    66
    I hope you're able to adapt this to your needs!
    Wishing you a fun day.

IMN logo majestic logo threadwatch logo seochat tools logo