#1
  1. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,928
    Rep Power
    378

    Regex and automating regex


    This starts out as a (probably very simple) regex question, but I ask here because 1. you're all very clever and 2. I'd like to generalise the problem if at all possible...

    So, keeping things simple, I have a string that looks like this
    Code:
    | Project Name :
    
    New Barn
    |
    and I'd like to extract the project name, either as a variable or an array, so for example:

    $project_name = 'New Barn';

    Moving forward, I have a whole bunch of key-value pairs in a string that are demilited in the following fashion:

    | key : value |

    to save on typing it'd be nice if I could automatically extract all keys and their corresponding values.

    Makes sense? As always any help with any of the above greatly appreciated.
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,002
    Rep Power
    9398
    preg_match_all can find all the things matching a pattern...
    PHP Code:
    preg_match_all('/\|(.*?):(.*?)\|/s'$text$matches
    You might want to set one of the PREG flags to get $matches in the format you want. And you'll need to trim() the individual values.
  4. #3
  5. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Hi,

    Originally Posted by requinix
    PHP Code:
    preg_match_all('/\|(.*?):(.*?)\|/s'$text$matches
    You should generally avoid the "." and non-greedy quantifiers, because they involve massive backtracking. It's better to use explicit patterns. This also makes the regex more readable.

    I'd use this pattern:
    Code:
    /\|([^:]+):([^|]+)\|/

    Comments on this post

    • requinix agrees : shouldn't be any backtracking in this situation, as for readability... eh, to each his own
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,928
    Rep Power
    378
    This is brilliant, but highlights the fact that I grossly oversimplified the problem... and perhaps 'backtracking' (is that the same as a 'lookbehind') is inevitable.

    A more representative example (and desired output) is provided below. Given the current pattern, I know I can access values with something like $matches[2][0][0], but that seems clumsy and unscalable.

    I should stress that I don't (or at least 'shouldn't) need someone to just write out the whole thing for me - but another hefty push in the right direction would be HUGELY appreciated, as even with the help of cheatsheets, I find regex arcane in the extreme! :-)

    Code:
    $string="
    [ F | 19/10/2012 | I ]
    
    Some stuff
    some more stuff.
    And more stuff
    
    [ E | 10/10/2012 | C ]
    
    Gas tap location added
    
    [ D | 02/10/2012 | C ]
    
    [ C | 01/06/2012 | T ]
    
    [ A | 08/03/2012 | I ]
    
    | Project Name :
    
    New Barn
    | Title:
    
    GROUND FLOOR
    LIVING ROOM DETAILS I
    | Scale :
    
    AS SHOWN (A1)
    | Status :
    
    CONSTRUCTION
    | Date :
    
    MAR 2012
    | Drg No :
    
    2266 XX 01
    
    F";
    Desired output (could be an array):
    Code:
    Explicit key value pairs
    Project Name:	New Barn
    Title: 		GROUND FLOOR LIVING ROOM DETAILS I
    Scale: 		AS SHOWN (A1)
    Status:		CONSTRUCTION
    Date Created:	MAR 2012
    Drg No:		2266 XX 01
    
    Values with implicit keys:
    Revision:	F 
    Revision Date:	19/10/2012
    Issued for:	I 
    Revision Notes:	Some stuff some more stuff. And more stuff
  8. #5
  9. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Well, this looks like a rather complex structure. Where does it come from? Is it a specified data format? Or did somebody just make it up for this project?

    I guess you could parse this with regexes, but it will be complicated and involve some handiwork to pull the data out of the matches. This isn't a simple pattern like the one before.



    Originally Posted by cafelatte
    This is brilliant, but highlights the fact that I grossly oversimplified the problem... and perhaps 'backtracking' (is that the same as a 'lookbehind') is inevitable.
    No, those have nothing to do with each other. Backtracking is an issue that happens with certain patterns. When the regex parser cannot match the input but has a chance to adjust the previous matches, it will try out all possibilties. This will slow down the whole process (not necessarily much -- it depends on the concrete case).
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,928
    Rep Power
    378
    Originally Posted by Jacques1
    Well, this looks like a rather complex structure. Where does it come from? Is it a specified data format?
    I wish! It's scraped from a PDF!

    No, those have nothing to do with each other. Backtracking is an issue that happens with certain patterns. When the regex parser cannot match the input but has a chance to adjust the previous matches, it will try out all possibilties. This will slow down the whole process (not necessarily much -- it depends on the concrete case).
    I see (I think!?!). Thanks for the clarification.

    I'll keep plugging away. It seems doable!
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,928
    Rep Power
    378
    If the above was a well-formed array, I guess it would look like the following (perhaps with a few handles to distinguish between 'revision' data, and 'project' data). Perhaps the easier question then is 'how do i get from that to this?':

    Code:
    Array
    (
        [0] => Array  //revision_data
            (
                [0] => Array //revision0
                    (
                        [0] => Array //revision_info
                            (
                                [0] => F
                                [1] => 19/10/2012
                                [2] => I
                            )
    
                        [1] => Array //revision_notes
                            (
                                [0] => Some stuff some more stuff. And more stuff
                            )
    
                    )
    
                [1] => Array  //revision1
                    (
                        [0] => Array //revision_info
                            (
                                [0] => E
                                [1] => 10/10/2012
                                [2] => C
                            )
    
                        [1] => Array //revision_notes
                            (
                                [0] => Gas tap location added
                            )
    
                    )
    
                [2] => Array  //revision2
                    (
                        [0] => Array //revision_info
                            (
                                [0] => D
                                [1] => 02/10/2012
                                [2] => C
                            )
    
                    )
    
                [3] => Array   //revision3
                    (
                        [0] => Array //revision_info
                            (
                                [0] => C
                                [1] => 01/06/2012
                                [2] => T
                            )
    
                    )
    
                [4] => Array  //revision4
                    (
                        [0] => Array //revision_info
                            (
                                [0] => A
                                [1] => 08/03/2012
                                [2] => I
                            )
    
                    )
    
            )
    
        [1] => Array //project_data
            (
                [Project Name] => New Barn
                [Title] => GROUND FLOOR LIVING ROOM DETAILS I
                [Scale] => AS SHOWN (A1)
                [Status] => CONSTRUCTION
                [Date] => MAR 2012
                [Drg No] => 2266 XX 01
                [0] => F
            )
    
    )
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,928
    Rep Power
    378
    There's probably a law against publishing code as ugly as this but, for now, this seems to be working, so I'm going to run with it anyway...

    I slightly modified the input string to ease the process...

    Code:
    <?php
    
    $input = "
    [ F | 19/10/2012 | I ]
    
    Some stuff
    some more stuff.
    And more stuff
    
    [ E | 10/10/2012 | C ]
    
    GAS TAP LOCATION ADDED
    
    [ D | 02/10/2012 | C ]
    
    [ C | 01/06/2012 | T ]
    
    [ A | 08/03/2012 | I ]
    
    |
    
    |
    
    |
    
    |
    
    |
    
    | Project Name :
    
    ASGILL LODGE
    | Title :
    
    GROUND FLOOR
    LIVING ROOM DETAILS I
    | Scale :
    
    AS SHOWN (A1)
    | Status :
    
    CONSTRUCTION
    | Date :
    
    MAR 2012
    | Drg No :
    
    2266 XX 01
    
    F
    ";
    $output = $input;
    $output = preg_replace('/\s{2,}/',' ',$output);
    $output = preg_replace('/[\|\s]{5,}/',']],{"',$output);
    $output = preg_replace('/\s\]\s/','"],["',$output);
    $output = preg_replace('/\s\[\s/','"]],[["',$output);
    $output = preg_replace('/\s\|\s/','","',$output);
    $output = preg_replace('/\"\],\[\"\[\s/','"]],[["',$output);
    $output = preg_replace('/\s\]/','"]',$output);
    $output = preg_replace('/\s\:\s/','":"',$output);
    $output = preg_replace('/^\"\]\],/','[[',$output);
    $output = preg_replace('/[\s]$/','"}]',$output);
    $output = preg_replace('/\//','\/',$output);
    echo $output."<br>\n";
    print_r(json_decode($output));
    
    ?>

    Comments on this post

    • Jacques1 agrees : Nice idea
  16. #9
  17. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Turning this into JSON is actually an idea that didn't come to my mind.

    I haven't exactly checked it, but it's probably the best you can get without putting too much effort into this task. You've got "bad" input, anway.

IMN logo majestic logo threadwatch logo seochat tools logo