#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2007
    Posts
    29
    Rep Power
    0

    Regexp for importing data


    I am working on a personal project which requires a lot of data to be imported, because it is just too much for me to add manually.

    The data is episode guides of TV shows formatted like this:
    1. First Episode
    First aired: 9/30/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    First episode description here. It was interesting.

    2. Second Episode
    First aired: 10/7/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    The description of the second episode here.
    This data is coming from another website, but (having checked their copyright terms) I'm OK to use the data for this personal non-commercial activity.

    I need to extract the name of the episode, the airdate, and the description to put in my personal database. I'm not sure how to do that. There'll be about 24 episodes going in at a time, so it'd have to go through them all.

    I think regexp is what I need, but I'm not sure how to do it. Any help will be much appreciated.
  2. #2
  3. mod_dev_shed
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2002
    Location
    Atlanta, GA
    Posts
    14,817
    Rep Power
    1099
    Do you just need help with the regular expressions or do you need help implementing them with PHP?
    # Jeremy

    Explain your problem instead of asking how to do what you decided was the solution.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2007
    Posts
    29
    Rep Power
    0
    Originally Posted by jharnois
    Do you just need help with the regular expressions or do you need help implementing them with PHP?
    Well I've used regular expressions in PHP, but I'm not totally sure with them. I definitely need help with the expression, but I might need help implementing it as well. I'm not sure how I'd do it in PHP if I was expecting multiple occurances of the same pattern, for I only use it where it was only going to be a single occurance. These things go over my head a bit.

  6. #4
  7. mod_dev_shed
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2002
    Location
    Atlanta, GA
    Posts
    14,817
    Rep Power
    1099
    This shouldn't be to difficult as long as the description is one line (technically not visually) b/c the contents of the file can be treated as a single string, and the "s" modifier can be used to make . match newline characters. So pretend the file's contents are all on one line and start writing the pattern.

    Note that this may not be the most efficient method.
    # Jeremy

    Explain your problem instead of asking how to do what you decided was the solution.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2007
    Posts
    29
    Rep Power
    0
    OK, I've got this. It doesn't work that well, it won't seperate the global rating from the episode synopsis and when I add more than one episode to the string it doesn't respond well.

    Code:
    $Data = '1. First Episode First aired: 9/30/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 1.2 First episode description here. It was interesting.';
    
    if(preg_match_all('#(?<digit>\d+)[^a-zA-Z](.*)First aired:(.*)Writer:(.*)Director:(.*)Guest star:(.*)Global rating:[^0-9](.*)[^a-zA-Z](.*)#', $Data, $Matches))    print_r($Matches);
    The response is:
    Code:
    Array ( [0] => Array ( [0] => 1. First Episode First aired: 9/30/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 1.2 First episode description here. It was interesting. ) [digit] => Array ( [0] => 1 ) [1] => Array ( [0] => 1 ) [2] => Array ( [0] => First Episode ) [3] => Array ( [0] => 9/30/1982 ) [4] => Array ( [0] => John Smithson, Smith Johnson ) [5] => Array ( [0] => John Johnson ) [6] => Array ( [0] => Joe Bloggs ) [7] => Array ( [0] => 1.2 First episode description here. It was interesting ) [8] => Array ( [0] => ) )
    With the string having 2 episodes:
    Code:
    $Data = '1. First Episode First aired: 9/30/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 1.2 First episode description here. It was interesting. 2. Second Episode First aired: 10/06/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 2.1 Second episode description here. It was not interesting.';
    The response is:
    Code:
    Array ( [0] => Array ( [0] => 1. First Episode First aired: 9/30/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 1.2 First episode description here. It was interesting. 2. Second Episode First aired: 10/06/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 2.1 Second episode description here. It was not interesting. ) [digit] => Array ( [0] => 1 ) [1] => Array ( [0] => 1 ) [2] => Array ( [0] => First Episode First aired: 9/30/1982 Writer: John Smithson, Smith Johnson Director: John Johnson Guest star: Joe Bloggs Global rating: 1.2 First episode description here. It was interesting. 2. Second Episode ) [3] => Array ( [0] => 10/06/1982 ) [4] => Array ( [0] => John Smithson, Smith Johnson ) [5] => Array ( [0] => John Johnson ) [6] => Array ( [0] => Joe Bloggs ) [7] => Array ( [0] => 2.1 Second episode description here. It was not interesting ) [8] => Array ( [0] => ) )
    .

    This stuff is really confusing me.
  10. #6
  11. mod_dev_shed
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2002
    Location
    Atlanta, GA
    Posts
    14,817
    Rep Power
    1099
    Thread moved from PHP to Regex ... I think you'll be fine with the implementation in PHP once you get a good expression, and this forum is best for that
    # Jeremy

    Explain your problem instead of asking how to do what you decided was the solution.
  12. #7
  13. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    I don't recommend using the s-modifier: the DOT-ALL is a dangerous thing combined with it. A safer way is to use the m-modifier so that the DOT-STAR will at most "consume" and entire line. The \s will catch new line characters regardless if they're Windows, *nix or MacOS new line characters.
    Here's a possible way:

    [CODE=php]<?php
    $text =<<< BLOCK
    1. First Episode
    First aired: 9/30/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    First episode description here. It was interesting.

    2. Second Episode
    First aired: 10/7/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    The description of the second episode here.
    BLOCK;

    $regex = '/
    ^ (\d+\.\s.*) $ \s
    ^ first\saired:\s(.*) $ \s
    ^ writer:.* $ \s
    ^ director:.* $ \s
    ^ guest\sstar.* $ \s
    ^ global\srating:.* $ \s
    ^ [\t ]* $ \s
    ^ (.*) $
    /imx';

    if(preg_match_all($regex, $text, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
    echo "title = {$match[1]}\n";
    echo "aired = {$match[2]}\n";
    echo "descr = {$match[3]}\n";
    echo "\n";
    }
    }
    ?>[/CODE]

    Comments on this post

    • sarav_dude agrees
    • jharnois agrees
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2007
    Posts
    29
    Rep Power
    0
    Thanks for that prometheuzz, you've been incredibly helpful

    Originally Posted by prometheuzz
    I don't recommend using the s-modifier: the DOT-ALL is a dangerous thing combined with it. A safer way is to use the m-modifier so that the DOT-STAR will at most "consume" and entire line. The \s will catch new line characters regardless if they're Windows, *nix or MacOS new line characters.
    Here's a possible way:

    [CODE=php]<?php
    $text =<<< BLOCK
    1. First Episode
    First aired: 9/30/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    First episode description here. It was interesting.

    2. Second Episode
    First aired: 10/7/1982
    Writer: John Smithson, Smith Johnson
    Director: John Johnson
    Guest star: Joe Bloggs
    Global rating: 1.2

    The description of the second episode here.
    BLOCK;

    $regex = '/
    ^ (\d+\.\s.*) $ \s
    ^ first\saired:\s(.*) $ \s
    ^ writer:.* $ \s
    ^ director:.* $ \s
    ^ guest\sstar.* $ \s
    ^ global\srating:.* $ \s
    ^ [\t ]* $ \s
    ^ (.*) $
    /imx';

    if(preg_match_all($regex, $text, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
    echo "title = {$match[1]}\n";
    echo "aired = {$match[2]}\n";
    echo "descr = {$match[3]}\n";
    echo "\n";
    }
    }
    ?>[/CODE]
  16. #9
  17. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by Wambaugh
    Thanks for that prometheuzz, you've been incredibly helpful
    No problem.

IMN logo majestic logo spyfu logo threadwatch logo seochat tools logo