#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0

    Regex for mod_rewrite


    So I have a site that was recently translated from ColdFusion to PHP and moved to an Apache server.

    On the CF server, I could specify a 404 handler and the server permitted me to send "200" Header Response Codes. The new server also lets me do that, but sends the 404 codes first!! So many search engines are now NOT indexing these pages. Annoying but unchangeable (as far as I an tell - if someone knows some htaccess hack, let me know). It is on a shared host, so I do have limits as to what I can do on the server itself.

    So now I have a situation where I must use mod_rewrite instead of using the 404 error handler script to send the correct HTML to the user's browser. Previously, I simply broke the URL apart in PHP or ColdFusion into arrays from "/" or "-" delimited strings and created variables then included the appropriate php script to generate the HTML.

    Problem is, I have little to know experience writing regex (for the mod-rewrite)!

    If I get the hang of it for one kind of setting, I know I can apply it to others.

    So here is the scenario:

    I have URLs requests coming in like:

    /Company-Products-X/string1-Y/string2
    or
    /Company-Products-X/string1-string2-Y/string3
    (X and Y are integers, the last sting sometimes contains numbers and hyphens)

    and it needs to be translated in mod_rewrite to something like:
    /CompanyProducts/actualscript.php?prodID=Y&type=X

    Only URLs that have the initial directory of "/Company-Products-X" should be handled in this fashion.

    So I think I need an initial search for that first Directory to CONTAIN "Company-Products" and an integer. Then I need to grab the lone integer at the end of the hyphen delimited string that makes up the second directory and use them to crate the target URL.

    The site has been around in the CF server so long that many incoming links are already in place on 3rd party websites. That is why I am bothering to handle this rather than change it to directories that appear to be only integers. I have to handle all the incoming requests from these 3rd party sites.

    Thanks for any help you can provide. Once I get this pattern done, I am sure I can figure out the other rules I will need for other section of the site that have similar problems.
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,236
    Rep Power
    9400
    By the way for SEO your code still needs to check that "/Company-Products-X/string1-Y/string2" is the correct URL for the page. (If SEO is relevant.)

    Assuming you will do that and so don't actually care about string1 not containing hyphens, a Rule you could use is
    Code:
    RewriteRule ^/?Company-Products-(\d+)/[^/]+-(\d+)/ CompanyProducts/actualscript.php?prodID=$1&type=$2 [L]

    Comments on this post

    • jschwarz agrees : It worked!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0
    Originally Posted by requinix
    By the way for SEO your code still needs to check that "/Company-Products-X/string1-Y/string2" is the correct URL for the page. (If SEO is relevant.)

    Assuming you will do that and so don't actually care about string1 not containing hyphens, a Rule you could use is
    Code:
    RewriteRule ^/?Company-Products-(\d+)/[^/]+-(\d+)/ CompanyProducts/actualscript.php?prodID=$1&type=$2 [L]
    Thank you for replying. I am not sure if I understand what you mean about checking whether or not the correct URL is being passed. If the parameters are passed and the page is called, and the parameters don't make any sense, the php that is trying to process will display an error. I am not too concerned about that.

    I will try this rule and see how it works! Thanks again!!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0
    Originally Posted by requinix
    By the way for SEO your code still needs to check that "/Company-Products-X/string1-Y/string2" is the correct URL for the page. (If SEO is relevant.)

    Assuming you will do that and so don't actually care about string1 not containing hyphens, a Rule you could use is
    Code:
    RewriteRule ^/?Company-Products-(\d+)/[^/]+-(\d+)/ CompanyProducts/actualscript.php?prodID=$1&type=$2 [L]
    Upon trying this rule, it seems that when the X or Y parameter is greater than 9, it grabs the first digit only. I need it to see the number as an integer and take the entire value...

    like

    /Company-Products-15/string1-46/string2

    pointing at

    /CompanyProducts/actualscript.php?prodID=46&type=15

    thanks again!! I am starting to understand this a bit more.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0
    Originally Posted by jschwarz
    Upon trying this rule, it seems that when the X or Y parameter is greater than 9, it grabs the first digit only. I need it to see the number as an integer and take the entire value...

    like

    /Company-Products-15/string1-46/string2

    pointing at

    /CompanyProducts/actualscript.php?prodID=46&type=15

    thanks again!! I am starting to understand this a bit more.
    scratch that, I just had the $2 and the $1 in the wrong spots! It is working perfectly!!! Thank you very much!!!

    I will take this string and modify it now for other parts of the site that need similar handling. Thanks for getting me started!!
  10. #6
  11. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,236
    Rep Power
    9400
    Originally Posted by jschwarz
    I am not sure if I understand what you mean about checking whether or not the correct URL is being passed. If the parameters are passed and the page is called, and the parameters don't make any sense, the php that is trying to process will display an error. I am not too concerned about that.
    Consider the following two (partial) URLs:
    Code:
    /Company-Products-15/Im-a-little-teapot-46/short-and-stout
    Code:
    /Company-Products-15/Here-is-my-handle-46/here-is-my-spout
    Those will both work, right? The problem is that search engines will look at it and see duplicate content: the URLs are different (they don't know you're rewriting behind the scenes) but the content is the same. You'll be penalized.
    So in actualscript.php you should check that the URL requested (look in $_SERVER) is what it should be. If the correct URL is the... the teapot one ... but someone went to the handle URL instead then you can 301 redirect. No penalty.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0
    Originally Posted by requinix
    Consider the following two (partial) URLs:
    Code:
    /Company-Products-15/Im-a-little-teapot-46/short-and-stout
    Code:
    /Company-Products-15/Here-is-my-handle-46/here-is-my-spout
    Those will both work, right? The problem is that search engines will look at it and see duplicate content: the URLs are different (they don't know you're rewriting behind the scenes) but the content is the same. You'll be penalized.
    So in actualscript.php you should check that the URL requested (look in $_SERVER) is what it should be. If the correct URL is the... the teapot one ... but someone went to the handle URL instead then you can 301 redirect. No penalty.
    Well, I see what you mean... we have never created links in a different fashion than the standard... so unless someone out there has a bone to grind with us, AND knows this SEO issue AND decides to create a slew of pages with altered links, we should be ok, right?

    we ALWAYS format the link as "/Company-Products-X/Product-Type-Y/Product-Description"

    Where "X" is the ID of the product type and "Y" is the ID of the product itself.

    make sense? think we are safe? or should I put the error checking in place to set a no-index flag?
  14. #8
  15. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,236
    Rep Power
    9400
    A rel=canonical would probably be better than a meta noindex.

    Bots are either stupid or clever, I'm not sure which. At my job we regularly see bots hitting completely nonsensical URLs, but they put the nonsense in a place that's already somewhat arbitrary. For example a page to browse a listing of things: our URL looks like /browse/(category)/(letter), and though the "browse" is consistent, the "category" or "letter" may be gibberish.

    Now them indexing your site (if that's what they're doing) may not be important, but if you serve them 301s or better 404s then that will encourage them to stop hitting your server. Keeps more resources free for the legitimate requests.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    PA, USA
    Posts
    6
    Rep Power
    0
    Originally Posted by requinix
    A rel=canonical would probably be better than a meta noindex.

    Bots are either stupid or clever, I'm not sure which. At my job we regularly see bots hitting completely nonsensical URLs, but they put the nonsense in a place that's already somewhat arbitrary. For example a page to browse a listing of things: our URL looks like /browse/(category)/(letter), and though the "browse" is consistent, the "category" or "letter" may be gibberish.

    Now them indexing your site (if that's what they're doing) may not be important, but if you serve them 301s or better 404s then that will encourage them to stop hitting your server. Keeps more resources free for the legitimate requests.
    Thanks for the advice!

IMN logo majestic logo threadwatch logo seochat tools logo