mod_rewrite mystery

Posted by Pierre Igot in: Blogging, Technology
June 22nd, 2007 • 4:55 pm

I am not particularly good with regular expressions, but you know, I can manage. Today, however, I am faced with a problem that I simply cannot figure out.

I still have, in older posts in this blog, URLs such as this one:

http://www.latext.com/blog/2003/04/15.html#a191

I want a mod_rewrite rule on my www.latext.com server that automatically converts such an URL to this:

https://www.betalogue.com/2003/04/15/

That shouldn’t be so hard to do, should it? Well, I simply cannot get it to work. For some strange reason, I can get half-way there, but not any further. Here is what I have been able to get to work:

RewriteCond %{PATH_INFO} ([0-9]+)/([0-9]+)(\.html) [NC]
RewriteRule blog https://www.betalogue.com/%1/%2/ [R=301,L]

This successfully converts the above URL to:

https://www.betalogue.com/04/15/

It’s not what I want, but it’s part of it. I am still missing the year part.

However, as soon as I add the year part in the rule’s condition, it completely fails:

RewriteCond %{PATH_INFO} ([0-9]+)/([0-9]+)/([0-9]+)(\.html) [NC]
RewriteRule blog https://www.betalogue.com/%1/%2/%3/ [R=301,L]

Can anyone tell me what I am doing wrong here?

It’s not the slash character separating the year from the month, since I also have a slash character separating the month from the day and my rule is able to process that one just fine.

For some reason, it just won’t process it.

I have tried all kinds of things for an hour now and I just cannot figure out what’s wrong. The Apache page on mod_rewrite is just too cryptic for me. I need help!


10 Responses to “mod_rewrite mystery”

  1. Arden says:

    {} can be used for selecting a specific number of matches. For example:

    e{6} matches eeeeee
    e{4, 6} matches eeee, eeeee, and eeeeee
    e{4, } matches eeee, eeeee, eeeeee, eeeeeee…
    e{, 4} matches [empty], e, ee, eee, eeee

    So perhaps try [0-9]{4} for the year. Also, give Regular-Expressions.info a good, thorough read when you have the chance.

  2. Arden says:

    Upon further contemplation, this seems to be what you should use:

    RewriteCond %{PATH_INFO} ([0-9]{4})/([0-9]{2})/([0-9]{2})(\.html) [NC]
    RewriteRule blog https://www.betalogue.com/%1/%2/%3/ [R=301,L]

    And I also suggest reading the Comprehensive guide to .htaccess at Javascriptkit.com.

  3. Pierre Igot says:

    Thanks, but I am afraid your suggestion makes no difference. I don’t think specifying the number of matches is necessary here anyway. But even if I try something like

    ([0-9][0-9][0-9][0-9])/([0-9][0-9])/([0-9][0-9])

    it still doesn’t work. So the issue is not with the expressions for the numbers.

    I’ll read the references you suggested!

  4. Arden says:

    What about:

    RewriteCond %{PATH_INFO} ([0-9/]+)(\.html) [NC]
    RewriteRule blog https://www.betalogue.com/%1/ [R=301,L]

    Do you run into any problems here? Since the only things you really need to match are numbers and forward slashes, just match as many of them in a row as you can.

  5. Pierre Igot says:

    Now that’s interesting. Your idea was good. Indeed, I do not need separate sections if I can capture the date sequence/path all at once. So I tried what I suggested, and discovered that if I use

    ([0-9/]+)(\.html)

    to try and capture the date sequence/path as %1 and then use %1 in the rewrite rule, I get a URL like this:

    www.betalogue.com/04/15

    So it looks like ([0-9/]+) actually only captures the sequence “/04/15”, and not the year bit that comes before it. And I have no idea why. It looks as if the year bit is not part of the PATH_INFO variable!

    That also explains why my earlier solutions didn’t work. But it doesn’t tell me how to retrieve that year bit! I would expect ([0-9/]+) to capture “/2003/04/15” but it does not. Maybe I am using the wrong variable here and PATH_INFO cannot be used for this.

  6. danridley says:

    I’ve always matched on REQUEST_URI, rather than PATH_INFO.

    My understanding is that PATH_INFO is “extra” path information, which is designed to be passed on to a script; for example, if Apache knows that /2007/* is handled by WordPress, it passes on the *remainder* of the URL in PATH_INFO, because that’s the virtual path that is required by WordPress. (The advantage to this approach is that if WordPress is installed such that its URL is, say, /blog/2007, it’s still going to get the same data in PATH_INFO.)

  7. Pierre Igot says:

    That did it! Using REQUEST_URI did indeed solve the problem. The weird thing is that I really thought I’d tried it earlier on already, without success. But I think what happened what that there was cached stuff on the server or in the browserand it wasn’t reloading the .htaccess file—which I discovered later on. After that I tried a different URL each time, and that helped work around the cache issue, but I should have retried the REQUEST_URI thing.

    Thanks for your help and persistence!

  8. Pierre Igot says:

    Actually there’s one more tiny problem :). In Safari, the resulting URL is what I want, but in Camino, I still get the trailing “#xxx” stuff added to it at the end. It’s not a big problem, because the desired page loads fine just the same, but I should probably try to get rid of it just the same. I figure each browser processes and sends URLs in a different way, and that’s why Safari doesn’t still have it.

  9. danridley says:

    Okay, I just spent more time and effort than I probably should have to verify my hunch: the anchor in the URL (the # and everything that follows it) stays in the browser; it never gets passed to the server at all, so mod_rewrite won’t have a chance to affect it.

    Safari drops the anchor on a redirect; Camino does not. I personally would consider this a Safari bug.

    If you cared enough, you could use JavaScript and parse the anchor out on the client side, by looking at location.href. That is undoubtedly overkill, but it might be fun :-) .

  10. Pierre Igot says:

    Indeed, you shouldn’t have :). But thanks, the additional insight is greatly appreciated. I don’t think I’ll bother with the JavaScript stuff. I’ve already spent far too much time trying to figure out the rewrite stuff.

    The rewrite already takes people to the right page. That’s good enough as far as I am concerned, especially since we’re talking about someone possibly following an old link once in a blue moon.

    Thanks again for all your help.

Leave a Reply

Comments are closed.