From pMachine to WordPress: Rewriting the URLs
Posted by Pierre Igot in: BloggingJuly 19th, 2005 • 10:13 am
This is the final chapter in the mini-saga of my switch from pMachine to WordPress as the blog engine powering Betalogue.
As soon as I decided to switch from pMachine to WordPress, I knew that I would have to deal with the URL issue (or URI or whatever you want to call it). One of the reasons for the switch was that the search-engine-friendly permanent links used by pMachine were cryptic URLs like:
http://www.latext.com/pm/comments/xxx_0_1_0_C1/
(where “xxx
” is the post ID of the post in question). The URLs for the category pages and archive pages were no better.
The problem with such URLs is that they are user-hostile. It’s far too easy, for example, to make a mistake when typing out such addresses in a browser’s address bar. And they contain absolutely no information about the contents of the pages that they are referring to.
pMachine’s successor, ExpressionEngine, has a better system, but it’s far from perfect and, as far as I can tell, it still ends up generating URLs that are pretty long and not exactly cruft-free. (This was one of the reasons why I didn’t want to switch from pMachine to ExpressionEngine — although the main reason was cost.)
The system adopted by WordPress, on the other hand, is both flexible and well-designed. The default configuration uses URLs that look like this:
https://www.betalogue.com/index.php?p=xxx
(where “xxx” is the post ID). But it’s very easy to switch from this scheme to a search-engine-friendly scheme that provides the user with valuable information and is pretty easy to follow. For example, the URL for the post about “cruft-free” URLs mentioned above is:
https://www.betalogue.com/2003/08/19/cruft-free-urls/
It’s pretty clear: first the domain name, then the date (year/month/day), and then a cleaned-up, case-insensitive version of the blog post’s title. All this is handled on the fly by WordPress. The only thing that the writer of the blog entry might have to do is to enter a shorter, simpler alternative when the cleaned-up version of the title ends up being a very long string of text. This can be done using an optional field on the form for writing a post.
The problem for me, when I decided to switch from pMachine to WordPress, was that I already had approximately 1,500 posts, and that I had numerous cross-references to these posts in my own blog entries, all of which used the lousy pMachine URL scheme described above. In addition, all these existing pages have been indexed by search engines such as Google and AskJeeves and are frequently visited and referenced by other people on other web sites. I couldn’t exactly switch and make all these URLs invalid overnight.
The first indication that the switch would not lead to such major disturbance was the fact that, in the process of importing the blog data from pMachine into WordPress, the post IDs were preserved. In other words, the post with the ID “28
” in pMachine still had the ID “28
” in WordPress. This meant that I could fairly easily convert pMachine’s URLs into WordPress-friendly URLs. For example, if the post had the URL:
http://www.latext.com/pm/comments/28_0_1_0_C1/
in pMachine, then the following URL:
https://www.betalogue.com/?p=28
would work as a way to refer to the same post in WordPress. (The “index.php
” part in the WordPress URLs is optional, since it’s the name of the default page in the directory.)
With this in mind, I first had to change all the cross-referencing URLs in my own blog entries. I did that using BBEdit’s powerful find/replace function, which lets you use regular expressions (a.k.a. as grep). Since the MySQL dump from pMachine was just one gigantic text file containing all my entries (including all the cross-references), I just had to use grep patterns matching the different types of URLs used by pMachine, and use them to replace the URLs with their WordPress equivalents. Fortunately, pMachine’s URL scheme only uses a limited number of patterns, so replacing them only involved a handful of find/replace operations.
For example, the grep pattern for the type of URL mentioned above was simply:
http://www.latext.com/pm/comments/([0-9]+)_0_1_0_C1/
All I had to do was replace this with:
https://www.betalogue.com/?\1
Pretty basic stuff as far as grep patterns go.
This took care of the cross-references in my own blog entries. But what about references on other people’s web sites? Good as it is, BBEdit doesn’t yet include an option for world-wide find/replace operations… And it most definitely never will! :-)
I knew that references on other people’s web sites would have to be handled through mod_rewrite
rules included in the .htaccess
file on my old LATEXT server. The problem was that, well, I didn’t know much about mod_rewrite
. I didn’t exactly feel like spending hours learning how to write mod_rewrite
rules. And the documentation about this stuff is still rather user-hostile.
Then I thought that ExpressionEngine, being an upgrade from pMachine, obviously had to use a similar scheme for redirecting URLs going to pMachine entries. So I went to the pMachine web site and looked for documentation about this. (I figured that pMachine “owed” me that little bit of free help, since they had discontinued the product on me!)
Indeed, I quickly found the following wiki page, which explains how ExpressionEngine does its own magic when it comes to redirecting URLs leading to old pMachine posts. I didn’t analyze all that stuff in full detail. I just grabbed the aspects of it that were relevant to my situation. And I soon found out that it wouldn’t be as difficult as I initially thought.
First, I dealt with the category pages, because it was the most straightforward thing. In pMachine, the category pages are referred to with URLs like this:
http://latext.com/pm/betalogue/C0_xx_1/
where “xx
” is the number ID of the category in question. (Again, hugely user-friendly stuff…)
I had 12 categories in my blog, so 12 URLs to redirect. But there was no on-the-fly calculation involved. I just had to find the number ID for each category and then redirect it to the appropriate category page in WordPress. In other words, my .htaccess
file on the LATEXT server simply had to include the following 12 redirections:
Redirect permanent /pm/betalogue/C0_1_1/ https://www.betalogue.com/category/macintosh/
Redirect permanent /pm/betalogue/C0_2_1/ https://www.betalogue.com/category/arts/
Redirect permanent /pm/betalogue/C0_3_1/ https://www.betalogue.com/category/blogging/
Redirect permanent /pm/betalogue/C0_4_1/ https://www.betalogue.com/category/football/
Redirect permanent /pm/betalogue/C0_5_1/ https://www.betalogue.com/category/french-stuff/
Redirect permanent /pm/betalogue/C0_6_1/ https://www.betalogue.com/category/music/
Redirect permanent /pm/betalogue/C0_7_1/ https://www.betalogue.com/category/nature/
Redirect permanent /pm/betalogue/C0_8_1/ https://www.betalogue.com/category/society/
Redirect permanent /pm/betalogue/C0_9_1/ https://www.betalogue.com/category/technology/
Redirect permanent /pm/betalogue/C0_10_1/ https://www.betalogue.com/category/writing/
Redirect permanent /pm/betalogue/C0_11_1/ https://www.betalogue.com/category/movies/
Redirect permanent /pm/betalogue/C0_12_1/ https://www.betalogue.com/category/language/
It would seem that the only problem with this is that the categories cannot be changed (but subcategories can still be added without interfering). But in fact it’s not a problem. If I ever decide to change my category structure in WordPress, I’ll just have to change the redirection rules accordingly. As long as the .htaccess
file on the old server redirects to something on the new Betalogue web site, I am fine.
The rest of the pMachine stuff has to be handled with actual rewriting rules involving the “RewriteEngine
” in Apache’s mod_rewrite
module, which can redirect people on the fly. As we’ve already seen, there are post-specific pMachine URLs that use the post ID:
http://www.latext.com/pm/comments/Axxx_0_1_0_C
http://www.latext.com/pm/betalogue/Pxxx
http://www.latext.com/pm/comments/Pxxx_0_1_0
http://www.latext.com/pm/comments/xxx_0_1_0_C
http://www.latext.com/pm/comments/xxx_0_1_0_C1
Then there are pMachine’s day pages, which contain all the posts published on the same day, with URLs that look like this:
http://latext.com/pm/betalogue/D20041229/
And then there are archival pages that contain a month’s worth of posts, with URLs like this one:
http://latext.com/pm/archives/A2004101
(Don’t ask me what the trailing “1
” means. This is the page for “200410”, i.e. October 2004. Also, all of those pMachine URLs can include an optional slash at the very end.)
What I quickly discovered when trying to familiarize myself with mod_rewrite
was that it uses pretty much the same type of grep patterns as the ones I had used in my batch find/replace operations in BBEdit. In addition, the ExpressionEngine help page provided me with ready-made patterns to match existing pMachine URLs. After that, it was just a matter of defining the WordPress-friendly URLs that would replace the pMachine URLs.
And so I ended up with the following rewriting rules:
RewriteEngine On
RewriteCond %{PATH_INFO} D([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9]) [NC]
RewriteRule betalogue https://www.betalogue.com/%1/%2/%3/ [R=301,L]
RewriteCond %{PATH_INFO} P([0-9]+) [NC]
RewriteRule betalogue https://www.betalogue.com/?p=%1 [R=301,L]
RewriteCond %{PATH_INFO} P([0-9]+) [NC]
RewriteRule comments https://www.betalogue.com/?p=%1 [R=301,L]
RewriteCond %{PATH_INFO} /([0-9]+)_0 [NC]
RewriteRule comments https://www.betalogue.com/?p=%1 [R=301,L]
RewriteCond %{PATH_INFO} A([0-9][0-9][0-9][0-9])([0-9][0-9]) [NC]
RewriteRule archives https://www.betalogue.com/%1/%2/ [R=301,L]
RewriteCond %{PATH_INFO} A([0-9]+)_0 [NC]
RewriteRule comments https://www.betalogue.com/?p=%1 [R=301,L]
I still do not know what some of the stuff means (more specifically, the “[NC]
” and “[R=301,L]
” at the end of the line), but all I know is that it works! These six rules are enough to handle all the existing pMachine URLs that people might still be using when referring to my blog.
In fact, I could probably have skipped the step where I updated the cross-referencing URLs in my own blog entries with batch find/replace operations in BBEdit, because these rewriting rules would have redirected these cross-references to the appropriate WordPress pages on the new site anyway. But of course I prefer having WordPress blog entries that don’t depend on the rules defined in an .htaccess
file on another site (www.latext.com) to refer to each other — even though this other site is mine as well.
In conclusion, I should stress that this is not, of course, a perfect solution. Referring to my blog entries in WordPress with a URL like:
https://www.betalogue.com/?p=xxx
works, but it’s not the ideal way to refer to them, which would be to use their cruft-free URLs with the date and the cleaned-up blog title:
https://www.betalogue.com/2003/08/19/cruft-free-urls/
Unfortunately, there is simply no way that mod_rewrite
can find the cruft-free URL for post ID xxx on the fly. It would involve looking up the entry itself in the MySQL database! I doubt very much that this is possible with mod_rewrite
rules.
The other thing that I don’t really know is how a search engine such as Google will cope with the redirection. As far as I know, Google indexes my Betalogue pages on a regular basis — but now that the URLs are redirected to a new site, with “?p=xxx
” in the actual page references, I suspect that Google won’t like that too much and will probably stop indexing those pages.
On the other hand, it probably will not be long before Google starts indexing my new pages directly, using the cruft-free, search-engine-friendly URLs generated by WordPress. So any lull in the Google indexing process for Betalogue will probably be only temporary. Since I do not depend on Google for my livelihood, this is not a particularly big problem.
So there you have it. www.betalogue.com is operational, and all the URLs to specific blog entries or blog pages on the old www.latext.com site are automatically redirected to the corresponding entries/pages on the new site. The only thing that’s left for me to do is to add a single line to my .htaccess
file on the old server:
Redirect permanent /pm/betalogue https://www.betalogue.com
This line will automatically redirect people visiting the old Betalogue home page to the new Betalogue home page. I have not added it yet, because I actually want people to know what’s going on before they are redirected to the new site. so I want them to be able to read the final post on the old Betalogue home page, which explains the situation. (Of course, rather than redirecting it to the dynamic new home page, I could simply redirect /pm/betalogue
to a specific static page on the new site that would explain the situation. That’s what I might end up doing after a while.)
I could also redirect the old RSS feed URL to the new one, but for the same reason I have not done so. I think I am more comfortable with people knowingly updating their RSS subscriptions — especially in light of the fact that there are now three different RSS feeds (full text entries, excerpts, and reader comments).
One last thing… To me, the most amazing aspect of this switch from pMachine to Betalogue (and from one domain name to another, and from one provider to another) was that I did it all… on a pokey 28.8 kbps modem connection. It is really amazing what you can do with so little bandwidth, isn’t it?
July 19th, 2005 at Jul 19, 05 | 11:04 am
Congrats on the move :) – glad you got all the little things sorted out.
I know what you mean about modem connections, I’ve had 56k dialup for a month now (HELL when you are used to 8Mb broadband) – but I still managed to get a lot of stuff done online.
July 19th, 2005 at Jul 19, 05 | 12:31 pm
Thanks :).
These things might be “little”, but they are quite important for an already established blog such as this one.
As for bandwidth, I guess I shouldn’t complain too much, since I can go to the university 10 minutes from here whenever I need to download really large files. But of course it’s still a major pain to have to cope with dial-up on a daily basis for browsing. No podcasting for me! (Not that I would be any good at it, mind you…)