Unicode, WordPress, Panther Server and BBEdit: UTF-8 with or without BOM

Posted by Pierre Igot in: Macintosh
September 27th, 2004 • 12:00 am

I am having fun these days developing and designing a web site powered by the WordPress blogging system. It’s working well for the most part.

Yesterday, however, I encounted a problem that might have left me stuck for a long time if I hadn’t developed a general awareness of potential Unicode-related issues over the years.

When setting up and designing this new web site, I made a conscious decision to use Unicode for character encoding. Previously, I had always designed web pages using ISO-Latin1 — but I figured it was time to put the new standard to the test. After all, Panther Server, WordPress, etc. were all tools that were supposed to fully support Unicode.

I am also using the latest version of BBEdit to develop my web pages — and BBEdit features built-in support for a variety of character encoding schemes, including several flavours of Unicode.

The key phrase here is several flavours. Unfortunately, the promise of one universal character encoding scheme that Unicode was supposed to fulfill has been marred, so far, by the fact that there are many flavours of Unicode: UTF-16, UTF-8, with BOM (Byte-Order Mark) or without BOM, etc.

So yesterday, I was busy customizing my WordPress files. Since this is a French-language web site that I am developing, I needed to be careful about accented characters and French punctuation issues. I had configured WordPress to use UTF-8 as the default character encoding for all pages. After opening WordPress files in BBEdit, therefore, I made sure that I was saving them as UTF-8 files.

BBEdit offers two UTF-8 options: “UTF-8” and “UTF-8, no BOM“. Since I wasn’t sure which one to use, I selected the regular “UTF-8” one. There was nothing in the WordPress documentation that specified anything about the UTF-8 encoding.

And everything seemed to work fine. I added French text with accented characters to the files, and the French text was rendered properly in the web pages generated by WordPress.

And then I started customizing some WordPress files inside the “wp-includes” folder, which include bits of English here and there that I wanted to replace by their French equivalents. Here again, I made sure that I was saving the files with “UTF-8” encoding.

The web pages generated by WordPress still looked fine, and the bits of English that I had replaced with their French equivalents were rendered properly. But then I went to the blogging system’s admin pages to make some changes to some posts. And I was greeted with the following error message in Safari, in multiple copies:

Warning: Cannot modify header information – headers already sent by (output started at /Library/WebServer/Documents/wp-includes/locale.php:1) in /Library/WebServer/Documents/wp-login.php on line XX

Ooops. The whole admin side of things in WordPress was completely broken. I could no longer login, edit posts, or anything. And the error message given by WordPress wasn’t exactly helpful about the nature of the problem.

Since things had been working fine until then, including on the admin side, I immediately suspected that the source of the problems was the “wp-includes” files that I had just updated, especially since they were referred to in the error message (even though what the error message was saying about them was completely opaque to me). And because I am somewhat aware of possible Unicode issues, I immediately suspected that the problem had to do with the particular flavour of Unicode that I was using (even though the error message didn’t mention anything about character encoding).

Sure enough, I reopened the “wp-includes” files that I had edited, and saved them in BBEdit using “UTF-8, no BOM” encoding (instead of the “UTF-8” that I had been using so far). I uploaded the files back to the server, and tried accessing the blogging system’s admin pages. And it worked!

Why does “UTF-8, no BOM” work for these files when “UTF-8” doesn’t? I have no idea. Why does “UTF-8” work for the other files used by WordPress (outside the “wp-includes” folder)? I have no idea.

What I do know is that Unicode is great… when it works. When it doesn’t, well, as my experience above shows, you are pretty much on your own, and experimenting is often the only option. Since character encoding issues are more likely to have an impact on non-English users of the software, it means that they are not documented as well as other issues in software documentation and on-line support forums. That’s why I feel it is particularly important for me to report on such issues here in this blog, so that it might help other non-English users out there.

Regardless of the issues encountered, this experience is also making me feel more enthusiastic about embracing Unicode for my own web site, including this blog. Right now, I am still using ISO-Latin1, and I am constantly frustrated by its limitations. But upgrading my blogging system would not be a small undertaking. With the amount of work that I have these days, it’ll take me some time… Unfortunately, other people’s web sites are a higher priority than mine!


17 Responses to “Unicode, WordPress, Panther Server and BBEdit: UTF-8 with or without BOM”

  1. Zach says:

    Agreed – the options should be “UTF-8” and “UTF-8, with BOM”. I wish they would have just completely left out the option for a BOM for UTF-8, cause it’s caused me trouble in the past too.

  2. Pierre Igot says:

    Olivier and Zach: Thanks for the clarifications. I did suspect something like this, but don’t know enough about PHP to understand the specifics. Still, the BOM doesn’t seem to bother PHP in root level files…

    I think what happens here is that the BOM doesn’t seem to bother PHP for root level files, but it bothers PHP if it’s included in the “header” (beginning) of an “include” file, because obviously an include file is actually inserted inside an existing file when building the dynamic pages. So we end up with a page that contains a BOM somewhere in the middle of it… and that doesn’t work.

    However, based on what you guys are saying, I should probably get rid of the BOM everywhere. Intuitively, it doesn’t make much sense to have a BOM for UTF-8. I wonder why it even is an option.

  3. Olivier says:

    A quick guess:

    The BOM (Byte Order Mark) is composed of two invisible characters at the very beginning of the file. They’re used to distinguish between big endian and little endian byte order in UTF-16. There is also a BOM for UTF-8, even though there is no byte order issue with it.

    When a PHP script uses the header function to send special HTTP headers (like a graphical counter that sends a Content-Type: image/png header), it must happen before any content is produced. Therefore, the php opening tag must be the very first characters in the file. When there’s a BOM, the two invisible characters composing it are treated as content and the header function doesn’t work anymore.

    I remember reading that using a BOM with UTF-8 is not recommended, for this reason among others. I agree it’s not clear in BBEdit that UTF-8 no BOM is actually the “standard” encoding and UTF-8 tout court is the special one.

  4. Mike P. says:

    Yeah,

    I’ve had BOM problems when coding up PHP with Dreamweaver and utf-8, and found that Topstyle works well for hunting them down and getting rid of them.

    Maybe there’s a better way in DW… but this works..

  5. ssp says:

    Huh, this is strange. I was under the impression that no BOM is needed in UTF-8 (as UTF-8 is, well, made up of 8-bit pieces). The only case where it might be needed is to give programs a hint that a file is Unicode. But I thought that’s more of a dirty trick and not necessary if the software is otherwise informed of that fact.

    Being a traditional BBEdit ‘disliker’, I’ll take it as another hint at that application’s particular ‘quality’. ;)

    Or take it from the horse’s mouth: http://www.unicode.org/faq/utf_bom.html

  6. Olivier says:

    ssp: BBEdit has many ways of figuring out the encoding of text files when it first encounters them. It supports a number of methods that are likely to be used by different people using different programs, and I’m glad that it does not limit itself to the ones *you* like to use. A BOM in a UTF-8 file can exist (and is legal, according to unicode.org) so BBEdit is able to read it when it encounters it. Reciprocally, it is capable of writing in any character set it reads and does not play cop by telling you how you should save your documents. If you like simpler applications that take you by the hand using a wizard whenever you save a document, fine, but it does not mean that a more powerful tool is necessarily of lower ‘quality’.

    Pierre: I made a quick test with BOMs and PHP. The problem is indeed with the header() function. If you put anything before the php block in the file (a blank line, some text), PHP fails with the error you mentioned. If the file is a BOMmed UTF-8, PHP treats the BOM as text and fails the same way. It does not seem to like any flavour of UTF-16, though.

  7. Pierre Igot says:

    Thanks to everyone for their input. It does look like the BOM can be an issue with various tools, and not just PHP scripts. I guess it’ll still be years before the use of Unicode is so prevalent that most tools support it “transparently”. Maybe what the BBEdit developers could do is alter their UI a bit so that the user is made aware of the fact that the presence of a BOM in a UTF-8 file can cause problems with various tools.

  8. Lachlan Hunt says:

    I had this exact problem when I started learning PHP. The problem is caused because PHP begins the content from the first non-whitespace character that is not part of any PHP code. The UTF-8 BOM, U+FEFF, is represented by the octets: 0×EF 0×BB 0×BF. Thus, when you save a file as UTF-8, the first few characters of the file will look like this, if each octet is interpreted as one characters, as in ISO-8859-1:

    <?php

    Since none of those octets represent white space, PHP assumes it is the beginning of the content, and sends out the all the default HTTP headers, and begins the content with those bytes. So, when you try to send out additional HTTP headers, the content has already begun, so it?s too late ? you can?t bring back what you?ve already sent.

    When I found this out, I read in a forum somewhere, when I was looking for a solution, that this problem either has been, or would be fixed in PHP5. But I can?t find that forum now, so I can?t be certain. I just know that my ISP has an old version, and I was forced to locate an editor for windows that allowed by to choose not to output the BOM.

  9. Paul Ingraham says:

    A couple more relevant notes…

    Here’s an excerpt from BBEdit’s manual which may help to further clarify the issue:

    no BOM: When saving Unicode files, you should always include a byte-order mark (BOM) so that the reading application knows what byte order the file?s data is in. For maximum compatibility, the BOM should be used whenever possible. Use one of the ?no BOM? options only if there is a specific reason to do so, such as providing compatibility with software that malfunctions when a BOM is present. (For purposes of recognition when you use this option, the UTF-16 BOM is FEFF, and the UTF-8 BOM is EFBBBF.)

    After reading this, I thought, “Okay, sure thing, UTF-8 with a BOM for me, I sure am probably not one of those users with a ‘specific reason’ to use the ‘no BOM’ option.” So far so good, although from the sounds of the comments here I have probably come within a hair’s breadth of running afoul of the BOM/PHP conflict that y’all have been discussing. Maybe I am one of those users with specific reasons to use the no BOM option. :-)

    I’ve also discovered that PHP doesn’t like UTF-16. Lord knows why, I naively experimented with encoding some files as UTF-16, which broke rendering of PHP includes in those pages. Not really surprising in retrospect.

  10. Olivier says:

    Surely PHP must have its own internal string encoding (UTF-8, UTF-16, what do I care as long as it can hold any character that exists in any language) and it only deals with other charsets on input and output. That is, it converts text to the internal encoding when it’s read (the source encoding must be specified somehow, e.g. with the headers of a HTTP POST request) and also converts it to any encoding the developer sees fit when it’s written somewhere (with a sensible encoding set by default). That’s how software that has to deal with text data must be written. Then functions like strpos() must only be able to deal with strings encoded in PHP’s internal charset.

    As for a PHP script encoded in UTF-16, I see no reason why PHP fails to parse it. It has a BOM, PHP should recognise it and act accordingly. What’s inside a PHP block should be converted to whatever encoding PHP likes to parse and the rest should be output as is.

  11. Olivier says:

    Why would (NULL)H(NULL)i(NULL)! be so confusing to PHP? All it has to do with it is send it as is to the output, it does not even need to parse it. It’s not a big challenge, it’s a matter of being aware that UTF-16 exists. Unfortunately, that seems to be the problem with many developers, those of PHP among them.

    ? Unicode Ribbon Campaign ? No ASCII, anywhere
    <http://ithink.ch/unicode&gt;

  12. J. King says:

    I’ve also discovered that PHP doesn’t like UTF-16. Lord knows why, I naively experimented with encoding some files as UTF-16, which broke rendering of PHP includes in those pages. Not really surprising in retrospect.

    Simply: because PHP isn’t a Unicode application. To PHP, a script encoded in UTF-16 looks something like this:

    (NULL)H(NULL)i(NULL)!

    You can just imagine how confusing that would be. PHP will work okay with UTF-8 because ASCII characters have the same octet values in UTF-8 as they do in ISO-89859-1. If you try a sorting function, though, PHP will not see your é character as an accented letter, but as a garbled string of three nonsense characters (much like the UTF-8 BOM). UTF-8 is fine for most purposes in PHP (4), but it’s certainly not fully supported, or even a good idea, to be honest. PHP 5’s default package contains some Unicode-compatible functions, but they don’t cover everything, unfortunately.

  13. J. King says:

    Why would (NULL)H(NULL)i(NULL)! be so confusing to PHP? All it has to do with it is send it as is to the output, it does not even need to parse it.

    Consider strpos(). Sure, PHP can -output- UTF-16 (in theory; it’s not practical to do so), but if you have a PHP script encoded in UTF-16 or that works with UTF-16 strings using the standard functions, then that changes matters greatly.

  14. J. King says:

    As I said, Olivier: PHP isn’t a Unicode application. Indeed, it’s remarkably dense with character encodings. All it knows (natively, without any added modules) is ISO-8859-1. Period. It won’t convert a UTF-16 script to one it can understand because it assumes that any script it receives will without fail be encoded in ISO-8859-1—because to PHP, nothing else exists.

    Modules like ‘iconv’ and ‘mbstring’ address most of the problems inherent with using multi-byte encodings (including various encodings of the Unicode character set), but these are usually not available with most Web hosts, I am told. They certainly weren’t with mine until I asked for them.

    Even then I don’t know if they would allow you to feed PHP UTF-16 scripts.

  15. Musings & Meanderings » Blog Archive » Waylaid by the BOM in UTF8 says:

    […] Although I found many entries in help forums by webmasters waylaid by BOM, the only formal faq I’ve found on it is by Sun and Unicode. The Wikipedia entry refers to this being a problem with Unix and not Windows servers, and I’ve read that including the BOM in UTF-8 by default was one of those unilateral Microsoft decisions. Here also is a post by WordPress blogger Pierre, and a related issue post on translating character sets and collation in WordPress. […]

  16. Musings & Meanderings » Blog Archive » Waylaid by the BOM in UTF8 says:

    […] Although I found many entries in help forums by webmasters waylaid by BOM, the only formal faq I’ve found on it is by Sun and Unicode. The Wikipedia entry refers to this being a problem with Unix and not Windows servers, and I’ve read that including the BOM in UTF-8 by default was one of those unilateral Microsoft decisions. Here also is a post by WordPress blogger Pierre, and a related issue post on translating character sets and collation in WordPress. […]

  17. Character Encodings « Mr Chimp Learns to Write says:

    […] Here are some more links: Character Encoding Issues UTF-8 With or without BOM UTF/BOM […]

Leave a Reply

Comments are closed.