Mac OS X 10.6 (Snow Leopard): More on Preview problem with accented characters

Posted by Pierre Igot in: Macintosh
March 23rd, 2010 • 3:54 pm

Following yesterday’s post on the other problem I recently encountered in Mac OS X’s Preview, with the Clipboard failing to properly preserve accented characters, I have received a few e-mails which have helped me narrow down the problem.

Betalogue reader John M. wonders whether it makes any difference in which application I attempt to paste the “qualité” word (with the final accented “é”) selected and copied from the sample PDF document in Preview.

It does not, at least not on my machine running the latest available version of Snow Leopard. Whether I paste the word into Mail, Pages, TextEdit, or BBEdit, I get the same result, i.e. “qualite” without the accented “é.”

John, on the other hand, notes that, in Tiger, he is able to reproduce the problem when pasting into a Pages document, but not into a Mail message. So it looks like a bug that’s been with us for a while, but manifests itself differently across OS versions.

John also notes that you can use a little tool call Clipboard Viewer to view the contents of the Clipboard itself after copying the selected text. He says Clipboard Viewer is installed as part of the Xcode developer tools. However, after installing the Snow Leopard developer tools on my startup volume, I couldn’t find that utility anywhere. But Spotlight found it in an older archive of an older system that I had on another hard drive partition. It looks like Clipboard Viewer is actually a sample project/application included in a previous version of the Xcode developer tools, but no longer included in the latest version.

In any case, I am able to run Clipboard Viewer on my system and here’s what I can see. Clipboard Viewer is a pretty crude application, but it lets you view the various things that Mac OS X actually stores in the Clipboard when you copy something in an application:

Clipboard View 1

The picture above shows what I see in Clipboard Viewer after selecting the word “qualité” in my sample PDF (see yesterday’s post) and copying it.

The string of bytes at the bottom is what Clipboard Viewer displays when I click on the third line in the top half, i.e. the one that reads “public.utf8-plain-text.” As you can see, it shows a “plain text” word “qualite” without the accent.

Of course, it does not make sense that the plain text version of the copied text should be stripped of its accents, since accents are not part of the format of a text string (although Microsoft’s engineers would probably have something to say about that), but part of its encoding. And UTF8 encoding is more than capable of handling accented characters, since it’s a flavour of Unicode, which normally includes all possible combinations of diacritics in the world’s alphabets—and then some.

Since Mac OS X only strips the accent when it’s part of the very last character in the selected text that is copied from a PDF file in Preview to the Clipboard, the next step is to look at what Clipboard Viewer shows when copying text that includes accented characters somewhere within the text string. Here’s what I get when I select the string “qualité et les mécanismes d’assurance de la qualité” in my sample PDF file and then copy it:

Clipboard View 2

As you can see, the first accented “é” in the first occurrence of the “qualité” word is listed as “65 cc 81,” whereas the last “e” in the string is simply listed as “65,” which is indeed the code for the plain “e” without an accent.

So basically it looks like a character encoding issue, with Mac OS X’s Clipboard somehow failing to preserve the full code of the final character in the selected text.

I was not surprised to also receive an e-mail from Sven-S. Porst, since he’s a long-time reader and also part of earthlingsoft, the makers of the UnicodeChecker utility, so that kind of stuff is right up his alley:

It seems you can also reproduce it by just typing qualité in TextEdit and printing to PDF.

Interestingly, running the file through mdimport -d5 faithfully reproduces the accent.

The problem seems to happen with any accented character. And it doesn’t seem to be about normalisation forms.

My guess is that Preview doesn’t compute the selection index correctly but gets that of the first and the last character in the selection, thereby ignoring that the character may have a composing accent coming after it (the problem also occurs in the middle of a word, btw, when you’re selecting only up to the accented characters).

To me it sounds like a bug, but like one that should be easy to fix.

I agree with him, of course, and suspect that, if Apple’s engineers pay any attention to my bug reports (and there are signs that they do), this will eventually get fixed.

I still find it interesting that something so basic manages to slip through unnoticed. It is probably revealing that it is a problem that has to do with copying text from a PDF and with accented characters. If other Mac users are like me, they are just used to things not working quite right when attempting to copy text from PDF documents and probably do not pay too much attention to such smallish flaws. They just fix the accented “e” manually and get on with their work.

And the fact that the problem affects accented characters means that it will primarily be noticed by non-English users who routinely attempt to copy text in PDF documents that are in foreign languages making heavy use of diacritics. Here again, such users tend to be used to things not working quite right on their computers, simply because, as noted yesterday, these computers are designed by English-speaking engineers and have a long history of treating other languages as an afterthought.

Things have undoubtedly got better in recent years in that department, but to indicate that we still have a way to go, I only need to tell you that, if you want a copy of Microsoft Word with the French user interface, you need to purchase a separate copy of the software, because Microsoft refuses to support Mac OS X’s built-in international features, which enable you to switch the language of your user interface on the fly. (Both the English and the French version of Word 2008 include English and French spell-checking capabilities, but to switch the language of the user interface, you need to purchase a separate version of the same software.)


Comments are closed.