Comments on: Mac OS X’s Preview: When a space is not a space

By: danridley

danridley — Thu, 19 Jul 2007 18:33:20 +0000

DT Pro Office is actually running OCR; not just converting the text data from the PDF. Conceptually, it almost seems like a step backward — basically throwing away the text data and treating the PDF like an image — but if the text data in the PDF is junk, you’ll get better results from OCR.

I don’t know how PDF2RTF is implemented, but I assume since it’s free (and decent OCR engines are expensive); and since it’s fairly small (and OCR engines are big) that it’s probably extracting the text with either PDFKit or something similar, and would therefore have similar limitations to Preview, as you saw.

By: Pierre Igot

Pierre Igot — Thu, 19 Jul 2007 17:02:01 +0000

Thanks for the suggestion. Interestingly, DEVONtechnologies' own freeware PDF2RTF service suffers from the same problem as Preview, i.e. it fails to "see" the spaces after the commas and semi-colons and the resulting RTF file fails to include the required space characters in those locations. Presumably PDF2RTF is based on the same underlying engine as Preview, whereas the DEVONthink Pro Office has its own algorithms. Of course, I already own Acrobat Pro 8 (CS3), so there's no real incentive for me to spend money on another product for this particular purpose—although of course DEVONthink Pro Office has several other intriguing features.

By: danridley

danridley — Thu, 19 Jul 2007 16:31:42 +0000

There's a lovely Mac program called DevonThink Pro Office which I use for document management; the Pro Office version includes an OCR engine that works for PDFs. Since John discovered the document in question, I went ahead and downloaded it and ran it through DT Pro Office's OCR engine. For what it's worth, it seems to have handled things perfectly. Without OCR, it behaved like Preview (I assume it uses PDFKit); after OCR, it handled text selection correctly (and, since it saves the PDF with the OCR data, so did Preview). If you run into this type of problem frequently, or have other reasons to run OCR on PDFs or scans, you might look at DevonThink.

By: danridley

danridley — Thu, 19 Jul 2007 16:24:08 +0000

I like your ideals, I just don’t think that can happen in the real world, that’s all :-)

The PDF spec can do everything you want, and PDFs generated with good tools can too. Over time, I reckon the average quality of PDFs will go up as people use better tools.

By: Pierre Igot

Pierre Igot — Thu, 19 Jul 2007 13:43:18 +0000

Dan: My point was that, whatever the PDF file format was initially designed for and regardless of how it is generated, it is now commonly used as a file format for sharing print-ready documents, and in many cases the sharing is not intended to preclude the recipient from using the document in a variety of ways, including selecting chunks of text to copy and reuse them elsewhere.

Today’s PDF viewing tools need to adapt not just to the evolution of the file format itself, but also to the evolution of its uses. If people perceive PDF as a suitable file format for sharing text files, then PDF viewers must also suit this purpose, whatever the technical challenges are. The actual, real-world uses of a technology are just as important as—if not more than—its originally intended uses.

By: danridley

danridley — Thu, 19 Jul 2007 00:07:39 +0000

Part of the issue is that PDF is such a broad category of files, that it’s challenging to make broad assertions about how PDF files should be handled. I expect that if a designer handed you an Adobe Illustrator .AI file that had been autogenerated from some other application, it would make sense to you that the text was oddly represented… but in a Distiller PDF, that’s pretty much what you’re being given. Likewise, if I gave you a JPEG, you would understand that you couldn’t copy and paste text out of it; and some PDFs are nothing more than JPEGs with page numbers. On the other hand, a well-made PDF that came from software with native support can give you perfect copy/pastable text as well as all sorts of other features.

Distiller is common, but it’s a hack. It takes PostScript data, usually from an application that thinks it’s sending data to a printer, and wraps it up in a PDF. This gives you cross-platform viewing and page numbers, but it can’t magic spaces into the text where there were none before; and Preview and Acrobat Reader alike are left guessing.

This is a case of poor tools (Distiller) giving you poor results (a badly formed PDF).

And actually I think your Web browser analogy is fairly apt — a poorly made Web site will often look good at first glance in all the browsers, but try to do something slightly less common — change text size, say — and you’ll find that the behavior across browsers suddenly deviates wildly. Likewise, a Distiller PDF gives you the core behavior — a file you can view and print cross platform — but go beyond the simplest interaction and you’re left dealing with software having to guess about the intent of some other piece of software, without having enough information to work from.

By: Pierre Igot

Pierre Igot — Wed, 18 Jul 2007 21:18:09 +0000

John, if you can host your pictures somewhere, you can always insert HTML tags referring to them in your comments.

As for the rest, of course these are all valid points you make. I am sure it is more complex than it looks like. It always is :-). And I truly appreciate your insights.

Ultimately, though, I am still hoping that, some day, somehow, our computers will become really good at what they are supposed to be good at, i.e. automate repetitive tasks where no human intervention should be required. As a professional Mac user, I cannot help but feel that the potential of computers in general (and of the Mac in particular) as professional tools is not fully tapped, for a variety of reasons (focus on “consumer-level” products, lack of resources, lack of real competition in the market, etc.). It is a frustrating and it is a frustration that, as a professional Mac user, I experience on a daily basis with a myriad of “smaller” issues such as the one that started this thread.

By: John Calhoun

John Calhoun — Wed, 18 Jul 2007 20:44:23 +0000

>> the rule needs to be able to “see” the space between words and to render
>> that space as a space char. If I can see it (and if an OCR program can see it
>> when scanning the text), then I expect Preview to be able to see it too.

PDF Kit does have an algorithm that tries to determine where spaces are to be inserted. I’m sure OCR software is much more complex but what PDF Kit does is to look at the gaps between characters and determine if they are wide enough to warrant inserting a space. “Wide enough” is a complicated thing. Using the font size helps (a larger gap would be required for larger fonts for example). The problem is that you can never get it always right. If you’re too aggressive you end up inserting spaces in the middle of words, if you’re not aggressive enough, you run words together. You can pay with ratios all day and never get a magic number that works for all PDF’s.

To that end, for PDF Kit it was decided to go down this path as a last resort. In the example PDF we are talking about, there are spaces elsewhere in the PDF. In fact, but for the commas, the PDF is otherwise well-behaved. For this reason, PDF Kit is not aggressive with it. This is again, why I call it the worst kind of PDF — it half tries to do the right thing.

The other point is of course that you are a lot better than a computer will perhaps ever be at deciding where spaces should be by looking at a document. OCR software too would go the extra mile to use dictionary look-ups to decided the borderline cases.

I wish I could post images here. I could show you some interesting examples. I can show you this PDF with the each character bounded (and you can see that there is a missing space). It would be interesting too to post a PDF with only character bounds (no text) and see if you can decide where the space should be.

Again though, back to Preview, if Adobe is doing it, PDF Kit can as well — but it’s going to take considerable resources.

>> My other point is that “bad” PDFs are only as bad as the programs used to
>> create them (and the ways that these programs are used) are bad. In this
>> case, the program used was Adobe Distiller 6 for Mac.

Distiller 6 is only half the problem (or solution perhaps). It can be boiled down to garbage in, garbage out.

Distiller 6 (and other PDF creation tools) present a PDF “context” for the client application to render into. Anything rendered into the context will be dutifully recorded in a PDF rendering stream.

The problem is when the client application doesn’t render the spaces into the context. No spaces in, no spaces out.

As an example, take a PostScript file. These very typically are devoid of spaces (and why would you bother wasting bandwidth sending a space character to a printer that will ignore it anyway). Run the PostScript file into Distiller 6 and see what comes out.

Don’t misunderstand my intent here – I’m not trying to be either defensive of Preview or critical of your comments. PDF’s are not a well understood format and so I’m trying to expose some of the challenges that they present and the causes of these challenges. (Okay, perhaps that is being a bit defensive of Preview.)

By: Pierre Igot

Pierre Igot — Wed, 18 Jul 2007 17:00:10 +0000

I cannot test Acrobat Distiller 6, but, in my experience, Acrobat Pro CS3 (version 8) produces PDFs that are similar to the ones produced by Mac OS X itself or by the Export command in InDesign CS3. It may be that there have been improvements since Distiller 6, but it could also be that many factors other than the actual software version are involved. Maybe it also has to do with the original application that sent the file to Distiller (Quark Xpress?), with the font choices, with the specific formatting options used for this particular document, etc.

Not every designer uses the latest version of everything, and not every designer (in my experience) does things in the “normal” / expected way. There are just too many possibilities here. We simply need PDF viewers/readers that are able to handle most PDF files properly, just like we need web browsers that are able to handle most web pages properly, regardless of how badly coded they are.

By: ssp

ssp — Wed, 18 Jul 2007 15:45:58 +0000

While Distiller may be a common tool, I can imagine that it’s not the best. And that, because of the way it works – converting files from PostScript -, it may just not have the best/full information at hand for the job.

With ‘good’ PDF files in mind, Distiller (or at least an old Distiller?) may just not be the best solution for the problem. Systems that create PDF files directly may be preferable.

For example the simple files PDF files I just tested (created with Text Edit and the Print dialogue and pdfTeX respectively did reasonably well in terms of copy and paste).

By: Pierre Igot

Pierre Igot — Wed, 18 Jul 2007 13:19:12 +0000

My thinking is that, if there are no space chars in the PDF itself, then the algorithm must take into account the space between words and render it as space characters when selecting/copying text. There is clearly a space here in this particular PDF after the commas and semi-colons. There is no ambiguity about this, even for a “dumb” algorithm that doesn’t understand English.

So I wouldn’t say that the rule would be to always insert a space after commas and semi-colons (there are clearly cases where this shouldn’t happen, such as the use of the comma as a decimal separator in French and other languages), but rather that the rule needs to be able to “see” the space between words and to render that space as a space char. If I can see it (and if an OCR program can see it when scanning the text), then I expect Preview to be able to see it too.

My other point is that “bad” PDFs are only as bad as the programs used to create them (and the ways that these programs are used) are bad. In this case, the program used was Adobe Distiller 6 for Mac. Not exactly a rare/exotic PDF authoring tool… I expect Preview to be able to handle PDFs created by Adobe Distiller 6 for Mac, and I don’t think it’s an unreasonable expectation.

It would still be interesting to determine exactly what caused this particular PDF to be as “bad” as it is, but unfortunately, that’s probably beyond my capabilities.

By: John Calhoun

John Calhoun — Wed, 18 Jul 2007 02:37:18 +0000

>> I agree that this PDF is probably a “bad” PDF, but the reality is…

I agree with your point. PDF Kit does need to handle these cases – as you say, users have certain expectations that the stuff just works.

Earlier in the post and comments though there was some confusion as to whether the spaces were being ignored by Preview due to a bug (and why does it work for other PDF’s) and I wanted to make it clear to anyone reading that not all PDF’s are created equal (so to speak). Some PDF’s create more challenges than others (this would be one of the more challenging ones).

Sadly too, there isn’t anything really in the PDF spec that would suggest these problems and how to accommodate them. The reality is, someone has to physically come across one of the PDF’s in the wild and write up a bug.

You can imagine an obvious fix for this PDF would be to *always* insert a space after a comma (if no space is found – does that break with number sequences? – special case that…). Sadly though, another PDF appears one day with different problems that the “space after comma fix” doesn’t address. So you keep adding layers of rules….

I’ve certainly seen PDF’s so messed up that neither Preview now Reader come even close.

>> Thanks for filing it as a bug. I had already filed a bug report myself.

I should have waited … your bug just showed up (with a portion of the aforementioned PDF).

By: Pierre Igot

Pierre Igot — Wed, 18 Jul 2007 02:01:55 +0000

Good job with fishing out the original document! :-) I could have shared this information with you myself… As you found out, the document is in the public domain.

I agree that this PDF is probably a “bad” PDF, but the reality is that, just like badly coded web pages and badly formatted Word documents, we have to live with them, and software applications have to deal with them as well.

As you can easily imagine, there is not much I can do in my position about the situation. This particular document was probably contracted out to a graphic designer, and either the designer himself or the PDF authoring tool he used are responsible for this. And it’s not like it’s a major issue with the document that makes it unreadable or unprintable. It’s effectively one of those many more minor issues that professional Mac users have to deal with on a daily basis in a world dominated by badly designed software used by badly trained people.

Thanks for filing it as a bug. I had already filed a bug report myself.

By: John Calhoun

John Calhoun — Tue, 17 Jul 2007 21:27:10 +0000

I found the PDF file you refer to: “REFERENCE_GUIDE_FINAL_15_March.pdf”.

A quick Google search with some of the words you illustrated found it here:
http://www.dukeofed.org/ns/docs/REFERENCE_GUIDE_FINAL_15_March.pdf

So I checked it out.

In fact there are physically no space characters in the PDF after the comma character in the example you site (on page 5).

This makes the PDF in my mind of the worst kind. It has some spaces in the document but not everywhere…. This goes back to the algorithm point I was making earlier.

So, try not to beat up Preview too badly — perhaps take your ire out on the creator of the PDF (or perhaps the looseness of the PDF spec itself).

Again, I’m not denying that Adobe does a much better job with these edge cases (a richer algorithm to detect and insert spaces). To help improve Apple’s technology though, I’m to going to go ahead and file a bug against PDF Kit and attach this PDF as a reference.

Perhaps a future Preview will better measure up to Adobe in these regards.

By: Pierre Igot

Pierre Igot — Tue, 17 Jul 2007 02:07:06 +0000

Thanks for the insights, John. What I do find interesting is that, when I use Mac OS X’s built-in PDF architecture to create a PDF file from, say, a Pages document, and then open it in Illustrator, most paragraphs of texts consist of chunks of text that are usually whole lines. (There are a few exceptions here and there.) Yet in other PDF files, like this document produced by Distiller, the text is divided into much smaller chunks.

And if I take the same text and place it in an InDesign publication, and then export it as PDF, again I get much bigger chunks.

And again if I create a PDF from my Pages document with Distiller (CS3).

So it looks to me like the issue also has lots to do with the algorithms that are used to create the PDF files in the first place—and possibly also with various font settings. (For example, in the problem PDF mentioned above, there are lines where the space between chars—the tracking—is bigger, and these lines end up as individual characters in the PDF as exposed in Illustrator, whereas for the other lines in the same page, the chunks are bigger.)

Anyway, the basic issue here remains that Pages fails to “see” certain spaces (after commas and semi-colons) in this particular document, whereas Acrobat does. I don’t see why Preview can “see” the other spaces in the lines of text, but not the spaces after the commas and semi-colons. It looks like a bug to me, rather than just a slightly less effective algorithm.