Microsoft Word: Chris Pratley on content vs. formatting
Posted by Pierre Igot in: TechnologyMay 3rd, 2004 • 4:11 am
In his fourth post on Word’s development process, MS program manager Chris Pratley says the following:
A couple of people have asked about the permanence of electronic information and access to it in the future if it is in Word format. Microsoft takes this very seriously. That’s one of the reasons we make the format documentation available to governments and other institutions, so that there is no concern that they will not have the ability to access the information at a later date. Personally, I find this whole discussion a little bit overwrought though. If it is access to the content of a Word doc that is a concern, just about any word processor available today can import Word documents sufficiently that you can access their content. You don’t need a Microsoft product for that. The issues are usually around getting the formatting exactly right, not access to the content.
This is highly problematic. Elsewhere Chris makes a big deal of Word’s adoption of XML as a way to describe the structure of Word documents. So how can he now separate the “content” from the “formatting” and allege that the issue of access to archived Word stuff is not really serious because we’ll always have tools (provided either by MS or by third-parties) to access the content, even if the formatting is not “exactly right“?
Personally, I didn’t wait for Microsoft to embrace XML before I started using formatting as more than superficial text “decoration”. When I use a character style called “emphasis” in my Word document, it’s not just a fancy way to change the text formatting to italics. It actually conveys essential information about the content of my document. It is therefore very important that this “formatting” be preserved.
Reader “Juan B.” does take Chris to task on this particular issue in the comments section of Chris’s blog entry, but Chris’s response still doesn’t address this major concern. He still says that his point was “in regards to archival access to information as opposed ot daily use of the software“.
Just because I used “formatting” to emphasize text in a Word document 10 years ago, does not mean that it is no longer important for me to be able to access this Word document with the formatting intact! The emphasis is part of the “content” of the document and is just as important 10 years later as it was when I first used the software to create the document.
May 3rd, 2004 at May 03, 04 | 5:07 pm
I think you’re taking me a little too literally. I was not referring to strictly to content (as in “plain text”) vs. formatting as in a computer science definition. As I mentioned in my post, you only have to open a Word doc in just about any capable word processor (non-Microsoft) to see that access to content and typical formatting is not a problem, even without documentation for the format publicly available or Microsoft code running. The issues (for any capable word processor) are relatively small things – a document might have an extra line, or a line might wrap in a different place, or a drop cap might appear in a slightly different location. These are big things to customers who want an exact reproduction, but not necessarily for access to content in the future, and won’t materially affect the meaning or interpretation of the content as in “emphasis” (for equivalents, think: yellowed newspaper, scratched parchment, etc.) Note that I am not talking about big problems, which are more due to limitations in whatever word processor you might use than about the file format (e.g. if your word processor doesn’t support Chinese, or three-column layout). Just in case that or even the slight formatting issues are a concern, the format documentation is available for governments and institutions to access if there is interest (many have accessed it).
These same issues exist for any format (e.g. HTML), since 100 years from now, we may not have “browsers” that use HTML as a format, so environments that can read HTML and display it have to be maintained, or you need to have a tool that can convert HTML into a format that the future viewers can understand – and that conversion will likely have a few small errors in it. Nothing really different here.
May 3rd, 2004 at May 03, 04 | 10:23 pm
Chris: Thanks for your reply. I think that things boil down to the following: My own experience with Word files from other people and with my own files is that Word in its various incarnations is not very good at preserving the integrity of people’s data, and cannot be trusted. It’s a rather complex issue, because there are several Word weaknesses involved.
One of them is document corruption (discussed elsewhere). You might be working on a document with section-based formatting for a while (nothing fancy, just different sections with changing headers or footers, etc.) — and then, all of a sudden, Word starts acting up, crashing when you access a certain section, mangling the headers/footers or page numbers, etc. It’s not just a matter of “getting the formatting exactly right”, it’s also a matter of guaranteeing that both the content and the formatting will always be preserved. When people create a PDF file of something, they are not afraid that, the next time they open the document, all of a sudden their application is going to start crashing on them or mangling the page numbers. Granted, PDF is not editable (per se), but you get the idea.
One other key weakness is that too many people still use manual formatting, because Microsoft has done a really poor job of encouraging users to use styles. And, as every MVP out there will agree, manual formatting is a really bad idea. It’s unreliable, it might contribute to document corruption (see above), and it really behaves very poorly when transferring files from user to user or blocks of text from document to document. The key benefit of having electronic files is that they can be reused. If I can open a Word file and look at it and hopefully print it — OK, that’s not too bad. But if I cannot take a paragraph from it and reuse it elsewhere without having to redo all the formatting, etc. then it becomes a serious issue.
There are other issues (automatic numbering/bullets is another one), but the bottom-line is that, at this point, the level of trust is pretty low. And that includes what you call “typical formatting” — not just fancy little things like drop caps. I honestly don’t think any serious computer user uses Word to achieve “exact reproduction”. There are too many unpredictable behaviours in Word for this. But even within the range of “typical” things that a word processor should be able to handle gracefully and reliably, Word has serious problems.