So I'm a naive user of technology. (No. Really I am. Ask anyone that's worked with me.) I am definitely not an expert in modern XML document standards. (I have actually hacked troff escapes in a document production chain to insert commands in the PostScript output stream that would be recognized by the PDF generator to produce a hyper-linked document, so I know a little bit about the concepts involved, but that was also 10 years ago now.)
I am a [marvelously happy] Mac user for the past two years. That means I already have iWork 2008 loaded with the new improved Pages '08 (the Apple word processor). On the Apple web site, if I search for "office open xml" then I end up on this page (31 Aug, 2007), which tells me all about Pages '08:
Widely compatible.
Pages ‘08 supports industry-standard formats, so you can easily open documents created in other word processing applications and share documents with others. Whether they’re using a Mac or a PC.
Open for business.
Import your Microsoft Word documents into Pages ’08 with ease. Whether they’re Microsoft Office 2007 (Office Open XML) or earlier Word files, Pages will open them. Pages imports not only the text, but also the styles, tables, inline and floating objects, charts, footnotes, endnotes, bookmarks, hyperlinks, lists, sections, change tracking, and other elements of your original Word document.
COOL! I'm in! This is awesome. I want to see how well I can read interesting docx files. As it happens, ECMA International makes the Office Open XML standard available as both PDF and as docx files. Clever — it's a document format standard see, and so they've provided it in its own format. Perfect.
So I download the .docx version of ECMA-376. All 5 parts of it. And I open "Part 1 - Fundamentals" and immediately get told some warnings occurred:
I choose to review and get:
The file mostly looks good, but not quite as clean as the PDF image with the other font (Consolas?). And clicking on the first warning (about the unsupported field) gives me NO additional information to understand what/where the error might be. Now this is what we in the standards industry call "a quality of implementation issue". Clearly Apple has not done a good job. Get used to hearing this phrase a lot in the press — I'm predicting Microsoft will be forced to apply it liberally to their partners that helped them win votes and helped with the marketing message.
Then I notice the paging problem. I have no idea why, but there seems to be page drift between the PDF and .docx versions. [More on THAT little problem in a minute.] The paging problem does NOT mean there's necessarily a problem with the standard itself but rather the document production machine ECMA was using — we don't know what the definitive source and tool chain was that produced the PDF. (Serious document production is the same as serious software production, something most word processor users fortunately don't get to experience.) Oh, and there are line numbers in the PDF that don't appear in the .docx as opened by Pages '08.
Ignoring the document version skew problem, I decide to see what happens when I throw an even bigger docx file at Pages '08. So I open "Part 3 - Primer" and ...
A few more "warnings" to deal with here. More missing font problems. Things were "removed". No helpful information as to what or how.
I asked a friend with Office 2007 to download and open the two .docx files. You guessed it — no warnings. So we're now on the slippery slope. Apparently I can create files in Office 2007 that Microsoft marketing claims are "standard" Office Open XML that may (or may not) use proprietary extensions. Or maybe Apple did a really bad job. How would a government customer interested in preserving documents know? But it gets worse. The Office 2007 pagination perfectly matched the PDF version. And there are line numbers in the Office 2007 version just like the PDF version.
Ooops.
I'm betting the average business or government office person saving a file won't think twice about it. You see Office 2007 gives you no way to save something as "strict" Office Open XML. Not even not by default, but not at all. Microsoft's definition of "Office Open XML" appears to be .docx itself.
Indeed, even Apple's Pages '08 will only EXPORT to old Microsoft Office format (.doc) and not the standard Office Open XML (.docx) format. So I appear to have no way to generate a OOXML file from Pages '08. [Yes, yes, yes — Microsoft will again point out it's a quality of implementation problem. Or they'll point out that Pages '08 is a "consumer" of OOXML only, which is allowed by the standard. I get it. It's not Microsoft's fault. I'm beginning, however, to wonder at the quality of implementation on the Novell platform. There's a business partnership under duress.]
So as an adjudicated monopoly of desktop operating systems, supplying an office productivity suite with 95+% market share, they will be able to claim instant victory for the adoption of their international standard because .docx files equal Office Open XML standard files. Oh, wait — that's what was essentially done in the IDC study published this week that was "sponsored by Microsoft".
[Now we're about to get a wee bit tedious and exact as standards wonks are prone to be. I'm going to try to explain the conformance game. It can be subtle. Apologies in advance for perhaps getting too ... well boring. If you're not interested in standards mechanics, you can safely stop reading.]
So OOXML defines a couple of types of conformance. There is Document Conformance, and Application Conformance. And conforming applications can be producers (i.e. OOXML document writers) or consumers (i.e. OOXML document readers) or both. Here's the text from the standard [Part 1, PDF edition, p. 3, lines 8-30]:
2.3 What this Standard Specifies
To address the issues listed above, this Standard constrains both syntax and semantics, but it is not intended to predefine application behavior. Therefore, it includes, among others, the following three types of information:
- Schemas and an associated validation procedure for validating document syntax against those schemas. (The validation procedure includes un-zipping, locating files, processing the extensibility elements and attributes, and XML Schema validation.)
- Additional syntax constraints in written form, wherever these constraints cannot feasibly be expressed in the schema language.
- Descriptions of element semantics. The semantics of an element refers to its intended interpretation by a human being.
2.4 Document Conformance
Document conformance is purely syntactic; it involves only Items 1 and 2 in §2.3 above.
- A conforming document shall conform to the schema (Item 1) and any additional syntax constraints (Item 2).
- The document character set shall conform to the Unicode Standard and ISO/IEC 10646-1, with either the UTF-8 or UTF-16 encoding form, as required by the XML 1.0 standard.
- Any XML element or attribute not explicitly included in this Standard shall use the extensibility mechanisms described by Parts 4 and 5 of this Standard.
2.5 Application Conformance
Application conformance is purely syntactic; it also involves only Items 1 and 2 in §2.3 above.
- A conforming consumer shall not reject any conforming documents of the document type (§4) expected by that application.
- A conforming producer shall be able to produce conforming documents.
This is the traditional way things are done with programming languages standards as well. The concept of a strictly conforming C-language program is defined in the ISO/ANSI C standard so as then to define conformance of an actual implementation (i.e. C-language compilers). In the OOXML standard, document conformance exists to be able to talk about implementation conformance, i.e. what readers/writers need to produce or accept if they conform to the standard.
For completeness sake, the "document type" reference in 2.5 above is described in section §4 as [Part 1, PDF Edition, p. 6, lines 16-26]:
document type — One of the three types of Office Open XML documents: Wordprocessing, Spreadsheet, and Presentation, defined as follows:
- A document whose package-relationship item contains a relationship to a Main Document part (§11.3.10) is a document of type Wordprocessing.
- A document whose package-relationship item contains a relationship to a Workbook part (§12.3.23) is a document of type Spreadsheet.
- A document whose package-relationship item contains a relationship to a Presentation part (§13.3.6) is a document of type Presentation.
An Office Open XML document can contain one or more embedded Office Open XML packages (§15.2.10) with each embedded package having any of the three document types. However, the presence of these embedded packages does not change the type of the document.
Now there is no statement of conformance to Office Open XML on the Apple web site beyond the above statement of "support". A search in Pages '08 Help for "office open xml" finds no reference at all. So Apple appears not to actually claim conformance to the OOXML standard anywhere. They simply "support" it. So they're not really guilty of not reading a conforming OOXML document.
But the Microsoft standards and marketing machines are claiming "support" for their standard with the assured tones that "support" = "conformance". Aside from the successful "adoption" claims in the aforementioned IDC report (where Office 2007 market share apparently equates to Office Open XML adoption) we have Tom Robertson (Microsoft General Manager of Standards and Interoperability) "citing support in products from Novell, Corel, Apple and others." Disingenuous at best.
Jason Matusow points out on his blog:
A real litmus test for the viability of the ISO/IEC DIS (draft international standard) 29500 (Open XML) is whether or not there are independent implementations. The answer to this question for Open XML is an unequivocal yes. There are independent Open XML implementations based on the existing specification in applications that run on Linux, Mac, Palm OS, iPhone, and Windows.
Again note the complete lack of reference to actual conformance per the definitions in the standard they have driven through the process. These are the people that are responsible for standards management and messaging at Microsoft. They are by definition the folks that should be defending the strict conformance of the standards in which they participate, and not merely suggesting that partial implementations are a "great start".
So where does this leave the government customer that thought they were buying an open document format for document exchange and interop? It is indeed finally time to roll out the certification machine — for everybody. Let the games continue.
Great analysis, especially about the document-production tool and transformation chain.
I suspect that font changes will change pagination, as will differences in printer metrics (with Office 200x) and these are probably beyond the scope of OOXML. The early ECMA drafts were stricter on conformance, as I recall, and the room for greater variation came later (it is now closer to the ultra-loose ODF language), as it usually does.
It is important to raise attention on the difficulty of interchange fidelity and what that may mean (versus be assumed/presumed) in one case or another.
In fairness, it would be good to conduct this same experiment with two ODF-supporting desktop products (OpenOffice.org claims ODF "support," not conformance). I think this is going to be eye-opening all the way around and should also calibrate people's expectations about whatever "fidelity" translators will be able to preserve and what round-tripping is unlikely to accomplish (although OOXML has built-in support for roundtripping and OpenOffice.org takes advantage of ODF alternate-rendition roundtripping provisions when interchanging to and from Microsoft Office formats).
Of course, neither ODF nor OOXML specify presentation fidelity, unlike Adobe Postscript (and Microsoft XPS) and the much-ignored ODA (a genuine Open Document Architecture scheme with allowance for layout fidelity).
To be clear: I think this is a great analysis. The next step is to point out that all of the current standardization efforts for office-document formats suffer this problem with regard to the difficulty of fidelity preservation across implementations. This needs to be understood much more broadly. It is critical for understanding of collaboration, interchange, and preservation prospects moving into a document-standards based future.
I have been waiting to see how government procurement agencies learn to qualify products and also see what happens when these practical difficulties are recognized. As far as I can tell in the Massachusetts poster-child case, ODF has simply come to mean whatever OpenOffice.org does sort of like ANSI COBOL became whatever a particular IBM compiler did. I thought we'd do better here, but apparently not.
Posted by: orcmid | 01 September 2007 at 08:13
Oh, a PS: You said "[Microsoft] are by definition the folks that should be defending the strict conformance of the standards in which they participate, and not merely suggesting that partial implementations are a 'great start'."
Once we move into the standards world, there might be competition around who is more compliant than who else, but it is no longer Microsoft's job. Turning governance over to a standards body relieves them of any ability to enforce compliance (e.g., the way Sun did with Java licenses).
Likewise, the standards bodies eschew enforcement and for the most part certification/qualification of vendor offerings. It is going to come down to procurement practices and any third-party arrangements for certification of the conformance of a product. Test suites will be good, although they might not deal with presentation fidelity issues in these case. Introduction of NIST into this process would be useful, although my sense is that NIST has been rather defanged and defunded in this area over the past several years.
Vendors make the claims vendors make unless there is some penalty for their ingenuity. How many ODF-compliant products are there and what does it mean to say/claim that?
This is important, and it is important for all efforts to adopt standardized document formats. There are important lessons here.
Posted by: orcmid | 01 September 2007 at 08:22
Morning orcmid: Thanks for the excellent commentary. I completely agree that those with the economic need for certification need to put the model in place that works for them.
NIST did that for the U.S. for a long time for U.S. government procurement, and the commercial world gained the benefits as well. (I believe NIST was castrated when the head of NIST became a presidential appointment instead of the traditional civil service position. NIST's mission then began tracking White House policy instead of the boring stuff you want it to do with you tax dollars.)
I started on this meme of certification a while ago here:
Conformance and Certification: The ODF Standard and Microsoft's Office Open XML Specification
Posted by: stephe | 01 September 2007 at 11:38
Stephen,
Really interesting analysis. When the 'softies started crowing about how Apple was "supporting OOXML", I challenged them that I didn't really consider an import-only ability as even worthy of being labeled "support".
See comments: notes2self.net/archive/2007/08/14/iwork-08-supports-openxml.aspx
Their response (from Brian Jones): "I would think import is more challenging than export but I guess it really depends on how your application is modeled. ... I think the reason you see import built first isn't as much around difficulty as it is around scenarios."
Posted by: Ed Brill | 02 September 2007 at 04:04
I downloaded and somewhat forcefed Novell's Office Open XML Plugin to my Mandriva 2k7 OpenOffice.org 2.x install. I was pleased to see that it did at least appear in the Save As ... menu entry.
I have tried to open bona fide TC45*.docx files, with as much luck as you - OO.org tells me that the file is corrupt and I should allow OO.org to fix it for me. I don't think so ... I also saved a "almost-throw-away" half-started novella as odf, docx, sxw, and [MSO]xml, copied them to zip so that konqueror knew what to do with them, and opened them.
The docx file is indistinguishable from the odf one and the swx one.
So, either the Novell docx isn't working as an import filter, or it isn't connecting in any meaningful way to my version of OO.org. I don't know which.
If I could get a meaningful response to my application for a MS Office 2k7 Trial Edition serial number from Microsoft - ie, one that recognizes that someone who is sent there with the APC reader's reference, is by definition not in the US of A - I would find it worth testing further, to see just what it is - I've also got a copy of Novell SLED 10.2, and so should be able to see if saving as docx results in a file that is indistinguishable from a file saved as odf ...
But Microsoft being Microsoft, I doubt that they'll allow me this test - I made two posts about this sort of thing on Brian Jones' blog, and the second one got censored.
Posted by: Wesley Parish | 03 September 2007 at 04:34
One thing I like about iWorks is that it give a list of "potential problem" during format conversion, which is better than the vague warnings given by OpenOffice.org and MSOffice.
The font problem is entrenched in ALL document format, not OOXML alone. If you do not have the font on your computer, rendering will suffer. That's why like "number of zeros after decimal point" I normally do not put font in the list of "must have" in document conversion process. Unfortunately with OOXML, in justifying its existence it wants to faithfully represent all of Microsoft document format to date, which to me means getting applications to render the document exactly, put the "font" and "decimal point" issue into the "must have".... then failed to deliver.
Posted by: Wu MingShi | 03 September 2007 at 10:10
@Stephen: Thanks for the link to your January post. We think much alike on this aspect. I will hone some sort of post about it eventually.
@Wesley: I'm not sure what the deal was with Brian Jones, but I can probably do a verification for you. It may be that the Novell plug-in is vetted with the Windows version of Novell's OO.o distro.
Here's what I can provide if you want to do some confirmation testing: On my Windows XP SP2 Machine, I have Office 2003 (SP2 not SP3 yet) with the Office Compatibility Kit and I work in OOXML almost exclusively now. I also have OO.o installed on that system and I will upgrade to the new OO.o 2.3 which I have just downloaded. I don't think the Sun translator comes with it. I don't remember why I decided not to install the Sun Translator beta, but if I do it goes here. On my Tablet PC I have Vista and Office 2007 and I can install the Novell OO.o here. I have the distro and their plug-in, though I should check for any updates before installing. I think I would install the Microsoft-sponsored translator here.
I'm willing to do reasonable experiments with this configuration and try various tests, and also report the results/problems to the appropriate parties. I just want to be careful of the mix and not destabilize anything very much. You (any anyone else interested in this kind of activity) can contact me at my e-mail address (see the contact information on my blog). Be sure to put ODF: or OOXML: in the subject so I will catch it when reviewing my junk mail folder for legitimate mail from unknown addresses.
Posted by: orcmid | 20 September 2007 at 14:45
Usually when I work with word files I use-corrupt text recovery,because tool has many pluses,and as far as I can see has free status,also tool helped two my good friends,software can use a backup copy and restore all text files from scratch, but, this possibility is not accessible for all users,program for corrupt text recovery Word 2007 is efficient to restore damaged text files,repairing Word 2007 file will not take a lot of time,can work either on the slowest computers or on modern workstations,recovery and help to repair corrupt Word doc.
Posted by: del_piero | 19 May 2009 at 04:12