If you have to extract information from Microsoft Excel workbooks, Microsoft PowerPoint presentations, or Microsoft Word documents, you can use several methods. These methods include API programming calls, Office Open XML, XML, RTF, or HTML. If these methods do not address your needs, you may be eligible to participate in a Royalty-Free File Format Program and to receive technical documentation for certain Microsoft Office binary file formats.
This article describes several techniques that are available
for extracting information from Excel workbooks, PowerPoint presentations, and
Word documents.
Office Open XML
The Office Open XML Formats are designed so that multiple applications on multiple platforms can create and access Office Open XML documents. By using the Office Open XML Format, you can directly manipulate the file format. You do not have to use Microsoft Office applications to create or to access the files.
Benefits of Office Open XML
- It is open. Office Open XML is openly licensed and documented. It is refined in the open Ecma process so that it works across a wide variety of platforms, applications, and usages.
- It is XML. Office Open XML is a standard technology that many tools and applications can easily and transparently use.
- It is backward compatible and interoperable. This enables you to preserve documents in their original form while they are converted to an open, modern format. Additionally, different applications can use the Office Open XML Format with predictable results.
- It works with what you have through custom XML schema support, through free updates for existing versions of Office, and through support of important accessibility functions for disabled workers.
- It is ready for the future. With Office Open XML, you can use all the features in the 2007 Microsoft Office programs to create documents. Office Open XML provides ways to subset or to extend these features while it maintains conformity.
- It can help improve security. IT security procedures and applications can more easily discover and fix potential problems, while documents are less likely to be corrupted.
For more information about the Office Open XML Format, read the Office Open XML v1.0 draft on the following Ecma International Web site:
Additionally, visit the following OpenXMLDeveloper.org Web site:
The Office Open XML Formats use the Open Packaging Conventions to store the Office Open XML file information on disk. For more information about the Open Packaging Conventions as used by Office Open XML, see the Office Open XML v1.0 draft, part 2, "Open Packaging Conventions".
Office Application Programming Interfaces (APIs)
Office binary file formats are designed to be accessed through
the Office Application Programming Interfaces (APIs), instead of by direct
manipulation of the file format. Because of the complexity of the formats, direct
manipulation can cause corruption and is strongly discouraged.
For
more information about the Office APIs, visit the following Microsoft Web
site:
The Office 97-2003 binary file formats use the Windows Structured Storage
APIs. The Office-specific information is stored as streams in this more
generalized format. Common elements, such as document properties, can be
accessed through the Structured Storage APIs and do not require access to the
Office binary file format documentation.
For more information
about the Windows Structured Storage APIs, visit the following Microsoft Web
site:
The Microsoft Excel 2007 binary format (*.xlsb) stores binary records. This format uses the same part and packaging technologies that are found in SpreadsheetML. SpreadsheetML is part of the Office Open XML Format.
Important Reading or manipulating the structure directly can cause
corruption and is strongly discouraged.
XML
XML is a plain-text, Unicode-based metalanguage (a language for
defining markup languages). XML is not tied to any programming language,
operating system, or software vendor. XML provides access to a plethora of
technologies for manipulating, structuring, transforming, and querying data. As
the use of XML has grown, it is now typically accepted that XML is not only
useful for describing new document formats for the Web, but is also suitable to
describe structured data. Examples of structured data include information that
is typically contained in spreadsheets, program configuration files, and
network protocols.
Microsoft Office includes support for XML schemas.
Microsoft maintains a licensing program for certain Office XML
schemas.
To learn more about Office XML schemas, visit the following
Microsoft Web site to view the
Microsoft Office System and XML: Bringing XML to the Desktop article:
Rich Text Format (RTF)
The Rich Text Format (RTF) specification is a method of encoding
formatted text and graphics for easy transfer between programs. The RTF
specification provides a format for text and graphics interchange that can be
used with different output devices, operating environments, and operating
systems. RTF uses the American National Standards Institute (ANSI), PC-8,
Macintosh, or IBM PC character set to control the representation and the
formatting of a document, both on the screen and in print. With the RTF
specification, documents that are created under different operating systems and
that are created by using different software programs can be transferred
between those operating systems and those programs.
For more
information about how to write or how to implement a sample RTF reader, visit
the following Microsoft Web site, and then type
RTF
Reader in the
Search MSDN For box:
Visio XML schema
Through the Microsoft documentation and a royalty-free license,
customers and partners can take advantage of the XML schema in its diagramming
and data visualization tool. The availability of the Visio schema provides a
complete and W3C-compliant description of the Visio Extensible Markup Language
(XML) file format, enabling organizations to access information captured in
their Visio diagrams and uses it with other XML-enabled programs, such as
customer relationship management (CRM) and enterprise resource planning (ERP)
systems, as part of their business processes. For more information and download
capabilities, visit the following Microsoft Web site:
HTML
HTML files are text files that include the information that users
will see, and tags that specify formatting information about how the
information will be presented for display purposes. You can use HTML to store,
distribute, and present Office documents and data in a format that can be
viewed by using most Web browsers while retaining the rich content and
functionality of Office documents.
Note In Microsoft Excel 2007, the HTML file format does not save features that are specific to Excel. Additionally, the HTML formal does not support or render all the features in Excel 2007 when you save a workbook as HTML.
For more information about how to
edit HTML, visit the following Microsoft Web site:
For more information about how to work with code, HTML, and
resource files, visit the following Microsoft Web site:
Royalty-Free File Format Programs
Microsoft Office Binary File Formats
Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and for forensic reference purposes.
Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program. The documentation that covers the binary file format specifications is cumulative and covers the most current form of the binary file formats as well as earlier versions.
Office Binary File Format specifications are available under the Open Specification Promise. To obtain documentation, visit the following Microsoft Web site: