You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2014/04/01 17:01:41 UTC
svn commit: r1583700 - in /tika/site/src/site/apt/1.6: ./ formats.apt
Author: nick
Date: Tue Apr 1 15:01:41 2014
New Revision: 1583700
URL: http://svn.apache.org/r1583700
Log:
Start on the 1.6 supported formats document, to avoid us forgetting to update it with the new formats when we release
Added:
tika/site/src/site/apt/1.6/
tika/site/src/site/apt/1.6/formats.apt
Added: tika/site/src/site/apt/1.6/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.6/formats.apt?rev=1583700&view=auto
==============================================================================
--- tika/site/src/site/apt/1.6/formats.apt (added)
+++ tika/site/src/site/apt/1.6/formats.apt Tue Apr 1 15:01:41 2014
@@ -0,0 +1,211 @@
+ --------------------------
+ Supported Document Formats
+ --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+ This page lists all the document formats supported by Apache Tika 0.6.
+ Follow the links to the various parser class javadocs for more detailed
+ information about each document format and how it is parsed by Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {HyperText Markup Language}
+
+ The HyperText Markup Language (HTML) is the lingua franca of the web.
+ Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+ library to support virtually any kind of HTML found on the web.
+ The output from the
+ {{{api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+ is guaranteed to be well-formed and valid XHTML, and various heuristics
+ are used to prevent things like inline scripts from cluttering the
+ extracted text content.
+
+* {XML and derived formats}
+
+ The Extensible Markup Language (XML) format is a generic format that can
+ be used for all kinds of content. Tika has custom parsers for some widely
+ used XML vocabularies like XHTML, OOXML and ODF, but the default
+ {{{api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+ class simply extracts the text content of the document and ignores any XML
+ structure. The only exception to this rule are Dublin Core metadata
+ elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+ Microsoft Office and some related applications produce documents in the
+ generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+ older OLE 2 format was introduced in Microsoft Office version 97 and was
+ the default format until Office version 2007 and the new XML-based
+ OOXML format. The
+ {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+ and
+ {{{api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+ classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+ text and metadata extraction from both OLE2 and OOXML documents.
+
+* {OpenDocument Format}
+
+ The OpenDocument format (ODF) is used most notably as the default format
+ of the OpenOffice.org office suite. The
+ {{{api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+ class supports this format and the earlier OpenOffice 1.0 format on which
+ ODF is based.
+
+* {iWorks document formats}
+
+ The various iWorks document formats (Numbers, Pages, Keynote) are supported
+ by the
+ {{{api/org/apache/tika/parser/iwork/IWorkPackageParser.html}IWorkPackageParser}}
+ class, which extracts text and metadata.
+
+* {Portable Document Format}
+
+ The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+ parsers Portable Document Format (PDF) documents using the
+ {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+ The {{{api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+ supports the Electronic Publication Format (EPUB) used for many digital
+ books.
+
+* {Rich Text Format}
+
+ The {{{api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+ uses the standard javax.swing.text.rtf feature to extract text content
+ from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+ Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+ library to support various compression and packaging formats. The
+ {{{api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+ class and its subclasses first parse the top level compression or
+ packaging format and then pass the unpacked document streams to a
+ second parsing stage using the parser instance specified in the
+ parse context. Formats supported include Tar, CPIO, Zip and 7Zip.
+
+* {Text formats}
+
+ Extracting text content from plain text files seems like a simple task
+ until you start thinking of all the possible character encodings. The
+ {{{api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+ encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+ project to automatically detect the character encoding of a text document.
+
+* {Feed and Syndication formats}
+
+ The {{{api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} class
+ supports the RSS and Atom feed syndication formats.
+
+* {Help formats}
+
+ The {{{api/org/apache/tika/parser/chm/ChmParser.html}ChmParser}} class
+ supports the CHM Help format.
+
+* {Audio formats}
+
+ Tika can detect several common audio formats and extract metadata
+ from them. Even text extraction is supported for some audio files that
+ contain lyrics or other textual content. Extracted metadata includes
+ sampling rates, channels, format information, artists, titles etc. The
+ {{{api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+ and {{{api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+ classes use standard javax.sound features to process simple audio
+ formats. The
+ {{{api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+ adds support for the widely used MP3 format, and the
+ {{{api/org/apache/tika/parser/mp4/MP4Parser.html}MP4Parser}} class
+ provides it for MP4 audio. The Ogg family of audio formats (Vorbis,
+ Speex, Opus, Flac etc) are supported by the
+ {{{api/org/gagravarr/tika/VorbisParser.html}VorbisParser}},
+ {{{api/org/gagravarr/tika/OpusParser.html}OpusParser}},
+ {{{api/org/gagravarr/tika/SpeexParser.html}SpeexParser}} and
+ {{{api/org/gagravarr/tika/FlacParser.html}FlacParser}}
+ classes.
+
+* {Image formats}
+
+ The {{{api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+ class uses the standard javax.imageio feature to extract simple metadata
+ from image formats supported by the Java platform, such as PNG, GIF
+ and BMP. More complex image metadata is available through the
+ {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class and
+ {{{api/org/apache/tika/parser/image/TiffParser.html}TiffParser}} classes
+ that uses the metadata-extractor library to supports Exif metadata
+ extraction from Jpeg and Tiff images. The
+ {{{api/org/apache/tika/parser/image/PSDParser.html}PSDParser}} class
+ extracts metadata from PSD images.
+
+* {Video formats}
+
+ Tika supports the Flash video format using a simple parsing algorithm
+ implemented in the
+ {{{api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+
+ The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported
+ by the {{{api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} class,
+ which extracts metadata on the video, along with audio stream
+ (if present).
+
+ For the Ogg family of video formats, a limited amount of metadata is
+ extracted by the
+ {{{api/org/gagravarr/tika/OggParser.html}OggParser}} class.
+
+* {Java class files and archives}
+
+ The {{{api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+ extracts class names and method signatures from Java class files, and
+ the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
+ supports also jar archives.
+
+* {Source code}
+
+ The {{{api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}} class
+ handles a number of source code formats, including Java, C, C++ and Groovy.
+ It provides a formatted form of the code, along with some simple metadata.
+
+* {Mail formats}
+
+ The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+ extract email messages from the mbox format used by many email archives
+ and Unix-style mailboxes.
+
+ The {{{api/org/apache/tika/parser/mbox/PSTParser.html}PSDParser}} can
+ extract email messages from the Microsoft Outlook PST email format.
+
+* {CAD formats}
+
+ The {{{api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
+ extract simple metadata from the DWG CAD format.
+
+* {Font formats}
+
+ The {{{api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}}
+ class can extract simple metadata from the TrueType font format.
+ The {{{api/org/apache/tika/parser/font/AdobeFontMetricParser.html}AdobeFontMetricParser}}
+ class does something similar for Adobe Font Metrics files.
+
+* {Executable programs and libraries}
+
+ The {{{api/org/apache/tika/parser/executable/ExecutableParser.html}ExecutableParser}} can
+ extract metadata information on platforms, architectures and types from a range
+ of executable formats and libraries, such as Windows Executables and Linux / BSD
+ programs and libraries.