You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2011/10/11 00:12:20 UTC
svn commit: r1181271 [1/3] - in /tika/site/src/site/apt: ./ 0.10/ 0.5/ 0.6/
0.7/ 0.8/ 0.9/ 1.0/
Author: jukka
Date: Mon Oct 10 22:12:19 2011
New Revision: 1181271
URL: http://svn.apache.org/viewvc?rev=1181271&view=rev
Log:
site: Add svn:eol-style settings
Modified:
tika/site/src/site/apt/0.10/detection.apt (props changed)
tika/site/src/site/apt/0.10/formats.apt (props changed)
tika/site/src/site/apt/0.10/gettingstarted.apt (props changed)
tika/site/src/site/apt/0.10/index.apt (props changed)
tika/site/src/site/apt/0.10/parser.apt (props changed)
tika/site/src/site/apt/0.10/parser_guide.apt (props changed)
tika/site/src/site/apt/0.5/documentation.apt (contents, props changed)
tika/site/src/site/apt/0.5/formats.apt (contents, props changed)
tika/site/src/site/apt/0.5/gettingstarted.apt (contents, props changed)
tika/site/src/site/apt/0.5/index.apt (contents, props changed)
tika/site/src/site/apt/0.6/formats.apt (contents, props changed)
tika/site/src/site/apt/0.6/gettingstarted.apt (contents, props changed)
tika/site/src/site/apt/0.6/index.apt (contents, props changed)
tika/site/src/site/apt/0.6/parser.apt (contents, props changed)
tika/site/src/site/apt/0.7/detection.apt (contents, props changed)
tika/site/src/site/apt/0.7/formats.apt (contents, props changed)
tika/site/src/site/apt/0.7/gettingstarted.apt (contents, props changed)
tika/site/src/site/apt/0.7/index.apt (contents, props changed)
tika/site/src/site/apt/0.7/parser.apt (contents, props changed)
tika/site/src/site/apt/0.7/parser_guide.apt (contents, props changed)
tika/site/src/site/apt/0.8/detection.apt (props changed)
tika/site/src/site/apt/0.8/formats.apt (props changed)
tika/site/src/site/apt/0.8/gettingstarted.apt (props changed)
tika/site/src/site/apt/0.8/index.apt (props changed)
tika/site/src/site/apt/0.8/parser.apt (props changed)
tika/site/src/site/apt/0.8/parser_guide.apt (props changed)
tika/site/src/site/apt/0.9/detection.apt (props changed)
tika/site/src/site/apt/0.9/formats.apt (props changed)
tika/site/src/site/apt/0.9/gettingstarted.apt (props changed)
tika/site/src/site/apt/0.9/index.apt (props changed)
tika/site/src/site/apt/0.9/parser.apt (props changed)
tika/site/src/site/apt/0.9/parser_guide.apt (props changed)
tika/site/src/site/apt/1.0/parser_guide.apt (props changed)
tika/site/src/site/apt/download.apt (contents, props changed)
Propchange: tika/site/src/site/apt/0.10/detection.apt
------------------------------------------------------------------------------
svn:eol-style = native
Propchange: tika/site/src/site/apt/0.10/formats.apt
------------------------------------------------------------------------------
svn:eol-style = native
Propchange: tika/site/src/site/apt/0.10/gettingstarted.apt
------------------------------------------------------------------------------
svn:eol-style = native
Propchange: tika/site/src/site/apt/0.10/index.apt
------------------------------------------------------------------------------
svn:eol-style = native
Propchange: tika/site/src/site/apt/0.10/parser.apt
------------------------------------------------------------------------------
svn:eol-style = native
Propchange: tika/site/src/site/apt/0.10/parser_guide.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.5/documentation.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/documentation.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/documentation.apt (original)
+++ tika/site/src/site/apt/0.5/documentation.apt Mon Oct 10 22:12:19 2011
@@ -1,227 +1,227 @@
- -------------------------
- Apache Tika Documentation
- -------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Apache Tika Documentation
-
- This document describes the key abstractions and usage of Apache Tika.
-
-The Parser interface
-
- The {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
- interface is the key concept of Apache Tika. It hides the complexity of
- different file formats and parsing libraries while providing a simple and
- powerful mechanism for client applications to extract structured text
- content and metadata from all sorts of documents. All this is achieved
- with a single method:
-
----
-void parse(InputStream stream, ContentHandler handler, Metadata metadata)
- throws IOException, SAXException, TikaException;
----
-
- The <<<parse>>> method takes the document to be parsed and related metadata
- as input and outputs the results as XHTML SAX events and extra metadata.
- The main criteria that lead to this design were:
-
- [Streamed parsing] The interface should require neither the client
- application nor the parser implementation to keep the full document
- content in memory or spooled to disk. This allows even huge documents
- to be parsed without excessive resource requirements.
-
- [Structured content] A parser implementation should be able to
- include structural information (headings, links, etc.) in the extracted
- content. A client application can use this information for example to
- better judge the relevance of different parts of the parsed document.
-
- [Input metadata] A client application should be able to include metadata
- like the file name or declared content type with the document to be
- parsed. The parser implementation can use this information to better
- guide the parsing process.
-
- [Output metadata] A parser implementation should be able to return
- document metadata in addition to document content. Many document
- formats contain metadata like the name of the author that may be useful
- to client applications.
-
- []
-
- These criteria are reflected in the arguments of the <<<parse>>> method.
-
-Document input stream
-
- The first argument is an
- {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
- for reading the document to be parsed.
-
- If this document stream can not be read, then parsing stops and the thrown
- {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
- is passed up to the client application. If the stream can be read but
- not parsed (for example if the document is corrupted), then the parser
- throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
-
- The parser implementation will consume this stream but <will not close it>.
- Closing the stream is the responsibility of the client application that
- opened it in the first place. The recommended pattern for using streams
- with the <<<parse>>> method is:
-
----
-InputStream stream = ...; // open the stream
-try {
- parser.parse(stream, ...); // parse the stream
-} finally {
- stream.close(); // close the stream
-}
----
-
- Some document formats like the OLE2 Compound Document Format used by
- Microsoft Office are best parsed as random access files. In such cases the
- content of the input stream is automatically spooled to a temporary file
- that gets removed once parsed. A future version of Tika may make it possible
- to avoid this extra file if the input document is already a file in the
- local file system. See
- {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
- of this feature request.
-
-XHTML SAX events
-
- The parsed content of the document stream is returned to the client
- application as a sequence of XHTML SAX events. XHTML is used to express
- structured content of the document and SAX events enable streamed
- processing. Note that the XHTML format is used here only to convey
- structural information, not to render the documents for browsing!
-
- The XHTML SAX events produced by the parser implementation are sent to a
- {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
- instance given to the <<<parse>>> method. If this the content handler
- fails to process an event, then parsing stops and the thrown
- {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
- is passed up to the client application.
-
- The overall structure of the generated event stream is (with indenting
- added for clarity):
-
----
-<html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <title>...</title>
- </head>
- <body>
- ...
- </body>
-</html>
----
-
- Parser implementations typically use the
- {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
- utility class to generate the XHTML output.
-
- Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
- version 0.2) comes with a number of utility classes that can be used to
- process and convert the event stream to other representations.
-
- For example, the
- {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
- class can be used to extract just the body part of the XHTML output and
- feed it either as SAX events to another content handler or as characters
- to an output stream, a writer, or simply a string. The following code
- snippet parses a document from the standard input stream and outputs the
- extracted text content to standard output:
-
----
-ContentHandler handler = new BodyContentHandler(System.out);
-parser.parse(System.in, handler, ...);
----
-
- Another useful class is
- {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
- uses a background thread to parse the document and returns the extracted
- text content as a character stream:
-
----
-InputStream stream = ...; // the document to be parsed
-Reader reader = new ParsingReader(parser, stream, ...);
-try {
- ...; // read the document text using the reader
-} finally {
- reader.close(); // the document stream is closed automatically
-}
----
-
-Document metadata
-
- The final argument to the <<<parse>>> method is used to pass document
- metadata both in and out of the parser. Document metadata is expressed
- as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
-
- The following are some of the more interesting metadata properties:
-
- [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
- the document.
-
- A client application can set this property to allow the parser to use
- file name heuristics to determine the format of the document.
-
- The parser implementation may set this property if the file format
- contains the canonical name of the file (for example the Gzip format
- has a slot for the file name).
-
- [Metadata.CONTENT_TYPE] The declared content type of the document.
-
- A client application can set this property based on for example a HTTP
- Content-Type header. The declared content type may help the parser to
- correctly interpret the document.
-
- The parser implementation sets this property to the content type according
- to which the document was parsed.
-
- [Metadata.TITLE] The title of the document.
-
- The parser implementation sets this property if the document format
- contains an explicit title field.
-
- [Metadata.AUTHOR] The name of the author of the document.
-
- The parser implementation sets this property if the document format
- contains an explicit author field.
-
- []
-
- Note that metadata handling is still being discussed by the Tika development
- team, and it is likely that there will be some (backwards incompatible)
- changes in metadata handling before Tika 1.0.
-
-Parser implementations
-
- Apache Tika comes with a number of parser classes for parsing
- {{{./formats.html}various document formats}}. You can also extend Tika
- with your own parsers, and of course any contributions to Tika are
- warmly welcome.
-
- The goal of Tika is to reuse existing parser libraries like
- {{{http://www.pdfbox.org/}PDFBox}} or
- {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
- of the parser classes in Tika are adapters to such external libraries.
-
- Tika also contains some general purpose parser implementations that are
- not targeted at any specific document formats. The most notable of these
- is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
- class that encapsulates all Tika functionality into a single parser that
- can handle any types of documents. This parser will automatically determine
- the type of the incoming document based on various heuristics and will then
- parse the document accordingly.
+ -------------------------
+ Apache Tika Documentation
+ -------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika Documentation
+
+ This document describes the key abstractions and usage of Apache Tika.
+
+The Parser interface
+
+ The {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+ interface is the key concept of Apache Tika. It hides the complexity of
+ different file formats and parsing libraries while providing a simple and
+ powerful mechanism for client applications to extract structured text
+ content and metadata from all sorts of documents. All this is achieved
+ with a single method:
+
+---
+void parse(InputStream stream, ContentHandler handler, Metadata metadata)
+ throws IOException, SAXException, TikaException;
+---
+
+ The <<<parse>>> method takes the document to be parsed and related metadata
+ as input and outputs the results as XHTML SAX events and extra metadata.
+ The main criteria that lead to this design were:
+
+ [Streamed parsing] The interface should require neither the client
+ application nor the parser implementation to keep the full document
+ content in memory or spooled to disk. This allows even huge documents
+ to be parsed without excessive resource requirements.
+
+ [Structured content] A parser implementation should be able to
+ include structural information (headings, links, etc.) in the extracted
+ content. A client application can use this information for example to
+ better judge the relevance of different parts of the parsed document.
+
+ [Input metadata] A client application should be able to include metadata
+ like the file name or declared content type with the document to be
+ parsed. The parser implementation can use this information to better
+ guide the parsing process.
+
+ [Output metadata] A parser implementation should be able to return
+ document metadata in addition to document content. Many document
+ formats contain metadata like the name of the author that may be useful
+ to client applications.
+
+ []
+
+ These criteria are reflected in the arguments of the <<<parse>>> method.
+
+Document input stream
+
+ The first argument is an
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+ for reading the document to be parsed.
+
+ If this document stream can not be read, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+ is passed up to the client application. If the stream can be read but
+ not parsed (for example if the document is corrupted), then the parser
+ throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+ The parser implementation will consume this stream but <will not close it>.
+ Closing the stream is the responsibility of the client application that
+ opened it in the first place. The recommended pattern for using streams
+ with the <<<parse>>> method is:
+
+---
+InputStream stream = ...; // open the stream
+try {
+ parser.parse(stream, ...); // parse the stream
+} finally {
+ stream.close(); // close the stream
+}
+---
+
+ Some document formats like the OLE2 Compound Document Format used by
+ Microsoft Office are best parsed as random access files. In such cases the
+ content of the input stream is automatically spooled to a temporary file
+ that gets removed once parsed. A future version of Tika may make it possible
+ to avoid this extra file if the input document is already a file in the
+ local file system. See
+ {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+ of this feature request.
+
+XHTML SAX events
+
+ The parsed content of the document stream is returned to the client
+ application as a sequence of XHTML SAX events. XHTML is used to express
+ structured content of the document and SAX events enable streamed
+ processing. Note that the XHTML format is used here only to convey
+ structural information, not to render the documents for browsing!
+
+ The XHTML SAX events produced by the parser implementation are sent to a
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+ instance given to the <<<parse>>> method. If this the content handler
+ fails to process an event, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+ is passed up to the client application.
+
+ The overall structure of the generated event stream is (with indenting
+ added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>...</title>
+ </head>
+ <body>
+ ...
+ </body>
+</html>
+---
+
+ Parser implementations typically use the
+ {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+ utility class to generate the XHTML output.
+
+ Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
+ version 0.2) comes with a number of utility classes that can be used to
+ process and convert the event stream to other representations.
+
+ For example, the
+ {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+ class can be used to extract just the body part of the XHTML output and
+ feed it either as SAX events to another content handler or as characters
+ to an output stream, a writer, or simply a string. The following code
+ snippet parses a document from the standard input stream and outputs the
+ extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+ Another useful class is
+ {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+ uses a background thread to parse the document and returns the extracted
+ text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+ ...; // read the document text using the reader
+} finally {
+ reader.close(); // the document stream is closed automatically
+}
+---
+
+Document metadata
+
+ The final argument to the <<<parse>>> method is used to pass document
+ metadata both in and out of the parser. Document metadata is expressed
+ as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+ The following are some of the more interesting metadata properties:
+
+ [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+ the document.
+
+ A client application can set this property to allow the parser to use
+ file name heuristics to determine the format of the document.
+
+ The parser implementation may set this property if the file format
+ contains the canonical name of the file (for example the Gzip format
+ has a slot for the file name).
+
+ [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+ A client application can set this property based on for example a HTTP
+ Content-Type header. The declared content type may help the parser to
+ correctly interpret the document.
+
+ The parser implementation sets this property to the content type according
+ to which the document was parsed.
+
+ [Metadata.TITLE] The title of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit title field.
+
+ [Metadata.AUTHOR] The name of the author of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit author field.
+
+ []
+
+ Note that metadata handling is still being discussed by the Tika development
+ team, and it is likely that there will be some (backwards incompatible)
+ changes in metadata handling before Tika 1.0.
+
+Parser implementations
+
+ Apache Tika comes with a number of parser classes for parsing
+ {{{./formats.html}various document formats}}. You can also extend Tika
+ with your own parsers, and of course any contributions to Tika are
+ warmly welcome.
+
+ The goal of Tika is to reuse existing parser libraries like
+ {{{http://www.pdfbox.org/}PDFBox}} or
+ {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+ of the parser classes in Tika are adapters to such external libraries.
+
+ Tika also contains some general purpose parser implementations that are
+ not targeted at any specific document formats. The most notable of these
+ is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+ class that encapsulates all Tika functionality into a single parser that
+ can handle any types of documents. This parser will automatically determine
+ the type of the incoming document based on various heuristics and will then
+ parse the document accordingly.
Propchange: tika/site/src/site/apt/0.5/documentation.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.5/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/formats.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/formats.apt (original)
+++ tika/site/src/site/apt/0.5/formats.apt Mon Oct 10 22:12:19 2011
@@ -1,303 +1,303 @@
- --------------------------
- Supported Document Formats
- --------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Supported Document Formats
-
- This page lists all the document formats supported by Apache Tika.
-
-* Microsoft's OLE 2 Compound Document format
-
- A number of Microsoft applications, most notably the Microsoft Office
- suite, use the generic OLE 2 Compound Document format as the basis of
- their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
- to support a number of these formats.
-
- The OLE2 Compound Document format is designed for use with random access
- files, and so the input stream passed to a Tika parser needs to be spooled
- in memory or in a temporary file depending on the size of the document.
- See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
- effort to avoid this extra temporary file if the input document already
- comes from a file.
-
- In addition to the shared base format there's also a shared sets of
- metadata in typical OLE2 documents. Tika uses the
- {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
- property sets and exposes them as the following document metadata:
-
- * <<<TITLE>>> Title
-
- * <<<SUBJECT>>> Subject
-
- * <<<AUTHOR>>> Author
-
- * <<<KEYWORDS>>> Keywords
-
- * <<<COMMENTS>>> Comments
-
- * <<<TEMPLATE>>> Template
-
- * <<<LAST_SAVED>>> Last Saved By
-
- * <<<REVISION_NUMBER>>> Revision Number
-
- * <<<LAST_PRINTED>>> Last Printed
-
- * <<<LAST_SAVED>>> Last Saved Time/Date
-
- * <<<LAST_SAVED>>> Last Saved Time/Date
-
- * <<<PAGE_COUNT>>> Number of Pages
-
- * <<<WORD_COUNT>>> Number of Words
-
- * <<<CHARACTER_COUNT>>> Number of Characters
-
- * <<<APPLICATION_NAME>>> Name of Creating Application
-
- Note that in practice the metadata in many documents is either missing,
- incomplete or even incorrect, so a client application should not rely
- too much on this information.
-
- Support for the new Office Open XML format used by Microsoft Office
- version 2007 is pending for a POI upgrade. Current status is recorded in
- {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
-
- The generic OLE2 Compound Document format is automatically detected using
- a magic number, and further parsing can automatically determine the more
- specific document format. Tika also knows a number of common glob patterns
- like <<<*.doc>>> and <<<*.ppt>>> for these formats.
-
- The supported OLE 2 Compound Document formats are:
-
- [Microsoft Excel (application/vnd.ms-excel)]
- Excel spreadsheet support is available in all versions of Tika and is
- based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
-
- The Excel parser in Tika uses the
- {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
- is able to extract much of the document structure, including all
- (non-empty) worksheets and their table structures. Formula results are
- extracted as stored in the Excel file, and cell links are exposed as
- XHTML links. These features were added in Tika version 0.2.
-
- Cell comments and formatting are currently not supported. See
- {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
- {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
- respective issues.
-
- [Microsoft Word (application/msword)]
- Word document support is available in all versions of Tika and is based
- on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
-
- The Word parser uses the
- {{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
- class from HWPF to extract document content as a sequence of paragraphs.
-
- [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
- PowerPoint presentation support is available in all versions of Tika and
- is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
-
- The PowerPoint parser uses the
- {{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
- class from HSLF to extract spreadsheet content as a single paragraph.
-
- [Microsoft Visio (application/vnd.visio)]
- Visio diagram support was added in Tika version 0.2 and is based on the
- {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
-
- The Visio parser uses the
- {{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
- class from HDGF to extract diagram content as a sequence of paragraphs.
-
- [Microsoft Outlook (application/vnd.ms-outlook)]
- Outlook message support was added in Tika version 0.2 and is based on the
- {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
-
- The Outlook parser extracts the subject of the message and the From,
- To, Cc, and Bcc addresses (formatted for display) along with the body
- text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>>> and
- <<<SUBJECT>>> metadata properties are set explicitly, overriding
- potential generic document metadata retrieved from OLE2 property sets.
-
-* Compression formats
-
- General purpose compression formats are used to reduce the size of
- any kinds of documents. Tika uses a parsing pipeline to support general
- purpose compression: in the first stage the compressed stream decompressed
- and the resulting decompressed stream is passed on to a second parsing
- stage where it will be processed as if the document had never been
- compressed.
-
- Tika contains magic numbers and glob patterns for auto-detecting all
- supported compression formats. The glob patterns of compression formats
- are also used to determine the name of the original uncompressed document.
- If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
- property that matches such a glob pattern, then the decompressing first
- parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
- with the deduced original document name before passing control to the
- second parsing stage.
-
- Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
- property, no document metadata is passed to or from the second parsing
- stage. Only the text content extracted by the second stage parser is
- returned to the client application.
-
- The supported compression formats are:
-
- [gzip compression (application/x-gzip)]
- {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
- Tika version 0.2 and is based on the
- {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
- class in the Java 5 class library.
-
- The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
- and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
- <<<*>>> as described above.
-
- [bzip2 compression (application/x-bzip)]
- {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
- Tika version 0.2 and is based on bzip2 parsing code from
- {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
- based on work by Keiron Liddle from Aftex Software.
-
- The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
- and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
- <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
-
-* Audio formats
-
- Tika can detect several common audio formats and extract metadata
- from them. Text extraction is supported for some MIDI-based karaoke
- formats that contain the lyrics of the encoded audio.
-
- See {{{https://issues.apache.org/jira/browse/TIKA-94}TIKA-94}} for
- an effort to integrate speech recognition support to Tika.
-
- [MP3 Audio (audio/mpeg)]
- The parsing of {{{http://www.id3.org/ID3v1}ID3v1}} tags from MP3 files
- was added in Tika version 0.2. If found the following metadata is
- extracted and set:
-
- * <<<TITLE>>> Title
-
- * <<<SUBJECT>>> Subject
-
- The above information, as well as the <<<Album>>>, <<<Track>>>,
- <<<Year>>>, <<<Genre>>> and additional <<<Comment>>> are extracted
- when set in the file.
-
- [MIDI audio (audio/midi)]
- Tika uses the MIDI support in <<<javax.audio.midi>>> to parse MIDI
- sequence files. Many karaoke file formats are based on MIDI, and
- contain lyrics as embedded text tracks that Tika knows how to extract.
-
- Support for MIDI files was added in Tika 0.3.
-
- [Wave audio (audio/basic)]
- Tika supports sampled wave audio (.wav files, etc.) using the
- <<<javax.audio.sampled>>> package. Only sampling metadata is extracted.
-
- Support for sampled wave audio was added in Tika 0.3.
-
-* Other supported formats
-
- [Extensible Markup Language (application/xml)]
- Tika uses the <<<javax.xml>>> classes to parse Extensible Markup Language files.
- Support for Extensible Markup Language files was added in Tika 0.1.
-
- [HyperText Markup Language (text/html)]
- Tika uses the {{{http://sourceforge.net/projects/nekohtml}CyberNeko}} library to parse HyperText Markup Language files.
- Support for HyperText Markup Language files was added in Tika 0.1.
-
- [Images (image/*)]
- Tika uses the <<<javax.imageio>>> classes to extract metadata
- from image files.
-
- Support for Image files was added in Tika 0.2.
-
- [Java class files]
- The parsing of Java Class files is based on the asm library and
- work by Dave Brosius in JCR-1522.
-
- Support for Java Class files was added in Tika 0.2.
-
- [Java jar archives]
- The parsing of Java JAR archives is performed using a combination of
- the ZIP and Java class file parsers.
-
- Support for Java JAR archives was added in Tika 0.2.
-
- [OpenDocument (application/vnd.oasis.opendocument.*)]
- Tika uses the built-in ZIP and XML features in Java to parse the
- {{{http://en.wikipedia.org/wiki/OpenDocument}OpenDocument}} document types
- used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0
- formats are also supported, though they are currently not auto-detected
- as well as the newer formats.
-
- Support for the OpenDocument formats was added in Tika 0.3.
-
- [Plain text (text/plain)]
- Tika uses the
- {{{http://www.icu-project.org/}International Components for Unicode}}
- Java library (ICU4J) to parse plain text. Support for plain text was added
- in Tika 0.1.
-
- Extracting text content from plain text files is actually a relatively
- complex task due to the fact that the character encoding of the text
- file is often unknown to the parser.
-
- The text parser in Tika uses the ICU4J
- {{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
- class to automatically detect the character encoding of any text input.
- As an added benefit, the ICU4J library is in some cases able to detect
- also the language in which the text is written.
-
- The character encoding and language of the plain text document are
- returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
- metadata properties. If the (declared) content encoding of a text document
- is already known to the client application, then it can be supplied as the
- <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
- simplify encoding detection.
-
- [Portable Document Format (application/pdf)]
- Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
- Portable Document Format (PDF) documents.
-
- Support for PDF was added in Tika 0.1.
-
- [Rich Text Format (application/rtf)]
- Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
- documents. Support for RTF was added in Tika 0.1.
-
- The RTF parser in Tika uses the Swing
- {{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
- class to extract all text from an RTF document as a single paragraph.
- Document metadata extraction is currently not supported.
-
- [tar archive (application/x-tar)]
- Tika uses an adapted version of the tar parsing code from
- {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
- The tar code is originally based on work by Timothy Gerard Endres.
-
- Support for tar archives was added in Tika 0.2.
-
- [ZIP archive (application/zip)]
- Tika uses Java's built-in Zip classes to parse ZIP files.
-
- Support for ZIP was added in Tika 0.2.
+ --------------------------
+ Supported Document Formats
+ --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+ This page lists all the document formats supported by Apache Tika.
+
+* Microsoft's OLE 2 Compound Document format
+
+ A number of Microsoft applications, most notably the Microsoft Office
+ suite, use the generic OLE 2 Compound Document format as the basis of
+ their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
+ to support a number of these formats.
+
+ The OLE2 Compound Document format is designed for use with random access
+ files, and so the input stream passed to a Tika parser needs to be spooled
+ in memory or in a temporary file depending on the size of the document.
+ See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
+ effort to avoid this extra temporary file if the input document already
+ comes from a file.
+
+ In addition to the shared base format there's also a shared sets of
+ metadata in typical OLE2 documents. Tika uses the
+ {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
+ property sets and exposes them as the following document metadata:
+
+ * <<<TITLE>>> Title
+
+ * <<<SUBJECT>>> Subject
+
+ * <<<AUTHOR>>> Author
+
+ * <<<KEYWORDS>>> Keywords
+
+ * <<<COMMENTS>>> Comments
+
+ * <<<TEMPLATE>>> Template
+
+ * <<<LAST_SAVED>>> Last Saved By
+
+ * <<<REVISION_NUMBER>>> Revision Number
+
+ * <<<LAST_PRINTED>>> Last Printed
+
+ * <<<LAST_SAVED>>> Last Saved Time/Date
+
+ * <<<LAST_SAVED>>> Last Saved Time/Date
+
+ * <<<PAGE_COUNT>>> Number of Pages
+
+ * <<<WORD_COUNT>>> Number of Words
+
+ * <<<CHARACTER_COUNT>>> Number of Characters
+
+ * <<<APPLICATION_NAME>>> Name of Creating Application
+
+ Note that in practice the metadata in many documents is either missing,
+ incomplete or even incorrect, so a client application should not rely
+ too much on this information.
+
+ Support for the new Office Open XML format used by Microsoft Office
+ version 2007 is pending for a POI upgrade. Current status is recorded in
+ {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
+
+ The generic OLE2 Compound Document format is automatically detected using
+ a magic number, and further parsing can automatically determine the more
+ specific document format. Tika also knows a number of common glob patterns
+ like <<<*.doc>>> and <<<*.ppt>>> for these formats.
+
+ The supported OLE 2 Compound Document formats are:
+
+ [Microsoft Excel (application/vnd.ms-excel)]
+ Excel spreadsheet support is available in all versions of Tika and is
+ based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
+
+ The Excel parser in Tika uses the
+ {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
+ is able to extract much of the document structure, including all
+ (non-empty) worksheets and their table structures. Formula results are
+ extracted as stored in the Excel file, and cell links are exposed as
+ XHTML links. These features were added in Tika version 0.2.
+
+ Cell comments and formatting are currently not supported. See
+ {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
+ {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
+ respective issues.
+
+ [Microsoft Word (application/msword)]
+ Word document support is available in all versions of Tika and is based
+ on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
+
+ The Word parser uses the
+ {{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
+ class from HWPF to extract document content as a sequence of paragraphs.
+
+ [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
+ PowerPoint presentation support is available in all versions of Tika and
+ is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
+
+ The PowerPoint parser uses the
+ {{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
+ class from HSLF to extract spreadsheet content as a single paragraph.
+
+ [Microsoft Visio (application/vnd.visio)]
+ Visio diagram support was added in Tika version 0.2 and is based on the
+ {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
+
+ The Visio parser uses the
+ {{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
+ class from HDGF to extract diagram content as a sequence of paragraphs.
+
+ [Microsoft Outlook (application/vnd.ms-outlook)]
+ Outlook message support was added in Tika version 0.2 and is based on the
+ {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
+
+ The Outlook parser extracts the subject of the message and the From,
+ To, Cc, and Bcc addresses (formatted for display) along with the body
+ text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>>> and
+ <<<SUBJECT>>> metadata properties are set explicitly, overriding
+ potential generic document metadata retrieved from OLE2 property sets.
+
+* Compression formats
+
+ General purpose compression formats are used to reduce the size of
+ any kinds of documents. Tika uses a parsing pipeline to support general
+ purpose compression: in the first stage the compressed stream decompressed
+ and the resulting decompressed stream is passed on to a second parsing
+ stage where it will be processed as if the document had never been
+ compressed.
+
+ Tika contains magic numbers and glob patterns for auto-detecting all
+ supported compression formats. The glob patterns of compression formats
+ are also used to determine the name of the original uncompressed document.
+ If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
+ property that matches such a glob pattern, then the decompressing first
+ parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
+ with the deduced original document name before passing control to the
+ second parsing stage.
+
+ Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
+ property, no document metadata is passed to or from the second parsing
+ stage. Only the text content extracted by the second stage parser is
+ returned to the client application.
+
+ The supported compression formats are:
+
+ [gzip compression (application/x-gzip)]
+ {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
+ Tika version 0.2 and is based on the
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
+ class in the Java 5 class library.
+
+ The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
+ and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
+ <<<*>>> as described above.
+
+ [bzip2 compression (application/x-bzip)]
+ {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
+ Tika version 0.2 and is based on bzip2 parsing code from
+ {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
+ based on work by Keiron Liddle from Aftex Software.
+
+ The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
+ and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
+ <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
+
+* Audio formats
+
+ Tika can detect several common audio formats and extract metadata
+ from them. Text extraction is supported for some MIDI-based karaoke
+ formats that contain the lyrics of the encoded audio.
+
+ See {{{https://issues.apache.org/jira/browse/TIKA-94}TIKA-94}} for
+ an effort to integrate speech recognition support to Tika.
+
+ [MP3 Audio (audio/mpeg)]
+ The parsing of {{{http://www.id3.org/ID3v1}ID3v1}} tags from MP3 files
+ was added in Tika version 0.2. If found the following metadata is
+ extracted and set:
+
+ * <<<TITLE>>> Title
+
+ * <<<SUBJECT>>> Subject
+
+ The above information, as well as the <<<Album>>>, <<<Track>>>,
+ <<<Year>>>, <<<Genre>>> and additional <<<Comment>>> are extracted
+ when set in the file.
+
+ [MIDI audio (audio/midi)]
+ Tika uses the MIDI support in <<<javax.audio.midi>>> to parse MIDI
+ sequence files. Many karaoke file formats are based on MIDI, and
+ contain lyrics as embedded text tracks that Tika knows how to extract.
+
+ Support for MIDI files was added in Tika 0.3.
+
+ [Wave audio (audio/basic)]
+ Tika supports sampled wave audio (.wav files, etc.) using the
+ <<<javax.audio.sampled>>> package. Only sampling metadata is extracted.
+
+ Support for sampled wave audio was added in Tika 0.3.
+
+* Other supported formats
+
+ [Extensible Markup Language (application/xml)]
+ Tika uses the <<<javax.xml>>> classes to parse Extensible Markup Language files.
+ Support for Extensible Markup Language files was added in Tika 0.1.
+
+ [HyperText Markup Language (text/html)]
+ Tika uses the {{{http://sourceforge.net/projects/nekohtml}CyberNeko}} library to parse HyperText Markup Language files.
+ Support for HyperText Markup Language files was added in Tika 0.1.
+
+ [Images (image/*)]
+ Tika uses the <<<javax.imageio>>> classes to extract metadata
+ from image files.
+
+ Support for Image files was added in Tika 0.2.
+
+ [Java class files]
+ The parsing of Java Class files is based on the asm library and
+ work by Dave Brosius in JCR-1522.
+
+ Support for Java Class files was added in Tika 0.2.
+
+ [Java jar archives]
+ The parsing of Java JAR archives is performed using a combination of
+ the ZIP and Java class file parsers.
+
+ Support for Java JAR archives was added in Tika 0.2.
+
+ [OpenDocument (application/vnd.oasis.opendocument.*)]
+ Tika uses the built-in ZIP and XML features in Java to parse the
+ {{{http://en.wikipedia.org/wiki/OpenDocument}OpenDocument}} document types
+ used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0
+ formats are also supported, though they are currently not auto-detected
+ as well as the newer formats.
+
+ Support for the OpenDocument formats was added in Tika 0.3.
+
+ [Plain text (text/plain)]
+ Tika uses the
+ {{{http://www.icu-project.org/}International Components for Unicode}}
+ Java library (ICU4J) to parse plain text. Support for plain text was added
+ in Tika 0.1.
+
+ Extracting text content from plain text files is actually a relatively
+ complex task due to the fact that the character encoding of the text
+ file is often unknown to the parser.
+
+ The text parser in Tika uses the ICU4J
+ {{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
+ class to automatically detect the character encoding of any text input.
+ As an added benefit, the ICU4J library is in some cases able to detect
+ also the language in which the text is written.
+
+ The character encoding and language of the plain text document are
+ returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
+ metadata properties. If the (declared) content encoding of a text document
+ is already known to the client application, then it can be supplied as the
+ <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
+ simplify encoding detection.
+
+ [Portable Document Format (application/pdf)]
+ Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
+ Portable Document Format (PDF) documents.
+
+ Support for PDF was added in Tika 0.1.
+
+ [Rich Text Format (application/rtf)]
+ Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
+ documents. Support for RTF was added in Tika 0.1.
+
+ The RTF parser in Tika uses the Swing
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
+ class to extract all text from an RTF document as a single paragraph.
+ Document metadata extraction is currently not supported.
+
+ [tar archive (application/x-tar)]
+ Tika uses an adapted version of the tar parsing code from
+ {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
+ The tar code is originally based on work by Timothy Gerard Endres.
+
+ Support for tar archives was added in Tika 0.2.
+
+ [ZIP archive (application/zip)]
+ Tika uses Java's built-in Zip classes to parse ZIP files.
+
+ Support for ZIP was added in Tika 0.2.
Propchange: tika/site/src/site/apt/0.5/formats.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.5/gettingstarted.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/gettingstarted.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.5/gettingstarted.apt Mon Oct 10 22:12:19 2011
@@ -1,241 +1,241 @@
- --------------------------------
- Getting Started with Apache Tika
- --------------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Getting Started with Apache Tika
-
- This document describes how to build Apache Tika from sources and
- how to start using Tika in an application.
-
-Getting and building the sources
-
- To build Tika from sources you first need to either
- {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
- version control.
-
- Once you have the sources, you can build them using the
- {{{http://maven.apache.org/}Maven 2}} build system. Executing the
- following command in the base directory will build the sources
- and install the resulting artifacts in your local Maven repository.
-
----
-mvn install
----
-
- See the Maven documentation for more information about the available
- build options.
-
- Note that you need Java 5 or higher to build Tika.
-
-Build artifacts
-
- Starting with Tika 0.5, the build consists of a number of components
- and produces the following main binaries (x.y stands for the current
- Tika version number):
-
- [tika-core/target/tika-core-x.y.jar]
- Tika core library. Contains the core interfaces and classes of Tika,
- but none of the parser implementations. Depends only on Java 5.
-
- [tika-core/target/tika-core-x.y-jdk14.jar]
- Java 1.4 version of the Tika core library.
-
- [tika-parsers/target/tika-parsers-x.y.jar]
- Tika parsers. Collection of classes that implement the Tika Parser
- interface based on various external parser libraries.
-
- [tika-app/target/tika-app-x.y.jar]
- Tika application. Combines the above libraries and all the external
- parser libraries into a single runnable jar with a GUI and a command
- line interface.
-
-Using Tika as a Maven dependency
-
- Since the 0.5 release Tika has been split to components to give you
- more control over which parts of Tika you want to use in your application.
- The core library, tika-core, contains the key interfaces and classes, so
- you'll always want to include a dependency to it:
-
----
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-core</artifactId>
- <version>x.y</version> <!-- 0.5 or higher -->
- </dependency>
----
-
- This dependency only gives you basic Tika functionality without any of
- the parser libraries. If you want to use Tika to parse documents (instead
- of simply detecting document types, etc.), you also need the tika-parsers
- dependency:
-
----
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-parsers</artifactId>
- <version>x.y</version> <!-- same version as in tika-core -->
- </dependency>
----
-
- Note that adding this dependency will introduce a number of
- transitive dependencies to your project. You need to make sure that
- these dependencies won't conflict with your existing project dependencies.
- The listing below shows all the compile-scope dependencies of the
- current Tika parsers release (0.5, November 2009). You can use the
- command "mvn dependency:tree" to check the latest tree of dependencies on any
- one of Tika's core, parsers and app projects.
-
----
-org.apache.tika:tika-parent:pom:0.5
-org.apache.tika:tika-core:bundle:0.5
-\- junit:junit:jar:3.8.1:test
-org.apache.tika:tika-parsers:bundle:0.5
-+- org.apache.tika:tika-core:jar:0.5:compile
-+- org.apache.commons:commons-compress:jar:1.0:compile
-+- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
-| +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
-| \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
-+- org.apache.poi:poi:jar:3.5-FINAL:compile
-+- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:compile
-+- org.apache.poi:poi-ooxml:jar:3.5-FINAL:compile
-| +- org.apache.poi:ooxml-schemas:jar:1.0:compile
-| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
-| \- dom4j:dom4j:jar:1.6.1:compile
-| \- xml-apis:xml-apis:jar:1.0.b2:compile
-+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
-+- commons-logging:commons-logging:jar:1.1.1:compile
-+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
-+- asm:asm:jar:3.1:compile
-+- log4j:log4j:jar:1.2.14:compile
-+- junit:junit:jar:3.8.1:test
-+- org.mockito:mockito-core:jar:1.7:test
-| +- org.hamcrest:hamcrest-core:jar:1.1:test
-| \- org.objenesis:objenesis:jar:1.0:test
-\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
-org.apache.tika:tika-app:bundle:0.5
-\- org.apache.tika:tika-parsers:jar:0.5:provided
- +- org.apache.tika:tika-core:jar:0.5:provided
- +- org.apache.commons:commons-compress:jar:1.0:provided
- +- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:provided
- | +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:provided
- | \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:provided
- +- org.apache.poi:poi:jar:3.5-FINAL:provided
- +- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:provided
- +- org.apache.poi:poi-ooxml:jar:3.5-FINAL:provided
- | +- org.apache.poi:ooxml-schemas:jar:1.0:provided
- | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:provided
- | \- dom4j:dom4j:jar:1.6.1:provided
- | \- xml-apis:xml-apis:jar:1.0.b2:provided
- +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:provided
- +- commons-logging:commons-logging:jar:1.1.1:provided
- +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
- +- asm:asm:jar:3.1:provided
- +- log4j:log4j:jar:1.2.14:provided
- \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided
----
-
-Using Tika in an Ant project
-
- Unless you use a dependency manager tool like
- {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
- you can include the Tika jar files and the dependencies individually.
-
----
-<classpath>
- ... <!-- your other classpath entries -->
- <pathelement location="path/to/tika-core-0.5.jar"/>
- <pathelement location="path/to/tika-parsers-0.5.jar"/>
- <pathelement location="path/to/commons-logging-1.1.1.jar"/>
- <pathelement location="path/to/commons-compress-1.0.jar"/>
- <pathelement location="path/to/pdfbox-0.7.3.jar"/>
- <pathelement location="path/to/fontbox-0.1.0.jar"/>
- <pathelement location="path/to/jempbox-0.2.0.jar"/>
- <pathelement location="path/to/bcmail-jdk14-136.jar"/>
- <pathelement location="path/to/bcprov-jdk14-136.jar"/>
- <pathelement location="path/to/poi-3.5-beta6.jar"/>
- <pathelement location="path/to/poi-scratchpad-3.5-beta6.jar"/>
- <pathelement location="path/to/poi-ooxml-3.5-beta6.jar"/>
- <pathelement location="path/to/ooxml-schemas-1.0.jar"/>
- <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
- <pathelement location="path/to/dom4j-1.6.1.jar"/>
- <pathelement location="path/to/nekohtml-1.9.9.jar"/>
- <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
- <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
- <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
- <pathelement location="path/to/asm-3.1.jar"/>
- <pathelement location="path/to/log4j-1.2.14.jar"/>
-</classpath>
----
-
- An easy way to gather all these libraries is to run
- "mvn dependency:copy-dependencies" in the Tika source directory.
- This will copy all Tika dependencies to the <<<target/dependencies>>>
- directory.
-
- Alternatively you can simply drop the entire tika-app jar to your
- classpath to get all of the above dependencies in a single archive.
-
-Using Tika as a command line utility
-
- The Tika application jar (tika-app-x.y.jar) can be used as a command
- line utility for extracting text content and metadata from all sorts of
- files. This runnable jar contains all the dependencies it needs, so
- you don't need to worry about classpath settings to run it.
-
- The usage instructions are shown below.
-
----
-usage: java -jar tika-app-x.y.jar [option] [file]
-
-Options:
- -? or --help Print this usage message
- -v or --verbose Print debug level messages
- -g or --gui Start the Apache Tika GUI
- -x or --xml Output XHTML content (default)
- -h or --html Output HTML content
- -t or --text Output plain text content
- -m or --metadata Output only metadata
-
-Description:
- Apache Tika will parse the file(s) specified on the
- command line and output the extracted text content
- or metadata to standard output.
-
- Instead of a file name you can also specify the URL
- of a document to be parsed.
-
- If no file name or URL is specified (or the special
- name "-" is used), then the standard input stream
- is parsed.
-
- Use the "--gui" (or "-g") option to start
- the Apache Tika GUI. You can drag and drop files
- from a normal file explorer to the GUI window to
- extract text content and metadata from the files.
----
-
- You can also use the jar as a component in a Unix pipeline or
- as an external tool in many scripting languages.
-
----
-# Check if an Internet resource contains a specific keyword
-curl http://.../document.doc \
- | java -jar tika-app-x.y.jar --text \
- | grep -q keyword
----
+ --------------------------------
+ Getting Started with Apache Tika
+ --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 5 or higher to build Tika.
+
+Build artifacts
+
+ Starting with Tika 0.5, the build consists of a number of components
+ and produces the following main binaries (x.y stands for the current
+ Tika version number):
+
+ [tika-core/target/tika-core-x.y.jar]
+ Tika core library. Contains the core interfaces and classes of Tika,
+ but none of the parser implementations. Depends only on Java 5.
+
+ [tika-core/target/tika-core-x.y-jdk14.jar]
+ Java 1.4 version of the Tika core library.
+
+ [tika-parsers/target/tika-parsers-x.y.jar]
+ Tika parsers. Collection of classes that implement the Tika Parser
+ interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-x.y.jar]
+ Tika application. Combines the above libraries and all the external
+ parser libraries into a single runnable jar with a GUI and a command
+ line interface.
+
+Using Tika as a Maven dependency
+
+ Since the 0.5 release Tika has been split to components to give you
+ more control over which parts of Tika you want to use in your application.
+ The core library, tika-core, contains the key interfaces and classes, so
+ you'll always want to include a dependency to it:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-core</artifactId>
+ <version>x.y</version> <!-- 0.5 or higher -->
+ </dependency>
+---
+
+ This dependency only gives you basic Tika functionality without any of
+ the parser libraries. If you want to use Tika to parse documents (instead
+ of simply detecting document types, etc.), you also need the tika-parsers
+ dependency:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parsers</artifactId>
+ <version>x.y</version> <!-- same version as in tika-core -->
+ </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project. You need to make sure that
+ these dependencies won't conflict with your existing project dependencies.
+ The listing below shows all the compile-scope dependencies of the
+ current Tika parsers release (0.5, November 2009). You can use the
+ command "mvn dependency:tree" to check the latest tree of dependencies on any
+ one of Tika's core, parsers and app projects.
+
+---
+org.apache.tika:tika-parent:pom:0.5
+org.apache.tika:tika-core:bundle:0.5
+\- junit:junit:jar:3.8.1:test
+org.apache.tika:tika-parsers:bundle:0.5
++- org.apache.tika:tika-core:jar:0.5:compile
++- org.apache.commons:commons-compress:jar:1.0:compile
++- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
+| +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
+| \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
++- org.apache.poi:poi:jar:3.5-FINAL:compile
++- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:compile
++- org.apache.poi:poi-ooxml:jar:3.5-FINAL:compile
+| +- org.apache.poi:ooxml-schemas:jar:1.0:compile
+| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
+| \- dom4j:dom4j:jar:1.6.1:compile
+| \- xml-apis:xml-apis:jar:1.0.b2:compile
++- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
++- commons-logging:commons-logging:jar:1.1.1:compile
++- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
++- asm:asm:jar:3.1:compile
++- log4j:log4j:jar:1.2.14:compile
++- junit:junit:jar:3.8.1:test
++- org.mockito:mockito-core:jar:1.7:test
+| +- org.hamcrest:hamcrest-core:jar:1.1:test
+| \- org.objenesis:objenesis:jar:1.0:test
+\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+org.apache.tika:tika-app:bundle:0.5
+\- org.apache.tika:tika-parsers:jar:0.5:provided
+ +- org.apache.tika:tika-core:jar:0.5:provided
+ +- org.apache.commons:commons-compress:jar:1.0:provided
+ +- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:provided
+ | +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:provided
+ | \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:provided
+ +- org.apache.poi:poi:jar:3.5-FINAL:provided
+ +- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:provided
+ +- org.apache.poi:poi-ooxml:jar:3.5-FINAL:provided
+ | +- org.apache.poi:ooxml-schemas:jar:1.0:provided
+ | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:provided
+ | \- dom4j:dom4j:jar:1.6.1:provided
+ | \- xml-apis:xml-apis:jar:1.0.b2:provided
+ +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:provided
+ +- commons-logging:commons-logging:jar:1.1.1:provided
+ +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
+ +- asm:asm:jar:3.1:provided
+ +- log4j:log4j:jar:1.2.14:provided
+ \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
+ you can include the Tika jar files and the dependencies individually.
+
+---
+<classpath>
+ ... <!-- your other classpath entries -->
+ <pathelement location="path/to/tika-core-0.5.jar"/>
+ <pathelement location="path/to/tika-parsers-0.5.jar"/>
+ <pathelement location="path/to/commons-logging-1.1.1.jar"/>
+ <pathelement location="path/to/commons-compress-1.0.jar"/>
+ <pathelement location="path/to/pdfbox-0.7.3.jar"/>
+ <pathelement location="path/to/fontbox-0.1.0.jar"/>
+ <pathelement location="path/to/jempbox-0.2.0.jar"/>
+ <pathelement location="path/to/bcmail-jdk14-136.jar"/>
+ <pathelement location="path/to/bcprov-jdk14-136.jar"/>
+ <pathelement location="path/to/poi-3.5-beta6.jar"/>
+ <pathelement location="path/to/poi-scratchpad-3.5-beta6.jar"/>
+ <pathelement location="path/to/poi-ooxml-3.5-beta6.jar"/>
+ <pathelement location="path/to/ooxml-schemas-1.0.jar"/>
+ <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
+ <pathelement location="path/to/dom4j-1.6.1.jar"/>
+ <pathelement location="path/to/nekohtml-1.9.9.jar"/>
+ <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
+ <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
+ <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
+ <pathelement location="path/to/asm-3.1.jar"/>
+ <pathelement location="path/to/log4j-1.2.14.jar"/>
+</classpath>
+---
+
+ An easy way to gather all these libraries is to run
+ "mvn dependency:copy-dependencies" in the Tika source directory.
+ This will copy all Tika dependencies to the <<<target/dependencies>>>
+ directory.
+
+ Alternatively you can simply drop the entire tika-app jar to your
+ classpath to get all of the above dependencies in a single archive.
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-x.y.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app-x.y.jar [option] [file]
+
+Options:
+ -? or --help Print this usage message
+ -v or --verbose Print debug level messages
+ -g or --gui Start the Apache Tika GUI
+ -x or --xml Output XHTML content (default)
+ -h or --html Output HTML content
+ -t or --text Output plain text content
+ -m or --metadata Output only metadata
+
+Description:
+ Apache Tika will parse the file(s) specified on the
+ command line and output the extracted text content
+ or metadata to standard output.
+
+ Instead of a file name you can also specify the URL
+ of a document to be parsed.
+
+ If no file name or URL is specified (or the special
+ name "-" is used), then the standard input stream
+ is parsed.
+
+ Use the "--gui" (or "-g") option to start
+ the Apache Tika GUI. You can drag and drop files
+ from a normal file explorer to the GUI window to
+ extract text content and metadata from the files.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+ | java -jar tika-app-x.y.jar --text \
+ | grep -q keyword
+---
Propchange: tika/site/src/site/apt/0.5/gettingstarted.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.5/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/index.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/index.apt (original)
+++ tika/site/src/site/apt/0.5/index.apt Mon Oct 10 22:12:19 2011
@@ -1,100 +1,100 @@
- ---------------
- Apache Tika 0.5
- ---------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Apache Tika 0.5
-
- The most notable changes in Tika 0.5 over the previous release are:
-
- * Improved RDF/OWL mime detection using both MIME magic as well as
- pattern matching.
- ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}})
-
- * An org.apache.tika.Tika facade class has been added to simplify
- common text extraction and type detection use cases.
- ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}})
-
- * A new parse context argument was added to the Parser.parse() method.
- This context map can be used to pass things like a delegate parser
- or other settings to the parsing process. The previous parse() method
- signature has been deprecated and will be removed in Tika 1.0.
- ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}})
-
- * A simple ngram-based language detection mechanism has been added
- along with predefined language profiles for 18 languages.
- ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}})
-
- * The media type registry in Tika was synchronized with the MIME type
- configuration in the Apache HTTP Server. Tika now knows about 1274
- different media types and can detect 672 of those using 927 file
- extension and 280 magic byte patterns.
- ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}})
-
- * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing
- PDF documents. This version is notably better than the 0.7.3 release
- used earlier.
- ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}})
-
- The following people have contributed to Tika 0.5 by submitting or
- commenting on the issues resolved in this release:
-
- * Alex Baranov
-
- * Bart Hanssens
-
- * Benson Margulies
-
- * Chris A. Mattmann
-
- * Daan de Wit
-
- * Erik Hetzner
-
- * Frank Hellwig
-
- * Jeff Cadow
-
- * Joachim Zittmayr
-
- * Jukka Zitting
-
- * Julien Nioche
-
- * Ken Krugler
-
- * Maxim Valyanskiy
-
- * MRIT64
-
- * Paul Borgermans
-
- * Piotr B.
-
- * Robert Newson
-
- * Sascha Szott
-
- * Ted Dunning
-
- * Thilo Goetz
-
- * Uwe Schindler
-
- * Yuan-Fang Li
-
- See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.
+ ---------------
+ Apache Tika 0.5
+ ---------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika 0.5
+
+ The most notable changes in Tika 0.5 over the previous release are:
+
+ * Improved RDF/OWL mime detection using both MIME magic as well as
+ pattern matching.
+ ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}})
+
+ * An org.apache.tika.Tika facade class has been added to simplify
+ common text extraction and type detection use cases.
+ ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}})
+
+ * A new parse context argument was added to the Parser.parse() method.
+ This context map can be used to pass things like a delegate parser
+ or other settings to the parsing process. The previous parse() method
+ signature has been deprecated and will be removed in Tika 1.0.
+ ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}})
+
+ * A simple ngram-based language detection mechanism has been added
+ along with predefined language profiles for 18 languages.
+ ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}})
+
+ * The media type registry in Tika was synchronized with the MIME type
+ configuration in the Apache HTTP Server. Tika now knows about 1274
+ different media types and can detect 672 of those using 927 file
+ extension and 280 magic byte patterns.
+ ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}})
+
+ * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing
+ PDF documents. This version is notably better than the 0.7.3 release
+ used earlier.
+ ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}})
+
+ The following people have contributed to Tika 0.5 by submitting or
+ commenting on the issues resolved in this release:
+
+ * Alex Baranov
+
+ * Bart Hanssens
+
+ * Benson Margulies
+
+ * Chris A. Mattmann
+
+ * Daan de Wit
+
+ * Erik Hetzner
+
+ * Frank Hellwig
+
+ * Jeff Cadow
+
+ * Joachim Zittmayr
+
+ * Jukka Zitting
+
+ * Julien Nioche
+
+ * Ken Krugler
+
+ * Maxim Valyanskiy
+
+ * MRIT64
+
+ * Paul Borgermans
+
+ * Piotr B.
+
+ * Robert Newson
+
+ * Sascha Szott
+
+ * Ted Dunning
+
+ * Thilo Goetz
+
+ * Uwe Schindler
+
+ * Yuan-Fang Li
+
+ See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.
Propchange: tika/site/src/site/apt/0.5/index.apt
------------------------------------------------------------------------------
svn:eol-style = native