You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ma...@apache.org on 2013/07/04 03:40:46 UTC
svn commit: r1499614 [7/7] - in /tika/site: publish/ publish/0.10/
publish/0.5/ publish/0.6/ publish/0.7/ publish/0.8/ publish/0.9/
publish/1.0/ publish/1.1/ publish/1.2/ publish/1.3/ publish/1.4/ src/site/
src/site/apt/ src/site/apt/1.3/ src/site/apt/...
Added: tika/site/src/site/apt/1.4/parser.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser.apt?rev=1499614&view=auto
==============================================================================
--- tika/site/src/site/apt/1.4/parser.apt (added)
+++ tika/site/src/site/apt/1.4/parser.apt Thu Jul 4 01:40:44 2013
@@ -0,0 +1,245 @@
+ --------------------
+ The Parser interface
+ --------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+The Parser interface
+
+ The
+ {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+ interface is the key concept of Apache Tika. It hides the complexity of
+ different file formats and parsing libraries while providing a simple and
+ powerful mechanism for client applications to extract structured text
+ content and metadata from all sorts of documents. All this is achieved
+ with a single method:
+
+---
+void parse(
+ InputStream stream, ContentHandler handler, Metadata metadata,
+ ParseContext context) throws IOException, SAXException, TikaException;
+---
+
+ The <<<parse>>> method takes the document to be parsed and related metadata
+ as input and outputs the results as XHTML SAX events and extra metadata.
+ The parse context argument is used to specify context information (like
+ the current local) that is not related to any individual document.
+ The main criteria that lead to this design were:
+
+ [Streamed parsing] The interface should require neither the client
+ application nor the parser implementation to keep the full document
+ content in memory or spooled to disk. This allows even huge documents
+ to be parsed without excessive resource requirements.
+
+ [Structured content] A parser implementation should be able to
+ include structural information (headings, links, etc.) in the extracted
+ content. A client application can use this information for example to
+ better judge the relevance of different parts of the parsed document.
+
+ [Input metadata] A client application should be able to include metadata
+ like the file name or declared content type with the document to be
+ parsed. The parser implementation can use this information to better
+ guide the parsing process.
+
+ [Output metadata] A parser implementation should be able to return
+ document metadata in addition to document content. Many document
+ formats contain metadata like the name of the author that may be useful
+ to client applications.
+
+ [Context sensitivity] While the default settings and behaviour of Tika
+ parsers should work well for most use cases, there are still situations
+ where more fine-grained control over the parsing process is desirable.
+ It should be easy to inject such context-specific information to the
+ parsing process without breaking the layers of abstraction.
+
+ []
+
+ These criteria are reflected in the arguments of the <<<parse>>> method.
+
+* Document input stream
+
+ The first argument is an
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+ for reading the document to be parsed.
+
+ If this document stream can not be read, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+ is passed up to the client application. If the stream can be read but
+ not parsed (for example if the document is corrupted), then the parser
+ throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+ The parser implementation will consume this stream but <will not close it>.
+ Closing the stream is the responsibility of the client application that
+ opened it in the first place. The recommended pattern for using streams
+ with the <<<parse>>> method is:
+
+---
+InputStream stream = ...; // open the stream
+try {
+ parser.parse(stream, ...); // parse the stream
+} finally {
+ stream.close(); // close the stream
+}
+---
+
+ Some document formats like the OLE2 Compound Document Format used by
+ Microsoft Office are best parsed as random access files. In such cases the
+ content of the input stream is automatically spooled to a temporary file
+ that gets removed once parsed. A future version of Tika may make it possible
+ to avoid this extra file if the input document is already a file in the
+ local file system. See
+ {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+ of this feature request.
+
+* XHTML SAX events
+
+ The parsed content of the document stream is returned to the client
+ application as a sequence of XHTML SAX events. XHTML is used to express
+ structured content of the document and SAX events enable streamed
+ processing. Note that the XHTML format is used here only to convey
+ structural information, not to render the documents for browsing!
+
+ The XHTML SAX events produced by the parser implementation are sent to a
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+ instance given to the <<<parse>>> method. If this the content handler
+ fails to process an event, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+ is passed up to the client application.
+
+ The overall structure of the generated event stream is (with indenting
+ added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>...</title>
+ </head>
+ <body>
+ ...
+ </body>
+</html>
+---
+
+ Parser implementations typically use the
+ {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+ utility class to generate the XHTML output.
+
+ Dealing with the raw SAX events can be a bit complex, so Apache Tika
+ comes with a number of utility classes that can be used to process and
+ convert the event stream to other representations.
+
+ For example, the
+ {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+ class can be used to extract just the body part of the XHTML output and
+ feed it either as SAX events to another content handler or as characters
+ to an output stream, a writer, or simply a string. The following code
+ snippet parses a document from the standard input stream and outputs the
+ extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+ Another useful class is
+ {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+ uses a background thread to parse the document and returns the extracted
+ text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+ ...; // read the document text using the reader
+} finally {
+ reader.close(); // the document stream is closed automatically
+}
+---
+
+* Document metadata
+
+ The third argument to the <<<parse>>> method is used to pass document
+ metadata both in and out of the parser. Document metadata is expressed
+ as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+ The following are some of the more interesting metadata properties:
+
+ [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+ the document.
+
+ A client application can set this property to allow the parser to use
+ file name heuristics to determine the format of the document.
+
+ The parser implementation may set this property if the file format
+ contains the canonical name of the file (for example the Gzip format
+ has a slot for the file name).
+
+ [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+ A client application can set this property based on for example a HTTP
+ Content-Type header. The declared content type may help the parser to
+ correctly interpret the document.
+
+ The parser implementation sets this property to the content type according
+ to which the document was parsed.
+
+ [Metadata.TITLE] The title of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit title field.
+
+ [Metadata.AUTHOR] The name of the author of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit author field.
+
+ []
+
+ Note that metadata handling is still being discussed by the Tika development
+ team, and it is likely that there will be some (backwards incompatible)
+ changes in metadata handling before Tika 1.0.
+
+* Parse context
+
+ The final argument to the <<<parse>>> method is used to inject
+ context-specific information to the parsing process. This is useful
+ for example when dealing with locale-specific date and number formats
+ in Microsoft Excel spreadsheets. Another important use of the parse
+ context is passing in the delegate parser instance to be used by
+ two-phase parsers like the
+ {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
+ Some parser classes allow customization of the parsing process through
+ strategy objects in the parse context.
+
+* Parser implementations
+
+ Apache Tika comes with a number of parser classes for parsing
+ {{{formats.html}various document formats}}. You can also extend Tika
+ with your own parsers, and of course any contributions to Tika are
+ warmly welcome.
+
+ The goal of Tika is to reuse existing parser libraries like
+ {{{http://www.pdfbox.org/}PDFBox}} or
+ {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+ of the parser classes in Tika are adapters to such external libraries.
+
+ Tika also contains some general purpose parser implementations that are
+ not targeted at any specific document formats. The most notable of these
+ is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+ class that encapsulates all Tika functionality into a single parser that
+ can handle any types of documents. This parser will automatically determine
+ the type of the incoming document based on various heuristics and will then
+ parse the document accordingly.
Added: tika/site/src/site/apt/1.4/parser_guide.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser_guide.apt?rev=1499614&view=auto
==============================================================================
--- tika/site/src/site/apt/1.4/parser_guide.apt (added)
+++ tika/site/src/site/apt/1.4/parser_guide.apt Thu Jul 4 01:40:44 2013
@@ -0,0 +1,135 @@
+ --------------------------------------------
+ Get Tika parsing up and running in 5 minutes
+ --------------------------------------------
+ Arturo Beltran
+ --------------------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Get Tika parsing up and running in 5 minutes
+
+ This page is a quick start guide showing how to add a new parser to Apache Tika.
+ Following the simple steps listed below your new parser can be running in only 5 minutes.
+
+%{toc|section=1|fromDepth=1}
+
+* {Getting Started}
+
+ The {{{gettingstarted.html}Getting Started}} document describes how to
+ build Apache Tika from sources and how to start using Tika in an application. Pay close attention
+ and follow the instructions in the "Getting and building the sources" section.
+
+
+* {Add your MIME-Type}
+
+ You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}}
+ in order to Tika can map the file extension with its MIME-Type. You should add something like this:
+
+---
+ <mime-type type="application/hello">
+ <glob pattern="*.hi"/>
+ </mime-type>
+---
+
+* {Create your Parser class}
+
+ Now, you need to create your new parser. This is a class that must implement the Parser interface
+ offered by Tika. A very simple Tika Parser looks like this:
+
+---
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * @Author: Arturo Beltran
+ */
+package org.apache.tika.parser.hello;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.Set;
+
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.mime.MediaType;
+import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.Parser;
+import org.apache.tika.sax.XHTMLContentHandler;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+public class HelloParser implements Parser {
+
+ private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello"));
+ public static final String HELLO_MIME_TYPE = "application/hello";
+
+ public Set<MediaType> getSupportedTypes(ParseContext context) {
+ return SUPPORTED_TYPES;
+ }
+
+ public void parse(
+ InputStream stream, ContentHandler handler,
+ Metadata metadata, ParseContext context)
+ throws IOException, SAXException, TikaException {
+
+ metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
+ metadata.set("Hello", "World");
+
+ XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
+ xhtml.startDocument();
+ xhtml.endDocument();
+ }
+
+ /**
+ * @deprecated This method will be removed in Apache Tika 1.0.
+ */
+ public void parse(
+ InputStream stream, ContentHandler handler, Metadata metadata)
+ throws IOException, SAXException, TikaException {
+ parse(stream, handler, metadata, new ParseContext());
+ }
+}
+---
+
+ Pay special attention to the definition of the SUPPORTED_TYPES static class
+ field in the parser class that defines what MIME-Types it supports.
+
+ Is in the "parse" method where you will do all your work. This is, extract
+ the information of the resource and then set the metadata.
+
+* {List the new parser}
+
+ Finally, you should explicitly tell the AutoDetectParser to include your new
+ parser. This step is only needed if you want to use the AutoDetectParser functionality.
+ If you figure out the correct parser in a different way, it isn't needed.
+
+ List your new parser in:
+ {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}}
+
+
Modified: tika/site/src/site/apt/download.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/apt/download.apt (original)
+++ tika/site/src/site/apt/download.apt Thu Jul 4 01:40:44 2013
@@ -19,19 +19,19 @@
Download Apache Tika
- Apache Tika 1.3 is now available.
- See the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}}
+ Apache Tika 1.4 is now available.
+ See the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}}
file for more information on the list of updates in this initial release.
- * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.3-src.zip}apache-tika-1.3-src.zip}}
- (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.3-src.zip.asc}PGP signature}})\
- SHA1: <<<a80e45d1976e655381d6e93b50b9c7b118e9d6fc>>>\
- MD5: <<<ce6cf28866e64201775261e0b558f84e>>>
-
- * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.3.jar}tika-app-1.3.jar}}
- (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.3.jar.asc}PGP signature}})\
- SHA1: <<<fb5786dfe4fa19a651c9f6d9417336127b34ddc2>>>\
- MD5: <<<783dd0f77b2b2fe39fe957657d3c5005>>>
+ * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-1.4-src.zip}apache-tika-1.4-src.zip}}
+ (source archive, {{{http://www.apache.org/dist/tika/tika-1.4-src.zip.asc}PGP signature}})\
+ SHA1: <<<84ce9ebc104ca348a3cd8e95ec31a96169548c13>>>\
+ MD5: <<<6daa446b1dfb08888169d558263416d7>>>
+
+ * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.4.jar}tika-app-1.4.jar}}
+ (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.4.jar.asc}PGP signature}})\
+ SHA1: <<<e91c758149ce9ce799fff184e9bf3aabda394abc>>>
+ MD5: <<<53936b30a84a933389ea959a36dd963e>>>
[]
Modified: tika/site/src/site/apt/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/apt/index.apt (original)
+++ tika/site/src/site/apt/index.apt Thu Jul 4 01:40:44 2013
@@ -23,7 +23,7 @@ Apache Tika - a content analysis toolkit
structured text content from various documents using existing parser
libraries. You can find the latest release on the
{{{./download.html}download page}}. See the
- {{{./1.2/gettingstarted.html}Getting Started}} guide for instructions on
+ {{{./1.4/gettingstarted.html}Getting Started}} guide for instructions on
how to start using Tika.
Tika is a project of the
@@ -32,6 +32,12 @@ Apache Tika - a content analysis toolkit
Latest News
+ [3 July 2013: Apache Tika Release]
+ Apache Tika 1.4 has been released! This release includes several important bugfixes
+ and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}}
+ file for a full list of changes in this release, and have a look at the download
+ page for more information on how to obtain Apache Tika 1.4.
+
[22 January 2013: Apache Tika Release]
Apache Tika 1.3 has been released! This release includes several important bugfixes
and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}}
Modified: tika/site/src/site/site.xml
URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/site.xml (original)
+++ tika/site/src/site/site.xml Thu Jul 4 01:40:44 2013
@@ -39,7 +39,15 @@
<item name="Issue Tracker" href="https://issues.apache.org/jira/browse/TIKA"/>
</menu>
<menu name="Documentation">
- <item name="Apache Tika 1.3" href="1.3/index.html">
+ <item name="Apache Tika 1.4" href="1.4/index.html">
+ <item name="Getting Started" href="1.4/gettingstarted.html"/>
+ <item name="Supported Formats" href="1.4/formats.html"/>
+ <item name="Parser API" href="1.4/parser.html"/>
+ <item name="Parser 5min Quick Start Guide" href="1.4/parser_guide.html"/>
+ <item name="Content and Language Detection" href="1.4/detection.html"/>
+ <item name="API Documentation" href="1.4/api/"/>
+ </item>
+ <item name="Apache Tika 1.3" href="1.3/index.html" collapse="true">
<item name="Getting Started" href="1.3/gettingstarted.html"/>
<item name="Supported Formats" href="1.3/formats.html"/>
<item name="Parser API" href="1.3/parser.html"/>
@@ -71,14 +79,6 @@
<item name="Content and Language Detection" href="1.0/detection.html"/>
<item name="API Documentation" href="1.0/api/"/>
</item>
- <item name="Apache Tika 0.10" href="0.10/index.html" collapse="true">
- <item name="Getting Started" href="0.10/gettingstarted.html"/>
- <item name="Supported Formats" href="0.10/formats.html"/>
- <item name="Parser API" href="0.10/parser.html"/>
- <item name="Parser 5min Quick Start Guide" href="0.10/parser_guide.html"/>
- <item name="Content and Language Detection" href="0.10/detection.html"/>
- <item name="API Documentation" href="0.10/api/"/>
- </item>
</menu>
<menu name="The Apache Software Foundation">
<item name="About" href="http://www.apache.org/foundation/"/>