You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ma...@apache.org on 2013/07/04 03:40:46 UTC

svn commit: r1499614 [7/7] - in /tika/site: publish/ publish/0.10/ publish/0.5/ publish/0.6/ publish/0.7/ publish/0.8/ publish/0.9/ publish/1.0/ publish/1.1/ publish/1.2/ publish/1.3/ publish/1.4/ src/site/ src/site/apt/ src/site/apt/1.3/ src/site/apt/...

Added: tika/site/src/site/apt/1.4/parser.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser.apt?rev=1499614&view=auto
==============================================================================
--- tika/site/src/site/apt/1.4/parser.apt (added)
+++ tika/site/src/site/apt/1.4/parser.apt Thu Jul  4 01:40:44 2013
@@ -0,0 +1,245 @@
+                       --------------------
+                       The Parser interface
+                       --------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+The Parser interface
+
+   The
+   {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+   interface is the key concept of Apache Tika. It hides the complexity of
+   different file formats and parsing libraries while providing a simple and
+   powerful mechanism for client applications to extract structured text
+   content and metadata from all sorts of documents. All this is achieved
+   with a single method:
+
+---
+void parse(
+    InputStream stream, ContentHandler handler, Metadata metadata,
+    ParseContext context) throws IOException, SAXException, TikaException;
+---
+
+   The <<<parse>>> method takes the document to be parsed and related metadata
+   as input and outputs the results as XHTML SAX events and extra metadata.
+   The parse context argument is used to specify context information (like
+   the current local) that is not related to any individual document.
+   The main criteria that lead to this design were:
+
+   [Streamed parsing] The interface should require neither the client
+     application nor the parser implementation to keep the full document
+     content in memory or spooled to disk. This allows even huge documents
+     to be parsed without excessive resource requirements.
+
+   [Structured content] A parser implementation should be able to
+     include structural information (headings, links, etc.) in the extracted
+     content. A client application can use this information for example to
+     better judge the relevance of different parts of the parsed document.
+
+   [Input metadata] A client application should be able to include metadata
+     like the file name or declared content type with the document to be
+     parsed. The parser implementation can use this information to better
+     guide the parsing process.
+
+   [Output metadata] A parser implementation should be able to return
+     document metadata in addition to document content. Many document
+     formats contain metadata like the name of the author that may be useful
+     to client applications.
+
+   [Context sensitivity] While the default settings and behaviour of Tika
+     parsers should work well for most use cases, there are still situations
+     where more fine-grained control over the parsing process is desirable.
+     It should be easy to inject such context-specific information to the
+     parsing process without breaking the layers of abstraction.
+
+   []
+
+   These criteria are reflected in the arguments of the <<<parse>>> method.
+
+* Document input stream
+
+   The first argument is an
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+   for reading the document to be parsed.
+
+   If this document stream can not be read, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+   is passed up to the client application. If the stream can be read but
+   not parsed (for example if the document is corrupted), then the parser
+   throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+   The parser implementation will consume this stream but <will not close it>.
+   Closing the stream is the responsibility of the client application that
+   opened it in the first place. The recommended pattern for using streams
+   with the <<<parse>>> method is:
+
+---
+InputStream stream = ...;      // open the stream
+try {
+    parser.parse(stream, ...); // parse the stream
+} finally {
+    stream.close();            // close the stream
+}
+---
+
+   Some document formats like the OLE2 Compound Document Format used by
+   Microsoft Office are best parsed as random access files. In such cases the
+   content of the input stream is automatically spooled to a temporary file
+   that gets removed once parsed. A future version of Tika may make it possible
+   to avoid this extra file if the input document is already a file in the
+   local file system. See
+   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+   of this feature request.
+
+* XHTML SAX events
+
+   The parsed content of the document stream is returned to the client
+   application as a sequence of XHTML SAX events. XHTML is used to express
+   structured content of the document and SAX events enable streamed
+   processing. Note that the XHTML format is used here only to convey
+   structural information, not to render the documents for browsing!
+
+   The XHTML SAX events produced by the parser implementation are sent to a
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   instance given to the <<<parse>>> method. If this the content handler
+   fails to process an event, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+   is passed up to the client application.
+
+   The overall structure of the generated event stream is (with indenting
+   added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>...</title>
+  </head>
+  <body>
+    ...
+  </body>
+</html>
+---
+
+   Parser implementations typically use the
+   {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+   utility class to generate the XHTML output.
+
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika
+   comes with a number of utility classes that can be used to process and
+   convert the event stream to other representations.
+
+   For example, the
+   {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   class can be used to extract just the body part of the XHTML output and
+   feed it either as SAX events to another content handler or as characters
+   to an output stream, a writer, or simply a string. The following code
+   snippet parses a document from the standard input stream and outputs the
+   extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+   Another useful class is
+   {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+   uses a background thread to parse the document and returns the extracted
+   text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+    ...;                  // read the document text using the reader
+} finally {
+    reader.close();       // the document stream is closed automatically
+}
+---
+
+* Document metadata
+
+   The third argument to the <<<parse>>> method is used to pass document
+   metadata both in and out of the parser. Document metadata is expressed
+   as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+   The following are some of the more interesting metadata properties:
+
+   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+    the document.
+
+    A client application can set this property to allow the parser to use
+    file name heuristics to determine the format of the document.
+
+    The parser implementation may set this property if the file format
+    contains the canonical name of the file (for example the Gzip format
+    has a slot for the file name).
+
+   [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+    A client application can set this property based on for example a HTTP
+    Content-Type header. The declared content type may help the parser to
+    correctly interpret the document.
+
+    The parser implementation sets this property to the content type according
+    to which the document was parsed.
+
+   [Metadata.TITLE] The title of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit title field.
+
+   [Metadata.AUTHOR] The name of the author of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit author field.
+
+   []
+
+   Note that metadata handling is still being discussed by the Tika development
+   team, and it is likely that there will be some (backwards incompatible)
+   changes in metadata handling before Tika 1.0.
+
+* Parse context
+
+   The final argument to the <<<parse>>> method is used to inject
+   context-specific information to the parsing process. This is useful
+   for example when dealing with locale-specific date and number formats
+   in Microsoft Excel spreadsheets. Another important use of the parse
+   context is passing in the delegate parser instance to be used by
+   two-phase parsers like the
+   {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
+   Some parser classes allow customization of the parsing process through
+   strategy objects in the parse context.
+
+* Parser implementations
+
+   Apache Tika comes with a number of parser classes for parsing
+   {{{formats.html}various document formats}}. You can also extend Tika
+   with your own parsers, and of course any contributions to Tika are
+   warmly welcome.
+
+   The goal of Tika is to reuse existing parser libraries like
+   {{{http://www.pdfbox.org/}PDFBox}} or
+   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+   of the parser classes in Tika are adapters to such external libraries.
+
+   Tika also contains some general purpose parser implementations that are
+   not targeted at any specific document formats. The most notable of these
+   is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+   class that encapsulates all Tika functionality into a single parser that
+   can handle any types of documents. This parser will automatically determine
+   the type of the incoming document based on various heuristics and will then
+   parse the document accordingly.

Added: tika/site/src/site/apt/1.4/parser_guide.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser_guide.apt?rev=1499614&view=auto
==============================================================================
--- tika/site/src/site/apt/1.4/parser_guide.apt (added)
+++ tika/site/src/site/apt/1.4/parser_guide.apt Thu Jul  4 01:40:44 2013
@@ -0,0 +1,135 @@
+                       --------------------------------------------
+                       Get Tika parsing up and running in 5 minutes
+                       --------------------------------------------
+					   Arturo Beltran
+					   --------------------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Get Tika parsing up and running in 5 minutes
+
+   This page is a quick start guide showing how to add a new parser to Apache Tika.
+   Following the simple steps listed below your new parser can be running in only 5 minutes.
+
+%{toc|section=1|fromDepth=1}
+
+* {Getting Started}
+
+   The {{{gettingstarted.html}Getting Started}} document describes how to 
+   build Apache Tika from sources and how to start using Tika in an application. Pay close attention 
+   and follow the instructions in the "Getting and building the sources" section.
+   
+
+* {Add your MIME-Type}
+
+   You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}}
+   in order to Tika can map the file extension with its MIME-Type. You should add something like this:
+   
+---
+ <mime-type type="application/hello">
+	<glob pattern="*.hi"/>
+ </mime-type>
+---
+
+* {Create your Parser class}
+
+   Now, you need to create your new parser. This is a class that must implement the Parser interface 
+   offered by Tika. A very simple Tika Parser looks like this:
+   
+---
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * 
+ * @Author: Arturo Beltran
+ */
+package org.apache.tika.parser.hello;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.Set;
+
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.mime.MediaType;
+import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.Parser;
+import org.apache.tika.sax.XHTMLContentHandler;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+public class HelloParser implements Parser {
+
+	private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello"));
+	public static final String HELLO_MIME_TYPE = "application/hello";
+	
+	public Set<MediaType> getSupportedTypes(ParseContext context) {
+		return SUPPORTED_TYPES;
+	}
+
+	public void parse(
+			InputStream stream, ContentHandler handler,
+			Metadata metadata, ParseContext context)
+			throws IOException, SAXException, TikaException {
+
+		metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
+		metadata.set("Hello", "World");
+
+		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
+		xhtml.startDocument();
+		xhtml.endDocument();
+	}
+
+	/**
+	 * @deprecated This method will be removed in Apache Tika 1.0.
+	 */
+	public void parse(
+			InputStream stream, ContentHandler handler, Metadata metadata)
+			throws IOException, SAXException, TikaException {
+		parse(stream, handler, metadata, new ParseContext());
+	}
+}
+---
+   
+   Pay special attention to the definition of the SUPPORTED_TYPES static class 
+   field in the parser class that defines what MIME-Types it supports. 
+   
+   Is in the "parse" method where you will do all your work. This is, extract 
+   the information of the resource and then set the metadata.
+
+* {List the new parser}
+
+   Finally, you should explicitly tell the AutoDetectParser to include your new 
+   parser. This step is only needed if you want to use the AutoDetectParser functionality. 
+   If you figure out the correct parser in a different way, it isn't needed. 
+   
+   List your new parser in:
+    {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}}
+   
+

Modified: tika/site/src/site/apt/download.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/apt/download.apt (original)
+++ tika/site/src/site/apt/download.apt Thu Jul  4 01:40:44 2013
@@ -19,19 +19,19 @@
 
 Download Apache Tika
 
-   Apache Tika 1.3 is now available.
-   See the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}}
+   Apache Tika 1.4 is now available.
+   See the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}}
    file for more information on the list of updates in this initial release.
 
-   * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.3-src.zip}apache-tika-1.3-src.zip}}
-     (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.3-src.zip.asc}PGP signature}})\
-     SHA1: <<<a80e45d1976e655381d6e93b50b9c7b118e9d6fc>>>\
-     MD5: <<<ce6cf28866e64201775261e0b558f84e>>>
-
-   * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.3.jar}tika-app-1.3.jar}}
-     (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.3.jar.asc}PGP signature}})\
-     SHA1: <<<fb5786dfe4fa19a651c9f6d9417336127b34ddc2>>>\
-     MD5: <<<783dd0f77b2b2fe39fe957657d3c5005>>>
+   * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-1.4-src.zip}apache-tika-1.4-src.zip}}
+     (source archive, {{{http://www.apache.org/dist/tika/tika-1.4-src.zip.asc}PGP signature}})\
+     SHA1: <<<84ce9ebc104ca348a3cd8e95ec31a96169548c13>>>\
+     MD5: <<<6daa446b1dfb08888169d558263416d7>>>
+
+   * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.4.jar}tika-app-1.4.jar}}
+     (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.4.jar.asc}PGP signature}})\
+     SHA1: <<<e91c758149ce9ce799fff184e9bf3aabda394abc>>>
+     MD5: <<<53936b30a84a933389ea959a36dd963e>>>
 
    []
 

Modified: tika/site/src/site/apt/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/apt/index.apt (original)
+++ tika/site/src/site/apt/index.apt Thu Jul  4 01:40:44 2013
@@ -23,7 +23,7 @@ Apache Tika - a content analysis toolkit
    structured text content from various documents using existing parser
    libraries. You can find the latest release on the
    {{{./download.html}download page}}. See the
-   {{{./1.2/gettingstarted.html}Getting Started}} guide for instructions on
+   {{{./1.4/gettingstarted.html}Getting Started}} guide for instructions on
    how to start using Tika.
 
    Tika is a project of the
@@ -32,6 +32,12 @@ Apache Tika - a content analysis toolkit
 
 Latest News
 
+   [3 July 2013: Apache Tika Release]
+    Apache Tika 1.4 has been released! This release includes several important bugfixes
+    and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}}
+    file for a full list of changes in this release, and have a look at the download
+    page for more information on how to obtain Apache Tika 1.4.
+
    [22 January 2013: Apache Tika Release]
     Apache Tika 1.3 has been released! This release includes several important bugfixes
     and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}}

Modified: tika/site/src/site/site.xml
URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1499614&r1=1499613&r2=1499614&view=diff
==============================================================================
--- tika/site/src/site/site.xml (original)
+++ tika/site/src/site/site.xml Thu Jul  4 01:40:44 2013
@@ -39,7 +39,15 @@
       <item name="Issue Tracker" href="https://issues.apache.org/jira/browse/TIKA"/>
     </menu>
     <menu name="Documentation">
-      <item name="Apache Tika 1.3" href="1.3/index.html">
+      <item name="Apache Tika 1.4" href="1.4/index.html">
+        <item name="Getting Started" href="1.4/gettingstarted.html"/>
+        <item name="Supported Formats" href="1.4/formats.html"/>
+        <item name="Parser API" href="1.4/parser.html"/>
+        <item name="Parser 5min Quick Start Guide" href="1.4/parser_guide.html"/>
+        <item name="Content and Language Detection" href="1.4/detection.html"/>
+        <item name="API Documentation" href="1.4/api/"/>
+      </item>
+      <item name="Apache Tika 1.3" href="1.3/index.html" collapse="true">
         <item name="Getting Started" href="1.3/gettingstarted.html"/>
         <item name="Supported Formats" href="1.3/formats.html"/>
         <item name="Parser API" href="1.3/parser.html"/>
@@ -71,14 +79,6 @@
         <item name="Content and Language Detection" href="1.0/detection.html"/>
         <item name="API Documentation" href="1.0/api/"/>
       </item>
-      <item name="Apache Tika 0.10" href="0.10/index.html" collapse="true">
-        <item name="Getting Started" href="0.10/gettingstarted.html"/>
-        <item name="Supported Formats" href="0.10/formats.html"/>
-        <item name="Parser API" href="0.10/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" href="0.10/parser_guide.html"/>
-        <item name="Content and Language Detection" href="0.10/detection.html"/>
-        <item name="API Documentation" href="0.10/api/"/>
-      </item>
     </menu>
     <menu name="The Apache Software Foundation">
       <item name="About" href="http://www.apache.org/foundation/"/>