You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2017/05/08 18:13:15 UTC

svn commit: r1794435 - /tika/site/src/site/apt/1.15/formats.apt

Author: nick
Date: Mon May  8 18:13:15 2017
New Revision: 1794435

URL: http://svn.apache.org/viewvc?rev=1794435&view=rev
Log:
EMF, and mention the existance of the NER, NLP and OR parsers

Modified:
    tika/site/src/site/apt/1.15/formats.apt

Modified: tika/site/src/site/apt/1.15/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.15/formats.apt?rev=1794435&r1=1794434&r2=1794435&view=diff
==============================================================================
--- tika/site/src/site/apt/1.15/formats.apt (original)
+++ tika/site/src/site/apt/1.15/formats.apt Mon May  8 18:13:15 2017
@@ -209,6 +209,9 @@ Supported Document Formats
 
    The {{{./api/org/apache/tika/parser/microsoft/WMFParser.html}WMFParser}}
    class extracts simple text from Microsoft WMF drawings.
+   The {{{./api/org/apache/tika/parser/microsoft/EMFParser.html}EMFParser}}
+   class extracts simple text from Microsoft EMF drawings, along with
+   exposing any embedded other resources / files.
 
 * {Video formats}
 
@@ -264,7 +267,7 @@ Supported Document Formats
    extract email messages from the Microsoft Outlook PST email format.
 
    The {{{./api/org/apache/tika/parser/microsoft/OutlookExtractor.html}OutlookExtractor}} (part of 
-   {{{./api/org/apache/tika/parser/microsoft/OfficeParser}OfficeParser}})
+   {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}})
    is able to extract email messages from the Microsoft Outlook MSG email
    format.
 
@@ -352,6 +355,36 @@ Supported Document Formats
    {{{http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml} digitalpreservation.gov}}
    for background on this format.
 
+* {Natural Language Processing}
+
+   Tika supports calling out to a number of Natural Language Processing and
+   Named Entity Recognition frameworks, tools and libraries. 
+
+   These can be used to support additional formats, or to gain extra information on 
+   existing formats. In many cases, additional tools or REST services or training 
+   datasets are required to enable or power this support.
+
+   Details on the requirements and setup steps are generally given either in
+   the parser's javadocs, or on the {{{https://wiki.apache.org/tika/}Tika wiki}}.
+
+   The {{{./api/org/apache/tika/parser/sentiment/analysis/SentimentParser.html}SentimentParser}}
+   class classifies documents based on the sentiment of document, powered by Apache 
+   OpenNLP's Maximum Entropy Classifier.
+
+   {{{./api/org/apache/tika/parser/journal/JournalParser.html}JournalParser}} uses
+   Grobid (via RESTful server) to extract additional metadata from the text of
+   journal publications. A number of other NLP and NER parsers are available in the
+   {{{./api/org/apache/tika/parser/ner/}ner package}}
+
+* {Image and Video object recognition}
+
+   Tika supports calling out to a number of Object Recognition frameworks to
+   analyse the contents of images and videos. Large training datasets and or
+   frameworks are generally required, often accessed via REST services. The
+   {{{./api/org/apache/tika/parser/recognition/}recognition package}} contains
+   most of these. Details on the requirements and setup steps are generally given
+   on the {{{https://wiki.apache.org/tika/}Tika wiki}}.
+
 
 Full list of Supported Formats