You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/08/09 17:20:27 UTC
[jira] [Closed] (NUTCH-463) Nutch powerpoint parser plugin fails to
parse ppt with images
[ https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche closed NUTCH-463.
-------------------------------
Resolution: Won't Fix
Parsing delegated to Tika
> Nutch powerpoint parser plugin fails to parse ppt with images
> -------------------------------------------------------------
>
> Key: NUTCH-463
> URL: https://issues.apache.org/jira/browse/NUTCH-463
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8.1
> Environment: Windows
> Reporter: W Fong
>
> With powerpoint presentations that have images, the parser seems to treat images as if they are text and tries to index it resulting in maxFieldLength being reached.
> The lines from the crawl log file for the powerpoint in question:
> Indexing [http://127.0.0.1/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null)
> Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null)
> maxFieldLength 10000 reached, ignoring following tokens
>
> The parser should extract only the text and skip the images.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira