You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Jonathan LI (JIRA)" <ji...@apache.org> on 2011/07/20 22:26:57 UTC

[jira] [Updated] (TIKA-684) Partial/Incomplete text extraction for certain Powerpoint files

     [ https://issues.apache.org/jira/browse/TIKA-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan LI updated TIKA-684:
-----------------------------

    Attachment: 2eebe3db1196aa8ea58c9be83965f0eb.ppt

Source file from Enron Sample Data Set - http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2

License: Creative Commons Attribution 3.0 United States License.

> Partial/Incomplete text extraction for certain Powerpoint files
> ---------------------------------------------------------------
>
>                 Key: TIKA-684
>                 URL: https://issues.apache.org/jira/browse/TIKA-684
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Jonathan LI
>         Attachments: 2eebe3db1196aa8ea58c9be83965f0eb.ppt
>
>
> Example file with issue attached.
> Tika throws exception during text extraction of certain powerpoints.  In this example file, the extracted text only goes up to slide 37.  Text from slides 38-40 are missing.
> Tested via both tika library and tika GUI. Apache POI (3.8 beta 3 & 3.7) doesn't have any issues with text extraction of this file. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira