You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/08/17 09:32:00 UTC
[jira] [Commented] (TIKA-3526) i cant extract content from
attachments in the document
[ https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400288#comment-17400288 ]
Tim Allison commented on TIKA-3526:
-----------------------------------
Can you share example files please?
Also, how are you calling Tika?
> i cant extract content from attachments in the document
> -------------------------------------------------------
>
> Key: TIKA-3526
> URL: https://issues.apache.org/jira/browse/TIKA-3526
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.20
> Reporter: matcha007
> Priority: Major
>
> office series documents contain office series document attachment. Can the contents of the attachments be extracted as shown in the table below
>
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>
> 1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX");
> Parser parser = new OfficeParser();
> ParseContext context = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>
> 2.We use Tika version: 1.20. Of course, we have replaced the latest version 2.0. This problem still exists.
>
> 3.If there is indeed this omission in the current version, please help us optimize it in subsequent versions
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)