You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Md (JIRA)" <ji...@apache.org> on 2019/07/08 16:39:00 UTC
[jira] [Updated] (TIKA-2901) Tika extracting points from Chart
[ https://issues.apache.org/jira/browse/TIKA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Md updated TIKA-2901:
---------------------
Attachment: Chart_data_sample_text_possible_issue.docx.txt
Chart_data_sample_text_possible_issue.docx
> Tika extracting points from Chart
> ----------------------------------
>
> Key: TIKA-2901
> URL: https://issues.apache.org/jira/browse/TIKA-2901
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 1.21
> Reporter: Md
> Priority: Major
> Attachments: Chart_data_sample_text_possible_issue.docx, Chart_data_sample_text_possible_issue.docx.txt
>
>
> I am using Tika to extract content from *.docx and other files. I am noticing Tika is extracting points from charts and putting them at the end of the file.
> I am using following code for extraction
> {code:java}
> StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}
> Please find the attach files for input and output from Tika.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)