You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jason Guo (Jira)" <ji...@apache.org> on 2022/07/13 08:20:00 UTC
[jira] [Created] (TIKA-3816) Tika cannot parse the text in the table(Microsoft word)
Jason Guo created TIKA-3816:
-------------------------------
Summary: Tika cannot parse the text in the table(Microsoft word)
Key: TIKA-3816
URL: https://issues.apache.org/jira/browse/TIKA-3816
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.3.0
Environment: OS : Windows 10,
Software Platform : Java
Reporter: Jason Guo
Fix For: 2.4.2
Attachments: test1.docx
I am trying to parse a microsoft word document (.doc) which contains a table that contains a select component and a text.
the code I am using for parsing the doc is below
public static byte[] convertToByteArray(byte[] bytes) throws Exception {
Tika tika = new Tika();
if(bytes.length > tika.getMaxStringLength()) {
tika.setMaxStringLength(bytes.length);
}
String result = tika.parseToString(new ByteArrayInputStream(bytes));
byte[] rv = result.getBytes();
return rv;
}
the dependencies I am using are
compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
exclude group: 'org.apache.poi', module : 'poi-scratchpad'
exclude group: 'org.apache.poi', module : 'poi'
// exclude group: 'com.drewnoakes', module : 'metadata-extractor'
}
compile 'org.apache.tika:tika-core:2.3.0'
compile 'org.apache.poi:poi-scratchpad:5.2.1'
compile 'org.apache.poi:poi:5.2.1'
--
This message was sent by Atlassian Jira
(v8.20.10#820010)