You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jason Guo (Jira)" <ji...@apache.org> on 2022/07/13 08:20:00 UTC

[jira] [Created] (TIKA-3816) Tika cannot parse the text in the table(Microsoft word)

Jason Guo created TIKA-3816:
-------------------------------

             Summary: Tika cannot parse the text in the table(Microsoft word)
                 Key: TIKA-3816
                 URL: https://issues.apache.org/jira/browse/TIKA-3816
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.3.0
         Environment: OS : Windows 10,
Software Platform : Java
            Reporter: Jason Guo
             Fix For: 2.4.2
         Attachments: test1.docx

I am trying to parse a microsoft word document (.doc) which contains a table that contains a select component and a text.

 the code I am using for parsing the doc is below

public static byte[] convertToByteArray(byte[] bytes) throws Exception {
Tika tika = new Tika();
if(bytes.length > tika.getMaxStringLength()) {
tika.setMaxStringLength(bytes.length);
}
String result = tika.parseToString(new ByteArrayInputStream(bytes));

byte[] rv = result.getBytes();
return rv;
}

the dependencies I am using are

compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
exclude group: 'org.apache.poi', module : 'poi-scratchpad'
exclude group: 'org.apache.poi', module : 'poi'
// exclude group: 'com.drewnoakes', module : 'metadata-extractor'
}
compile 'org.apache.tika:tika-core:2.3.0'

compile 'org.apache.poi:poi-scratchpad:5.2.1'
compile 'org.apache.poi:poi:5.2.1'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)