You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jason Guo (Jira)" <ji...@apache.org> on 2022/07/13 08:22:00 UTC

[jira] [Updated] (TIKA-3816) Tika cannot parse the text in the table(Microsoft word)

     [ https://issues.apache.org/jira/browse/TIKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Guo updated TIKA-3816:
----------------------------
    Attachment: output.PNG

> Tika cannot parse the text in the table(Microsoft word)
> -------------------------------------------------------
>
>                 Key: TIKA-3816
>                 URL: https://issues.apache.org/jira/browse/TIKA-3816
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>         Environment: OS : Windows 10,
> Software Platform : Java
>            Reporter: Jason Guo
>            Priority: Major
>             Fix For: 2.4.2
>
>         Attachments: output.PNG, test1.docx
>
>
> I am trying to parse a microsoft word document (.doc) which contains a table that contains a select component and a text.
>  the code I am using for parsing the doc is below
> public static byte[] convertToByteArray(byte[] bytes) throws Exception {
> Tika tika = new Tika();
> if(bytes.length > tika.getMaxStringLength()) {
> tika.setMaxStringLength(bytes.length);
> }
> String result = tika.parseToString(new ByteArrayInputStream(bytes));
> byte[] rv = result.getBytes();
> return rv;
> }
> the dependencies I am using are
> compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
> exclude group: 'org.apache.poi', module : 'poi-scratchpad'
> exclude group: 'org.apache.poi', module : 'poi'
> // exclude group: 'com.drewnoakes', module : 'metadata-extractor'
> }
> compile 'org.apache.tika:tika-core:2.3.0'
> compile 'org.apache.poi:poi-scratchpad:5.2.1'
> compile 'org.apache.poi:poi:5.2.1'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)