You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mane (JIRA)" <ji...@apache.org> on 2013/12/11 18:45:07 UTC

[jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files

    [ https://issues.apache.org/jira/browse/TIKA-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845571#comment-13845571 ] 

Mane edited comment on TIKA-1121 at 12/11/13 5:43 PM:
------------------------------------------------------

Also worth to mention, I've this gibberish.txt file that have garbage text in 2mb file, which Tika Socket server is unable to parser and hangs (downloaded this file from one of the tickets i've found related to tika , i do not have the link of that file any more)


was (Author: mane_genius):
Also worth to mention, I've this gibberish.txt file that have garbage text in 2mb file, which Tika Socket server is unable to parser and return the text (download this file from one of the tickets i've found related to tika , i do not have the link of that file any more)

> Socket server text parsing error on large text files
> ----------------------------------------------------
>
>                 Key: TIKA-1121
>                 URL: https://issues.apache.org/jira/browse/TIKA-1121
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.4
>         Environment: Ubuntu 10.04, 10.10, 12.04.02
>            Reporter: Dave Meikle
>            Assignee: Dave Meikle
>
> As reported on the user list[1], when using the tika-app socket server command with the -t switch to parse text, the process hangs on large text files.
> This occurs on Ubuntu 10.04, 10.10 and 12.04.02.
> [1]http://mail-archives.apache.org/mod_mbox/tika-user/201305.mbox/%3CCAGxBzUFxSJ4h5jWdeUX9HhD2FxtTQ1vsbM7u-VfSyGE9VmrQHQ@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Re: [jira] [Comment Edited] (TIKA-1121) Socket server text parsing error on large text files

Posted by Raymond Wiker <rw...@gmail.com>.
I've been reading through some of the emails referenced, and it looks like the problem might be in the code on the client side.

In one of the emails from May 2013, the client-side code tries to write the entire file to Tika, and then to read the extracted text back. I had a similar problem with some files, and discovered that, for certain files, Tika started to write back extracted text before the entire file had been written. At some point, a deadlock situation arose where each side was waiting for the other to read what had been written to the socket.

I solved this by running the read part on the client side in a separate thread. This appears to work fine – I have seen no strange hangs even after feeding close to a million files in sizes up to 100MB through a single Tika process.