You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2014/08/29 16:42:53 UTC

[jira] [Commented] (TIKA-1404) tika-server leaking temporary files when converting Word97 (doc)

    [ https://issues.apache.org/jira/browse/TIKA-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115294#comment-14115294 ] 

Nick Burch commented on TIKA-1404:
----------------------------------

Any chance you could re-test with a recent nightly build? (Or failing that, wait about a week and try with 1.6, which'll hopefully be out by then)

It's possible that the updates to Apache POI included since 1.5 will solve this

> tika-server leaking temporary files when converting Word97 (doc)
> ----------------------------------------------------------------
>
>                 Key: TIKA-1404
>                 URL: https://issues.apache.org/jira/browse/TIKA-1404
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.5
>         Environment: Linux (observed on CentOS 6.5 and SuSE SLES 11)
>            Reporter: Lukas Graf
>         Attachments: simple_word97.doc
>
>
> When converting Word97 documents (*.doc), tika-server reproducibly leaves behind temporary files.
> Steps to reproduce:
> - Start {{tika-app-1.5.jar}} in {{--server}} mode
> - Send a {{*.doc}} file to server for conversion
> - Stop tika-server using CTRL+C or {{kill -15}}
> For example:
> {code}
> lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text
> # ...
> lukas@host:/tmp> ls -lah apache-tika-*
> ls: cannot access apache-tika-*: No such file or directory
> lukas@host:/tmp>
> lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
> Simple Word-97 Document
> Lorem Ipsum.
> lukas@host:/tmp> ls -lah apache-tika-*
> -rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 apache-tika-2457738389388821864.tmp
> # after conversion is done, tmp file handles are still open
> lukas@host:/tmp> lsof | grep tika
> java   29857   lukas   32r   REG   104,2  28628386  4571740 /home/lukas/tika-app-1.5.jar
> java   29857   lukas   85r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
> java   29857   lukas   86r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
> # stop tika-server...
> ^C
> lukas@host:~>
> # ...
> lukas@host:/tmp> lsof | grep tika
> lukas@host:/tmp>
> {code}
> No exceptions are thrown, and the plaintext is being extracted correctly from the document, but temporary files are still left behind every single time.
> This obviously is a major issue in a production environment when converting thousands of documents a day. Our temp directories are filling up rapidly, and we had to configure cron jobs to clean up after Tika on most of our production servers. I wasn't able to reproduce this issue using {{tika-app-1.5.jar}} in non-server mode. However, booting up a JVM for every single conversion is just too slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)