You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Lukas Graf (JIRA)" <ji...@apache.org> on 2014/08/29 16:36:53 UTC
[jira] [Created] (TIKA-1404) tika-server leaking temporary files
when converting Word97 (doc)
Lukas Graf created TIKA-1404:
--------------------------------
Summary: tika-server leaking temporary files when converting Word97 (doc)
Key: TIKA-1404
URL: https://issues.apache.org/jira/browse/TIKA-1404
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.5
Environment: Linux (observed on CentOS 6.5 and SuSE SLES 11)
Reporter: Lukas Graf
Attachments: simple_word97.doc
When converting Word97 documents (*.doc), tika-server reproducibly leaves behind temporary files.
Steps to reproduce:
- Start {{tika-app-1.5.jar}} in {{--server}} mode
- Send a {{*.doc}} file to server for conversion
- Stop tika-server using CTRL+C or {{kill -15}}
For example:
{code}
lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text
# ...
lukas@host:/tmp> ls -lah apache-tika-*
ls: cannot access apache-tika-*: No such file or directory
lukas@host:/tmp>
lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
Simple Word-97 Document
Lorem Ipsum.
lukas@host:/tmp> ls -lah apache-tika-*
-rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 apache-tika-2457738389388821864.tmp
# after conversion is done, tmp file handles are still open
lukas@host:/tmp> lsof | grep tika
java 29857 lukas 32r REG 104,2 28628386 4571740 /home/lukas/tika-app-1.5.jar
java 29857 lukas 85r REG 104,2 22528 8604717 /tmp/apache-tika-2457738389388821864.tmp
java 29857 lukas 86r REG 104,2 22528 8604717 /tmp/apache-tika-2457738389388821864.tmp
# stop tika-server...
^C
lukas@host:~>
# ...
lukas@host:/tmp> lsof | grep tika
lukas@host:/tmp>
{code}
No exceptions are thrown, and the plaintext is being extracted correctly from the document, but temporary files are still left behind every single time.
This obviously is a major issue in a production environment when converting thousands of documents a day. Our temp directories are filling up rapidly, and we had to configure cron jobs to clean up after Tika on most of our production servers. I wasn't able to reproduce this issue using {{tika-app-1.5.jar}} in non-server mode. However, booting up a JVM for every single conversion is just too slow.
--
This message was sent by Atlassian JIRA
(v6.2#6252)