You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Lukas Graf (JIRA)" <ji...@apache.org> on 2014/08/29 16:36:53 UTC

[jira] [Created] (TIKA-1404) tika-server leaking temporary files when converting Word97 (doc)

Lukas Graf created TIKA-1404:
--------------------------------

             Summary: tika-server leaking temporary files when converting Word97 (doc)
                 Key: TIKA-1404
                 URL: https://issues.apache.org/jira/browse/TIKA-1404
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 1.5
         Environment: Linux (observed on CentOS 6.5 and SuSE SLES 11)
            Reporter: Lukas Graf
         Attachments: simple_word97.doc

When converting Word97 documents (*.doc), tika-server reproducibly leaves behind temporary files.

Steps to reproduce:

- Start {{tika-app-1.5.jar}} in {{--server}} mode
- Send a {{*.doc}} file to server for conversion
- Stop tika-server using CTRL+C or {{kill -15}}

For example:

{code}
lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text

# ...

lukas@host:/tmp> ls -lah apache-tika-*
ls: cannot access apache-tika-*: No such file or directory
lukas@host:/tmp>
lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
Simple Word-97 Document
Lorem Ipsum.
lukas@host:/tmp> ls -lah apache-tika-*
-rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 apache-tika-2457738389388821864.tmp

# after conversion is done, tmp file handles are still open

lukas@host:/tmp> lsof | grep tika
java      29857       lukas   32r      REG              104,2  28628386  4571740 /home/lukas/tika-app-1.5.jar
java      29857       lukas   85r      REG              104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
java      29857       lukas   86r      REG              104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp

# stop tika-server...

^C
lukas@host:~>

# ...

lukas@host:/tmp> lsof | grep tika
lukas@host:/tmp>
{code}

No exceptions are thrown, and the plaintext is being extracted correctly from the document, but temporary files are still left behind every single time.

This obviously is a major issue in a production environment when converting thousands of documents a day. Our temp directories are filling up rapidly, and we had to configure cron jobs to clean up after Tika on most of our production servers. I wasn't able to reproduce this issue using {{tika-app-1.5.jar}} in non-server mode. However, booting up a JVM for every single conversion is just too slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)