You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/04 21:44:43 UTC

[jira] [Comment Edited] (TIKA-1464) Too many open files in system when parsing thousands of files

    [ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196765#comment-14196765 ] 

Tim Allison edited comment on TIKA-1464 at 11/4/14 8:35 PM:
------------------------------------------------------------

On Windows 7 with Tika 1.7-SNAPSHOT, on a batch of 3k msg files that have many attachments, the most I can get with a 4 thread process is 12 descriptors open at a time according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
....

#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)

....
#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

....
#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)

....

#9 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
...

#10 ...file4.msg by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
...

#11 ...Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
...

#12 ...\Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)

----


{noformat}


was (Author: tallison@mitre.org):
On Windows 7, on a batch of 3k msg files that have many of attachments, the most I can get with a 4 thread process is 12 descriptors open at a time according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
....

#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)

....
#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

....
#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)

....

#9 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
...

#10 ...file4.msg by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown Source)
...

#11 ...Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.FileInputStream.<init>(FileInputStream.java:147)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)
...

#12 ...\Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
	at org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
	at org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
	at org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown Source)

----


{noformat}

> Too many open files in system when parsing thousands of files
> -------------------------------------------------------------
>
>                 Key: TIKA-1464
>                 URL: https://issues.apache.org/jira/browse/TIKA-1464
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
>            Reporter: Tim Barrett
>            Priority: Blocker
>              Labels: TooManyOpenFilesInSystem
>
> Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure.
> We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6.
> However we have seen that a Too many open files in system exception is raised somewhere above 10000 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl.
> Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)