You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luis Filipe Nassif (JIRA)" <ji...@apache.org> on 2014/09/06 18:55:28 UTC

[jira] [Created] (TIKA-1411) Temporary 7z file leak

Luis Filipe Nassif created TIKA-1411:
----------------------------------------

             Summary: Temporary 7z file leak
                 Key: TIKA-1411
                 URL: https://issues.apache.org/jira/browse/TIKA-1411
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.6
            Reporter: Luis Filipe Nassif


When working with a 7z file, the created TikaInputStream is not closed inside PackageParser. Also, it is prematurely wrapping the stream into a CloseShieldInputStream, so it will never be a TikaInputStream and always wrapped into a BufferedInputStream. Proposed change:

{code}
public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
       
        // Ensure that the stream supports the mark feature
        if (! TikaInputStream.isTikaInputStream(stream))
            stream = new BufferedInputStream(stream);
        
        
        TemporaryResources tmp = new TemporaryResources();
        ArchiveInputStream ais = null;
        try {
            ArchiveStreamFactory factory = context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory());
            // At the end we want to close the archive stream to release
            // any associated resources, but the underlying document stream
            // should not be closed
            ais = factory.createArchiveInputStream(new CloseShieldInputStream(stream));
            
        } catch (StreamingNotSupportedException sne) {
            // Most archive formats work on streams, but a few need files
            if (sne.getFormat().equals(ArchiveStreamFactory.SEVEN_Z)) {
                // Rework as a file, and wrap
                stream.reset();
                TikaInputStream tstream = TikaInputStream.get(stream, tmp);
                
                // Pending a fix for COMPRESS-269, this bit is a little nasty
                ais = new SevenZWrapper(new SevenZFile(tstream.getFile()));
                
            } else {
            	tmp.close();
                throw new TikaException("Unknown non-streaming format " + sne.getFormat(), sne);
            }
        } catch (ArchiveException e) {
        	tmp.close();
            throw new TikaException("Unable to unpack document stream", e);
        }

        MediaType type = getMediaType(ais);
        if (!type.equals(MediaType.OCTET_STREAM)) {
            metadata.set(CONTENT_TYPE, type.toString());
        }

        // Use the delegate parser to parse the contained document
        EmbeddedDocumentExtractor extractor = context.get(
                EmbeddedDocumentExtractor.class,
                new ParsingEmbeddedDocumentExtractor(context));

        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();

        try {
            ArchiveEntry entry = ais.getNextEntry();
            while (entry != null) {
                if (!entry.isDirectory()) {
                    parseEntry(ais, entry, extractor, xhtml);
                }
                entry = ais.getNextEntry();
            }
        } finally {
            ais.close();
            tmp.close();
        }

        xhtml.endDocument();
    }
{code}

I would be nice if TIKA-1246 (very simple) was resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)