You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Steven White <sw...@gmail.com> on 2016/02/08 19:37:53 UTC

Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out
if the OOM I'm getting is due to the way I'm using Tika or if it is an
issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.
The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
in size.  I do not see this issue with other file types that I tested so
far.  Memory usage keeps on growing with XML file types, but stays constant
with other file types.

    public class Extractor {
  private BodyContentHandler contentHandler = new BodyContentHandler(-1);
  private AutoDetectParser parser = new AutoDetectParser();
  private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve

RE: Preventing OutOfMemory exception

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tika can fail catastrophically (permanent hangs, memory leaks, oom and other surprises).  These problems happen very, very rarely, and we fix problems as soon as we can, but really bad things can happen – see, e.g. TIKA-1132, TIKA-1401, SOLR-7764, PDFBOX-2200 and [0] and [1].

Tika runs within memory in Solr Cell.  The good news is that Tika works so well that no one has gotten around to putting it into its own jvm in Solr Cell.  I’m active on the Solr list and have shared potential problems with running Tika in the same jvm several times over there.

So, the short answer is: with the exception of TIKA-1401, I don’t _know_ of specific vulnerabilities that would cause serious problems with Tika.  However, given what we’ve seen, I have little reason to believe that these issues won’t happen again…very, very rarely.

I added tika-batch, which you can run from the commandline of tika-app, to handle these catastrophic failures.  You can also wrap your own solution via ForkParser or other methods.

[0] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] http://www.slideshare.net/gagravarr/whats-new-with-apache-tika
[2] http://mail-archives.apache.org/mod_mbox/lucene-dev/201507.mbox/%3CJIRA.12843538.1436367863000.133708.1436382786622@Atlassian.JIRA%3E

From: Steven White [mailto:swhite4141@gmail.com]
Sent: Tuesday, February 09, 2016 5:37 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks for the confirmation Tim.

This is a production code, so ...

I'm a bit surprise why you suggest I keep the Tika code out-of-process as standalone application vs. directly using it from my app.  Are there known issues with Tika to prevent it from being used in a long running process?  Does Solr use Tika as an out-of-process application?  See https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika (I will also ask this question on the Solr mailing list).

A bit background about my application.  I am writing a file system crawler that will run 24x7xN-days uninterrupted.  The application monitors the file system once every N min. where N can be anywhere from 1 min and up for new files or updated files.  It will then send the file to Tika to extract the raw text and the raw text is than sent to Solr for indexing.  My file-system-crawler will not be recycled or stopped unless if the OS has to be restarted.  Thus, I expect it to run 24x7xN-days.  Finally, the file system is expected to be busy where on average there will be 10 new files added / updated per minute.  Overall, I'm expecting to make at least 10 calls to Tika per min.

Steve


On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a production environment, I encourage separating Tika into a separate jvm vs tying it into any post processing – consider tika-batch and writing separate text files for each file processed (not so efficient, but exceedingly robust).  If this is demo code or you know your document set well enough, you should be good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Tuesday, February 09, 2016 10:35 AM

To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single instance of AutoDetectParser and Metadata throughout the life of my application?  The reason why I'm reusing a single instance is to cut down on overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your files?  Or are you creating a new BodyContentHandler for each file?  If the former, then, y, there’s a problem with your code; if the latter, that’s not something I’ve seen before.

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and parse them using the logic close to what I sowed.  I use contentHandler.toString() to get back the raw text so I can save it.  Even if I get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I still have far more formats to test) I do *NOT* see the OOM issue even when I increase the loop to 1000 -- memory usage remains steady and stable.  This is why in my original email I asked if there is an issue with XML files or with my code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into one handler.  Typically, you’d use a new ContentHandler and new Metadata object for each parse.  Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are appending a new copy of the extracted text with each loop.  You might not be seeing the memory growth if your other file types aren’t big enough and if you are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve





Re: Preventing OutOfMemory exception

Posted by Steven White <sw...@gmail.com>.
Thanks for the confirmation Tim.

This is a production code, so ...

I'm a bit surprise why you suggest I keep the Tika code out-of-process as
standalone application vs. directly using it from my app.  Are there known
issues with Tika to prevent it from being used in a long running process?
Does Solr use Tika as an out-of-process application?  See
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
(I will also ask this question on the Solr mailing list).

A bit background about my application.  I am writing a file system crawler
that will run 24x7xN-days uninterrupted.  The application monitors the file
system once every N min. where N can be anywhere from 1 min and up for new
files or updated files.  It will then send the file to Tika to extract the
raw text and the raw text is than sent to Solr for indexing.  My
file-system-crawler will not be recycled or stopped unless if the OS has to
be restarted.  Thus, I expect it to run 24x7xN-days.  Finally, the file
system is expected to be busy where on average there will be 10 new files
added / updated per minute.  Overall, I'm expecting to make at least 10
calls to Tika per min.

Steve


On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Same parser is ok to reuse…should even be ok in multithreaded applications.
>
>
>
> Do not reuse ContentHandler or Metadata objects.
>
>
>
> As a side note, if you are handling a bunch of files from the wild in a
> production environment, I encourage separating Tika into a separate jvm vs
> tying it into any post processing – consider tika-batch and writing
> separate text files for each file processed (not so efficient, but
> exceedingly robust).  If this is demo code or you know your document set
> well enough, you should be good to go with keeping Tika and your
> postprocessing steps in the same jvm.
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Tuesday, February 09, 2016 10:35 AM
>
> *To:* user@tika.apache.org
> *Subject:* Re: Preventing OutOfMemory exception
>
>
>
> Thanks Tim!!  You helped me find the defect in my code.
>
>
>
> Yes, I'm using one BodyContentHandler.  When I changed my code to create a
> new BodyContentHandler for each XML file I'm parsing, I no longer see the
> OOM.  It is weird that I see this issue with XML files only.
>
>
>
> For completeness, can you confirm if I have an issue in re-using a single
> instance of AutoDetectParser and Metadata throughout the life of my
> application?  The reason why I'm reusing a single instance is to cut down
> on overhead (I have yet to time this).
>
>
>
> Steve
>
>
>
>
>
> On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> In your actual code, are you using one BodyContentHandler for all of your
> files?  Or are you creating a new BodyContentHandler for each file?  If the
> former, then, y, there’s a problem with your code; if the latter, that’s
> not something I’ve seen before.
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Monday, February 08, 2016 4:56 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Preventing OutOfMemory exception
>
>
>
> Hi Tim,
>
>
>
> The code I showed is a minimal example code to show the issue I'm running
> into, which is: memory keeps on growing.
>
>
>
> In production, the loop that you see will read files off a file system and
> parse them using the logic close to what I sowed.  I use
> contentHandler.toString() to get back the raw text so I can save it.  Even
> if I get ride of that call, I run into OOM.
>
>
>
> Note that, if I test the exact same code against PDF or PPT or ODP or RTF
> (I still have far more formats to test) I do *NOT* see the OOM issue even
> when I increase the loop to 1000 -- memory usage remains steady and
> stable.  This is why in my original email I asked if there is an issue with
> XML files or with my code such as if I'm missing to close / release
> something.
>
>
>
> Here is the full call stack when I get the OOM:
>
>
>
>   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>     at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
>
>     at java.lang.StringBuffer.append(StringBuffer.java:114)
>
>     at java.io.StringWriter.write(StringWriter.java:106)
>
>     at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
>
>     at
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>
>     at javax.xml.parsers.SAXParser.parse(Unknown Source)
>
>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>
> On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> I’m not sure why you’d want to append document contents across documents
> into one handler.  Typically, you’d use a new ContentHandler and new
> Metadata object for each parse.  Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop.  You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* user@tika.apache.org
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
> in size.  I do not see this issue with other file types that I tested so
> far.  Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
>     public class Extractor {
>
>         private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
>         private AutoDetectParser parser = new AutoDetectParser();
>
>         private Metadata metadata = new Metadata();
>
>
>
>         public String extract(File file) throws Exception {
>
>             try {
>
>                 stream = TikaInputStream.get(file);
>
>                 parser.parse(stream, contentHandler, metadata);
>
>                 return contentHandler.toString();
>
>             }
>
>             finally {
>
>                 stream.close();
>
>             }
>
>         }
>
>     }
>
>
>
>     public static void main(...) {
>
>         Extractor extractor = new Extractor();
>
>         File file = new File("C:\\temp\\test.xml");
>
>         for (int i = 0; i < 20; i++) {
>
>             extractor.extract(file);
>
>         }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>
>
>

RE: Preventing OutOfMemory exception

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a production environment, I encourage separating Tika into a separate jvm vs tying it into any post processing – consider tika-batch and writing separate text files for each file processed (not so efficient, but exceedingly robust).  If this is demo code or you know your document set well enough, you should be good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4141@gmail.com]
Sent: Tuesday, February 09, 2016 10:35 AM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single instance of AutoDetectParser and Metadata throughout the life of my application?  The reason why I'm reusing a single instance is to cut down on overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your files?  Or are you creating a new BodyContentHandler for each file?  If the former, then, y, there’s a problem with your code; if the latter, that’s not something I’ve seen before.

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and parse them using the logic close to what I sowed.  I use contentHandler.toString() to get back the raw text so I can save it.  Even if I get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I still have far more formats to test) I do *NOT* see the OOM issue even when I increase the loop to 1000 -- memory usage remains steady and stable.  This is why in my original email I asked if there is an issue with XML files or with my code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into one handler.  Typically, you’d use a new ContentHandler and new Metadata object for each parse.  Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are appending a new copy of the extracted text with each loop.  You might not be seeing the memory growth if your other file types aren’t big enough and if you are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve




Re: Preventing OutOfMemory exception

Posted by Steven White <sw...@gmail.com>.
Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a
new BodyContentHandler for each XML file I'm parsing, I no longer see the
OOM.  It is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single
instance of AutoDetectParser and Metadata throughout the life of my
application?  The reason why I'm reusing a single instance is to cut down
on overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> In your actual code, are you using one BodyContentHandler for all of your
> files?  Or are you creating a new BodyContentHandler for each file?  If the
> former, then, y, there’s a problem with your code; if the latter, that’s
> not something I’ve seen before.
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Monday, February 08, 2016 4:56 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Preventing OutOfMemory exception
>
>
>
> Hi Tim,
>
>
>
> The code I showed is a minimal example code to show the issue I'm running
> into, which is: memory keeps on growing.
>
>
>
> In production, the loop that you see will read files off a file system and
> parse them using the logic close to what I sowed.  I use
> contentHandler.toString() to get back the raw text so I can save it.  Even
> if I get ride of that call, I run into OOM.
>
>
>
> Note that, if I test the exact same code against PDF or PPT or ODP or RTF
> (I still have far more formats to test) I do *NOT* see the OOM issue even
> when I increase the loop to 1000 -- memory usage remains steady and
> stable.  This is why in my original email I asked if there is an issue with
> XML files or with my code such as if I'm missing to close / release
> something.
>
>
>
> Here is the full call stack when I get the OOM:
>
>
>
>   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>     at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
>
>     at java.lang.StringBuffer.append(StringBuffer.java:114)
>
>     at java.io.StringWriter.write(StringWriter.java:106)
>
>     at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
>
>     at
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
>
>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>
>     at javax.xml.parsers.SAXParser.parse(Unknown Source)
>
>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>
> On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> I’m not sure why you’d want to append document contents across documents
> into one handler.  Typically, you’d use a new ContentHandler and new
> Metadata object for each parse.  Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop.  You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* user@tika.apache.org
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
> in size.  I do not see this issue with other file types that I tested so
> far.  Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
>     public class Extractor {
>
>         private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
>         private AutoDetectParser parser = new AutoDetectParser();
>
>         private Metadata metadata = new Metadata();
>
>
>
>         public String extract(File file) throws Exception {
>
>             try {
>
>                 stream = TikaInputStream.get(file);
>
>                 parser.parse(stream, contentHandler, metadata);
>
>                 return contentHandler.toString();
>
>             }
>
>             finally {
>
>                 stream.close();
>
>             }
>
>         }
>
>     }
>
>
>
>     public static void main(...) {
>
>         Extractor extractor = new Extractor();
>
>         File file = new File("C:\\temp\\test.xml");
>
>         for (int i = 0; i < 20; i++) {
>
>             extractor.extract(file);
>
>         }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>
>
>

RE: Preventing OutOfMemory exception

Posted by "Allison, Timothy B." <ta...@mitre.org>.
In your actual code, are you using one BodyContentHandler for all of your files?  Or are you creating a new BodyContentHandler for each file?  If the former, then, y, there’s a problem with your code; if the latter, that’s not something I’ve seen before.

From: Steven White [mailto:swhite4141@gmail.com]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and parse them using the logic close to what I sowed.  I use contentHandler.toString() to get back the raw text so I can save it.  Even if I get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I still have far more formats to test) I do *NOT* see the OOM issue even when I increase the loop to 1000 -- memory usage remains steady and stable.  This is why in my original email I asked if there is an issue with XML files or with my code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into one handler.  Typically, you’d use a new ContentHandler and new Metadata object for each parse.  Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are appending a new copy of the extracted text with each loop.  You might not be seeing the memory growth if your other file types aren’t big enough and if you are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4141@gmail.com<ma...@gmail.com>]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve



Re: Preventing OutOfMemory exception

Posted by Steven White <sw...@gmail.com>.
Hi Tim,

The code I showed is a minimal example code to show the issue I'm running
into, which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and
parse them using the logic close to what I sowed.  I use
contentHandler.toString() to get back the raw text so I can save it.  Even
if I get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF
(I still have far more formats to test) I do *NOT* see the OOM issue even
when I increase the loop to 1000 -- memory usage remains steady and
stable.  This is why in my original email I asked if there is an issue with
XML files or with my code such as if I'm missing to close / release
something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> I’m not sure why you’d want to append document contents across documents
> into one handler.  Typically, you’d use a new ContentHandler and new
> Metadata object for each parse.  Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop.  You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:swhite4141@gmail.com]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* user@tika.apache.org
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
> in size.  I do not see this issue with other file types that I tested so
> far.  Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
>     public class Extractor {
>
>         private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
>         private AutoDetectParser parser = new AutoDetectParser();
>
>         private Metadata metadata = new Metadata();
>
>
>
>         public String extract(File file) throws Exception {
>
>             try {
>
>                 stream = TikaInputStream.get(file);
>
>                 parser.parse(stream, contentHandler, metadata);
>
>                 return contentHandler.toString();
>
>             }
>
>             finally {
>
>                 stream.close();
>
>             }
>
>         }
>
>     }
>
>
>
>     public static void main(...) {
>
>         Extractor extractor = new Extractor();
>
>         File file = new File("C:\\temp\\test.xml");
>
>         for (int i = 0; i < 20; i++) {
>
>             extractor.extract(file);
>
>         }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>

RE: Preventing OutOfMemory exception

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I’m not sure why you’d want to append document contents across documents into one handler.  Typically, you’d use a new ContentHandler and new Metadata object for each parse.  Calling “toString()” does not clear the content handler, and you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are appending a new copy of the extracted text with each loop.  You might not be seeing the memory growth if your other file types aren’t big enough and if you are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4141@gmail.com]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve