You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Baldwin, David" <Da...@bmc.com> on 2010/01/05 17:47:38 UTC

Memory Usage/needs for file sizes/types

I need to get a handle on how much memory Tika needs to token-ize different file types.  In other words, I need to find information on required overhead (including copies of buffers made if applicable) so that I can produce some kind of guidelines for memory possibly needed by users of the product I am working on which uses Lucene/Tika.



Now I realize that there is a lot of context that can be provided, I want to find out first, if anyone knows of already existing data/metrics on this.

Much thanks in advance!

Re: Memory Usage/needs for file sizes/types

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jan 22, 2010 at 7:59 PM, Baldwin, David <Da...@bmc.com> wrote:
> I want to make sure that I am really running in streaming mode.  I am doing all tests
> with 1 thread for a basic baseline memory usage for different documents, then I will
> work on multiple threads which should be close to n multiples, I would expect.
>
> Can you tell me if streaming mode is more than just using the InputStream to Tika?

Yes, you'll want to stream also the parse output. You can do this by
either processing the SAX events directly as they come, or by using
the ParsingReader class (or the new Tika.parse() methods in Tika 0.5
and higher).

The problem with your code is that it's buffering the entire text
content of the document you're parsing into a single String.

BR,

Jukka Zitting

RE: Memory Usage/needs for file sizes/types

Posted by "Baldwin, David" <Da...@bmc.com>.
Jukka,

Thanks for your response.

I want to make sure that I am really running in streaming mode.  I am doing all tests with 1 thread for a basic baseline memory usage for different documents, then I will work on multiple threads which should be close to n multiples, I would expect.

Can you tell me if streaming mode is more than just using the InputStream to Tika?

I am using tika like demonstrated below in this example using the AutoDetectParser and passing it and InputStream which will actually be an instance of a ByteArrayInputStream or a FileInputStream. :

	private Parser m_parser = new AutoDetectParser();
	public String getText(InputStream is) throws DocumentHandlerException 
	{
		Metadata metadata = new Metadata();
		ContentHandler handler = new BodyContentHandler();
		try
		{
			m_parser.parse(is, handler, metadata);
			return handler.toString();
		} 
		catch (Exception e) 
		{
			throw new DocumentHandlerException("Cannot extract text from the document", e);
		}
	}

Thanks in advance,

David


-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Thursday, January 21, 2010 5:47 PM
To: tika-user@lucene.apache.org
Subject: Re: Memory Usage/needs for file sizes/types

Hi,

Sorry for the late response...

On Tue, Jan 5, 2010 at 5:47 PM, Baldwin, David <Da...@bmc.com> wrote:
> I need to get a handle on how much memory Tika needs to token-ize different
> file types.  In other words, I need to find information on required overhead
> (including copies of buffers made if applicable) so that I can produce some
> kind of guidelines for memory possibly needed by users of the product I am
> working on which uses Lucene/Tika.

Assuming you use Tika in streaming mode, then the memory use is
moderate and typically does not depend on the size of the document
being processed. Parsing complex documents like MS Office or PDF files
can require up to a few megabytes of memory,  while simple formats
like plain text only need a few kilobytes of memory.

In addition to the above estimates, you also need to take into account
the memory that the JVM needs for loading all the Tika and relevant
parser library classes needed. In total I'd estimate that you should
get pretty far with some 20 megs of memory for Tika unless you have
dozens or more parallel Tika parsing tasks running concurrently.

BR,

Jukka Zitting


Re: Memory Usage/needs for file sizes/types

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Sorry for the late response...

On Tue, Jan 5, 2010 at 5:47 PM, Baldwin, David <Da...@bmc.com> wrote:
> I need to get a handle on how much memory Tika needs to token-ize different
> file types.  In other words, I need to find information on required overhead
> (including copies of buffers made if applicable) so that I can produce some
> kind of guidelines for memory possibly needed by users of the product I am
> working on which uses Lucene/Tika.

Assuming you use Tika in streaming mode, then the memory use is
moderate and typically does not depend on the size of the document
being processed. Parsing complex documents like MS Office or PDF files
can require up to a few megabytes of memory,  while simple formats
like plain text only need a few kilobytes of memory.

In addition to the above estimates, you also need to take into account
the memory that the JVM needs for loading all the Tika and relevant
parser library classes needed. In total I'd estimate that you should
get pretty far with some 20 megs of memory for Tika unless you have
dozens or more parallel Tika parsing tasks running concurrently.

BR,

Jukka Zitting