You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org> on 2011/11/17 22:16:52 UTC
[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152329#comment-13152329 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Hello ,

I am using the following code.

		constructor()
		{
		this.context = new ParseContext();
		this.parser = new AutoDetectParser();
		this.context.set(Parser.class, parser);
		this.outputStream = argOutputStream;
		this.fileInputStream = argIp;

		}

		function convert()
		{	
		Metadata metadata = new Metadata();
		metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
		BodyContentHandler contentHandler = new BodyContentHandler(this.outputStream);  // outputStream is a pipedOutputStream
           	parser.parse(fileInputStream , contentHandler, metadata, context);
		}

The reason I am using the parsing mechanism like above because I wanted to use a pipedInput attached to a pipedOutputStream so that
I can use it more efficiently. While TIKA reads the file, pass the parsed content to pipedStream , another thread will pickup the
Text from pipedStream and start processing it. So the whole idea is if I need to parse an 30 MB file, I do not need to wait for TIKA
To parse the complete file , instead it could keep parsing a small chunk of file and send for processing by other threads.

Still I am seeing the performance with respect to time is not improved much. Do you have any suggestion on the way I am using TIKA ?
Is that a correct way of using TIKA? 

I am not using tika.parseToString() because it returns the whole parsing results string at once and till then the other threads would be blocked.

Hope I could explain my issue. Appreciate a response from your end.


Thanks
Anirban

		


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira