You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org> on 2011/11/17 22:16:52 UTC
[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx
file less than 5 MB
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152329#comment-13152329 ]
Anirban Mitra commented on TIKA-734:
------------------------------------
Hello ,
I am using the following code.
constructor()
{
this.context = new ParseContext();
this.parser = new AutoDetectParser();
this.context.set(Parser.class, parser);
this.outputStream = argOutputStream;
this.fileInputStream = argIp;
}
function convert()
{
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
BodyContentHandler contentHandler = new BodyContentHandler(this.outputStream); // outputStream is a pipedOutputStream
parser.parse(fileInputStream , contentHandler, metadata, context);
}
The reason I am using the parsing mechanism like above because I wanted to use a pipedInput attached to a pipedOutputStream so that
I can use it more efficiently. While TIKA reads the file, pass the parsed content to pipedStream , another thread will pickup the
Text from pipedStream and start processing it. So the whole idea is if I need to parse an 30 MB file, I do not need to wait for TIKA
To parse the complete file , instead it could keep parsing a small chunk of file and send for processing by other threads.
Still I am seeing the performance with respect to time is not improved much. Do you have any suggestion on the way I am using TIKA ?
Is that a correct way of using TIKA?
I am not using tika.parseToString() because it returns the whole parsing results string at once and till then the other threads would be blocked.
Hope I could explain my issue. Appreciate a response from your end.
Thanks
Anirban
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
> Key: TIKA-734
> URL: https://issues.apache.org/jira/browse/TIKA-734
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
> Reporter: Anirban Mitra
> Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira