You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Anirban Mitra (Created) (JIRA)" <ji...@apache.org> on 2011/09/28 15:21:45 UTC

[jira] [Created] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Out of memory exception with Xlsx file less than 5 MB
-----------------------------------------------------

                 Key: TIKA-734
                 URL: https://issues.apache.org/jira/browse/TIKA-734
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
            Reporter: Anirban Mitra


I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116517#comment-13116517 ] 

Nick Burch commented on TIKA-734:
---------------------------------

The 0.10 release vote is open for another few hours. You can try with the release candidate if you want, see <http://mail-archives.apache.org/mod_mbox/tika-user/201109.mbox/%3C0EBE3FBF-61B4-461E-A818-2FF936EE474D@jpl.nasa.gov%3E>
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anirban Mitra updated TIKA-734:
-------------------------------

    Attachment: Sample BIG Excel 2007 File.xls

Hi,
The out of memory issue is resolved now. but we are seeing a huge performance issue with 10 concurrent users when we tried to parse the attached 10 MB xlsx file.it takes around 15 mins in average for 10 concurrent users to parse the document.After profiling the code using JProfiler, we found AutoDetectParser.Parse() takes most of the time. and many threads are waiting/blocked.i am using XML beans jar xmlbeans-2.3.0.jar and xml-apis-1.0.b2.jar. any suggestions will be helpful.
Thanks
Anirban
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125385#comment-13125385 ] 

Nick Burch commented on TIKA-734:
---------------------------------

If you're using the AutoDetectParser, then you'd expect the parse method to take the time. Any chance you could dig in further and see what inside there is the performance bottleneck?
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152329#comment-13152329 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Hello ,

I am using the following code.

		constructor()
		{
		this.context = new ParseContext();
		this.parser = new AutoDetectParser();
		this.context.set(Parser.class, parser);
		this.outputStream = argOutputStream;
		this.fileInputStream = argIp;

		}

		function convert()
		{	
		Metadata metadata = new Metadata();
		metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
		BodyContentHandler contentHandler = new BodyContentHandler(this.outputStream);  // outputStream is a pipedOutputStream
           	parser.parse(fileInputStream , contentHandler, metadata, context);
		}

The reason I am using the parsing mechanism like above because I wanted to use a pipedInput attached to a pipedOutputStream so that
I can use it more efficiently. While TIKA reads the file, pass the parsed content to pipedStream , another thread will pickup the
Text from pipedStream and start processing it. So the whole idea is if I need to parse an 30 MB file, I do not need to wait for TIKA
To parse the complete file , instead it could keep parsing a small chunk of file and send for processing by other threads.

Still I am seeing the performance with respect to time is not improved much. Do you have any suggestion on the way I am using TIKA ?
Is that a correct way of using TIKA? 

I am not using tika.parseToString() because it returns the whole parsing results string at once and till then the other threads would be blocked.

Hope I could explain my issue. Appreciate a response from your end.


Thanks
Anirban

		


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121032#comment-13121032 ] 

Jukka Zitting commented on TIKA-734:
------------------------------------

Tika 0.10 is now available. If the problem still occurs, please attach an example file that can be used to reproduce the issue.
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116467#comment-13116467 ] 

Nick Burch commented on TIKA-734:
---------------------------------

Please re-test with a newer version of Tika (ideally 0.10 which should be out shortly)
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122614#comment-13122614 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Thanks. I will let you know soon.


-- Anirban 



                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125305#comment-13125305 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Hello ,

Memory issue is gone with xlsx file but we are seeing lots of performance issue with concurrent users. A 10 MB file takes around 15 mins for one users when 
10 concurrent users hits the app.
I have given more details on the website. Please advice.

Thanks
Anirban


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Jukka Zitting (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-734.
--------------------------------

    Resolution: Cannot Reproduce

I can't reproduce the problem you're describing.

On my computer the following code (that parses the attached file 400 times in total using 20 concurrent threads to do so) completes in less than a minute and requires less than 200MB of memory (10MB per thread).

{code}
final Tika tika = new Tika();
final File file = new File("Sample BIG Excel 2007 File.xls");
for (int i = 0; i < 20; i++) {
    new Thread(new Runnable() {
        public void run() {
            for (int i = 0; i < 20; i++) {
                try {
                    tika.parseToString(file);
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }).start();
}
{code}
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116488#comment-13116488 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Thank you very much Nick.
How long I need to wait for 0.10 version ? Can I get a copy of 0.10 version to test out my code ?


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152405#comment-13152405 ] 

Jukka Zitting commented on TIKA-734:
------------------------------------

Did you see the parse() method [1] that returns a java.io.Reader instead of a String? That should achieve the same thing you're doing.

Note however that only some of the parsers in Tika support such streaming. Others like the MS Office parser will in any case parse the entire input document or at least significant parts of it before starting to output any of the extracted content.

[1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html#parse(java.io.File)
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Posted by "Anirban Mitra (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125242#comment-13125242 ] 

Anirban Mitra commented on TIKA-734:
------------------------------------

Memory issue is gone now but we are seeing a major performance hit in the application with concurrent users.(10 users).can you suggest something?
                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira