You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Stephen Duncan Jr (JIRA)" <ji...@apache.org> on 2010/09/30 17:27:33 UTC

[jira] Created: (TIKA-521) OutOfMemoryError Parsing XSLX File

OutOfMemoryError Parsing XSLX File
----------------------------------

                 Key: TIKA-521
                 URL: https://issues.apache.org/jira/browse/TIKA-521
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.7, 0.8
            Reporter: Stephen Duncan Jr
         Attachments: memory-test.xlsx

I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916518#action_12916518 ] 

Nick Burch commented on TIKA-521:
---------------------------------

Excel files really really munch memory. XLSX is worse than XLS, as the xml processing into objects takes lots of memory.

Some files are worse than others, depends on the kinds of things in them. I'd suggest you just up your heap size.

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Sjoerd Smeets (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917819#action_12917819 ] 

Sjoerd Smeets commented on TIKA-521:
------------------------------------

I'm facing the same issue. Increasing the heapssize to the maximum will cover for a certain amount of xlsx files, but there are still a lot of files causing an OutOfMemoryError (> 10 Mb XLS files). The XSSFEventBasedExcelExtractor indeed processes these files as we would like to. What would be the draw back of using XSSFEventBasedExcelExtractor?

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918932#action_12918932 ] 

Maxim Valyanskiy commented on TIKA-521:
---------------------------------------

If a plain text is enough for you, you can apply patch from TIKA-511 and call ExtractorFactory.setAllThreadsPreferEventExtractors(true) before running Tika 

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Sjoerd Smeets (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sjoerd Smeets updated TIKA-521:
-------------------------------

    Attachment: tika-diff.txt
                tika-new-files.tar.bz2

Proposed patch

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx, tika-diff.txt, tika-new-files.tar.bz2
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Stephen Duncan Jr (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Duncan Jr updated TIKA-521:
-----------------------------------

    Attachment: memory-test.xlsx

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Sjoerd Smeets (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920032#action_12920032 ] 

Sjoerd Smeets commented on TIKA-521:
------------------------------------

Attached a proposed patch for bigger XLS files. It has been tested with a XSL spreadsheet of 70Mb with a heapsize of 1024Mb. It should be able to handle bigger files, since it is using SAX parsing. However, using a smaller heapsize for the test file restulted in a OutOfMemoryError, when extracting the different parts of the XLS document.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133)
	at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:118)
	at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:154)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:68)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146)
	at com.ravn.test.tika.XLSTester.parse(XLSTester.java:47)
	at com.ravn.test.TikaTester.main(TikaTester.java:39)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

The proposed patch is an attempt to generate the same information about a XSL document as the XSSFExcelExtractorDecorator parser does. There are still some issues to look into, which are commented with TODO. Some advice on these matters would be welcome. Could someone check if the proposed patch is acceptable, so I'll try to implement the TODO things plus write some testcases? Maybe this can then be the default parser

I also changed/created certain parts in POI in order to get the patch working. See https://issues.apache.org/bugzilla/show_bug.cgi?id=50076 for the proposed changes for POI.

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx, tika-diff.txt, tika-new-files.tar.bz2
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917900#action_12917900 ] 

Nick Burch commented on TIKA-521:
---------------------------------

It would need someone to work up a patch. We can't simply use XSSFEventBasedExcelExtractor, as that produces limited plain text, but we want to generate HTML + include headers, footers, links, comments etc

So, we'd need code that was similar to XSSFEventBasedExcelExtractor, but which also did the additional work to include the extra parts we currently have

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Sjoerd Smeets (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918047#action_12918047 ] 

Sjoerd Smeets commented on TIKA-521:
------------------------------------

Ok, I'll see if I can create a patch for this.

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Stephen Duncan Jr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916534#action_12916534 ] 

Stephen Duncan Jr commented on TIKA-521:
----------------------------------------

I have 7MB files that can't be handled when giving 2GB of RAM, it required 3GB to process.  I'm looking at likely needing to run on 32-bit Java, so increasing the heap size that high is not really an option.  Besides, at the growth rate I see, a 20MB file might require 10GB of heap.  That simply doesn't scale for reasonable file sizes.  Meanwhile, the same 7MB file can be parsed using the alternate API using 128MB for the heap size.  That should allow any reasonable file to be processed assuming a reasonable 1GB heap size.

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-521:
-----------------------------------

    Component/s: parser

- classify

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx, tika-diff.txt, tika-new-files.tar.bz2
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-521) OutOfMemoryError Parsing XSLX File

Posted by "Stephen Duncan Jr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916517#action_12916517 ] 

Stephen Duncan Jr commented on TIKA-521:
----------------------------------------

Using the POI API directly, and using their event-based model, I was able to to parse the file using less than 20MB of heap space (less than 64MB of heap size allocated).  Can Tika be modified to use the event based API when extracting text?  Here's sample code used:

final String filePath = "C:\\Users\\stephen.duncan\\tmp\\memory-test.xlsx";
XSSFEventBasedExcelExtractor extractor = new XSSFEventBasedExcelExtractor(filePath);

String text = extractor.getText();
System.out.println(text);

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using  a large heap size.  For instance the attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.