You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Matt Rogghe <mr...@blazent.com> on 2010/12/08 23:45:07 UTC

XSSF text extraction

Howdy folks, running into an issue extracting text from an .xlsx file.

I've used both my own extractor and the built in XSSFEventBasedExcelExtractor.  In either case the output yields only data from number or date columns.  All character/string columns return null.  Example code for using the latter:

                public static void main (String args[]) throws XmlException, OpenXML4JException, IOException
                {
                                XSSFEventBasedExcelExtractor e = new XSSFEventBasedExcelExtractor(
                                                                "F:\\test_file.xlsx");

                                String a = e.getText();

                                System.out.println(a);
                }

Output is similar to:
400403  11/16/10

The file has 49 columns with a lot of other text/character data.

I've used both the most recent POI 3.7 release and POI 3.7-beta1.

Interesting factoid is when I open the file in Excel 2007, save it without making any changes, and rerun the extraction utilities... they work fine.  I encountered something similar in HSSF with older versions of Excel files.  Possible this is a similar problem?

Has anyone else seen this issue?

I'm unable to upload the problem Excel file as it has client data.  The file is enormous (270MB when I unzip the .xlsx) also.  This is to say...to troubleshoot this issue is it possible for me to send a portion of the extracted ooxml data, and if so...what portion do you need?

Thanks for any help,
-Matt

RE: XSSF text extraction

Posted by Matt Rogghe <mr...@blazent.com>.
Thanks Nick,

Tested out the nightly build and it works perfect.

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Monday, December 13, 2010 12:09 AM
To: POI Users List
Subject: RE: XSSF text extraction

On Fri, 10 Dec 2010, Nick Burch wrote:
> Can you unzip the .xlsx file, and send through the start of one of the 
> sheet#.xml files?

Should be fixed in r1045020 - the event usermodel didn't support inline 
strings and it looks like that's what you had. (The usermodel and 
usermodel extractor did already though, it was just the event stuff that 
didn't)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: XSSF text extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 10 Dec 2010, Nick Burch wrote:
> Can you unzip the .xlsx file, and send through the start of one of the 
> sheet#.xml files?

Should be fixed in r1045020 - the event usermodel didn't support inline 
strings and it looks like that's what you had. (The usermodel and 
usermodel extractor did already though, it was just the event stuff that 
didn't)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: XSSF text extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 9 Dec 2010, Matt Rogghe wrote:
> Using the usermodel (so you can feel my pain, I had to put the file and 
> an executable jar up on a 64-bit server.  The file's size was causing 
> java to consume 5-8GB of heap space in usermodel mode):
>
> getRichStringCellValue()
> getStringCellValue()
> Both return the name of the column header for all columns.
>
> c.getRawValue()
> Returns null for all columns.

Can you unzip the .xlsx file, and send through the start of one of the 
sheet#.xml files? From the beginning, past the header cells which do work, 
and into a row or two of the data cells that are being returned blank.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: XSSF text extraction

Posted by Matt Rogghe <mr...@blazent.com>.
Using the usermodel (so you can feel my pain, I had to put the file and an executable jar up on a 64-bit server.  The file's size was causing java to consume 5-8GB of heap space in usermodel mode):
getRichStringCellValue()
getStringCellValue()
Both return the name of the column header for all columns.

c.getRawValue()
Returns null for all columns.


Using the latest snapshot jars:
poi-3.8-beta1-20101209.jar
poi-ooxml-3.8-beta1-20101209.jar
poi-ooxml-schemas-3.8-beta1-20101209.jar

Same issue as before.  It outputs anything that is a date or a number, but no character/text/string fields.

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Wednesday, December 08, 2010 8:04 PM
To: POI Users List
Subject: Re: XSSF text extraction

On Wed, 8 Dec 2010, Matt Rogghe wrote:
> I've used both my own extractor and the built in 
> XSSFEventBasedExcelExtractor.  In either case the output yields only 
> data from number or date columns.  All character/string columns return 
> null.  Example code for using the latter:

If you use the usermodel based one, does that work properly?

Also, have you tried with a recent nightly snapshot build in case it has 
been fixed lately? (There have been some tweaks since 3.7 went out)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XSSF text extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 8 Dec 2010, Matt Rogghe wrote:
> I've used both my own extractor and the built in 
> XSSFEventBasedExcelExtractor.  In either case the output yields only 
> data from number or date columns.  All character/string columns return 
> null.  Example code for using the latter:

If you use the usermodel based one, does that work properly?

Also, have you tried with a recent nightly snapshot build in case it has 
been fixed lately? (There have been some tweaks since 3.7 went out)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org