You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Chris Bamford <cb...@mimecast.com> on 2012/11/01 15:01:01 UTC

Controlling text extraction from XLS files

Hi there,

I have been reading with interest the POI docs on processing Excel documents.
In particular I am interested in text extraction for which I am currently using Aperture
1.6.0 which includes POI 3.8-beta5.

I would like to know how to fine tune the process because in addition to expected text
(column titles, workbook names, etc.), it also extracts a lot of floating point numbers
which bear no visible relation to the numbers in the cells.  I can only think that perhaps
they are dates (or similar) in an internal format?
Anyway, for my application they are unwanted and need to be suppressed somehow.

Could someone kindly point me in the right direction?

Thanks,

- Chris

Re: Controlling text extraction from XLSX files

Posted by Chris Bamford <cb...@mimecast.com>.
Sorry Nick,

Just saw this from one of your earlier mails:

On 2 Nov 2012, at 15:04, Nick Burch wrote:

The low level file formats are very different, but there's an analagous extractor that uses SAX XML events rather than record events

Would I be able to use the DataFormatter with this extractor ?

Thanks again,

- Chris

Re: Controlling text extraction from XLS files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi Nick

On 4 Nov 2012, at 15:31, "Nick Burch" <ap...@gagravarr.org> wrote:

> If you want numbers that look like they do in Excel, you'll likely want to use DataFormatter to format the double values based on their formatting rules
That sounds ideal, will take a look. Incidentally, why isn't this already in use in the code we discussed? Seems odd to me (admittedly with very little experience of such things!) that you would want to extract raw numbers rather than the way they look in Excel ... :-)

Thanks again

Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLS files

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 4 Nov 2012, Chris Bamford wrote:
> I have tried this 
> ReadMsOfficeFiles<http://codezrule.wordpress.com/2012/01/05/extract-text-from-ms-office-2007-files-docx-pptx-xlsx/> 
> program and I think I may have found the cause of my particular issue 
> i.e. text extraction of doubles gives very large scary looking numbers 
> ("9.2999999999999999E-2" instead of "0.093").

If you want numbers that look like they do in Excel, you'll likely want to 
use DataFormatter to format the double values based on their formatting 
rules. Otherwise, you'll get the closest floating point representation of 
the number, which isn't always exactly the same...

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLSX files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi Nick,

> 
> Is XSSFEventBasedExcelExtractor any good?
> 

Yes, it looks like the best of both worlds - a low memory event processor with formatting available at cell processing time :-)
Thanks again.

- Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLSX files

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 7 Nov 2012, Chris Bamford wrote:
> The DataFormatter works well.  However, the XSSFExcelExtractor runs out 
> of memory when I try to process a reasonably large XLSX file - is there 
> a more memory efficient way of processing such files?

Is XSSFEventBasedExcelExtractor any good?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLSX files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi Nick

The DataFormatter works well.  However, the XSSFExcelExtractor runs out of memory when I try to process a reasonably large XLSX file - is there a more memory efficient way of processing such files?

Thanks,

- Chris

On 4 Nov 2012, at 12:44, Chris Bamford wrote:

Hi Nick,

Thanks for the steer for XLSX files.
I have tried this ReadMsOfficeFiles<http://codezrule.wordpress.com/2012/01/05/extract-text-from-ms-office-2007-files-docx-pptx-xlsx/> program and I think I may have found the cause of my particular issue i.e. text extraction of doubles gives very large scary looking numbers ("9.2999999999999999E-2" instead of "0.093").

In XSSFExcelExtractor.getText():

...

// Rows and cells
for (Object rawR : sheet) {
Row row = (Row)rawR;
for(Iterator<Cell> ri = row.cellIterator(); ri.hasNext();) {
Cell cell = ri.next();

// Is it a formula one?
if(cell.getCellType() == Cell.CELL_TYPE_FORMULA && formulasNotResults) {
text.append(cell.getCellFormula());
} else if(cell.getCellType() == Cell.CELL_TYPE_STRING) {
text.append(cell.getRichStringCellValue().getString());
} else {
XSSFCell xc = (XSSFCell)cell;
text.append(xc.getRawValue());     // shouldn't this just be text.append(cell.toString()); ?
}

// Output the comment, if requested and exists
       Comment comment = cell.getCellComment();
if(includeCellComments && comment != null) {
   // Replace any newlines with spaces, otherwise it
   //  breaks the output
   String commentText = comment.getString().getString().replace('\n', ' ');
   text.append(" Comment by ").append(comment.getAuthor()).append(": ").append(commentText);
}

if(ri.hasNext())
text.append("\t");
}
text.append("\n");
}

...

The highlighted line spits out the raw double in all its glory rather than just the text equivalent.
As this class is designed to produce text it seems reasonable to me that toString() would be sufficient, what do you think?
I have a spreadsheet which exhibits the problem, would you like me to send it?  If so, how?

Thanks,

- Chris

On 2 Nov 2012, at 15:04, Nick Burch wrote:

On Fri, 2 Nov 2012, Chris Bamford wrote:
The XLS extraction is going great.  For XLSX can I use the same mechanism?

Similar. The low level file formats are very different, but there's an analagous extractor that uses SAX XML events rather than record events

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org<ma...@poi.apache.org>
For additional commands, e-mail: user-help@poi.apache.org<ma...@poi.apache.org>





<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=520a90a81be92c80dac7974a447e65bf>





        [ Our Blog<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=de5a5fb9c363a9315c48774b80382f74> ]   [ Twitter<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=0805caf78cbff352efabfad6b5793ee4> ]   [ YouTube<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=5a09c8486cc49449a18c8e13731c33aa> ]


        Chris Bamford
Senior Developer
m: +44 7860 405292
www.mimecast.com<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=20ad68c80d30750ecf11a8a1c4714c63>

2-8 Balfe Street, London, N1 9EG

        +44 (0) 207 843 2300





Disclaimer
The information contained in this communication from cbamford@mimecast.com<ma...@mimecast.com> sent at 2012-11-04 12:43:30 is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org<ma...@poi.apache.org> and others authorized to receive it. If you are not user@poi.apache.org<ma...@poi.apache.org> you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.

Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34
Registered Office: 2 - 8 Balfe Street, Kings Cross London, N1 9EG
Email Address: info@mimecast.com<ma...@mimecast.com>

________________________________
This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com<http://www.mimecast.com/><http://www.mimecast.com/>

________________________________

Re: Controlling text extraction from XLS files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi Nick,

Thanks for the steer for XLSX files.
I have tried this ReadMsOfficeFiles<http://codezrule.wordpress.com/2012/01/05/extract-text-from-ms-office-2007-files-docx-pptx-xlsx/> program and I think I may have found the cause of my particular issue i.e. text extraction of doubles gives very large scary looking numbers ("9.2999999999999999E-2" instead of "0.093").

In XSSFExcelExtractor.getText():

...

// Rows and cells
for (Object rawR : sheet) {
Row row = (Row)rawR;
for(Iterator<Cell> ri = row.cellIterator(); ri.hasNext();) {
Cell cell = ri.next();

// Is it a formula one?
if(cell.getCellType() == Cell.CELL_TYPE_FORMULA && formulasNotResults) {
text.append(cell.getCellFormula());
} else if(cell.getCellType() == Cell.CELL_TYPE_STRING) {
text.append(cell.getRichStringCellValue().getString());
} else {
XSSFCell xc = (XSSFCell)cell;
text.append(xc.getRawValue());     // shouldn't this just be text.append(cell.toString()); ?
}

// Output the comment, if requested and exists
       Comment comment = cell.getCellComment();
if(includeCellComments && comment != null) {
   // Replace any newlines with spaces, otherwise it
   //  breaks the output
   String commentText = comment.getString().getString().replace('\n', ' ');
   text.append(" Comment by ").append(comment.getAuthor()).append(": ").append(commentText);
}

if(ri.hasNext())
text.append("\t");
}
text.append("\n");
}

...

The highlighted line spits out the raw double in all its glory rather than just the text equivalent.
As this class is designed to produce text it seems reasonable to me that toString() would be sufficient, what do you think?
I have a spreadsheet which exhibits the problem, would you like me to send it?  If so, how?

Thanks,

- Chris

On 2 Nov 2012, at 15:04, Nick Burch wrote:

On Fri, 2 Nov 2012, Chris Bamford wrote:
The XLS extraction is going great.  For XLSX can I use the same mechanism?

Similar. The low level file formats are very different, but there's an analagous extractor that uses SAX XML events rather than record events

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org<ma...@poi.apache.org>
For additional commands, e-mail: user-help@poi.apache.org<ma...@poi.apache.org>


Chris Bamford
Senior Developer

2 - 8 Balfe Street
Kings Cross,
London, N1 9EG

mobile +44 7860 405292
tel: +44 (0) 207 843 2300
web www.mimecast.com


The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org and others authorized to receive it. If you are not user@poi.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.


Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34
Registered Office:2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: info@mimecast.com

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com

Re: Controlling text extraction from XLS files

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 2 Nov 2012, Chris Bamford wrote:
> The XLS extraction is going great.  For XLSX can I use the same 
> mechanism?

Similar. The low level file formats are very different, but there's an 
analagous extractor that uses SAX XML events rather than record events

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLS files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi again Nick,

The XLS extraction is going great.  For XLSX can I use the same mechanism?

Thanks,

- Chris

On 1 Nov 2012, at 17:25, Chris Bamford wrote:

> Hi Nick,
> 
> On 1 Nov 2012, at 16:11, Nick Burch wrote:
> 
> Yes. Formulas in Excel (.xls) are stored with two parts. One is the last value (number or string) that the formula evaluated to. The other is the formula itself, stored in a parsed format. With the false flag, you get the last result Excel calculated returned. With true, we try to turn the parsed formula tokens back into a string. For your case, the formula is actually shared across multiple cells, and another bit of POI is objecting because the sharing is expected to be resolved but hasn't been
> 
> Thanks for the clear explanation.  It would seem to me that returning the last result Excel calculated (setting it to false) is likely to be what is expected most of the time.  It could well be good enough for what I need.
> 
> Cheers
> 
> - Chris
> 
> 
> Chris Bamford
> Senior Developer
> 
> 2 - 8 Balfe Street
> Kings Cross,
> London, N1 9EG
> 
> mobile +44 7860 405292
> tel: +44 (0) 207 843 2300
> web www.mimecast.com
> 
> 
> The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org and others authorized to receive it. If you are not user@poi.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.
> 
> 
> Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34
> Registered Office:2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: info@mimecast.com
> 
> This email message has been scanned for viruses by Mimecast.
> Mimecast delivers a complete managed email solution from a single web based platform.
> For more information please visit http://www.mimecast.com

Re: Controlling text extraction from XLS files

Posted by Chris Bamford <cb...@mimecast.com>.
Hi Nick,

On 1 Nov 2012, at 16:11, Nick Burch wrote:

Yes. Formulas in Excel (.xls) are stored with two parts. One is the last value (number or string) that the formula evaluated to. The other is the formula itself, stored in a parsed format. With the false flag, you get the last result Excel calculated returned. With true, we try to turn the parsed formula tokens back into a string. For your case, the formula is actually shared across multiple cells, and another bit of POI is objecting because the sharing is expected to be resolved but hasn't been

Thanks for the clear explanation.  It would seem to me that returning the last result Excel calculated (setting it to false) is likely to be what is expected most of the time.  It could well be good enough for what I need.

Cheers

- Chris


Chris Bamford
Senior Developer

2 - 8 Balfe Street
Kings Cross,
London, N1 9EG

mobile +44 7860 405292
tel: +44 (0) 207 843 2300
web www.mimecast.com


The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org and others authorized to receive it. If you are not user@poi.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.


Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34
Registered Office:2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: info@mimecast.com

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com

Re: Controlling text extraction from XLS files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 1 Nov 2012, Chris Bamford wrote:
> Exception in thread "main" java.lang.RuntimeException: Coding Error: 
> Expected ExpPtg to be converted from Shared to Non-Shared Formula by 
> ValueRecordsAggregate, but it wasn't
>
> which I see is listed as a bug here:
>
>  https://issues.apache.org/bugzilla/show_bug.cgi?id=52158

Ah, yes. Some work is needed on that, but no-one has so far found the time 
to do it :(

> However, if I setFormulasNotResults(false) it runs OK and I see numbers 
> being displayed, not Strings like "D1+C3".  Is this what you'd expect?

Yes. Formulas in Excel (.xls) are stored with two parts. One is the last 
value (number or string) that the formula evaluated to. The other is the 
formula itself, stored in a parsed format. With the false flag, you get 
the last result Excel calculated returned. With true, we try to turn the 
parsed formula tokens back into a string. For your case, the formula is 
actually shared across multiple cells, and another bit of POI is objecting 
because the sharing is expected to be resolved but hasn't been

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Controlling text extraction from XLS files

Posted by Chris Bamford <cb...@mimecast.com>.
Thanks Nick!

The EventBasedExcelExtractor looks very good indeed.
A quick question on it, if I may.  If I use it to extract from an Excel doc with setFormulasNotResults(true) I get:

Exception in thread "main" java.lang.RuntimeException: Coding Error: Expected ExpPtg to be converted from Shared to Non-Shared Formula by ValueRecordsAggregate, but it wasn't

which I see is listed as a bug here:

  https://issues.apache.org/bugzilla/show_bug.cgi?id=52158

Now this spreadsheet was created by a customer with Excel, so we have no control over how it was created, nor can I provide it to you.

However, if I setFormulasNotResults(false) it runs OK and I see numbers being displayed, not Strings like "D1+C3".  Is this what you'd expect?

Thanks,

- Chris


On 1 Nov 2012, at 14:33, Nick Burch wrote:

On Thu, 1 Nov 2012, Chris Bamford wrote:
I would like to know how to fine tune the process

The simplest way is just to write some code yourself, which loops over the sheets / rows / columns, and performs the exact cell -> text transformation that your business rules need.

If you have some memory to burn, and can use usermodel, then the code is pretty easy to write. Take a look at ExcelExtractor for an example
http://svn.apache.org/repos/asf/poi/trunk/src/java/org/apache/poi/hssf/extractor/ExcelExtractor.java

If you want to do it in a very low memory way, the coding can be a little bit more involved, take a look at things like
http://svn.apache.org/repos/asf/poi/trunk/src/java/org/apache/poi/hssf/extractor/EventBasedExcelExtractor.java
and
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java

Finally, if you think that one of the text extractors that ships with Apache POI (or Apache Tika!) is doing something wrong for some cells, raise a bug and upload a sample file that shows the problem

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Chris Bamford
Senior Developer

2 - 8 Balfe Street
Kings Cross,
London, N1 9EG

mobile +44 7860 405292
tel: +44 (0) 207 843 2300
web www.mimecast.com


The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org and others authorized to receive it. If you are not user@poi.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.


Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34
Registered Office:2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: info@mimecast.com

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com

Re: Controlling text extraction from XLS files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 1 Nov 2012, Chris Bamford wrote:
> I would like to know how to fine tune the process

The simplest way is just to write some code yourself, which loops over the 
sheets / rows / columns, and performs the exact cell -> text 
transformation that your business rules need.

If you have some memory to burn, and can use usermodel, then the code is 
pretty easy to write. Take a look at ExcelExtractor for an example
http://svn.apache.org/repos/asf/poi/trunk/src/java/org/apache/poi/hssf/extractor/ExcelExtractor.java

If you want to do it in a very low memory way, the coding can be a little 
bit more involved, take a look at things like
http://svn.apache.org/repos/asf/poi/trunk/src/java/org/apache/poi/hssf/extractor/EventBasedExcelExtractor.java
and
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java

Finally, if you think that one of the text extractors that ships with 
Apache POI (or Apache Tika!) is doing something wrong for some cells, 
raise a bug and upload a sample file that shows the problem

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org