You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ap...@gagravarr.org> on 2013/10/08 15:14:19 UTC

Excel files with "holes" in the cell sequence

Hi All

The Excel file formats (.xls and .xlsx) are somewhat sparse formats, and 
where a cell has never been used it generally doesn't get written to the 
file. (Being a Microsoft format, there are exceptions to this...). 
Currently, if you parse a file with cells at A1 B1 F1 G1, then Tika will 
give you back a table with just 4 columns in, squashing the gaps.

Within POI, there is optional logic to detect these gaps, and generate 
dummy cells to let you know that something was missed. So, if we wanted, 
with not too much work we could detect and handle these

However, I'm not sure if that's something we should be doing or not? What 
do people think - should we be doing that level of processing before 
generating the SAX events, or would that be a step too far?

Nick

Re: Excel files with "holes" in the cell sequence

Posted by kevin slote <ks...@gmail.com>.
The last time I parsed spreadsheets with POI, I found a lot of
functionality to render the layout of the spreadsheet in css.  Does anyone
think that that would be a worthy endeavor or feasible?  I would love to
become a committer to tika.


On Tue, Oct 8, 2013 at 9:14 AM, Nick Burch <ap...@gagravarr.org> wrote:

> Hi All
>
> The Excel file formats (.xls and .xlsx) are somewhat sparse formats, and
> where a cell has never been used it generally doesn't get written to the
> file. (Being a Microsoft format, there are exceptions to this...).
> Currently, if you parse a file with cells at A1 B1 F1 G1, then Tika will
> give you back a table with just 4 columns in, squashing the gaps.
>
> Within POI, there is optional logic to detect these gaps, and generate
> dummy cells to let you know that something was missed. So, if we wanted,
> with not too much work we could detect and handle these
>
> However, I'm not sure if that's something we should be doing or not? What
> do people think - should we be doing that level of processing before
> generating the SAX events, or would that be a step too far?
>
> Nick
>