You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Gangwal, Adish (IS Consultant)" <AG...@consultantemail.com> on 2012/01/26 22:21:00 UTC

RE: Excel Parser - Blank Cell


Sorry attaching the example excel which I am trying to parse

_____________________________________________
            From:       Gangwal, Adish (IS Consultant)
            Sent:       Friday, January 13, 2012 6:26 PM
            To: 'user@tika.apache.org'
            Subject:    Excel Parser - Blank Cell

Hi,

When I parse the excel which has an empty cell, it doesn't create a extra tab character.

If there are three cells of which middle one is empty, it skips the middle cell and only outputs 1st and 3rd cell with a tab

For example below, the first column 'FLAG' is empty and we desire a tab character like row 1 and 2. In row 3 the text 'ID COST - LONG TERM INVESTMENTS' should have a tab before
Attaching the example excel sheet

How can I tell tika not to ignore the empty cells ?

Note : - If there are white spaces it correctly inserts tabs


Example output -

                Flag            Description             Starting Balance                Debits          Credits         Net Activity            Ending Balance

1                               ASSETS EXCLUDING MARKET VALUE

2                               ID COST - SWAPS         2,502,043.770           196,996,488.330         197,527,735.400         -531,247.070            1,970,796.700

3               ID COST - LONG TERM INVESTMENTS         814,320,658.100         210,385,704.520         235,299,892.650         -24,914,188.130


RE: Excel Parser - Blank Cell

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 27 Jan 2012, Gangwal, Adish (IS Consultant) wrote:
> We want to use Tika as it supports different doc formats and not just 
> xls or doc like POI I think Streamed parsing also makes Tika a lot 
> faster and efficient than POI to parse even large docs of 15 MB or 
> greater.

The streamed parsing of Excel files in Tika is powered by POI!

> I understand that Tika uses POI under the cover to parse excel. So , is 
> there some way, to tell Tika (and in turn POI) to follow some Missing 
> Cell Policy.

A missing cell policy won't help here, you're doing streaming event 
parsing.

It sounds like you have some very specific business requirements around 
the minimum number of cells per row, missing and blank cell handling 
etc. Tika is never going to be able to do everything for everyone, so for 
your specific case you may be best off writing your own custom parser and 
dropping that into Tika. XLS2CSVmra is a good basis for doing XLS -> CSV 
with full control over missing cells and missing rows (you can set a 
minimum number of columns to output for example), and XLSX2CSV has a 
similar thing for XLSX -> CSV

Nick

RE: Excel Parser - Blank Cell

Posted by "Gangwal, Adish (IS Consultant)" <AG...@consultantemail.com>.
Thanks Nick

We want to use Tika as it supports different doc formats and not just xls or doc like POI
I think Streamed parsing also makes Tika a lot faster and efficient than POI to parse even large docs of 15 MB or greater.

I understand that Tika uses POI under the cover to parse excel. So , is there some way, to tell Tika (and in turn POI) to follow some Missing Cell Policy.

This will help to produce Text document in a very readable format in case of missing cells

Any direct is really appreciated

-Adish



-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Friday, January 27, 2012 7:04 AM
To: 'user@tika.apache.org'
Subject: RE: Excel Parser - Blank Cell

On Thu, 26 Jan 2012, Gangwal, Adish (IS Consultant) wrote:
> When I parse the excel which has an empty cell, it doesn't create a 
> extra tab character.
>
> If there are three cells of which middle one is empty, it skips the 
> middle cell and only outputs 1st and 3rd cell with a tab

Tika itself doesn't generate tab characters, it generates xhtml table elements. It's the text content handler that does tabs

In general though, Tika will generate the text that is present.

If you're trying to generate a CSV or similar, and want full control over what shows up, missing cells etc, then I'd suggest you look at using Apache POI directly.

Nick

RE: Excel Parser - Blank Cell

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 26 Jan 2012, Gangwal, Adish (IS Consultant) wrote:
> When I parse the excel which has an empty cell, it doesn't create a 
> extra tab character.
>
> If there are three cells of which middle one is empty, it skips the 
> middle cell and only outputs 1st and 3rd cell with a tab

Tika itself doesn't generate tab characters, it generates xhtml table 
elements. It's the text content handler that does tabs

In general though, Tika will generate the text that is present.

If you're trying to generate a CSV or similar, and want full control over 
what shows up, missing cells etc, then I'd suggest you look at using 
Apache POI directly.

Nick