You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Niall Pemberton (JIRA)" <ji...@apache.org> on 2007/11/26 03:10:43 UTC

[jira] Created: (TIKA-105) Excel parser implementation based on POI's Event API

Excel parser implementation based on POI's Event API
----------------------------------------------------

                 Key: TIKA-105
                 URL: https://issues.apache.org/jira/browse/TIKA-105
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Niall Pemberton
            Priority: Minor


Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.

I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.


[1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Dec 26, 2007 9:38 PM, Niall Pemberton <ni...@gmail.com> wrote:
> On Dec 26, 2007 7:19 PM, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> > When you say it includes the sheet name, you mean the name of each sheet
> > (tab) in the Excel file, right? Does it come out as bare text, or is it
> > encoded in a way that can be parsed (e.g. "{[Sheet: MySheet1]}")?  Or is
> > this configurable?
>
> Just plain text and not configurable ATM.

Having to use a yet another parser on Tika output is something that we
should IMHO avoid as much as possible. A more reasonable way to make
the sheet structure available to clients that need it would be to use
the features of the XHTML output serialization.

How about something like this:

    <div class="sheet">
        <h1 class="sheet-title">....</h1>
        <p>...</p>
    </div>

or, if one wants to match Excel's screen representation more closely
(IMHO not a goal for Tika):

    <div class="sheet">
        <table>...</table>
        <p class="sheet-title">....</p>
    </div>

A client that needs the sheet content as structured data can then use
XPath queries like //div[@class='sheet'] or //*[@class='sheet-title']
to selectively extract the content of entire sheets or just their
titles.

> > We have a need to read Excel files with more structure than the usual
> > unstructured text document.  At minimum, it would be great to be able to be
> > able to know where one sheet ends and the next begins.  Is this something
> > that would be appropriate to support, or does that go beyond the generic
> > unstructured text parsing mission of Tika?
>
> I'm leave that for the Tika devs to comment on.

One of the stated goals for Tika is to support not only unstructured
but also structured text extraction. This goal was discussed at the
search roundtable in Amsterdam (see the followup thread at
http://markmail.org/message/ggihw2cns53t6ayl) and implemented on the
Parser API level by making the parsers output XHTML SAX events instead
of character streams (see TIKA-53).

Note however that the goal here is not to make Tika replace the native
Parser APIs, just produce structured enough output to satisfy the
needs of typical Tika clients.

I think Keith's need to distinguish sheet boundaries is within the
scope of Tika, but if one for example wants to find out detailed cell
formatting information they should instead be looking at the
underlying POI APIs.

BR,

Jukka Zitting

Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by Niall Pemberton <ni...@gmail.com>.

On Dec 26, 2007 7:19 PM, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>
> Niall -
>
> When you say it includes the sheet name, you mean the name of each sheet
> (tab) in the Excel file, right?

Yes

> Does it come out as bare text, or is it
> encoded in a way that can be parsed (e.g. "{[Sheet: MySheet1]}")?  Or is
> this configurable?

Just plain text and not configurable ATM.

> We have a need to read Excel files with more structure than the usual
> unstructured text document.  At minimum, it would be great to be able to be
> able to know where one sheet ends and the next begins.  Is this something
> that would be appropriate to support, or does that go beyond the generic
> unstructured text parsing mission of Tika?

I'm leave that for the Tika devs to comment on.

>  Also, based on your knowledge of
> Poi (I have none), how difficult is that to implement?  I may need to do it
> myself.

Very easy. Tika has two excel parsers now the original one
(ExcelParser) uses the easier/simpler POI API and the one I wrote
(ExcelEventParser) has a smaller memory footprint, but uses the
slightly more complex POI Event API. I believe either of them could be
easily adapted to your needs though.

Niall

> Thanks much,
> Keith
>
>
> JIRA jira@apache.org wrote:
> >
> >
> >     [
> > https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554021
> > ]
> >
> > Niall Pemberton commented on TIKA-105:
> > --------------------------------------
> >
>
> > The only functional difference between this implementation and ExcelParser
> > is that it also writes out the sheet name to the stream this could easily
> > be added with a one line change to ExcelParser though.
> >
> >
>
> --
> View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-105%29-Excel-parser-implementation-based-on-POI%27s-Event-API-tp13942709p14505443.html
> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>
>

Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.

Niall -

When you say it includes the sheet name, you mean the name of each sheet
(tab) in the Excel file, right?  Does it come out as bare text, or is it
encoded in a way that can be parsed (e.g. "{[Sheet: MySheet1]}")?  Or is
this configurable?

We have a need to read Excel files with more structure than the usual
unstructured text document.  At minimum, it would be great to be able to be
able to know where one sheet ends and the next begins.  Is this something
that would be appropriate to support, or does that go beyond the generic
unstructured text parsing mission of Tika?  Also, based on your knowledge of
Poi (I have none), how difficult is that to implement?  I may need to do it
myself.

Thanks much,
Keith

JIRA jira@apache.org wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554021
> ] 
> 
> Niall Pemberton commented on TIKA-105:
> --------------------------------------
> 
> The only functional difference between this implementation and ExcelParser
> is that it also writes out the sheet name to the stream this could easily
> be added with a one line change to ExcelParser though.
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-105%29-Excel-parser-implementation-based-on-POI%27s-Event-API-tp13942709p14505443.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

[jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553912 ] 

Jukka Zitting commented on TIKA-105:
------------------------------------

Looks good! I committed the class in revision 606141 so it'll be easier for people to try it out.

Is there major functional difference (i.e. different text or metadata extracted) between this and our existing ExcelParser class? If not, I think we should probably make this one the default Excel parser and drop the other one.

Test cases would be very much welcome. :-)

> Excel parser implementation based on POI's Event API
> ----------------------------------------------------
>
>                 Key: TIKA-105
>                 URL: https://issues.apache.org/jira/browse/TIKA-105
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: ExcelEventParser.java
>
>
> Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.
> I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.
> [1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated TIKA-105:
---------------------------------

    Attachment: ExcelEventParser.java

> Excel parser implementation based on POI's Event API
> ----------------------------------------------------
>
>                 Key: TIKA-105
>                 URL: https://issues.apache.org/jira/browse/TIKA-105
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: ExcelEventParser.java
>
>
> Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.
> I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.
> [1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Niall Pemberton (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554021 ] 

Niall Pemberton commented on TIKA-105:
--------------------------------------

Great thanks - I have tested this quite a bit using excel sheets from work, but I wanted to see if you were interested before creating test cases for Tika - I'll get do that now though (hopefully) in the next couple of weeks.

The only functional difference between this implementation and ExcelParser is that it also writes out the sheet name to the stream this could easily be added with a one line change to ExcelParser though.

Sorry about the ExcelUtils - its a work-in-progress, mostly for TIKA-103 (cell formatting & date/number values) - when I get time to finish it I plan to offer it to Tika (hopefully temporarily until most of it gets into a POI release).

> Excel parser implementation based on POI's Event API
> ----------------------------------------------------
>
>                 Key: TIKA-105
>                 URL: https://issues.apache.org/jira/browse/TIKA-105
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: ExcelEventParser.java
>
>
> Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.
> I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.
> [1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-105.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating
         Assignee: Jukka Zitting

I replaced the older ExcelParser with the ExcelEventParser in revision 613566. I also made some minor changes to the class (no info logging, JavaBean setter for listenForAllRecords, etc.).

Resolving this as fixed.

> Excel parser implementation based on POI's Event API
> ----------------------------------------------------
>
>                 Key: TIKA-105
>                 URL: https://issues.apache.org/jira/browse/TIKA-105
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.2-incubating
>
>         Attachments: ExcelEventParser.java
>
>
> Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.
> I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.
> [1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553914 ] 

Jukka Zitting commented on TIKA-105:
------------------------------------

Note that the class had a single debug line referring to an ExcelUtils class that I don't have. For now I just commented that line out.

> Excel parser implementation based on POI's Event API
> ----------------------------------------------------
>
>                 Key: TIKA-105
>                 URL: https://issues.apache.org/jira/browse/TIKA-105
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: ExcelEventParser.java
>
>
> Tika's existing ExcelParser implementation uses POI's HSSFWorkbook to extract text from an Excel file. POI also provides an alternative "Event API"[1] for processing Excel files - the advantage being that it has a much smaller memory footprint, but at the cost of a slightly more complex API.
> I have written an alternative excel parser implementation based on the Event API - if its of interest to the Tika project I'll write a test case for it.
> [1] http://poi.apache.org/hssf/how-to.html#event_api

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.