You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Niall Pemberton (JIRA)" <ji...@apache.org> on 2008/03/21 15:19:25 UTC

[jira] Updated: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

     [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated TIKA-132:
---------------------------------

    Attachment: TIKA-132-ExcelExtractor-refactor-v1.patch

Attaching a patch to refactor ExcelExtractor as per Jukka's suggestion. A few points to note:

 - Maintains "linked-lists" of Rows and Cells (each Row/Cell has a reference to the next Row/Cell)
 - Hyperlink support is currently commented out as it includes un-released POI features - marked with FIXME
 - Empty sheets are ignored - is this OK
 - Still doesn't produce links in the output using the WriteOutContentHandler as the link is a "href" attribute of an <a> element - is this correct?

To try out the hyperlink support - uncomment the relevant lines and use a POI version built from the latest subversion trunk.

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v1.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.