You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/03/26 18:17:26 UTC

[jira] Commented: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

    [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582353#action_12582353 ] 

Jukka Zitting commented on TIKA-132:
------------------------------------

Thanks! Applied your patch as-is in revision 641394.

Good point about empty sheets, I was wondering before how we could avoid exposing them. (A related thought: Perhaps we should avoid outputting the <h1> tags on default sheet names like "Sheet 1". 

Re: WriteOutContentHandler; I think that's acceptable, as IMHO in default settings WriteOutContentHandler should output the actual text that's visible in the document. If a client wants to access extra information like the embedded links, it should use the SAX stream.

There are a few improvements I'd like to make:

- I'd replace the processCellValue flow on the "text" variable with a method call as the case-if-if control flow may be a bit hard to follow especially if we keep adding functionality to processCellValue.
- We should leverage existing java.util collections instead of creating our own linked lists. For example a SortedMap of cell coordinates to cell values should fit our needs and reduce the amount of custom code in Tika
- Cell formatting could be delegated to TikaExcelCell subclasses for better separation of concerns
- The inner classes could be made package-private top level classes to avoid bloating ExcelExtractor

I'll follow up with respective commits directly in svn, but feel free to debate my changes if you prefer other solutions. I'll update svn accordingly until there's consensus.


> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.