You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Denis Kildishev (JIRA)" <ji...@apache.org> on 2013/07/02 12:45:19 UTC

[jira] [Updated] (TIKA-1140) Better table representation, cell spanning in Word Extractor

     [ https://issues.apache.org/jira/browse/TIKA-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Denis Kildishev updated TIKA-1140:
----------------------------------

    Attachment: word_table.patch

As an example of possible solution can be this patch. It should be mentioned that some code is based on specification of 2007 version of doc format(especially, color and border type decoding), so, some adaptations tends to be made to meet with earlier versions of format.
                
> Better table representation, cell spanning in Word Extractor
> ------------------------------------------------------------
>
>                 Key: TIKA-1140
>                 URL: https://issues.apache.org/jira/browse/TIKA-1140
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Denis Kildishev
>            Priority: Minor
>         Attachments: word_table.patch
>
>
> As for current version of Word Extractor, it have access to different 
> features of tables, but most of them are not used. As an example of possible improvements, may be support for borders, fixed cell widths and cell spanning.
> It should be noted that some of that features are already used in poi version of Html converted, so, that code can be reused in Tika.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira