You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Radim Rehurek (JIRA)" <ji...@apache.org> on 2018/03/07 17:57:00 UTC

[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

    [ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389884#comment-16389884 ] 

Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:56 PM:
-------------------------------------------------------------

We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row).


was (Author: piskvorky):
We just hit this bug too.

I say "bug" because Excel spreadsheets are really tables with rows, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row).

> Excel 2010 parser missing cell values are not reported resulting in missing columns values
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1020
>                 URL: https://issues.apache.org/jira/browse/TIKA-1020
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: java 1.6 & 1.7 
>            Reporter: Neil Blue
>            Priority: Major
>              Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported in the sax handler. As a result a missing value can result in unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> <tr><td>A</td><td>B</td><td>C</td><tr>
> <tr><td>1</td><td>2</td><td>3</td><tr>
> <tr><td>4</td><td>6</td><tr>
> <tr><td>7</td><td>8</td><td>9</td><tr>
> {code}
> As a result the handler can detect that the third row as incomplete cell values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
> ***************
> *** 200,206 ****
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              xhtml.startElement("td");
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> --- 200,208 ----
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              AttributesImpl attributes = new AttributesImpl();
> !              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
> !              xhtml.startElement("td",attributes);
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)