You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Neil Blue (JIRA)" <ji...@apache.org> on 2012/11/08 12:25:11 UTC

[jira] [Created] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

Neil Blue created TIKA-1020:
-------------------------------

             Summary: Excel 2010 parser missing cell values are not reported resulting in missing columns values
                 Key: TIKA-1020
                 URL: https://issues.apache.org/jira/browse/TIKA-1020
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.2
         Environment: java 1.6 & 1.7 
            Reporter: Neil Blue


When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported in the sax handler. As a result a missing value can result in unordered data.

For example given the table:

{code:title=Bar.java|borderStyle=solid}
A B B
1 2 3
4   6
7 8 9
{code}

the returned sax handler reports elements

{code:title=Bar.java|borderStyle=solid}
<tr><td>A</td><td>B</td><td>C</td><tr>
<tr><td>1</td><td>2</td><td>3</td><tr>
<tr><td>4</td><td>6</td><tr>
<tr><td>7</td><td>8</td><td>9</td><tr>
{code}

As a result the handler can detect that the third row as incomplete cell values but it is ambiguous which columns have missing data.

As a possible fix for this excel 2010 xml data contains the cell reference value, which could be returned to the sax handler as an attribute. 

{code:title=Bar.java|borderStyle=solid}
*** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
--- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
***************
*** 200,206 ****
  
         public void cell(String cellRef, String formattedValue) {
            try {
!              xhtml.startElement("td");
  
               // Main cell contents
               xhtml.characters(formattedValue);
--- 200,208 ----
  
         public void cell(String cellRef, String formattedValue) {
            try {
!              AttributesImpl attributes = new AttributesImpl();
!              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
!              xhtml.startElement("td",attributes);
  
               // Main cell contents
               xhtml.characters(formattedValue);


{code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494783#comment-13494783 ] 

Nick Burch commented on TIKA-1020:
----------------------------------

The current Tika behaviour is what I'd expected, you're getting text for the cells with real values, and things aren't cluttered for the missing cells/rows (of which there can be huge numbers in many excel files). I'm not sure we want to be putting in cell references, blank cells etc to the html.

If you have specific requirements in this area, eg you're actually wanting to generate things like CSV files, then you're best off using Apache POI directly yourself which does provide optional ways to detect these missing cells / rows and allows you to put in your own logic to handle them as your needs dictate.
                
> Excel 2010 parser missing cell values are not reported resulting in missing columns values
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1020
>                 URL: https://issues.apache.org/jira/browse/TIKA-1020
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: java 1.6 & 1.7 
>            Reporter: Neil Blue
>              Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported in the sax handler. As a result a missing value can result in unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> <tr><td>A</td><td>B</td><td>C</td><tr>
> <tr><td>1</td><td>2</td><td>3</td><tr>
> <tr><td>4</td><td>6</td><tr>
> <tr><td>7</td><td>8</td><td>9</td><tr>
> {code}
> As a result the handler can detect that the third row as incomplete cell values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
> ***************
> *** 200,206 ****
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              xhtml.startElement("td");
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> --- 200,208 ----
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              AttributesImpl attributes = new AttributesImpl();
> !              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
> !              xhtml.startElement("td",attributes);
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira