You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Maxim Valyanskiy (JIRA)" <ji...@apache.org> on 2009/06/24 15:43:07 UTC

[jira] Created: (TIKA-250) XLS parser does not extract empty sheet names

XLS parser does not extract empty sheet names
---------------------------------------------

                 Key: TIKA-250
                 URL: https://issues.apache.org/jira/browse/TIKA-250
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Maxim Valyanskiy
            Priority: Minor


ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-250) XLS parser does not extract empty sheet names

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724114#action_12724114 ] 

Jukka Zitting commented on TIKA-250:
------------------------------------

The currentSheet.isEmpty() conditional was added explicitly to avoid outputting empty sheets. Most Excel files out there have the three default worksheets but in the majority of cases only the first sheet contains anything and it's cleaner if the empty extra sheets aren't included in the output.

Are there real world cases where the name of an empty sheet is an important part of the extracted text content? I would assume that any essential sheets contain at least some content beside the sheet name.

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-250) XLS parser does not extract empty sheet names

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725112#action_12725112 ] 

Maxim Valyanskiy commented on TIKA-250:
---------------------------------------

Yes there are real cases where we really need to know names of the empty sheets. For example we faced the following issue. In the workbook each sheet represented a branch of the company, some sheets were empty just because information was not filled in yet. So when we extracted text from the files the names of some branches were missed. So later when we tried to search our database for these particular names we failed to find this information. 

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-250) XLS parser does not extract empty sheet names

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Valyanskiy updated TIKA-250:
----------------------------------

    Attachment: empty.patch

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-250) XLS parser does not extract empty sheet names

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-250.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Fair enough, fix committed in revision 801432. Thanks for the patch and the rationale!

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.