You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/07/22 12:57:20 UTC

[jira] [Resolved] (TIKA-2025) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

     [ https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2025.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.14
                   2.0

Up to 15 digits are now extracted for numbers with "General" format contrary to the MS spec.  After 15, we use scientific notation with more significant digits that we had before.

{noformat}
        assertContains("123456789012345", xml);//15 digit number
        assertContains("123456789012346", xml);//15 digit formula
        assertContains("1.23456789012345E+15", xml);//16 digit number is treated as scientific notation
        assertContains("1.23456789012345E+15", xml);//16 digit formula, ditto
{noformat}

Thank you, [~aeham.abushwashi] for noticing this and opening this issue!

Apologies for my delay...I thought I'd have to modify POI, but I found a way to do this at the Tika level.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2025
>                 URL: https://issues.apache.org/jira/browse/TIKA-2025
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.14
>
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “340229177292566” is extracted from the attached spreadsheet as 3.40229E+14, which clearly is not the desired output. 
> This works as expected in 1.12 and earlier. I suspect POI’s recent use of org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.
> I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)