You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/02/02 15:48:51 UTC

[jira] [Commented] (TIKA-2025) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

    [ https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850067#comment-15850067 ] 

ASF GitHub Bot commented on TIKA-2025:
--------------------------------------

GitHub user vulpes8 opened a pull request:

    https://github.com/apache/tika/pull/151

    fix for TIKA-2025 contributed by vulpes8

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vulpes8/tika fix/TIKA-2025

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #151
    
----
commit 0c5d609e0175dffb93c1a325c9a872c5e6945eb0
Author: Cataldo Mazzilli <ca...@studiostorti.com>
Date:   2017-02-02T14:32:42Z

    fix for TIKA-2025 contributed by vulpes8

----


> Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2025
>                 URL: https://issues.apache.org/jira/browse/TIKA-2025
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.14
>
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “340229177292566” is extracted from the attached spreadsheet as 3.40229E+14, which clearly is not the desired output. 
> This works as expected in 1.12 and earlier. I suspect POI’s recent use of org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.
> I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)