You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/09/08 06:16:00 UTC

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

    [ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710 ] 

Tilman Hausherr commented on TIKA-3544:
---------------------------------------

It seems to depend on the value:
{noformat}
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="extended-properties:AppVersion" content="16.0300"/>
<meta name="protected" content="false"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="meta:last-author" content="Jitin Jindal"/>
<meta name="X-TIKA:digest:SHA256" content="7d1109045508e7fdc0148d9e9e7b16d01ce18ae0794f7381145e23973996c0b6"/>
<meta name="extended-properties:DocSecurityString" content="None"/>
<meta name="resourceName" content="Credit Card Numbers.xlsx"/>
<meta name="dcterms:modified" content="2021-09-07T20:57:34Z"/>
<meta name="Content-Length" content="500481"/>
<meta name="X-TIKA:digest:MD5" content="72c4c6777f1f9144542ddf5a059d2ffa"/>
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<title/>
</head>
<body><div><h1>Payments - Payment Details</h1>
<table><tbody><tr>	<td>Payment Details</td></tr>
<tr>	<td>Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/)</td></tr>
<tr>	<td>6,48019534464278E+15</td></tr>
<tr>	<td>30295201231669</td></tr>
<tr>	<td>30082494556063</td></tr>
<tr>	<td>344850003945824</td></tr>
<tr>	<td>3,58338792333363E+15</td></tr>
<tr>	<td>3,58738537059364E+15</td></tr>
<tr/>
</tbody></table>
<p>&amp;"Helvetica,Regular"&amp;12&amp;K000000&amp;P  </p>
<a href="http://www.getcreditcardnumbers.com/">http://www.getcreditcardnumbers.com/</a></div>
</body></html>
{noformat}


> Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3544
>                 URL: https://issues.apache.org/jira/browse/TIKA-3544
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20
>            Reporter: Jitin Jindal
>            Priority: Major
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the attached spreadsheet as 6.480195344642784E15, which clearly is not the desired output.
> I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)