You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/10/22 13:20:36 UTC

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179809#comment-14179809 ] 

Tim Allison edited comment on TIKA-1442 at 10/22/14 11:20 AM:
--------------------------------------------------------------

[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison file and recommend other statistics that would be useful for file comparison (TIKA-1332) and junk detection (TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's StandardAnalyzer's list)...I need to make this language specific...if the langid component says "so", we need to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out Manning and Schutze... "token overlap" is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a given file pair if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...


was (Author: tallison@mitre.org):
[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison file and recommend other statistics that would be useful for file comparison (TIKA-1332) and junk detection TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's StandardAnalyzer's list)...I need to make this language specific...if the langid component says "so", we need to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out Manning and Schutze... "token overlap" is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a given file pair if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)