You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/10/30 15:22:27 UTC
[jira] [Comment Edited] (TIKA-1443) Add a junk text detector to Tika

    [ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982615#comment-14982615 ] 

Tim Allison edited comment on TIKA-1443 at 10/30/15 2:21 PM:
-------------------------------------------------------------

Doh... so much for that idea...

>From Optimaize's [site|https://github.com/optimaize/language-detector]
bq.This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)



was (Author: tallison@mitre.org):
Doh... so much for that idea...

>From Optimaize's [site|https://github.com/optimaize/language-detector]
{noformat}
This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)
{noformat}

> Add a junk text detector to Tika
> --------------------------------
>
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted text is junk.  This could be used as a component of TIKA-1332 or as a standalone detector.  See TIKA-1332 for some initial ideas of what statistics we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less "junky" documents or more "junky" documents.  This would also aid in prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)