You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/06/26 16:18:24 UTC

[jira] [Commented] (TIKA-1332) Create "eval" code

    [ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682 ] 

Tim Allison commented on TIKA-1332:
-----------------------------------

To my mind, there are three families of things that can go wrong:

1) Parser can fail
    1a) throw an exception
    1b) hang forever

2) Fail to extract text and/or metadata from documents
    2a) nothing is extracted
    2b) some document components or attachments are not extracted: TIKA-1317 and TIKA-1228

3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs in .docx, etc), in which case there are two options:
      3a) We can do better.
      3b) We can't...the document is just plain broken.

We can easily count and compare 1).   By easily, I mean that I haven't fully worked it out, but it should be fairly straightforward.

Without a truth set or a comparison parser, we cannot easily measure 2a or 2b.  For 2a, if there is no text, maybe there really is no text (image only pdfs or just a docx that contains images).  For 2b, we're really out of luck without other resources.
  
For 3), there's lots of room for work.  In short, I'd think we'd want to calculate how "languagey" the extracted text is.  Some indicators that occur to me:

 a) Type/token ratio or token entropy
 b) Average word length (with an exception for non-whitespace languages)
 c) Ratio of alphanumerics to total string length
 d) Analysis of language id confidence scores...if the string is long enough, you'd expect a langid component to return a very high score for the best language and then far lower scores for the 2nd and 3rd best languages.  If the langid component returns flat scores, then that might be an indicator that something didn't go well.  

What do you think?  Are there other things that can go wrong?  What else should we try to measure, in a supervised (not ideal) or semi-supervised (better) or unsupervised (best)? 

> Create "eval" code
> ------------------
>
>                 Key: TIKA-1332
>                 URL: https://issues.apache.org/jira/browse/TIKA-1332
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of exceptions per file type, most common exceptions per file type, number of metadata items, total text extracted, etc).  We should also be able to compare one run against another.  Going forward, there's plenty of room to improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)