You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/26 15:41:13 UTC

[jira] [Updated] (TIKA-1302) Let's run Tika against a large batch of docs nightly

     [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1302:
------------------------------
    Attachment: wayback_exception_summaries.xlsx

[~anjackson], I'm attaching some summary stats on the exceptions file you posted.  Thank you for sharing.

In these summary stats, I took the literal exception message, and then I also pared it down to the chunk of text before the first ":".  Without the full stacktrace, this will conflate exceptions, but it still might be useful.

I'm just getting started on the tika-eval code, but one of the things I've run into is that the literal exception message can be problematic if the task is to bin and count exception causes.  What I'm currently doing is truncating the message as I did with your data and then running group by on the full stacktrace.  One limitation of this, though, is that we can't easily compare exceptions across different versions of the software because line numbers are included, and if one changes, a comparison of "group by" output fails.

On another note, with govdocs1, we have very few modern pdfs, (ppt|doc|xls)[xm], rtf, msg, open office and multimedia files...Other Tikis, what other formats do we need?  I might be willing to crawl for docs, but I don't have a good starting point/list of links, and the search engine APIs aren't as generous as they used to be.  So, how recent is your crawl data?  Would you be willing to share a list of links or is it publicly available?

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>         Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)