You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "William Palmer (JIRA)" <ji...@apache.org> on 2014/05/19 11:23:38 UTC

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001532#comment-14001532 ] 

William Palmer commented on TIKA-1302:
--------------------------------------

This one might be worth a look - https://github.com/openplanets/format-corpus - Some of the files there are (intentionally) broken, and some are there as examples of format features (i.e. PDF with password, embedded fonts etc)  If the license is not clear enough for any files then please raise an issue, sure people will be glad to help.

Unfortunately I can't share any of the web content I describe using in that blog post.

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)