You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrew Jackson (JIRA)" <ji...@apache.org> on 2014/11/13 14:43:34 UTC

[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757 ] 

Andrew Jackson edited comment on TIKA-1302 at 11/13/14 1:42 PM:
----------------------------------------------------------------

[~tallison@apache.org] I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/\*xml) by Tika. i.e. the exceptions do *not* seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with).

Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that.


was (Author: anjackson):
[~tallison@apache.org] I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with).

Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that.

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)