You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Lewis John McGibbney <le...@apache.org> on 2018/12/01 01:14:13 UTC

Re: Resource Sharing Tika Corpus with Any23

Hi Tim,
Thanks for the reply... answer inline

On 2018/11/30 19:22:23, Tim Allison <ta...@apache.org> wrote: 
> I think that'd be great.  Some questions:
> 
> 1) Would you use the same input docs that we're using or would you
> need/want a new TB drive for your input/output?  

The same docs I suspect. We *could* contribute the documents we use in our test suite as well
https://github.com/apache/any23/tree/master/test-resources/src/test/resources
however this is not really necessary for us to run Any23. Any23 will only attempt extractions on a small subset of the documents in the corpus.

> How much space will
> you need for your eval framework including outputs?

I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the ability to run Open Information Extraction (smart relationship extraction from text) and this tends to generate more triples. If we decided to turn this on, then it would probably get towards the 5GB mark. I wouldnt imagine any more than that thought Tim.

> 2) Would you be willing to coordinate with us and PDFBox and POI
> around release times?

I think so yes. If anything this would be an excellent thing for Any23. I think improved coordination and communication between the communities would be a very positive step.

> 3) Would you be running your processing every so often (around your
> releases) or would it be constant aside from our releases? 

Most likely the former. I am aware that the service is billed to someones (your) card. So we would be looking to do only what is polite and acceptable. Prior to releases e.g. during review of a release candidate would be really cool. 

>  I ask
> because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> when we're not getting ready for a release.
> 

That sounds fine to me. 
Thank you for the response. 

Re: Resource Sharing Tika Corpus with Any23

Posted by Tim Allison <ta...@apache.org>.
Sorry for my delay, send me the usernames and email addresses
privately and I'll grant access.  We're coming up on a release cycle.
On Fri, Nov 30, 2018 at 8:14 PM Lewis John McGibbney <le...@apache.org> wrote:
>
> Hi Tim,
> Thanks for the reply... answer inline
>
> On 2018/11/30 19:22:23, Tim Allison <ta...@apache.org> wrote:
> > I think that'd be great.  Some questions:
> >
> > 1) Would you use the same input docs that we're using or would you
> > need/want a new TB drive for your input/output?
>
> The same docs I suspect. We *could* contribute the documents we use in our test suite as well
> https://github.com/apache/any23/tree/master/test-resources/src/test/resources
> however this is not really necessary for us to run Any23. Any23 will only attempt extractions on a small subset of the documents in the corpus.
>
> > How much space will
> > you need for your eval framework including outputs?
>
> I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the ability to run Open Information Extraction (smart relationship extraction from text) and this tends to generate more triples. If we decided to turn this on, then it would probably get towards the 5GB mark. I wouldnt imagine any more than that thought Tim.
>
> > 2) Would you be willing to coordinate with us and PDFBox and POI
> > around release times?
>
> I think so yes. If anything this would be an excellent thing for Any23. I think improved coordination and communication between the communities would be a very positive step.
>
> > 3) Would you be running your processing every so often (around your
> > releases) or would it be constant aside from our releases?
>
> Most likely the former. I am aware that the service is billed to someones (your) card. So we would be looking to do only what is polite and acceptable. Prior to releases e.g. during review of a release candidate would be really cool.
>
> >  I ask
> > because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> > when we're not getting ready for a release.
> >
>
> That sounds fine to me.
> Thank you for the response.