You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mark Kerzner <ma...@gmail.com> on 2011/09/07 03:29:29 UTC

Testing Tika

Hi,

as part of testing my FreeEed <http://freeeed.org/> open source eDiscovery
engine, I am processing the 153 Enron PSTs found
here<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>
.

Naturally, I see lot of errors and warning. For example, I started with the
error described here <https://issues.apache.org/jira/browse/PDFBOX-1008>.
For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am
building with maven from the latest svn checkout anyway.

However, for the future, my question is: is there a more systematic way to
approach this. Is anybody interested in the results of all the testing that
I am doing, and if yes, how should I report my findings?

Thank you,
Mark

Re: Testing Tika

Posted by Julien Nioche <li...@gmail.com>.
Hi Mark

See
http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.htmlfor
comments on processing the Enron corpus with Tika. Some of the errors
that you are seeing are probably described there.

Julien

On 7 September 2011 02:29, Mark Kerzner <ma...@gmail.com> wrote:

> Hi,
>
> as part of testing my FreeEed <http://freeeed.org/> open source eDiscovery
> engine, I am processing the 153 Enron PSTs found here<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>
> .
>
> Naturally, I see lot of errors and warning. For example, I started with the
> error described here <https://issues.apache.org/jira/browse/PDFBOX-1008>.
> For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am
> building with maven from the latest svn checkout anyway.
>
> However, for the future, my question is: is there a more systematic way to
> approach this. Is anybody interested in the results of all the testing that
> I am doing, and if yes, how should I report my findings?
>
> Thank you,
> Mark
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Testing Tika

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi Mike,

My mistake. I thought this discussion was taking place on the dev list, not
the user list.
*Steve*



On Wed, Sep 7, 2011 at 11:30 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Sorry, I don't understand what this output is telling me?
>
> Ie these 5 files are Tika's sources.... but, what's wrong with them?
>
> I thought we were talking about certain emails from the Enron corpus
> where Tika hits an exception or fails to extract text...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Sep 7, 2011 at 1:04 PM, Steve Aulenbach <sa...@neoninc.org>
> wrote:
> > Hi Mike,
> > Here you go. I ran a quick analysis on revision 1166216 and saw the
> > following:
> >
> > Analysis Summary:
> >
> > Files: 510
> >
> > *** Warning *** File(s) Not Found 5:
> >
> >
> /tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java
> >
> >
> /tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java
> >
> >
> /tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java
> >
> > /tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java
> >
> >
> /tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml
> >
> > Thanks,
> > Steve
> >
> >
> > On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
> >> wrote:
> >>
> >> > Is anybody interested in the results of all the testing that
> >> > I am doing, and if yes, how should I report my findings?
> >>
> >> I'm interested!  This sounds great....
> >>
> >> Tika should strive to have no errors on any valid documents... so if
> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> >> characterize them, open issues, and get them fixed :)
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >
> >
>

Re: Testing Tika

Posted by Michael McCandless <lu...@mikemccandless.com>.
Sorry, I don't understand what this output is telling me?

Ie these 5 files are Tika's sources.... but, what's wrong with them?

I thought we were talking about certain emails from the Enron corpus
where Tika hits an exception or fails to extract text...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Sep 7, 2011 at 1:04 PM, Steve Aulenbach <sa...@neoninc.org> wrote:
> Hi Mike,
> Here you go. I ran a quick analysis on revision 1166216 and saw the
> following:
>
> Analysis Summary:
>
> Files: 510
>
> *** Warning *** File(s) Not Found 5:
>
> /tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java
>
> /tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java
>
> /tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java
>
> /tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java
>
> /tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml
>
> Thanks,
> Steve
>
>
> On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>>
>> > Is anybody interested in the results of all the testing that
>> > I am doing, and if yes, how should I report my findings?
>>
>> I'm interested!  This sounds great....
>>
>> Tika should strive to have no errors on any valid documents... so if
>> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
>> characterize them, open issues, and get them fixed :)
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>
>

Re: Testing Tika

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi Mike,

Here you go. I ran a quick analysis on revision 1166216 and saw the
following:

Analysis Summary:

Files: 510

*** Warning *** File(s) Not Found 5:

/tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java

/tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java

/tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java

/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java

/tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml
*
*
*Thanks,*
*Steve*



On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Is anybody interested in the results of all the testing that
> > I am doing, and if yes, how should I report my findings?
>
> I'm interested!  This sounds great....
>
> Tika should strive to have no errors on any valid documents... so if
> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> characterize them, open issues, and get them fixed :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.
I get it from this site, http://www.edrm.net/resources/data-sets, where it
is much more complete. You can check there

On Sat, Sep 17, 2011 at 2:08 AM, Albretch Mueller <lb...@gmail.com> wrote:

>  from a corpus analysis point of view, who owns this data?, how do we
> know it is the real thing?
> ~
>  I don't see any validation data by Enron Email Dataset
> (http://www.cs.cmu.edu/~enron/)
> ~
>  lbrtchx
>
> On 9/15/11, Mark Kerzner <ma...@gmail.com> wrote:
> > Mike,
> >
> > I certainly will do it. I am refactoring the code before I run those
> tests
> > again.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> That summary is nice, but, can you provide specifics on which docs
> >> caused problems for Tika?
> >>
> >> Ie, if a certain doc hits an exception, we should open a Jira issue
> >> and get it fixed...
> >>
> >> Thanks,
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <ma...@gmail.com>
> >> wrote:
> >> > The processing is complete, the summary found here.
> >> > Mark
> >> >
> >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
> >> > <lu...@mikemccandless.com> wrote:
> >> >>
> >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Is anybody interested in the results of all the testing that
> >> >> > I am doing, and if yes, how should I report my findings?
> >> >>
> >> >> I'm interested!  This sounds great....
> >> >>
> >> >> Tika should strive to have no errors on any valid documents... so if
> >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> >> >> characterize them, open issues, and get them fixed :)
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >
> >> >
> >>
> >
>

Re: Testing Tika

Posted by Albretch Mueller <lb...@gmail.com>.
 from a corpus analysis point of view, who owns this data?, how do we
know it is the real thing?
~
 I don't see any validation data by Enron Email Dataset
(http://www.cs.cmu.edu/~enron/)
~
 lbrtchx

On 9/15/11, Mark Kerzner <ma...@gmail.com> wrote:
> Mike,
>
> I certainly will do it. I am refactoring the code before I run those tests
> again.
>
> Sincerely,
> Mark
>
> On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> That summary is nice, but, can you provide specifics on which docs
>> caused problems for Tika?
>>
>> Ie, if a certain doc hits an exception, we should open a Jira issue
>> and get it fixed...
>>
>> Thanks,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>> > The processing is complete, the summary found here.
>> > Mark
>> >
>> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
>> >> wrote:
>> >>
>> >> > Is anybody interested in the results of all the testing that
>> >> > I am doing, and if yes, how should I report my findings?
>> >>
>> >> I'm interested!  This sounds great....
>> >>
>> >> Tika should strive to have no errors on any valid documents... so if
>> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
>> >> characterize them, open issues, and get them fixed :)
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >
>> >
>>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.
Mike,

I certainly will do it. I am refactoring the code before I run those tests
again.

Sincerely,
Mark

On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> That summary is nice, but, can you provide specifics on which docs
> caused problems for Tika?
>
> Ie, if a certain doc hits an exception, we should open a Jira issue
> and get it fixed...
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > The processing is complete, the summary found here.
> > Mark
> >
> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
> >> wrote:
> >>
> >> > Is anybody interested in the results of all the testing that
> >> > I am doing, and if yes, how should I report my findings?
> >>
> >> I'm interested!  This sounds great....
> >>
> >> Tika should strive to have no errors on any valid documents... so if
> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> >> characterize them, open issues, and get them fixed :)
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >
> >
>

Re: Testing Tika

Posted by Michael McCandless <lu...@mikemccandless.com>.
That summary is nice, but, can you provide specifics on which docs
caused problems for Tika?

Ie, if a certain doc hits an exception, we should open a Jira issue
and get it fixed...

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <ma...@gmail.com> wrote:
> The processing is complete, the summary found here.
> Mark
>
> On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>>
>> > Is anybody interested in the results of all the testing that
>> > I am doing, and if yes, how should I report my findings?
>>
>> I'm interested!  This sounds great....
>>
>> Tika should strive to have no errors on any valid documents... so if
>> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
>> characterize them, open issues, and get them fixed :)
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.
The processing is complete, the summary found
here<http://shmsoft.blogspot.com/2011/09/freeeed-used-to-process-complete-enron.html>
.

Mark

On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Is anybody interested in the results of all the testing that
> > I am doing, and if yes, how should I report my findings?
>
> I'm interested!  This sounds great....
>
> Tika should strive to have no errors on any valid documents... so if
> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> characterize them, open issues, and get them fixed :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Re: Testing Tika

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <ma...@gmail.com> wrote:

> Is anybody interested in the results of all the testing that
> I am doing, and if yes, how should I report my findings?

I'm interested!  This sounds great....

Tika should strive to have no errors on any valid documents... so if
you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
characterize them, open issues, and get them fixed :)

Mike McCandless

http://blog.mikemccandless.com