You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Dominik Stadler <do...@gmx.at> on 2016/01/13 20:08:30 UTC

Using CommonCrawl for POI regression-mass-testing

Hi,

FYI, I am playing with CommonCrawl data for some talk that I plan to do in
2016. As part of this I built a small framework to let me run the POI
integrationtest-framework on a large number of documents that I extracted
from a number of CommonCrawl-runs. This is somewhat similar to what Tim is
doing for Tika, but it focues on POI-related documents.

I tried to use this as a huge regression-check, in this case I compared
relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this
against newer versions to check for any new regressions.


Some statistics:

* Overall I processed 829356 POI-related documents

* 687506 documents did process fine in both versions!

* 140699 documents caused parsing errors in both versions. Many of these
are actually invalid documents, wrong file-types, incorrect mime-types, ...
so the actuall error rate would be much lower, but it is currently not
overly useful to look at these errors without first sorting out all the
false-positives.

* 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made
more documents succeed now, jay!

* And finally 306 documents did fail in POI-3.14-beta1 while they processed
fine with POI-3.13.


However these potential regressions have the following causes:

** aprox 280 of these were caused because we do more checks for HSLF now
** 19 were OOMs that happen in my framework with large documents due to
parallel processing
** One document fails Date-parsing where I don't see how it did work
before, maybe this is also caused by more testing now
** 5 documents failed due to the new support for multi-part formats and
locale id
** One document showed an NPE in HSLFTextParagraph

So only the last two look like actual regressions, I will commit fixes
together with reproducing files for these two shortly.



I store the results into a database, so I can query on the results in
various ways:

E.g. attached is the list of top 100 exception-messages for the failed
files.

Let me know if you would like to get a full stacktrace and document for any
of those or if you have suggestions for additional queries/checks that we
could add here!

Dominik.

Re: Using CommonCrawl for POI regression-mass-testing

Posted by Andreas Beeker <ki...@apache.org>.

Hi Dominik,

I'd like to have the X/HSLF files, so a few of the first 280ies and the NPE.

Thank you for your efforts!

Andi.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

RE: Using CommonCrawl for POI regression-mass-testing

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi,

>>wow, nice slides! 
Thank you.

>>I am not working as sophisticated as you on this one, but rather  focused on finding regressions and catastrophic failures of POI for now, 
As part of tika-batch, which you can now run through tika-app as: java -jar tika-app.jar <input_directory> <output_directory>, there's multithreaded robust running of Tika that will restart the child process on OOM and permanent hangs...If you haven't already reinvented this wheel, any interest in using tika-batch?  I'm more than happy to show you how to upgrade versions of POI and then run it via tika-batch...I can also write a POI-only parser that you could run through Tika batch that would not involve our wrappers around POI...this would give you the control that you want.



>>because the large number of failures is hard to sort into actual failures and other things. I think one of the next steps will be to filter out the obvious cases,  >>wrong mime-types and HTML-pages, which seem to be quite common to see if I can get down the list of actual failures to a more manageable size.

Y, that's my next step in our corpus is to run Tika and/or "file" against the corpus.  I've done this already here and there, but I'd like to have a centralized table.  For PDFs and a few file types within POI, I've created lists of files (e.g. pdf_files.txt), and I can tell tika-batch to process only the files in that list and then I can tell tika-eval to process only those files. 

As for mime-types, IIRC, you're using the file suffix in the url?  If you process the full WARC vs your far more efficient/faster/smarter method, you can retrieve the mime that the server thought the file was.

>>I did not know about the 1MB limit in CommonCrawl, but again for the current regression testing this is not a big issue, the files will likely simply fail in both versions of POI. 
Right, I completely agree.  At least with PDFs, there are two common exceptions that are triggered by truncated files.  For MSAccess files, there was one exception that was always triggered by truncated files. 


>>It might become interesting later on, but one could try to re-download the file from the original source if it is possible to detect that it was cut and it is still available at the original URL...
Y, that's in my plans as well.  For PDFs, I simply moved every file that was ~1MB into a "probably_truncated" folder, but there were a number of others that were truncated not by CommonCrawl's limit but because of the vagaries of crawling.

As for whether it is possible to detect truncation, the ~1MB is a good hint, the handful of exceptions that are commonly thrown for truncation is another...and, if you're processing the full WARC, there is a flag for "truncated", which Common Crawl sets when it recognizes that a file has been truncated.




>>I though about possible code-pieces to share, part of my code is located in the project https://github.com/centic9/CommonCrawlDocumentDownload, which I enhanced to also download from newer crawls, not only the "old index" from a few years back.

Very cool.  I'll take a look.


>>The processing with POI and populating of the database is done in a separate project which I did not publish (yet), again the handling is done in multiple steps, first actually running against POI, writing results to a JSON-file again. And then writing the results to the database in a second step. This makes it possible to "fix" database writing without the very lengthy processing (more than 12 hours for the 180G worth of POI-relevant files on my laptop).

Y, completely agree about the need for a db.  Out of curiosity, which are you using?  I've had great luck with h2, but things starting getting slow in the eval code once the DB processes > 2 million files on my current hardware. There are probably parameters I should configure better and/or redesign the structure.

If you have any interest in using/contributing to tika-eval, I'll be happy to document it on our wiki.  I would be thrilled to get your feedback (and patches :) ).



On Thu, Jan 14, 2016 at 4:57 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Sweet!  Please feel free to make any use that you can out of [0].
>
> Y, I’m storing results in a db as well (h2) and using that to dump 
> reports along the lines of [1]…note I’m using POI to generate xlsx files now ☺.
>
> Is there any way we could collaborate on the eval code?  My active dev 
> (when I have a chance) is on the TIKA-1302 branch of my Tika github fork.
> The goal is to eventually contribute that as a tika-eval module.
>
> If you wanted access to our vm, I’d be more than happy to grant access 
> so we can collaborate on the corpus and the eval stuff.
>
> Oh, as for Common Crawl, as you already know, in addition to the 
> incorrect mime types, etc…one of the big things that’s been something 
> to be aware of is that they truncate their files at 1MB, which is a 
> big problem for file formats that tend to be bigger than that.  Are 
> you pulling only non-truncated files?
>
> Again, this is fantastic!  What can we share/collaborate on?
>
> Cheers,
>
>            Tim
>
>
> [0]
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_A
> CNA15_allison_herceg_v2.pdf
> [1]
> https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfb
> ox_1_8_11-rc1.zip
>
> From: Dominik Stadler [mailto:dominik.stadler@gmx.at]
> Sent: Wednesday, January 13, 2016 2:09 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Using CommonCrawl for POI regression-mass-testing
>
> Hi,
> FYI, I am playing with CommonCrawl data for some talk that I plan to 
> do in 2016. As part of this I built a small framework to let me run 
> the POI integrationtest-framework on a large number of documents that 
> I extracted from a number of CommonCrawl-runs. This is somewhat 
> similar to what Tim is doing for Tika, but it focues on POI-related documents.
> I tried to use this as a huge regression-check, in this case I 
> compared relelase 3.13 and 3.14-beta1. In the future I can fairly 
> easily run this against newer versions to check for any new regressions.
>
>
> Some statistics:
> * Overall I processed 829356 POI-related documents
>
> * 687506 documents did process fine in both versions!
> * 140699 documents caused parsing errors in both versions. Many of 
> these are actually invalid documents, wrong file-types, incorrect mime-types, ...
> so the actuall error rate would be much lower, but it is currently not 
> overly useful to look at these errors without first sorting out all 
> the false-positives.
>
> * 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we 
> made more documents succeed now, jay!
>
> * And finally 306 documents did fail in POI-3.14-beta1 while they 
> processed fine with POI-3.13.
>
>
> However these potential regressions have the following causes:
>
> ** aprox 280 of these were caused because we do more checks for HSLF 
> now
> ** 19 were OOMs that happen in my framework with large documents due 
> to parallel processing
> ** One document fails Date-parsing where I don't see how it did work 
> before, maybe this is also caused by more testing now
> ** 5 documents failed due to the new support for multi-part formats 
> and locale id
> ** One document showed an NPE in HSLFTextParagraph
>
> So only the last two look like actual regressions, I will commit fixes 
> together with reproducing files for these two shortly.
>
> I store the results into a database, so I can query on the results in 
> various ways:
>
> E.g. attached is the list of top 100 exception-messages for the failed 
> files.
>
> Let me know if you would like to get a full stacktrace and document 
> for any of those or if you have suggestions for additional 
> queries/checks that we could add here!
>
> Dominik.
>

Re: Using CommonCrawl for POI regression-mass-testing

Posted by Dominik Stadler <do...@gmx.at>.

Hi,

wow, nice slides! I am not working as sophisticated as you on this one, but
rather  focused on finding regressions and catastrophic failures of POI for
now, because the large number of failures is hard to sort into actual
failures and other things. I think one of the next steps will be to filter
out the obvious cases, i.e. wrong mime-types and HTML-pages, which seem to
be quite common to see if I can get down the list of actual failures to a
more manageable size.

I did not know about the 1MB limit in CommonCrawl, but again for the
current regression testing this is not a big issue, the files will likely
simply fail in both versions of POI. It might become interesting later on,
but one could try to re-download the file from the original source if it is
possible to detect that it was cut and it is still available at the
original URL...

I though about possible code-pieces to share, part of my code is located in
the project https://github.com/centic9/CommonCrawlDocumentDownload, which I
enhanced to also download from newer crawls, not only the "old index" from
a few years back.

It's split into a multi-step process, first retrieving the list of URLs and
their position in the crawl as large JSON-file, then using that information
to actually download the files.

The processing with POI and populating of the database is done in a
separate project which I did not publish (yet), again the handling is done
in multiple steps, first actually running against POI, writing results to a
JSON-file again. And then writing the results to the database in a second
step. This makes it possible to "fix" database writing without the very
lengthy processing (more than 12 hours for the 180G worth of POI-relevant
files on my laptop).

Dominik.

On Thu, Jan 14, 2016 at 4:57 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Sweet!  Please feel free to make any use that you can out of [0].
>
> Y, I’m storing results in a db as well (h2) and using that to dump reports
> along the lines of [1]…note I’m using POI to generate xlsx files now ☺.
>
> Is there any way we could collaborate on the eval code?  My active dev
> (when I have a chance) is on the TIKA-1302 branch of my Tika github fork.
> The goal is to eventually contribute that as a tika-eval module.
>
> If you wanted access to our vm, I’d be more than happy to grant access so
> we can collaborate on the corpus and the eval stuff.
>
> Oh, as for Common Crawl, as you already know, in addition to the incorrect
> mime types, etc…one of the big things that’s been something to be aware of
> is that they truncate their files at 1MB, which is a big problem for file
> formats that tend to be bigger than that.  Are you pulling only
> non-truncated files?
>
> Again, this is fantastic!  What can we share/collaborate on?
>
> Cheers,
>
>            Tim
>
>
> [0]
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
> [1]
> https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfbox_1_8_11-rc1.zip
>
> From: Dominik Stadler [mailto:dominik.stadler@gmx.at]
> Sent: Wednesday, January 13, 2016 2:09 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Using CommonCrawl for POI regression-mass-testing
>
> Hi,
> FYI, I am playing with CommonCrawl data for some talk that I plan to do in
> 2016. As part of this I built a small framework to let me run the POI
> integrationtest-framework on a large number of documents that I extracted
> from a number of CommonCrawl-runs. This is somewhat similar to what Tim is
> doing for Tika, but it focues on POI-related documents.
> I tried to use this as a huge regression-check, in this case I compared
> relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this
> against newer versions to check for any new regressions.
>
>
> Some statistics:
> * Overall I processed 829356 POI-related documents
>
> * 687506 documents did process fine in both versions!
> * 140699 documents caused parsing errors in both versions. Many of these
> are actually invalid documents, wrong file-types, incorrect mime-types, ...
> so the actuall error rate would be much lower, but it is currently not
> overly useful to look at these errors without first sorting out all the
> false-positives.
>
> * 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made
> more documents succeed now, jay!
>
> * And finally 306 documents did fail in POI-3.14-beta1 while they
> processed fine with POI-3.13.
>
>
> However these potential regressions have the following causes:
>
> ** aprox 280 of these were caused because we do more checks for HSLF now
> ** 19 were OOMs that happen in my framework with large documents due to
> parallel processing
> ** One document fails Date-parsing where I don't see how it did work
> before, maybe this is also caused by more testing now
> ** 5 documents failed due to the new support for multi-part formats and
> locale id
> ** One document showed an NPE in HSLFTextParagraph
>
> So only the last two look like actual regressions, I will commit fixes
> together with reproducing files for these two shortly.
>
> I store the results into a database, so I can query on the results in
> various ways:
>
> E.g. attached is the list of top 100 exception-messages for the failed
> files.
>
> Let me know if you would like to get a full stacktrace and document for
> any of those or if you have suggestions for additional queries/checks that
> we could add here!
>
> Dominik.
>

RE: Using CommonCrawl for POI regression-mass-testing

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Sweet!  Please feel free to make any use that you can out of [0].

Y, I’m storing results in a db as well (h2) and using that to dump reports along the lines of [1]…note I’m using POI to generate xlsx files now ☺.

Is there any way we could collaborate on the eval code?  My active dev (when I have a chance) is on the TIKA-1302 branch of my Tika github fork.  The goal is to eventually contribute that as a tika-eval module.

If you wanted access to our vm, I’d be more than happy to grant access so we can collaborate on the corpus and the eval stuff.

Oh, as for Common Crawl, as you already know, in addition to the incorrect mime types, etc…one of the big things that’s been something to be aware of is that they truncate their files at 1MB, which is a big problem for file formats that tend to be bigger than that.  Are you pulling only non-truncated files?

Again, this is fantastic!  What can we share/collaborate on?

Cheers,

           Tim


[0] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfbox_1_8_11-rc1.zip

From: Dominik Stadler [mailto:dominik.stadler@gmx.at]
Sent: Wednesday, January 13, 2016 2:09 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Using CommonCrawl for POI regression-mass-testing

Hi,
FYI, I am playing with CommonCrawl data for some talk that I plan to do in 2016. As part of this I built a small framework to let me run the POI integrationtest-framework on a large number of documents that I extracted from a number of CommonCrawl-runs. This is somewhat similar to what Tim is doing for Tika, but it focues on POI-related documents.
I tried to use this as a huge regression-check, in this case I compared relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this against newer versions to check for any new regressions.


Some statistics:
* Overall I processed 829356 POI-related documents

* 687506 documents did process fine in both versions!
* 140699 documents caused parsing errors in both versions. Many of these are actually invalid documents, wrong file-types, incorrect mime-types, ... so the actuall error rate would be much lower, but it is currently not overly useful to look at these errors without first sorting out all the false-positives.

* 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made more documents succeed now, jay!

* And finally 306 documents did fail in POI-3.14-beta1 while they processed fine with POI-3.13.


However these potential regressions have the following causes:

** aprox 280 of these were caused because we do more checks for HSLF now
** 19 were OOMs that happen in my framework with large documents due to parallel processing
** One document fails Date-parsing where I don't see how it did work before, maybe this is also caused by more testing now
** 5 documents failed due to the new support for multi-part formats and locale id
** One document showed an NPE in HSLFTextParagraph

So only the last two look like actual regressions, I will commit fixes together with reproducing files for these two shortly.

I store the results into a database, so I can query on the results in various ways:

E.g. attached is the list of top 100 exception-messages for the failed files.

Let me know if you would like to get a full stacktrace and document for any of those or if you have suggestions for additional queries/checks that we could add here!

Dominik.