You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2022/04/27 12:00:01 UTC

preliminary regression results from 2.4.0

The preliminary regression results for 2.4.0 are here:
https://corpora.tika.apache.org/base/reports/tika-2.4.0-reports.tgz

We have some new exceptions caused by the new http parser; many where
the files are truncated or malformed.  I view this as a good thing.

We have newly identified dgn7 and dgn8.

We have many more tika-ooxml and tika ole's being identified as more
specific xlsx, docx, etc, which is good.

The ppt that TIlman identified is a new exception in 2.4.0 as well,
and we need to fix that.

Once we fix the ppt issue, I'll rerun the regression tests.  Please
let me know if you see anything else.

Best,

            Tim

Re: preliminary regression results from 2.4.0

Posted by Tim Allison <ta...@apache.org>.
Y, I think this is an improvement because it was identified as xhtml
by the earlier version of Tika, and it is now correctly being parsed
by the rfc822 parser...and y, it is broken.

There were a number of other files that are now correctly identified
as http-response, but we're getting less text because the files are
truncated and the http-response parser is throwing an exception.

On Wed, Apr 27, 2022 at 2:59 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Am 27.04.2022 um 14:00 schrieb Tim Allison:
> > Once we fix the ppt issue, I'll rerun the regression tests.  Please
> > let me know if you see anything else.
>
> commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM
>
> has lost a bit of extracted text, but that "mail" is broken.
>
> Tilman
>

Re: preliminary regression results from 2.4.0

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 27.04.2022 um 14:00 schrieb Tim Allison:
> Once we fix the ppt issue, I'll rerun the regression tests.  Please
> let me know if you see anything else.

commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM

has lost a bit of extracted text, but that "mail" is broken.

Tilman