You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/03/05 00:25:45 UTC

1.26?

All,
  I ran the regression tests on a million docs from our corpus.  The
results are here:
https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
  Ready to start the release process?  Anything we want to backport
into 1.26?  Other modifications?

         Cheers,

             Tim

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
Will take a look.  Thank you!

On Tue, Mar 23, 2021 at 2:58 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Am 23.03.2021 um 17:31 schrieb Tim Allison:
> > Reports are available here:
> > https://corpora.tika.apache.org/base/reports/1_25_v_1_26.tgz
>
>
> govdocs1/966/966679.pdf
>
> claims to have 360 attachments more than last time. I don't see a single
> attachment, and when I run tika-app with "--extract" I get nothing???
>
>
> There are also some NPEs for BMP files, seems to be a java bug.
>
>
> Tilman
>

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
I bumped the maximum recursion depth recently.  When I reverted that
depth temporarily to max depth of 10, I got 653 attachments, which
doesn't align with either 1.25 or 1.26-SNAPSHOT, but is smaller.

On Tue, Mar 23, 2021 at 3:51 PM Tim Allison <ta...@apache.org> wrote:
>
> The govdocs file has 1290 MACRO (javascript) "attachments" with Tika
> 1.26-SNAPSHOT and 930 with Tika 1.25.  I have no idea why there are
> more macros in the more recent version of Tika, but there are
> "attachments" broadly speaking.
>
> I'll look into the NPEs.  If those are a Java bug, I don't think those
> are a blocker.
>
> Still working on the open office document issues...
> LIBRE_OFFICE-45041-0.ods is showing some weird behavior.
>
> On Tue, Mar 23, 2021 at 2:58 PM Tilman Hausherr <TH...@t-online.de> wrote:
> >
> > Am 23.03.2021 um 17:31 schrieb Tim Allison:
> > > Reports are available here:
> > > https://corpora.tika.apache.org/base/reports/1_25_v_1_26.tgz
> >
> >
> > govdocs1/966/966679.pdf
> >
> > claims to have 360 attachments more than last time. I don't see a single
> > attachment, and when I run tika-app with "--extract" I get nothing???
> >
> >
> > There are also some NPEs for BMP files, seems to be a java bug.
> >
> >
> > Tilman
> >

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
The govdocs file has 1290 MACRO (javascript) "attachments" with Tika
1.26-SNAPSHOT and 930 with Tika 1.25.  I have no idea why there are
more macros in the more recent version of Tika, but there are
"attachments" broadly speaking.

I'll look into the NPEs.  If those are a Java bug, I don't think those
are a blocker.

Still working on the open office document issues...
LIBRE_OFFICE-45041-0.ods is showing some weird behavior.

On Tue, Mar 23, 2021 at 2:58 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Am 23.03.2021 um 17:31 schrieb Tim Allison:
> > Reports are available here:
> > https://corpora.tika.apache.org/base/reports/1_25_v_1_26.tgz
>
>
> govdocs1/966/966679.pdf
>
> claims to have 360 attachments more than last time. I don't see a single
> attachment, and when I run tika-app with "--extract" I get nothing???
>
>
> There are also some NPEs for BMP files, seems to be a java bug.
>
>
> Tilman
>

Re: 1.26?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.03.2021 um 17:31 schrieb Tim Allison:
> Reports are available here:
> https://corpora.tika.apache.org/base/reports/1_25_v_1_26.tgz


govdocs1/966/966679.pdf

claims to have 360 attachments more than last time. I don't see a single 
attachment, and when I run tika-app with "--extract" I get nothing???


There are also some NPEs for BMP files, seems to be a java bug.


Tilman


Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
Reports are available here:
https://corpora.tika.apache.org/base/reports/1_25_v_1_26.tgz

I haven't looked carefully yet, but it looks like we need a tweak to
TIKA-3325...there are a couple of handfuls of new "potential zip bomb"
exceptions.

Will look deeper...

On Mon, Mar 22, 2021 at 2:19 PM Tim Allison <ta...@apache.org> wrote:
>
> Unless there are objections, I'll start the regression tests for 1.26-rc1.
>
> Thank you, Tilman, for the help on TIKA-3332.  I look forward to
> seeing the diffs!
>
> Cheers,
>
>          Tim
>
> On Fri, Mar 19, 2021 at 10:45 AM Tim Allison <ta...@apache.org> wrote:
> >
> > All,
> >
> > Will rerun regression tests against the 1M sample on Monday (Eastern
> > US) and respin 1.26-rc1 if I don't find any surprises.  Let me know if
> > there are any other blockers on 1.26 or other features you want to add
> > or cherrypick from main.
> >
> > Cheers,
> >
> >    Tim
> >
> > On Tue, Mar 9, 2021 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
> > >
> > > Wait...looks like the next version of PDFBox is coming out very soon.
> > > Let's hold 1.26 until that is out.
> > >
> > > On Tue, Mar 9, 2021 at 5:53 AM Tim Allison <ta...@apache.org> wrote:
> > > >
> > > > Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...
> > > >
> > > > On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
> > > > >
> > > > > All,
> > > > >   James Ahlborn modified jackcess-crypt for us and made a new release.
> > > > > I've made the upgrade in our branch_1x, and I ran a comparison of the
> > > > > msaccess files we have in our corpus:
> > > > > https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
> > > > >   No surprises if we upgrade to the latest Jackcess.
> > > > >   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
> > > > >
> > > > >             Cheers,
> > > > >
> > > > >                      Tim
> > > > >
> > > > > On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
> > > > > >
> > > > > > All,
> > > > > >   I ran the regression tests on a million docs from our corpus.  The
> > > > > > results are here:
> > > > > > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> > > > > >   Ready to start the release process?  Anything we want to backport
> > > > > > into 1.26?  Other modifications?
> > > > > >
> > > > > >          Cheers,
> > > > > >
> > > > > >              Tim

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
Unless there are objections, I'll start the regression tests for 1.26-rc1.

Thank you, Tilman, for the help on TIKA-3332.  I look forward to
seeing the diffs!

Cheers,

         Tim

On Fri, Mar 19, 2021 at 10:45 AM Tim Allison <ta...@apache.org> wrote:
>
> All,
>
> Will rerun regression tests against the 1M sample on Monday (Eastern
> US) and respin 1.26-rc1 if I don't find any surprises.  Let me know if
> there are any other blockers on 1.26 or other features you want to add
> or cherrypick from main.
>
> Cheers,
>
>    Tim
>
> On Tue, Mar 9, 2021 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
> >
> > Wait...looks like the next version of PDFBox is coming out very soon.
> > Let's hold 1.26 until that is out.
> >
> > On Tue, Mar 9, 2021 at 5:53 AM Tim Allison <ta...@apache.org> wrote:
> > >
> > > Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...
> > >
> > > On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
> > > >
> > > > All,
> > > >   James Ahlborn modified jackcess-crypt for us and made a new release.
> > > > I've made the upgrade in our branch_1x, and I ran a comparison of the
> > > > msaccess files we have in our corpus:
> > > > https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
> > > >   No surprises if we upgrade to the latest Jackcess.
> > > >   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
> > > >
> > > >             Cheers,
> > > >
> > > >                      Tim
> > > >
> > > > On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
> > > > >
> > > > > All,
> > > > >   I ran the regression tests on a million docs from our corpus.  The
> > > > > results are here:
> > > > > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> > > > >   Ready to start the release process?  Anything we want to backport
> > > > > into 1.26?  Other modifications?
> > > > >
> > > > >          Cheers,
> > > > >
> > > > >              Tim

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
All,

Will rerun regression tests against the 1M sample on Monday (Eastern
US) and respin 1.26-rc1 if I don't find any surprises.  Let me know if
there are any other blockers on 1.26 or other features you want to add
or cherrypick from main.

Cheers,

   Tim

On Tue, Mar 9, 2021 at 6:30 AM Tim Allison <ta...@apache.org> wrote:
>
> Wait...looks like the next version of PDFBox is coming out very soon.
> Let's hold 1.26 until that is out.
>
> On Tue, Mar 9, 2021 at 5:53 AM Tim Allison <ta...@apache.org> wrote:
> >
> > Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...
> >
> > On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
> > >
> > > All,
> > >   James Ahlborn modified jackcess-crypt for us and made a new release.
> > > I've made the upgrade in our branch_1x, and I ran a comparison of the
> > > msaccess files we have in our corpus:
> > > https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
> > >   No surprises if we upgrade to the latest Jackcess.
> > >   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
> > >
> > >             Cheers,
> > >
> > >                      Tim
> > >
> > > On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
> > > >
> > > > All,
> > > >   I ran the regression tests on a million docs from our corpus.  The
> > > > results are here:
> > > > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> > > >   Ready to start the release process?  Anything we want to backport
> > > > into 1.26?  Other modifications?
> > > >
> > > >          Cheers,
> > > >
> > > >              Tim

Re: 1.26?

Posted by Konstantin Gribov <gr...@gmail.com>.
+1, new pdfbox release is a fairly good reason to wait a bit.

-- 
Best regards,
Konstantin Gribov.


On Tue, Mar 9, 2021 at 2:30 PM Tim Allison <ta...@apache.org> wrote:

> Wait...looks like the next version of PDFBox is coming out very soon.
> Let's hold 1.26 until that is out.
>
> On Tue, Mar 9, 2021 at 5:53 AM Tim Allison <ta...@apache.org> wrote:
> >
> > Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...
> >
> > On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
> > >
> > > All,
> > >   James Ahlborn modified jackcess-crypt for us and made a new release.
> > > I've made the upgrade in our branch_1x, and I ran a comparison of the
> > > msaccess files we have in our corpus:
> > >
> https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
> > >   No surprises if we upgrade to the latest Jackcess.
> > >   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
> > >
> > >             Cheers,
> > >
> > >                      Tim
> > >
> > > On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org>
> wrote:
> > > >
> > > > All,
> > > >   I ran the regression tests on a million docs from our corpus.  The
> > > > results are here:
> > > > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> > > >   Ready to start the release process?  Anything we want to backport
> > > > into 1.26?  Other modifications?
> > > >
> > > >          Cheers,
> > > >
> > > >              Tim
>

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
Wait...looks like the next version of PDFBox is coming out very soon.
Let's hold 1.26 until that is out.

On Tue, Mar 9, 2021 at 5:53 AM Tim Allison <ta...@apache.org> wrote:
>
> Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...
>
> On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
> >
> > All,
> >   James Ahlborn modified jackcess-crypt for us and made a new release.
> > I've made the upgrade in our branch_1x, and I ran a comparison of the
> > msaccess files we have in our corpus:
> > https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
> >   No surprises if we upgrade to the latest Jackcess.
> >   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
> >
> >             Cheers,
> >
> >                      Tim
> >
> > On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
> > >
> > > All,
> > >   I ran the regression tests on a million docs from our corpus.  The
> > > results are here:
> > > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> > >   Ready to start the release process?  Anything we want to backport
> > > into 1.26?  Other modifications?
> > >
> > >          Cheers,
> > >
> > >              Tim

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
Or Tuesday, it turns out.  1.26-rc1 should be ready fairly soon...

On Fri, Mar 5, 2021 at 1:46 PM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   James Ahlborn modified jackcess-crypt for us and made a new release.
> I've made the upgrade in our branch_1x, and I ran a comparison of the
> msaccess files we have in our corpus:
> https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
>   No surprises if we upgrade to the latest Jackcess.
>   Unless there are objections, I'll roll a 1.26-rc1 on Monday.
>
>             Cheers,
>
>                      Tim
>
> On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
> >
> > All,
> >   I ran the regression tests on a million docs from our corpus.  The
> > results are here:
> > https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
> >   Ready to start the release process?  Anything we want to backport
> > into 1.26?  Other modifications?
> >
> >          Cheers,
> >
> >              Tim

Re: 1.26?

Posted by Tim Allison <ta...@apache.org>.
All,
  James Ahlborn modified jackcess-crypt for us and made a new release.
I've made the upgrade in our branch_1x, and I ran a comparison of the
msaccess files we have in our corpus:
https://corpora.tika.apache.org/base/reports/tika_1_25_v_1_26_msaccess_reports.tgz
  No surprises if we upgrade to the latest Jackcess.
  Unless there are objections, I'll roll a 1.26-rc1 on Monday.

            Cheers,

                     Tim

On Thu, Mar 4, 2021 at 7:25 PM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   I ran the regression tests on a million docs from our corpus.  The
> results are here:
> https://corpora.tika.apache.org/base/reports/tika-1-25-v-1-26.tgz
>   Ready to start the release process?  Anything we want to backport
> into 1.26?  Other modifications?
>
>          Cheers,
>
>              Tim