You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2022/04/26 11:07:04 UTC

1.28.2 regression results

Reports are here:
https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz

I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
think both are related to the underlying parsers being stricter (which
is good), but we need to change our code to handle these cases more
robustly.

Let me know if you see anything else.

Re: 1.28.2 regression results

Posted by Tilman Hausherr <TH...@t-online.de>.
> Let me know if you see anything else.

The jdk11 and 17 builds fail because of a dependency convergence error. 
I don't know if this is really relevant, i.e. would the jdk8 build still 
be ok for people using tika on jdk11 and 17 ?

Tilman


Re: 1.28.2 regression results

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 26.04.2022 um 21:45 schrieb Tim Allison:
> I should clarify that I fixed the two regressions that I had
> identified in the release candidate.  The regression results that I
> shared were run with 1.x before those fixes.

Ah ok, but then the tests should be run again after the fixes in case 
something got broken by the fix (it happened in the pdfbox project).  If 
nothing got broken, then there's still the satisfaction of having very 
small result files :-)

Also suspicious:

bug_trackers/TIKA/TIKA-2215-0.ppt


Tilman


>
> Still, let's fix the dependency convergence, and please let me know if
> there's anything else you find in the regression reports!
>
> On Tue, Apr 26, 2022 at 3:40 PM Tim Allison <ta...@apache.org> wrote:
>> Hi Tilman,
>>
>>    Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
>> related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
>> (new) exception on this file and stopping the parse without throwing
>> an exception.  The earlier version of junrar (7.4.1) did not find a
>> problem with the file.
>>
>>    My ubuntu package util throws an exception on this file, and I think
>> it is just kind of wonky.
>>
>>    I'm going to fix the dependency convergence issues.  Is there anything else?
>>
>>        Best,
>>
>>                   Tim
>>
>> On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <TH...@t-online.de> wrote:
>>> Am 26.04.2022 um 13:07 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
>>>>
>>>> I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
>>>> think both are related to the underlying parsers being stricter (which
>>>> is good), but we need to change our code to handle these cases more
>>>> robustly.
>>>>
>>>> Let me know if you see anything else.
>>> What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
>>> also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
>>> Is that related to TIKA-3734 ?
>>>
>>> Tilman
>>>


Re: 1.28.2 regression results

Posted by Tim Allison <ta...@apache.org>.
I should clarify that I fixed the two regressions that I had
identified in the release candidate.  The regression results that I
shared were run with 1.x before those fixes.

Still, let's fix the dependency convergence, and please let me know if
there's anything else you find in the regression reports!

On Tue, Apr 26, 2022 at 3:40 PM Tim Allison <ta...@apache.org> wrote:
>
> Hi Tilman,
>
>   Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
> related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
> (new) exception on this file and stopping the parse without throwing
> an exception.  The earlier version of junrar (7.4.1) did not find a
> problem with the file.
>
>   My ubuntu package util throws an exception on this file, and I think
> it is just kind of wonky.
>
>   I'm going to fix the dependency convergence issues.  Is there anything else?
>
>       Best,
>
>                  Tim
>
> On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <TH...@t-online.de> wrote:
> >
> > Am 26.04.2022 um 13:07 schrieb Tim Allison:
> > > Reports are here:
> > > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
> > >
> > > I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
> > > think both are related to the underlying parsers being stricter (which
> > > is good), but we need to change our code to handle these cases more
> > > robustly.
> > >
> > > Let me know if you see anything else.
> >
> > What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
> > also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
> > Is that related to TIKA-3734 ?
> >
> > Tilman
> >

Re: 1.28.2 regression results

Posted by Tim Allison <ta...@apache.org>.
Hi Tilman,

  Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
(new) exception on this file and stopping the parse without throwing
an exception.  The earlier version of junrar (7.4.1) did not find a
problem with the file.

  My ubuntu package util throws an exception on this file, and I think
it is just kind of wonky.

  I'm going to fix the dependency convergence issues.  Is there anything else?

      Best,

                 Tim

On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Am 26.04.2022 um 13:07 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
> >
> > I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
> > think both are related to the underlying parsers being stricter (which
> > is good), but we need to change our code to handle these cases more
> > robustly.
> >
> > Let me know if you see anything else.
>
> What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
> also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
> Is that related to TIKA-3734 ?
>
> Tilman
>

Re: 1.28.2 regression results

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 26.04.2022 um 13:07 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
>
> I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
> think both are related to the underlying parsers being stricter (which
> is good), but we need to change our code to handle these cases more
> robustly.
>
> Let me know if you see anything else.

What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is 
also a rar file and the last entry in content_diffs_no_exceptions.xlsx . 
Is that related to TIKA-3734 ?

Tilman