You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2023/01/19 15:15:45 UTC

next release?

All,
  I'm thinking we should cut a release in the next week or so.  I can
start the regression tests next week (possibly late in the week).  I
think that the changes move us into the "minor" version update, so
2.7.0.
  WDYT?  Are there any imminent releases of our dependencies that we
should wait for?  Anything else we'd want to get into the next
release?
  Thank you!

     Best,

             Tim

Re: next release?

Posted by Tim Allison <ta...@apache.org>.
Y.  I've looked at a few others, and this is exactly what's going on.

On Wed, Feb 1, 2023 at 2:20 PM Tim Allison <ta...@apache.org> wrote:
>
> Hi Tilman,
>
>   Thank you for raising this.  I noticed this, looked at a few and
> then failed to document what this diff means.  Sorry!
>
>    You're right that this has to do with TIKA-3962.  The issue is that
> we are now correctly handling attachments within emls as attachments
> rather than inlining the contents.  So this means that the main email
> will have less content, but the content still should show up in an
> attachment.  The challenge from an eval perspective is that there is
> no attachment in 2.6.0 to which to map the new attachment in
> 2.7.0-prerc1.
>
>   I attached the json output for 2.6.0 and 2.7.0 on
> https://issues.apache.org/jira/browse/TIKA-3962.  It looks, btw, like
> we fixed TIKA-2680 while we were at it. :D
>
>   I'm going to look at a few more files.  If I find any problems, I'll
> cancel the vote.
>
> Thank you, again.
>
>            Best,
>
>                      Tim
>
> On Tue, Jan 31, 2023 at 10:46 PM Tilman Hausherr <TH...@t-online.de> wrote:
> >
> > There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has
> > meaningful words, but TOP_10_MORE_IN_B is empty:
> >
> > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox
> > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip
> > bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox
> > commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM
> > commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2
> > bug_trackers/TIKA/TIKA-2680-1.eml
> > commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O
> > govdocs1/446/446030.tmp
> > govdocs1/330/330112.tmp
> > govdocs1/994/994741.tmp
> > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> > bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip
> > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> >
> >
> > I tested the one from https://issues.apache.org/jira/browse/TIKA-2680
> > and I did get text results, so I'm wonder what the problem is, or if
> > there is any problem at all. Or is this related to the changes in
> > TIKA-3962 ?
> >
> > Tilman
> >
> > On 31.01.2023 18:40, Tim Allison wrote:
> > > The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
> > > https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
> > >
> > > Some observations:
> > > * Many fewer "common words" in svg files because they are now
> > > correctly identified as svg+xml files and getting parsed by the xml
> > > parser.  We're no longer treating these as text and including all the
> > > tags.  There are a couple of handfuls of "now svg" files that are
> > > causing exceptions in the xml parser. Overall, I think this diff from
> > > 2.6.0 is good.
> > > * Our change in the charset detector has some improvements and some
> > > regressions.  Overall, I still think we made the right call.
> > > * Surprisingly, I don't see many diffs in the number of attachments in
> > > rfc822 files.  I thought there would be more.
> > >
> > > I'll start the release process now.  Please do take a look and let me
> > > know if you see any issues.  I'm happy to respin an rc2 if necessary.
> > >
> > > Thank you, all!
> > >
> > > Cheers,
> > >
> > >              Tim
> > >
> > > On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<ta...@apache.org>  wrote:
> > >> All,
> > >>    After I fix TIKA-3962, I'll start the regression tests in
> > >> preparation for a 2.7.0 release.  Please let me know if there are any
> > >> blockers or if you're working on something that you want to get into
> > >> the next release.
> > >>    Thank you!
> > >>
> > >>       Best,
> > >>
> > >>           Tim
> > >>
> > >> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<ta...@apache.org>  wrote:
> > >>> All,
> > >>>    I'm thinking we should cut a release in the next week or so.  I can
> > >>> start the regression tests next week (possibly late in the week).  I
> > >>> think that the changes move us into the "minor" version update, so
> > >>> 2.7.0.
> > >>>    WDYT?  Are there any imminent releases of our dependencies that we
> > >>> should wait for?  Anything else we'd want to get into the next
> > >>> release?
> > >>>    Thank you!
> > >>>
> > >>>       Best,
> > >>>
> > >>>               Tim
> >

Re: next release?

Posted by Tim Allison <ta...@apache.org>.
Hi Tilman,

  Thank you for raising this.  I noticed this, looked at a few and
then failed to document what this diff means.  Sorry!

   You're right that this has to do with TIKA-3962.  The issue is that
we are now correctly handling attachments within emls as attachments
rather than inlining the contents.  So this means that the main email
will have less content, but the content still should show up in an
attachment.  The challenge from an eval perspective is that there is
no attachment in 2.6.0 to which to map the new attachment in
2.7.0-prerc1.

  I attached the json output for 2.6.0 and 2.7.0 on
https://issues.apache.org/jira/browse/TIKA-3962.  It looks, btw, like
we fixed TIKA-2680 while we were at it. :D

  I'm going to look at a few more files.  If I find any problems, I'll
cancel the vote.

Thank you, again.

           Best,

                     Tim

On Tue, Jan 31, 2023 at 10:46 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has
> meaningful words, but TOP_10_MORE_IN_B is empty:
>
> bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox
> bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip
> bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox
> commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM
> commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2
> bug_trackers/TIKA/TIKA-2680-1.eml
> commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O
> govdocs1/446/446030.tmp
> govdocs1/330/330112.tmp
> govdocs1/994/994741.tmp
> bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip
> bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
>
>
> I tested the one from https://issues.apache.org/jira/browse/TIKA-2680
> and I did get text results, so I'm wonder what the problem is, or if
> there is any problem at all. Or is this related to the changes in
> TIKA-3962 ?
>
> Tilman
>
> On 31.01.2023 18:40, Tim Allison wrote:
> > The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
> > https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
> >
> > Some observations:
> > * Many fewer "common words" in svg files because they are now
> > correctly identified as svg+xml files and getting parsed by the xml
> > parser.  We're no longer treating these as text and including all the
> > tags.  There are a couple of handfuls of "now svg" files that are
> > causing exceptions in the xml parser. Overall, I think this diff from
> > 2.6.0 is good.
> > * Our change in the charset detector has some improvements and some
> > regressions.  Overall, I still think we made the right call.
> > * Surprisingly, I don't see many diffs in the number of attachments in
> > rfc822 files.  I thought there would be more.
> >
> > I'll start the release process now.  Please do take a look and let me
> > know if you see any issues.  I'm happy to respin an rc2 if necessary.
> >
> > Thank you, all!
> >
> > Cheers,
> >
> >              Tim
> >
> > On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<ta...@apache.org>  wrote:
> >> All,
> >>    After I fix TIKA-3962, I'll start the regression tests in
> >> preparation for a 2.7.0 release.  Please let me know if there are any
> >> blockers or if you're working on something that you want to get into
> >> the next release.
> >>    Thank you!
> >>
> >>       Best,
> >>
> >>           Tim
> >>
> >> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<ta...@apache.org>  wrote:
> >>> All,
> >>>    I'm thinking we should cut a release in the next week or so.  I can
> >>> start the regression tests next week (possibly late in the week).  I
> >>> think that the changes move us into the "minor" version update, so
> >>> 2.7.0.
> >>>    WDYT?  Are there any imminent releases of our dependencies that we
> >>> should wait for?  Anything else we'd want to get into the next
> >>> release?
> >>>    Thank you!
> >>>
> >>>       Best,
> >>>
> >>>               Tim
>

Re: next release?

Posted by Tilman Hausherr <TH...@t-online.de>.
There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has 
meaningful words, but TOP_10_MORE_IN_B is empty:

bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox
bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip
bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox
commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM
commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2
bug_trackers/TIKA/TIKA-2680-1.eml
commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O
govdocs1/446/446030.tmp
govdocs1/330/330112.tmp
govdocs1/994/994741.tmp
bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip
bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW


I tested the one from https://issues.apache.org/jira/browse/TIKA-2680 
and I did get text results, so I'm wonder what the problem is, or if 
there is any problem at all. Or is this related to the changes in 
TIKA-3962 ?

Tilman

On 31.01.2023 18:40, Tim Allison wrote:
> The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
> https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
>
> Some observations:
> * Many fewer "common words" in svg files because they are now
> correctly identified as svg+xml files and getting parsed by the xml
> parser.  We're no longer treating these as text and including all the
> tags.  There are a couple of handfuls of "now svg" files that are
> causing exceptions in the xml parser. Overall, I think this diff from
> 2.6.0 is good.
> * Our change in the charset detector has some improvements and some
> regressions.  Overall, I still think we made the right call.
> * Surprisingly, I don't see many diffs in the number of attachments in
> rfc822 files.  I thought there would be more.
>
> I'll start the release process now.  Please do take a look and let me
> know if you see any issues.  I'm happy to respin an rc2 if necessary.
>
> Thank you, all!
>
> Cheers,
>
>              Tim
>
> On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<ta...@apache.org>  wrote:
>> All,
>>    After I fix TIKA-3962, I'll start the regression tests in
>> preparation for a 2.7.0 release.  Please let me know if there are any
>> blockers or if you're working on something that you want to get into
>> the next release.
>>    Thank you!
>>
>>       Best,
>>
>>           Tim
>>
>> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<ta...@apache.org>  wrote:
>>> All,
>>>    I'm thinking we should cut a release in the next week or so.  I can
>>> start the regression tests next week (possibly late in the week).  I
>>> think that the changes move us into the "minor" version update, so
>>> 2.7.0.
>>>    WDYT?  Are there any imminent releases of our dependencies that we
>>> should wait for?  Anything else we'd want to get into the next
>>> release?
>>>    Thank you!
>>>
>>>       Best,
>>>
>>>               Tim


Re: next release?

Posted by Tim Allison <ta...@apache.org>.
The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz

Some observations:
* Many fewer "common words" in svg files because they are now
correctly identified as svg+xml files and getting parsed by the xml
parser.  We're no longer treating these as text and including all the
tags.  There are a couple of handfuls of "now svg" files that are
causing exceptions in the xml parser. Overall, I think this diff from
2.6.0 is good.
* Our change in the charset detector has some improvements and some
regressions.  Overall, I still think we made the right call.
* Surprisingly, I don't see many diffs in the number of attachments in
rfc822 files.  I thought there would be more.

I'll start the release process now.  Please do take a look and let me
know if you see any issues.  I'm happy to respin an rc2 if necessary.

Thank you, all!

Cheers,

            Tim

On Mon, Jan 30, 2023 at 11:14 AM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   After I fix TIKA-3962, I'll start the regression tests in
> preparation for a 2.7.0 release.  Please let me know if there are any
> blockers or if you're working on something that you want to get into
> the next release.
>   Thank you!
>
>      Best,
>
>          Tim
>
> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison <ta...@apache.org> wrote:
> >
> > All,
> >   I'm thinking we should cut a release in the next week or so.  I can
> > start the regression tests next week (possibly late in the week).  I
> > think that the changes move us into the "minor" version update, so
> > 2.7.0.
> >   WDYT?  Are there any imminent releases of our dependencies that we
> > should wait for?  Anything else we'd want to get into the next
> > release?
> >   Thank you!
> >
> >      Best,
> >
> >              Tim

Re: next release?

Posted by Tim Allison <ta...@apache.org>.
All,
  After I fix TIKA-3962, I'll start the regression tests in
preparation for a 2.7.0 release.  Please let me know if there are any
blockers or if you're working on something that you want to get into
the next release.
  Thank you!

     Best,

         Tim

On Thu, Jan 19, 2023 at 10:15 AM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   I'm thinking we should cut a release in the next week or so.  I can
> start the regression tests next week (possibly late in the week).  I
> think that the changes move us into the "minor" version update, so
> 2.7.0.
>   WDYT?  Are there any imminent releases of our dependencies that we
> should wait for?  Anything else we'd want to get into the next
> release?
>   Thank you!
>
>      Best,
>
>              Tim