You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by lo...@gmail.com on 2018/12/01 00:39:31 UTC

Re: 1.20?

Hi,
On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:

> Dave,
>   Should I try to get the Docker plugin working again?
>

That would be great. I think I may have went down the wrong path building
an image at package time, as there doesn't seem to be an easy way to
publish it as an Apache labelled org on Dockerhub unless it builds from
source.

I have some time over the weekend, so could update to where I got to and
see what you think.

Cheers,
Dave

Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
Thank you, again, Luís Filipe Nassif!  There's no point in having
reports unless we pay attention to them :P.  I reverted junrar to
where it was in 1.19.1. I also reverted jackcess based on the reports.

All,
  On the theory that it isn't a great idea to push to production on a
Friday.  I'm going to let the recent changes rest over the weekend.
I'll rerun some tests on a subset of the regression corpus on Monday
and then roll rc1.  If anyone wants to kick the tires on the recent
version changes, including parsers that depend on the upgraded guava,
that'd be great!

Onward!

Cheers,

           Tim

On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <ta...@apache.org> wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <ta...@apache.org> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lf...@gmail.com> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <ta...@apache.org>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >

Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
Reports on mp4s, junrar, msaccess and a random subset of the
regression corpus are available here:
http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz


On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <ta...@apache.org> wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <ta...@apache.org> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lf...@gmail.com> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <ta...@apache.org>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >

Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
Let me actually take a look before answering. Sorry!

On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <ta...@apache.org> wrote:

>  Thank you for reading the reports!!!
>
> The files are very likely broken.  I can take a look.  The change was
> probably because of an "upgrade" to junrar.  Should I revert to the
> version we used in 1.19.1?
> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lf...@gmail.com>
> wrote:
> >
> > Hi Tim,
> >
> > Reading your great reports, I also saw some new exceptions with RAR files
> > in likely broken folder, but seems tika was able to extract some text
> from
> > them before. Do you know if those files are really broken and why tika
> > extracted text from them before?
> >
> > Thank you,
> > Luis
> >
> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <ta...@apache.org>
> > escreveu:
> >
> > > Reports are here:
> > >
> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> > >
> > > I'm going to revert the mp4 parser, and commit the few dependency
> > > upgrades I ran.
> > >
> > > The _major_ difference in content for ppt is explained by the
> > > duplication of header/footer info.  To confirm this, note that the
> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > > identical for nearly all ppt->ppt, but there are far more tokens in
> > > "num_tokens_a" vs "num_tokens_b".
> > >
> > > I also see that we're losing content in x-java and x-groovy, etc., but
> > > that's because we're now suppressing the style markup that our parser
> > > was (incorrectly, IMHO, inserting) -- check the values in
> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > > weight: 3 | family: 2
> > >
> > > In short, I think we're good to go.  Will roll rc1 later today or
> > > (more likely) tomorrow unless there are objections.
> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org>
> wrote:
> > > >
> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > > shortly.
> > > > On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org>
> wrote:
> > > > >
> > > > > > Dave,
> > > > > >   Should I try to get the Docker plugin working again?
> > > > > >
> > > > >
> > > > > That would be great. I think I may have went down the wrong path
> > > building
> > > > > an image at package time, as there doesn't seem to be an easy way
> to
> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
> from
> > > > > source.
> > > > >
> > > > > I have some time over the weekend, so could update to where I got
> to
> > > and
> > > > > see what you think.
> > > > >
> > > > > Cheers,
> > > > > Dave
> > >
>

Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
 Thank you for reading the reports!!!

The files are very likely broken.  I can take a look.  The change was
probably because of an "upgrade" to junrar.  Should I revert to the
version we used in 1.19.1?
On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lf...@gmail.com> wrote:
>
> Hi Tim,
>
> Reading your great reports, I also saw some new exceptions with RAR files
> in likely broken folder, but seems tika was able to extract some text from
> them before. Do you know if those files are really broken and why tika
> extracted text from them before?
>
> Thank you,
> Luis
>
> Em qui, 13 de dez de 2018 às 13:02, Tim Allison <ta...@apache.org>
> escreveu:
>
> > Reports are here:
> >
> > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> >
> > I'm going to revert the mp4 parser, and commit the few dependency
> > upgrades I ran.
> >
> > The _major_ difference in content for ppt is explained by the
> > duplication of header/footer info.  To confirm this, note that the
> > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > identical for nearly all ppt->ppt, but there are far more tokens in
> > "num_tokens_a" vs "num_tokens_b".
> >
> > I also see that we're losing content in x-java and x-groovy, etc., but
> > that's because we're now suppressing the style markup that our parser
> > was (incorrectly, IMHO, inserting) -- check the values in
> > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > weight: 3 | family: 2
> >
> > In short, I think we're good to go.  Will roll rc1 later today or
> > (more likely) tomorrow unless there are objections.
> > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:
> > >
> > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > shortly.
> > > On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
> > > >
> > > > > Dave,
> > > > >   Should I try to get the Docker plugin working again?
> > > > >
> > > >
> > > > That would be great. I think I may have went down the wrong path
> > building
> > > > an image at package time, as there doesn't seem to be an easy way to
> > > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > > source.
> > > >
> > > > I have some time over the weekend, so could update to where I got to
> > and
> > > > see what you think.
> > > >
> > > > Cheers,
> > > > Dave
> >

Re: 1.20?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison <ta...@apache.org>
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>

Re: 1.20?

Posted by Chris Mattmann <ma...@apache.org>.
Roll forward! Yay!

 

 

 

From: Tim Allison <ta...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Thursday, December 13, 2018 at 7:02 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: 1.20?

 

Reports are here:

 

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

 

I'm going to revert the mp4 parser, and commit the few dependency

upgrades I ran.

 

The _major_ difference in content for ppt is explained by the

duplication of header/footer info.  To confirm this, note that the

values for "num_unique_tokens_a" and "num_unique_tokens_b" are

identical for nearly all ppt->ppt, but there are far more tokens in

"num_tokens_a" vs "num_tokens_b".

 

I also see that we're losing content in x-java and x-groovy, etc., but

that's because we're now suppressing the style markup that our parser

was (incorrectly, IMHO, inserting) -- check the values in

"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |

0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |

weight: 3 | family: 2

 

In short, I think we're good to go.  Will roll rc1 later today or

(more likely) tomorrow unless there are objections.

On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:

 

Any blockers on 1.20?  I'm going to kick off the regression tests shortly.

On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:

> 

> Hi,

> On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:

> 

> > Dave,

> >   Should I try to get the Docker plugin working again?

> >

> 

> That would be great. I think I may have went down the wrong path building

> an image at package time, as there doesn't seem to be an easy way to

> publish it as an Apache labelled org on Dockerhub unless it builds from

> source.

> 

> I have some time over the weekend, so could update to where I got to and

> see what you think.

> 

> Cheers,

> Dave

 


Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
Reports are here:

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

I'm going to revert the mp4 parser, and commit the few dependency
upgrades I ran.

The _major_ difference in content for ppt is explained by the
duplication of header/footer info.  To confirm this, note that the
values for "num_unique_tokens_a" and "num_unique_tokens_b" are
identical for nearly all ppt->ppt, but there are far more tokens in
"num_tokens_a" vs "num_tokens_b".

I also see that we're losing content in x-java and x-groovy, etc., but
that's because we're now suppressing the style markup that our parser
was (incorrectly, IMHO, inserting) -- check the values in
"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
weight: 3 | family: 2

In short, I think we're good to go.  Will roll rc1 later today or
(more likely) tomorrow unless there are objections.
On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <ta...@apache.org> wrote:
>
> Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
> On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
> >
> > Hi,
> > On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
> >
> > > Dave,
> > >   Should I try to get the Docker plugin working again?
> > >
> >
> > That would be great. I think I may have went down the wrong path building
> > an image at package time, as there doesn't seem to be an easy way to
> > publish it as an Apache labelled org on Dockerhub unless it builds from
> > source.
> >
> > I have some time over the weekend, so could update to where I got to and
> > see what you think.
> >
> > Cheers,
> > Dave

Re: 1.20?

Posted by Tim Allison <ta...@apache.org>.
Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
On Fri, Nov 30, 2018 at 7:39 PM <lo...@gmail.com> wrote:
>
> Hi,
> On Wed, 21 Nov 2018 at 13:00, Tim Allison <ta...@apache.org> wrote:
>
> > Dave,
> >   Should I try to get the Docker plugin working again?
> >
>
> That would be great. I think I may have went down the wrong path building
> an image at package time, as there doesn't seem to be an easy way to
> publish it as an Apache labelled org on Dockerhub unless it builds from
> source.
>
> I have some time over the weekend, so could update to where I got to and
> see what you think.
>
> Cheers,
> Dave