You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/23 15:07:23 UTC

release Tika 1.10?

All,
  With the fix of TIKA-1690, I think it makes sense to roll a new release (1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi) in before the release.  Are there any other blockers on 1.10?

       Cheers,

                Tim


Re: release Tika 1.10?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I could definitely roll it next week, Tim. +1 from me.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Thursday, July 23, 2015 at 6:07 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: release Tika 1.10?

>All,
>  With the fix of TIKA-1690, I think it makes sense to roll a new release
>(1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi)
>in before the release.  Are there any other blockers on 1.10?
>
>       Cheers,
>
>                Tim
>


Re: release Tika 1.10?

Posted by David Meikle <lo...@gmail.com>.
Hey,
> On 28 Jul 2015, at 19:08, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape for 1.10...from my perspective

Been running some tests locally on a private set I have and it is looking good here too.

Will start rolling this today!

Cheers,
Dave

Re: release Tika 1.10?

Posted by Oleg Tikhonov <ol...@apache.org>.
Thanks!
+1

BR,
Oleg

On Tue, Aug 4, 2015 at 5:37 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> +1
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: "Allison, Timothy B." <ta...@mitre.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Tuesday, July 28, 2015 at 11:08 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: RE: release Tika 1.10?
>
> >Just finished the run against ~2.8 million docs (4.8 million including
> >attachments) from a combination of govdocs1 and Common Crawl.  I compared
> >1.9 with trunk.
> >
> >Most looks good.
> >
> >Some highlights:
> >* Thanks to Andrew Jackson and TIKA-1678, we're now getting better
> >metadata out of ~1300 from 550k PDFs. This appears to be far more common
> >in Common Crawl PDFs than in govdocs1 PDFs.
> >* No significant changes found in the handful of msg files...I wanted to
> >check after the work on TIKA-1238.
> >* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer
> >PPT exceptions
> >* There are a very few more files in CommonCrawl that are now incorrectly
> >identified as RFC vs text (TIKA-1602), but this is a tiny handful (total
> >of 4 documents in both CC and govdocs1)
> >
> >A regret:
> >This run used the digesting parser for both container and embedded files.
> > This causes some truncated (=corrupt) package files to throw an
> >exception before they otherwise would.  The opposite happens, too (more
> >embedded files when using the digester), but this is extremely rare. This
> >means that for truncated gz, x-xz and x-archive files there are many more
> >with fewer attachments in Tika 1.10-SNAPSHOT than in Tika 1.9.
> >
> >With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape
> >for 1.10...from my perspective.
> >
> >             Best,
> >
> >                       Tim
> >-----Original Message-----
> >From: David Meikle [mailto:loompa@gmail.com]
> >Sent: Sunday, July 26, 2015 10:50 AM
> >To: dev@tika.apache.org
> >Subject: Re: release Tika 1.10?
> >
> >
> >> On 23 Jul 2015, at 14:07, Allison, Timothy B. <ta...@mitre.org>
> >>wrote:
> >>
> >>  With the fix of TIKA-1690, I think it makes sense to roll a new
> >>release (1.10) in the next week or so.  I'd like to get TIKA-1667
> >>(upgrade poi) in before the release.  Are there any other blockers on
> >>1.10?
> >
> >+1 from me too.  As discussed on private, I will roll the release on
> >Tuesday night (UK Time) to give people time to shout for other candidates.
> >
> >Cheers,
> >Dave
>
>

Re: release Tika 1.10?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, July 28, 2015 at 11:08 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: release Tika 1.10?

>Just finished the run against ~2.8 million docs (4.8 million including
>attachments) from a combination of govdocs1 and Common Crawl.  I compared
>1.9 with trunk.
>
>Most looks good.
>
>Some highlights:
>* Thanks to Andrew Jackson and TIKA-1678, we're now getting better
>metadata out of ~1300 from 550k PDFs. This appears to be far more common
>in Common Crawl PDFs than in govdocs1 PDFs.
>* No significant changes found in the handful of msg files...I wanted to
>check after the work on TIKA-1238.
>* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer
>PPT exceptions
>* There are a very few more files in CommonCrawl that are now incorrectly
>identified as RFC vs text (TIKA-1602), but this is a tiny handful (total
>of 4 documents in both CC and govdocs1)
>
>A regret:
>This run used the digesting parser for both container and embedded files.
> This causes some truncated (=corrupt) package files to throw an
>exception before they otherwise would.  The opposite happens, too (more
>embedded files when using the digester), but this is extremely rare. This
>means that for truncated gz, x-xz and x-archive files there are many more
>with fewer attachments in Tika 1.10-SNAPSHOT than in Tika 1.9.
>
>With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape
>for 1.10...from my perspective.
>
>             Best,
>
>                       Tim
>-----Original Message-----
>From: David Meikle [mailto:loompa@gmail.com]
>Sent: Sunday, July 26, 2015 10:50 AM
>To: dev@tika.apache.org
>Subject: Re: release Tika 1.10?
>
>
>> On 23 Jul 2015, at 14:07, Allison, Timothy B. <ta...@mitre.org>
>>wrote:
>> 
>>  With the fix of TIKA-1690, I think it makes sense to roll a new
>>release (1.10) in the next week or so.  I'd like to get TIKA-1667
>>(upgrade poi) in before the release.  Are there any other blockers on
>>1.10?
>
>+1 from me too.  As discussed on private, I will roll the release on
>Tuesday night (UK Time) to give people time to shout for other candidates.
>
>Cheers,
>Dave


RE: release Tika 1.10?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Just finished the run against ~2.8 million docs (4.8 million including attachments) from a combination of govdocs1 and Common Crawl.  I compared 1.9 with trunk.

Most looks good.

Some highlights:
* Thanks to Andrew Jackson and TIKA-1678, we're now getting better metadata out of ~1300 from 550k PDFs. This appears to be far more common in Common Crawl PDFs than in govdocs1 PDFs.
* No significant changes found in the handful of msg files...I wanted to check after the work on TIKA-1238.
* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer PPT exceptions
* There are a very few more files in CommonCrawl that are now incorrectly identified as RFC vs text (TIKA-1602), but this is a tiny handful (total of 4 documents in both CC and govdocs1)

A regret:
This run used the digesting parser for both container and embedded files.  This causes some truncated (=corrupt) package files to throw an exception before they otherwise would.  The opposite happens, too (more embedded files when using the digester), but this is extremely rare. This means that for truncated gz, x-xz and x-archive files there are many more with fewer attachments in Tika 1.10-SNAPSHOT than in Tika 1.9.

With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape for 1.10...from my perspective.

             Best,

                       Tim
-----Original Message-----
From: David Meikle [mailto:loompa@gmail.com] 
Sent: Sunday, July 26, 2015 10:50 AM
To: dev@tika.apache.org
Subject: Re: release Tika 1.10?


> On 23 Jul 2015, at 14:07, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
>  With the fix of TIKA-1690, I think it makes sense to roll a new release (1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi) in before the release.  Are there any other blockers on 1.10?

+1 from me too.  As discussed on private, I will roll the release on Tuesday night (UK Time) to give people time to shout for other candidates.

Cheers,
Dave

RE: release Tika 1.10?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I just kicked off a run against the our current CommonCrawl slice and govdocs1.  I should have time to see if there are any surprises this evening or early tomorrow.  So, Tuesday night UK time would be great.  Thank you!

-----Original Message-----
From: David Meikle [mailto:loompa@gmail.com] 
Sent: Sunday, July 26, 2015 10:50 AM
To: dev@tika.apache.org
Subject: Re: release Tika 1.10?


> On 23 Jul 2015, at 14:07, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
>  With the fix of TIKA-1690, I think it makes sense to roll a new release (1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi) in before the release.  Are there any other blockers on 1.10?

+1 from me too.  As discussed on private, I will roll the release on Tuesday night (UK Time) to give people time to shout for other candidates.

Cheers,
Dave

Re: release Tika 1.10?

Posted by David Meikle <lo...@gmail.com>.
> On 23 Jul 2015, at 14:07, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
>  With the fix of TIKA-1690, I think it makes sense to roll a new release (1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi) in before the release.  Are there any other blockers on 1.10?

+1 from me too.  As discussed on private, I will roll the release on Tuesday night (UK Time) to give people time to shout for other candidates.

Cheers,
Dave