You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/07/28 06:22:02 UTC
[VOTE] Apache Tika 1.6 release candidate #1
Hi Folks,
A candidate for the Tika 1.6 release is available at:
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.6/
The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.
A Maven staging repository is available at:
https://repository.apache.org/content/repositories/orgapachetika-1003/
Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.
[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ
Thank you!
Cheers,
Chris
P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Nick Burch <ni...@apache.org>.
Another quick thought on the artifiacts in
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ - as well as
needing to ditch original-tika-app.jar, shouldn't we have the Tika
Server standalone jar in there too as another released + easily
downloadable jar?
Thanks
Nick
On 28/07/14 05:22, Mattmann, Chris A (3980) wrote:
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Allison, Timothy B. wrote:
> There was one regression:
> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>
> Stacktrace:
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
> at java.lang.String.checkBounds(String.java:371)
> at java.lang.String.<init>(String.java:415)
> at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
> at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
Any chance you could raise a POI bug for this? We're probably going to do
the next POI beta release within a week, so if you hurry it might even get
fixed in that... :)
Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Nick,
Just to be clear -- that wasn't a veiled complaint that you hadn't cut the 3.11-beta! I really just have not had a chance to start the run with my local build of poi-trunk.
Thank you, as always!
Best,
Tim
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: Thursday, July 31, 2014 3:06 PM
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.6 release candidate #1
On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
> On a related note, I did some digging on the one regression I found in
> the pptx, and that will be solved if we wait for POI 3.11 beta 1. I
> haven't yet had a chance to rerun on the random sample with the updated
> POI...
I'm currently on a train to France, but fingers crossed I'll be able to
upload the POI 3.11 beta 1 artifacts for you to test with before I run out
of English mobile phone signal...
Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
> On a related note, I did some digging on the one regression I found in
> the pptx, and that will be solved if we wait for POI 3.11 beta 1. I
> haven't yet had a chance to rerun on the random sample with the updated
> POI...
I'm currently on a train to France, but fingers crossed I'll be able to
upload the POI 3.11 beta 1 artifacts for you to test with before I run out
of English mobile phone signal...
Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,
On a related note, I did some digging on the one regression I found in the pptx, and that will be solved if we wait for POI 3.11 beta 1. I haven't yet had a chance to rerun on the random sample with the updated POI...
Best,
Tim
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Thursday, July 31, 2014 2:30 PM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
Guys, based on all the comments here, I am going to roll another
RC #2 to address:
- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates
I'll roll another RC #2 probably on Monday.
Thanks!
Cheers,
Chris
P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Sergey Beryozkin <sb...@gmail.com>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>><ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs. There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>> at java.lang.String.checkBounds(String.java:371)
>>>> at java.lang.String.<init>(String.java:415)
>>>> at
>>>>
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163
>>>>)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:91)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:63)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(
>>>>A
>>>>bstractOOXMLExtractor.java:115)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXM
>>>>L
>>>>ExtractorFactory.java:112)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.ja
>>>>v
>>>>a:82)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>> [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ok thanks
Sent from my iPhone
On Aug 31, 2014, at 1:35 PM, "Tyler Palsulich" <tp...@gmail.com> wrote:
>> Commit it to trunk and then yes
> Already in there (thanks, Nick!).
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Tyler Palsulich <tp...@gmail.com>.
>Commit it to trunk and then yes
Already in there (thanks, Nick!).
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Commit it to trunk and then yes
Sent from my iPhone
> On Aug 31, 2014, at 1:11 PM, "Tyler Palsulich" <tp...@gmail.com> wrote:
>
> Can we get TIKA-1404 in 1.6? Simple, but significant, fix.
>
> Tyler
> On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Ugh, sorry. Maven release plugin issues, going to have to clean some
>> stuff up here. Don't mind me folks.
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Sunday, August 31, 2014 12:37 PM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>> OK RC #2 coming up shortly, just brought the branch up to date in
>>> r1621623. Also cleaned up JIRA.
>>>
>>> Here goes..
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW: http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>> Date: Thursday, July 31, 2014 11:29 AM
>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>>> Guys, based on all the comments here, I am going to roll another
>>>> RC #2 to address:
>>>>
>>>> - Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>>> - Dave's Lingo24 API plugin for translate
>>>> - Nick's POI updates
>>>>
>>>> I'll roll another RC #2 probably on Monday.
>>>>
>>>> Thanks!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. When I do, I'll diff trunk against the branch and then roll any
>>>> trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW: http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Date: Monday, July 28, 2014 11:45 AM
>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>>> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>>>> to get it in. If you don't have a patch yet, would you mind terribly if
>>>>> we pushed out 1.6, which already today has a ton of great updates, then
>>>>> shortly thereafter rolled a 1.7 (or did so when you finished with
>>>>> TIKA-1367)?
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>>
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW: http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sergey Beryozkin <sb...@gmail.com>
>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>> Date: Monday, July 28, 2014 11:38 AM
>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>
>>>>>> +0 given that it appears that the tika-parsers dependencies
>>>>>> documentation issue has been pushed away. I'm getting confused why.
>>>>>>
>>>>>> Thanks. Sergey
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>>>>>
>>>>>>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>>>>>> +1
>>>>>>>
>>>>>>> OSX 10.9.3, Java 1.7
>>>>>>>
>>>>>>> Tyler
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>>>>> <ta...@mitre.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>>>>> Windows 7, Java 1.7
>>>>>>>>
>>>>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>>>> docs
>>>>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>>>> yielding
>>>>>>>> 10,413 docs. There were several improvements in text extraction for
>>>>>>>> PDFs
>>>>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>>>>
>>>>>>>> There was one regression:
>>>>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>>>>
>>>>>>>> Stacktrace:
>>>>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>>>> out
>>>>>>>> of
>>>>>>>> range: -369073454
>>>>>>>> at java.lang.String.checkBounds(String.java:371)
>>>>>>>> at java.lang.String.<init>(String.java:415)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
>>>>>>>> v
>>>>>>>> a
>>>>>>>> :
>>>>>>>> 114)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
>>>>>>>> 6
>>>>>>>> 3
>>>>>>>> )
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>>> c
>>>>>>>> t
>>>>>>>> (
>>>>>>>> Ole10Native.java:91)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>>> c
>>>>>>>> t
>>>>>>>> (
>>>>>>>> Ole10Native.java:63)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>>> m
>>>>>>>> b
>>>>>>>> e
>>>>>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>>> m
>>>>>>>> b
>>>>>>>> e
>>>>>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
>>>>>>>> L
>>>>>>>> (
>>>>>>>> A
>>>>>>>> bstractOOXMLExtractor.java:115)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>>>>>>>> X
>>>>>>>> M
>>>>>>>> L
>>>>>>>> ExtractorFactory.java:112)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>>>>>>>> j
>>>>>>>> a
>>>>>>>> v
>>>>>>>> a:82)
>>>>>>>> at
>>>>>>>>
>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mattmann, Chris A (3980)
>>>>>>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>>>>> To: dev@tika.apache.org
>>>>>>>> Cc: user@tika.apache.org
>>>>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>>>>
>>>>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>>>>
>>>>>>>>
>>>>>>>> The release candidate is a zip archive of the sources in:
>>>>>>>>
>>>>>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>>>>
>>>>>>>> The SHA1 checksum of the archive is
>>>>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>>>>
>>>>>>>> A Maven staging repository is available at:
>> https://repository.apache.org/content/repositories/orgapachetika-1003
>>>>>>>> /
>>>>>>>>
>>>>>>>>
>>>>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>>>>> The vote is open for the next 72 hours and passes if a majority of
>>>>>>>> at
>>>>>>>> least three +1 Tika PMC votes are cast.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>>>>>> [ ] -1 Do not release this package becauseŠ
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> P.S. Here is my +1!
>>
>>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Tyler Palsulich <tp...@gmail.com>.
Can we get TIKA-1404 in 1.6? Simple, but significant, fix.
Tyler
On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
chris.a.mattmann@jpl.nasa.gov> wrote:
> Ugh, sorry. Maven release plugin issues, going to have to clean some
> stuff up here. Don't mind me folks.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Sunday, August 31, 2014 12:37 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
> >OK RC #2 coming up shortly, just brought the branch up to date in
> >r1621623. Also cleaned up JIRA.
> >
> >Here goes..
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW: http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> >Date: Thursday, July 31, 2014 11:29 AM
> >To: "dev@tika.apache.org" <de...@tika.apache.org>
> >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >
> >>Guys, based on all the comments here, I am going to roll another
> >>RC #2 to address:
> >>
> >>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
> >>- Dave's Lingo24 API plugin for translate
> >>- Nick's POI updates
> >>
> >>I'll roll another RC #2 probably on Monday.
> >>
> >>Thanks!
> >>
> >>Cheers,
> >>Chris
> >>
> >>P.S. When I do, I'll diff trunk against the branch and then roll any
> >>trunk updates post branch to 1.6 into the new 1.6 RC #2.
> >>
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: chris.a.mattmann@nasa.gov
> >>WWW: http://sunset.usc.edu/~mattmann/
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> >>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>Date: Monday, July 28, 2014 11:45 AM
> >>To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >>
> >>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
> >>>thread for a few weeks about getting 1.6 out. Do you have a patch right
> >>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
> >>>to get it in. If you don't have a patch yet, would you mind terribly if
> >>>we pushed out 1.6, which already today has a ton of great updates, then
> >>>shortly thereafter rolled a 1.7 (or did so when you finished with
> >>>TIKA-1367)?
> >>>
> >>>Cheers,
> >>>Chris
> >>>
> >>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Chris Mattmann, Ph.D.
> >>>Chief Architect
> >>>Instrument Software and Science Data Systems Section (398)
> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>Office: 168-519, Mailstop: 168-527
> >>>Email: chris.a.mattmann@nasa.gov
> >>>WWW: http://sunset.usc.edu/~mattmann/
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Adjunct Associate Professor, Computer Science Department
> >>>University of Southern California, Los Angeles, CA 90089 USA
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>-----Original Message-----
> >>>From: Sergey Beryozkin <sb...@gmail.com>
> >>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>Date: Monday, July 28, 2014 11:38 AM
> >>>To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >>>
> >>>>+0 given that it appears that the tika-parsers dependencies
> >>>>documentation issue has been pushed away. I'm getting confused why.
> >>>>
> >>>>Thanks. Sergey
> >>>>
> >>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
> >>>>
> >>>>On 28/07/14 17:16, Tyler Palsulich wrote:
> >>>>> +1
> >>>>>
> >>>>> OSX 10.9.3, Java 1.7
> >>>>>
> >>>>> Tyler
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
> >>>>><ta...@mitre.org>
> >>>>> wrote:
> >>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
> >>>>>> Windows 7, Java 1.7
> >>>>>>
> >>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
> >>>>>>docs
> >>>>>> (all formats) plus all available msoffice-x files in govdocs1,
> >>>>>>yielding
> >>>>>> 10,413 docs. There were several improvements in text extraction for
> >>>>>>PDFs
> >>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
> >>>>>>
> >>>>>> There was one regression:
> >>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
> >>>>>>
> >>>>>> Stacktrace:
> >>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
> >>>>>>out
> >>>>>>of
> >>>>>> range: -369073454
> >>>>>> at java.lang.String.checkBounds(String.java:371)
> >>>>>> at java.lang.String.<init>(String.java:415)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
> >>>>>>v
> >>>>>>a
> >>>>>>:
> >>>>>>114)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
> >>>>>>6
> >>>>>>3
> >>>>>>)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
> >>>>>>c
> >>>>>>t
> >>>>>>(
> >>>>>>Ole10Native.java:91)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
> >>>>>>c
> >>>>>>t
> >>>>>>(
> >>>>>>Ole10Native.java:63)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
> >>>>>>m
> >>>>>>b
> >>>>>>e
> >>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
> >>>>>>m
> >>>>>>b
> >>>>>>e
> >>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
> >>>>>>L
> >>>>>>(
> >>>>>>A
> >>>>>>bstractOOXMLExtractor.java:115)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
> >>>>>>X
> >>>>>>M
> >>>>>>L
> >>>>>>ExtractorFactory.java:112)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
> >>>>>>j
> >>>>>>a
> >>>>>>v
> >>>>>>a:82)
> >>>>>> at
> >>>>>>
> >>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
> >>>>>>)
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Mattmann, Chris A (3980)
> >>>>>>[mailto:chris.a.mattmann@jpl.nasa.gov]
> >>>>>> Sent: Monday, July 28, 2014 12:22 AM
> >>>>>> To: dev@tika.apache.org
> >>>>>> Cc: user@tika.apache.org
> >>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
> >>>>>>
> >>>>>> Hi Folks,
> >>>>>>
> >>>>>> A candidate for the Tika 1.6 release is available at:
> >>>>>>
> >>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
> >>>>>>
> >>>>>>
> >>>>>> The release candidate is a zip archive of the sources in:
> >>>>>>
> >>>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
> >>>>>>
> >>>>>> The SHA1 checksum of the archive is
> >>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
> >>>>>>
> >>>>>> A Maven staging repository is available at:
> >>>>>>
> >>>>>>
> >>>>>>
> https://repository.apache.org/content/repositories/orgapachetika-1003
> >>>>>>/
> >>>>>>
> >>>>>>
> >>>>>> Please vote on releasing this package as Apache Tika 1.6.
> >>>>>> The vote is open for the next 72 hours and passes if a majority of
> >>>>>>at
> >>>>>> least three +1 Tika PMC votes are cast.
> >>>>>>
> >>>>>> [ ] +1 Release this package as Apache Tika 1.6
> >>>>>> [ ] -1 Do not release this package becauseŠ
> >>>>>>
> >>>>>> Thank you!
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>>>
> >>>>>> P.S. Here is my +1!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >
>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ugh, sorry. Maven release plugin issues, going to have to clean some
stuff up here. Don't mind me folks.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, August 31, 2014 12:37 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>OK RC #2 coming up shortly, just brought the branch up to date in
>r1621623. Also cleaned up JIRA.
>
>Here goes..
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>Date: Thursday, July 31, 2014 11:29 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Guys, based on all the comments here, I am going to roll another
>>RC #2 to address:
>>
>>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>- Dave's Lingo24 API plugin for translate
>>- Nick's POI updates
>>
>>I'll roll another RC #2 probably on Monday.
>>
>>Thanks!
>>
>>Cheers,
>>Chris
>>
>>P.S. When I do, I'll diff trunk against the branch and then roll any
>>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, July 28, 2014 11:45 AM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>>to get it in. If you don't have a patch yet, would you mind terribly if
>>>we pushed out 1.6, which already today has a ton of great updates, then
>>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>>TIKA-1367)?
>>>
>>>Cheers,
>>>Chris
>>>
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW: http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Sergey Beryozkin <sb...@gmail.com>
>>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>Date: Monday, July 28, 2014 11:38 AM
>>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>>>+0 given that it appears that the tika-parsers dependencies
>>>>documentation issue has been pushed away. I'm getting confused why.
>>>>
>>>>Thanks. Sergey
>>>>
>>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>>>
>>>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>>>> +1
>>>>>
>>>>> OSX 10.9.3, Java 1.7
>>>>>
>>>>> Tyler
>>>>>
>>>>>
>>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>>><ta...@mitre.org>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>>> Windows 7, Java 1.7
>>>>>>
>>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>>docs
>>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>>yielding
>>>>>> 10,413 docs. There were several improvements in text extraction for
>>>>>>PDFs
>>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>>
>>>>>> There was one regression:
>>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>>
>>>>>> Stacktrace:
>>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>>out
>>>>>>of
>>>>>> range: -369073454
>>>>>> at java.lang.String.checkBounds(String.java:371)
>>>>>> at java.lang.String.<init>(String.java:415)
>>>>>> at
>>>>>>
>>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
>>>>>>v
>>>>>>a
>>>>>>:
>>>>>>114)
>>>>>> at
>>>>>>
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
>>>>>>6
>>>>>>3
>>>>>>)
>>>>>> at
>>>>>>
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>c
>>>>>>t
>>>>>>(
>>>>>>Ole10Native.java:91)
>>>>>> at
>>>>>>
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>c
>>>>>>t
>>>>>>(
>>>>>>Ole10Native.java:63)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>m
>>>>>>b
>>>>>>e
>>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>m
>>>>>>b
>>>>>>e
>>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
>>>>>>L
>>>>>>(
>>>>>>A
>>>>>>bstractOOXMLExtractor.java:115)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>>>>>>X
>>>>>>M
>>>>>>L
>>>>>>ExtractorFactory.java:112)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>>>>>>j
>>>>>>a
>>>>>>v
>>>>>>a:82)
>>>>>> at
>>>>>>
>>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
>>>>>>)
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mattmann, Chris A (3980)
>>>>>>[mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>>> To: dev@tika.apache.org
>>>>>> Cc: user@tika.apache.org
>>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>>
>>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>>
>>>>>>
>>>>>> The release candidate is a zip archive of the sources in:
>>>>>>
>>>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>>
>>>>>> The SHA1 checksum of the archive is
>>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>>
>>>>>> A Maven staging repository is available at:
>>>>>>
>>>>>>
>>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003
>>>>>>/
>>>>>>
>>>>>>
>>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>>> The vote is open for the next 72 hours and passes if a majority of
>>>>>>at
>>>>>> least three +1 Tika PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>>>> [ ] -1 Do not release this package becauseŠ
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>> P.S. Here is my +1!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
OK RC #2 coming up shortly, just brought the branch up to date in
r1621623. Also cleaned up JIRA.
Here goes..
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Date: Thursday, July 31, 2014 11:29 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>Guys, based on all the comments here, I am going to roll another
>RC #2 to address:
>
>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>- Dave's Lingo24 API plugin for translate
>- Nick's POI updates
>
>I'll roll another RC #2 probably on Monday.
>
>Thanks!
>
>Cheers,
>Chris
>
>P.S. When I do, I'll diff trunk against the branch and then roll any
>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:45 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>to get it in. If you don't have a patch yet, would you mind terribly if
>>we pushed out 1.6, which already today has a ton of great updates, then
>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>TIKA-1367)?
>>
>>Cheers,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Sergey Beryozkin <sb...@gmail.com>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, July 28, 2014 11:38 AM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>+0 given that it appears that the tika-parsers dependencies
>>>documentation issue has been pushed away. I'm getting confused why.
>>>
>>>Thanks. Sergey
>>>
>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>><ta...@mitre.org>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs. There were several improvements in text extraction for
>>>>>PDFs
>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>
>>>>> There was one regression:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>
>>>>> Stacktrace:
>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>out
>>>>>of
>>>>> range: -369073454
>>>>> at java.lang.String.checkBounds(String.java:371)
>>>>> at java.lang.String.<init>(String.java:415)
>>>>> at
>>>>>
>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.jav
>>>>>a
>>>>>:
>>>>>114)
>>>>> at
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:16
>>>>>3
>>>>>)
>>>>> at
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t
>>>>>(
>>>>>Ole10Native.java:91)
>>>>> at
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t
>>>>>(
>>>>>Ole10Native.java:63)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>b
>>>>>e
>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>b
>>>>>e
>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML
>>>>>(
>>>>>A
>>>>>bstractOOXMLExtractor.java:115)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>>>>>M
>>>>>L
>>>>>ExtractorFactory.java:112)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>>>>>a
>>>>>v
>>>>>a:82)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>> To: dev@tika.apache.org
>>>>> Cc: user@tika.apache.org
>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>
>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>
>>>>>
>>>>> The release candidate is a zip archive of the sources in:
>>>>>
>>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>
>>>>> The SHA1 checksum of the archive is
>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>
>>>>> A Maven staging repository is available at:
>>>>>
>>>>>
>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>> least three +1 Tika PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>>> [ ] -1 Do not release this package becauseŠ
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> P.S. Here is my +1!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Guys, based on all the comments here, I am going to roll another
RC #2 to address:
- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates
I'll roll another RC #2 probably on Monday.
Thanks!
Cheers,
Chris
P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Sergey Beryozkin <sb...@gmail.com>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>><ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs. There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>> at java.lang.String.checkBounds(String.java:371)
>>>> at java.lang.String.<init>(String.java:415)
>>>> at
>>>>
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163
>>>>)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:91)
>>>> at
>>>>
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:63)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(
>>>>A
>>>>bstractOOXMLExtractor.java:115)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXM
>>>>L
>>>>ExtractorFactory.java:112)
>>>> at
>>>>
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.ja
>>>>v
>>>>a:82)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>> [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 29/07/14 13:14, Nick Burch wrote:
> On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
>> This is not an issue that should block the release, I was careful not
>> to vote with a minus one. I've become a bit impatient, but no one
>> really blocks me from completing this pure documentation effort
>> myself, I was hoping that someone would do it first :-).
>
> Given that this is a documentation / website enhancement, I don't see
> any reason why we couldn't post the details for 1.6 (and even perhaps
> 1.5!) to the site in a few weeks time, irrespective of when the 1.6
> release goes out :)
Yes, you are right,
Cheers, Sergey
>
> Cheers
> Nick
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
> This is not an issue that should block the release, I was careful not to
> vote with a minus one. I've become a bit impatient, but no one really
> blocks me from completing this pure documentation effort myself, I was
> hoping that someone would do it first :-).
Given that this is a documentation / website enhancement, I don't see any
reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to
the site in a few weeks time, irrespective of when the 1.6 release goes
out :)
Cheers
Nick
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thank you Sergey! OK I will proceed. THanks for your contributions
to Tika and yes we'll get there
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 3:16 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>Hi Chris,
>
>This is not an issue that should block the release, I was careful not to
>vote with a minus one. I've become a bit impatient, but no one really
>blocks me from completing this pure documentation effort myself, I was
>hoping that someone would do it first :-).
>
>Please go ahead with the release as planned, thanks for offering the
>chance to delay the release, but I can not go for it, we'll get there as
>far as the documentation is concerned :-)
>
>Thanks, Sergey
>
>On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:
>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>> to get it in. If you don't have a patch yet, would you mind terribly if
>> we pushed out 1.6, which already today has a ton of great updates, then
>> shortly thereafter rolled a 1.7 (or did so when you finished with
>> TIKA-1367)?
>>
>> Cheers,
>> Chris
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <sb...@gmail.com>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Monday, July 28, 2014 11:38 AM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>> +0 given that it appears that the tika-parsers dependencies
>>> documentation issue has been pushed away. I'm getting confused why.
>>>
>>> Thanks. Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>> <ta...@mitre.org>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>> docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs. There were several improvements in text extraction for
>>>>> PDFs
>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>
>>>>> There was one regression:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>
>>>>> Stacktrace:
>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>out
>>>>> of
>>>>> range: -369073454
>>>>> at java.lang.String.checkBounds(String.java:371)
>>>>> at java.lang.String.<init>(String.java:415)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.jav
>>>>>a:
>>>>> 114)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:16
>>>>>3)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t(
>>>>> Ole10Native.java:91)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t(
>>>>> Ole10Native.java:63)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>be
>>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>be
>>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML
>>>>>(A
>>>>> bstractOOXMLExtractor.java:115)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>>>>>ML
>>>>> ExtractorFactory.java:112)
>>>>> at
>>>>>
>>>>>
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>>>>>av
>>>>> a:82)
>>>>> at
>>>>>
>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>> To: dev@tika.apache.org
>>>>> Cc: user@tika.apache.org
>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>
>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>
>>>>>
>>>>> The release candidate is a zip archive of the sources in:
>>>>>
>>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>
>>>>> The SHA1 checksum of the archive is
>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>
>>>>> A Maven staging repository is available at:
>>>>>
>>>>>
>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>> least three +1 Tika PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>>> [ ] -1 Do not release this package becauseŠ
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> P.S. Here is my +1!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris,
This is not an issue that should block the release, I was careful not to
vote with a minus one. I've become a bit impatient, but no one really
blocks me from completing this pure documentation effort myself, I was
hoping that someone would do it first :-).
Please go ahead with the release as planned, thanks for offering the
chance to delay the release, but I can not go for it, we'll get there as
far as the documentation is concerned :-)
Thanks, Sergey
On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:
> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
> thread for a few weeks about getting 1.6 out. Do you have a patch right
> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
> to get it in. If you don't have a patch yet, would you mind terribly if
> we pushed out 1.6, which already today has a ton of great updates, then
> shortly thereafter rolled a 1.7 (or did so when you finished with
> TIKA-1367)?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin <sb...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Monday, July 28, 2014 11:38 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>> +0 given that it appears that the tika-parsers dependencies
>> documentation issue has been pushed away. I'm getting confused why.
>>
>> Thanks. Sergey
>>
>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>> <ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>> docs
>>>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>>>> 10,413 docs. There were several improvements in text extraction for
>>>> PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>> of
>>>> range: -369073454
>>>> at java.lang.String.checkBounds(String.java:371)
>>>> at java.lang.String.<init>(String.java:415)
>>>> at
>>>>
>>>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
>>>> 114)
>>>> at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>>>> at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>> Ole10Native.java:91)
>>>> at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>> Ole10Native.java:63)
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
>>>> bstractOOXMLExtractor.java:115)
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
>>>> ExtractorFactory.java:112)
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
>>>> a:82)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Tika 1.6
>>>> [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
thread for a few weeks about getting 1.6 out. Do you have a patch right
now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
to get it in. If you don't have a patch yet, would you mind terribly if
we pushed out 1.6, which already today has a ton of great updates, then
shortly thereafter rolled a 1.7 (or did so when you finished with
TIKA-1367)?
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:38 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>+0 given that it appears that the tika-parsers dependencies
>documentation issue has been pushed away. I'm getting confused why.
>
>Thanks. Sergey
>
>[1] https://issues.apache.org/jira/browse/TIKA-1367
>
>On 28/07/14 17:16, Tyler Palsulich wrote:
>> +1
>>
>> OSX 10.9.3, Java 1.7
>>
>> Tyler
>>
>>
>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>><ta...@mitre.org>
>> wrote:
>>
>>> +1
>>>
>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>> Windows 7, Java 1.7
>>>
>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>docs
>>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>>> 10,413 docs. There were several improvements in text extraction for
>>>PDFs
>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>
>>> There was one regression:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>
>>> Stacktrace:
>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>of
>>> range: -369073454
>>> at java.lang.String.checkBounds(String.java:371)
>>> at java.lang.String.<init>(String.java:415)
>>> at
>>>
>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
>>>114)
>>> at
>>>
>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>>> at
>>>
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:91)
>>> at
>>>
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:63)
>>> at
>>>
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>> at
>>>
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>> at
>>>
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
>>>bstractOOXMLExtractor.java:115)
>>> at
>>>
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
>>>ExtractorFactory.java:112)
>>> at
>>>
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
>>>a:82)
>>> at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>
>>>
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Monday, July 28, 2014 12:22 AM
>>> To: dev@tika.apache.org
>>> Cc: user@tika.apache.org
>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>> Hi Folks,
>>>
>>> A candidate for the Tika 1.6 release is available at:
>>>
>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>
>>>
>>> The release candidate is a zip archive of the sources in:
>>>
>>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>
>>> The SHA1 checksum of the archive is
>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>
>>> A Maven staging repository is available at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>
>>>
>>> Please vote on releasing this package as Apache Tika 1.6.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 Tika PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Tika 1.6
>>> [ ] -1 Do not release this package becauseŠ
>>>
>>> Thank you!
>>>
>>> Cheers,
>>> Chris
>>>
>>> P.S. Here is my +1!
>>>
>>>
>>>
>>>
>>>
>>>
>>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Sergey Beryozkin <sb...@gmail.com>.
+0 given that it appears that the tika-parsers dependencies
documentation issue has been pushed away. I'm getting confused why.
Thanks. Sergey
[1] https://issues.apache.org/jira/browse/TIKA-1367
On 28/07/14 17:16, Tyler Palsulich wrote:
> +1
>
> OSX 10.9.3, Java 1.7
>
> Tyler
>
>
> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
>> +1
>>
>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>> Windows 7, Java 1.7
>>
>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>> 10,413 docs. There were several improvements in text extraction for PDFs
>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>
>> There was one regression:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>
>> Stacktrace:
>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
>> range: -369073454
>> at java.lang.String.checkBounds(String.java:371)
>> at java.lang.String.<init>(String.java:415)
>> at
>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
>> at
>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>> at
>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
>> at
>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
>> at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
>> at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>> at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>> at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>> at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>
>>
>> -----Original Message-----
>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>> Sent: Monday, July 28, 2014 12:22 AM
>> To: dev@tika.apache.org
>> Cc: user@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>
>> Hi Folks,
>>
>> A candidate for the Tika 1.6 release is available at:
>>
>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>
>>
>> The release candidate is a zip archive of the sources in:
>>
>> http://svn.apache.org/repos/asf/tika/tags/1.6/
>>
>> The SHA1 checksum of the archive is
>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>
>> A Maven staging repository is available at:
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>
>>
>> Please vote on releasing this package as Apache Tika 1.6.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.6
>> [ ] -1 Do not release this package becauseŠ
>>
>> Thank you!
>>
>> Cheers,
>> Chris
>>
>> P.S. Here is my +1!
>>
>>
>>
>>
>>
>>
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Tyler Palsulich <tp...@gmail.com>.
+1
OSX 10.9.3, Java 1.7
Tyler
On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:
> +1
>
> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
> Windows 7, Java 1.7
>
> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
> (all formats) plus all available msoffice-x files in govdocs1, yielding
> 10,413 docs. There were several improvements in text extraction for PDFs
> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>
> There was one regression:
> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>
> Stacktrace:
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -369073454
> at java.lang.String.checkBounds(String.java:371)
> at java.lang.String.<init>(String.java:415)
> at
> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
> Sent: Monday, July 28, 2014 12:22 AM
> To: dev@tika.apache.org
> Cc: user@tika.apache.org
> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1
Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7
I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.<init>(String.java:415)
at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1
Hi Folks,
A candidate for the Tika 1.6 release is available at:
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.6/
The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.
A Maven staging repository is available at:
https://repository.apache.org/content/repositories/orgapachetika-1003/
Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.
[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ
Thank you!
Cheers,
Chris
P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Tyler Palsulich <tp...@gmail.com>.
Hi All,
After the recent NPE that Chris found (
https://issues.apache.org/jira/browse/TIKA-1378), we should roll an RC#2.
Tyler
On Wed, Jul 30, 2014 at 10:55 AM, Nick Burch <ap...@gagravarr.org> wrote:
> On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:
>
>> A candidate for the Tika 1.6 release is available at:
>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>
>
> Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5
> release that it shouldn't be
>
>
>
> Please vote on releasing this package as Apache Tika 1.6.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>
> Otherwise I'm +1
>
> Nick
>
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:
> A candidate for the Tika 1.6 release is available at:
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5
release that it shouldn't be
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
Otherwise I'm +1
Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1
Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7
I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.<init>(String.java:415)
at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1
Hi Folks,
A candidate for the Tika 1.6 release is available at:
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.6/
The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.
A Maven staging repository is available at:
https://repository.apache.org/content/repositories/orgapachetika-1003/
Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.
[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ
Thank you!
Cheers,
Chris
P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
Posted by Oleg Tikhonov <ol...@apache.org>.
[x] +1 Release this package as Apache Tika 1.6.
Tested on the following systems:
1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC
2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux
Thanks,
Oleg
On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>