You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/07/28 06:22:02 UTC

[VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

    http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!






Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Nick Burch <ni...@apache.org>.
Another quick thought on the artifiacts in 
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ - as well as 
needing to ditch original-tika-app.jar, shouldn't we have the Tika 
Server standalone jar in there too as another released + easily 
downloadable jar?

Thanks
Nick

On 28/07/14 05:22, Mattmann, Chris A (3980) wrote:
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>      [ ] +1 Release this package as Apache Tika 1.6
>      [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>


RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Allison, Timothy B. wrote:
> There was one regression:
> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>
> Stacktrace:
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
> 	at java.lang.String.checkBounds(String.java:371)
> 	at java.lang.String.<init>(String.java:415)
> 	at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
> 	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)

Any chance you could raise a POI bug for this? We're probably going to do 
the next POI beta release within a week, so if you hurry it might even get 
fixed in that... :)

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Nick,
  Just to be clear -- that wasn't a veiled complaint that you hadn't cut the 3.11-beta!  I really just have not had a chance to start the run with my local build of poi-trunk.
  Thank you, as always!

            Best,

                    Tim

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Thursday, July 31, 2014 3:06 PM
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.6 release candidate #1

On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
>  On a related note, I did some digging on the one regression I found in 
> the pptx, and that will be solved if we wait for POI 3.11 beta 1.  I 
> haven't yet had a chance to rerun on the random sample with the updated 
> POI...

I'm currently on a train to France, but fingers crossed I'll be able to 
upload the POI 3.11 beta 1 artifacts for you to test with before I run out 
of English mobile phone signal...

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
>  On a related note, I did some digging on the one regression I found in 
> the pptx, and that will be solved if we wait for POI 3.11 beta 1.  I 
> haven't yet had a chance to rerun on the random sample with the updated 
> POI...

I'm currently on a train to France, but fingers crossed I'll be able to 
upload the POI 3.11 beta 1 artifacts for you to test with before I run out 
of English mobile phone signal...

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,
  On a related note, I did some digging on the one regression I found in the pptx, and that will be solved if we wait for POI 3.11 beta 1.  I haven't yet had a chance to rerun on the random sample with the updated POI...  

         Best,

                   Tim

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Thursday, July 31, 2014 2:30 PM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

Guys, based on all the comments here, I am going to roll another
RC #2 to address:

- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates

I'll roll another RC #2 probably on Monday.

Thanks!

Cheers,
Chris

P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Sergey Beryozkin <sb...@gmail.com>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>><ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>>          at java.lang.String.checkBounds(String.java:371)
>>>>          at java.lang.String.<init>(String.java:415)
>>>>          at
>>>> 
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163
>>>>)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:91)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:63)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(
>>>>A
>>>>bstractOOXMLExtractor.java:115)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXM
>>>>L
>>>>ExtractorFactory.java:112)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.ja
>>>>v
>>>>a:82)
>>>>          at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>>      [ ] +1 Release this package as Apache Tika 1.6
>>>>      [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ok thanks 

Sent from my iPhone

On Aug 31, 2014, at 1:35 PM, "Tyler Palsulich" <tp...@gmail.com> wrote:

>> Commit it to trunk and then yes
> Already in there (thanks, Nick!).

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Tyler Palsulich <tp...@gmail.com>.
>Commit it to trunk and then yes
Already in there (thanks, Nick!).

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Commit it to trunk and then yes 

Sent from my iPhone

> On Aug 31, 2014, at 1:11 PM, "Tyler Palsulich" <tp...@gmail.com> wrote:
> 
> Can we get TIKA-1404 in 1.6? Simple, but significant, fix.
> 
> Tyler
> On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
>> Ugh, sorry. Maven release plugin issues, going to have to clean some
>> stuff up here. Don't mind me folks.
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Sunday, August 31, 2014 12:37 PM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>> 
>>> OK RC #2 coming up shortly, just brought the branch up to date in
>>> r1621623. Also cleaned up JIRA.
>>> 
>>> Here goes..
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>> Date: Thursday, July 31, 2014 11:29 AM
>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>> 
>>>> Guys, based on all the comments here, I am going to roll another
>>>> RC #2 to address:
>>>> 
>>>> - Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>>> - Dave's Lingo24 API plugin for translate
>>>> - Nick's POI updates
>>>> 
>>>> I'll roll another RC #2 probably on Monday.
>>>> 
>>>> Thanks!
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> P.S. When I do, I'll diff trunk against the branch and then roll any
>>>> trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Date: Monday, July 28, 2014 11:45 AM
>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>> 
>>>>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>>> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>>>> to get it in. If you don't have a patch yet, would you mind terribly if
>>>>> we pushed out 1.6, which already today has a ton of great updates, then
>>>>> shortly thereafter rolled a 1.7 (or did so when you finished with
>>>>> TIKA-1367)?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Sergey Beryozkin <sb...@gmail.com>
>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>> Date: Monday, July 28, 2014 11:38 AM
>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>>> 
>>>>>> +0 given that it appears that the tika-parsers dependencies
>>>>>> documentation issue has been pushed away. I'm getting confused why.
>>>>>> 
>>>>>> Thanks. Sergey
>>>>>> 
>>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>>>>> 
>>>>>>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>>>>>> +1
>>>>>>> 
>>>>>>> OSX 10.9.3, Java 1.7
>>>>>>> 
>>>>>>> Tyler
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>>>>> <ta...@mitre.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>>>>> Windows 7, Java 1.7
>>>>>>>> 
>>>>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>>>> docs
>>>>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>>>> yielding
>>>>>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>>>>> PDFs
>>>>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>>>> 
>>>>>>>> There was one regression:
>>>>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>>>> 
>>>>>>>> Stacktrace:
>>>>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>>>> out
>>>>>>>> of
>>>>>>>> range: -369073454
>>>>>>>>         at java.lang.String.checkBounds(String.java:371)
>>>>>>>>         at java.lang.String.<init>(String.java:415)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
>>>>>>>> v
>>>>>>>> a
>>>>>>>> :
>>>>>>>> 114)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
>>>>>>>> 6
>>>>>>>> 3
>>>>>>>> )
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>>> c
>>>>>>>> t
>>>>>>>> (
>>>>>>>> Ole10Native.java:91)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>>> c
>>>>>>>> t
>>>>>>>> (
>>>>>>>> Ole10Native.java:63)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>>> m
>>>>>>>> b
>>>>>>>> e
>>>>>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>>> m
>>>>>>>> b
>>>>>>>> e
>>>>>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
>>>>>>>> L
>>>>>>>> (
>>>>>>>> A
>>>>>>>> bstractOOXMLExtractor.java:115)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>>>>>>>> X
>>>>>>>> M
>>>>>>>> L
>>>>>>>> ExtractorFactory.java:112)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>>>>>>>> j
>>>>>>>> a
>>>>>>>> v
>>>>>>>> a:82)
>>>>>>>>         at
>>>>>>>> 
>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
>>>>>>>> )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mattmann, Chris A (3980)
>>>>>>>> [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>>>>> To: dev@tika.apache.org
>>>>>>>> Cc: user@tika.apache.org
>>>>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>>>> 
>>>>>>>> Hi Folks,
>>>>>>>> 
>>>>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>>>> 
>>>>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The release candidate is a zip archive of the sources in:
>>>>>>>> 
>>>>>>>>     http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>>>> 
>>>>>>>> The SHA1 checksum of the archive is
>>>>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>>>> 
>>>>>>>> A Maven staging repository is available at:
>> https://repository.apache.org/content/repositories/orgapachetika-1003
>>>>>>>> /
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>>>>> The vote is open for the next 72 hours and passes if a majority of
>>>>>>>> at
>>>>>>>> least three +1 Tika PMC votes are cast.
>>>>>>>> 
>>>>>>>>     [ ] +1 Release this package as Apache Tika 1.6
>>>>>>>>     [ ] -1 Do not release this package becauseŠ
>>>>>>>> 
>>>>>>>> Thank you!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> P.S. Here is my +1!
>> 
>> 

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Tyler Palsulich <tp...@gmail.com>.
Can we get TIKA-1404 in 1.6? Simple, but significant, fix.

Tyler
On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Ugh, sorry. Maven release plugin issues, going to have to clean some
> stuff up here. Don't mind me folks.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Sunday, August 31, 2014 12:37 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
> >OK RC #2 coming up shortly, just brought the branch up to date in
> >r1621623. Also cleaned up JIRA.
> >
> >Here goes..
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> >Date: Thursday, July 31, 2014 11:29 AM
> >To: "dev@tika.apache.org" <de...@tika.apache.org>
> >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >
> >>Guys, based on all the comments here, I am going to roll another
> >>RC #2 to address:
> >>
> >>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
> >>- Dave's Lingo24 API plugin for translate
> >>- Nick's POI updates
> >>
> >>I'll roll another RC #2 probably on Monday.
> >>
> >>Thanks!
> >>
> >>Cheers,
> >>Chris
> >>
> >>P.S. When I do, I'll diff trunk against the branch and then roll any
> >>trunk updates post branch to 1.6 into the new 1.6 RC #2.
> >>
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: chris.a.mattmann@nasa.gov
> >>WWW:  http://sunset.usc.edu/~mattmann/
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
> >>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>Date: Monday, July 28, 2014 11:45 AM
> >>To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >>
> >>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
> >>>thread for a few weeks about getting 1.6 out. Do you have a patch right
> >>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
> >>>to get it in. If you don't have a patch yet, would you mind terribly if
> >>>we pushed out 1.6, which already today has a ton of great updates, then
> >>>shortly thereafter rolled a 1.7 (or did so when you finished with
> >>>TIKA-1367)?
> >>>
> >>>Cheers,
> >>>Chris
> >>>
> >>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Chris Mattmann, Ph.D.
> >>>Chief Architect
> >>>Instrument Software and Science Data Systems Section (398)
> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>Office: 168-519, Mailstop: 168-527
> >>>Email: chris.a.mattmann@nasa.gov
> >>>WWW:  http://sunset.usc.edu/~mattmann/
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Adjunct Associate Professor, Computer Science Department
> >>>University of Southern California, Los Angeles, CA 90089 USA
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>-----Original Message-----
> >>>From: Sergey Beryozkin <sb...@gmail.com>
> >>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>Date: Monday, July 28, 2014 11:38 AM
> >>>To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >>>
> >>>>+0 given that it appears that the tika-parsers dependencies
> >>>>documentation issue has been pushed away. I'm getting confused why.
> >>>>
> >>>>Thanks. Sergey
> >>>>
> >>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
> >>>>
> >>>>On 28/07/14 17:16, Tyler Palsulich wrote:
> >>>>> +1
> >>>>>
> >>>>> OSX 10.9.3, Java 1.7
> >>>>>
> >>>>> Tyler
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
> >>>>><ta...@mitre.org>
> >>>>> wrote:
> >>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
> >>>>>> Windows 7, Java 1.7
> >>>>>>
> >>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
> >>>>>>docs
> >>>>>> (all formats) plus all available msoffice-x files in govdocs1,
> >>>>>>yielding
> >>>>>> 10,413 docs.  There were several improvements in text extraction for
> >>>>>>PDFs
> >>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
> >>>>>>
> >>>>>> There was one regression:
> >>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
> >>>>>>
> >>>>>> Stacktrace:
> >>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
> >>>>>>out
> >>>>>>of
> >>>>>> range: -369073454
> >>>>>>          at java.lang.String.checkBounds(String.java:371)
> >>>>>>          at java.lang.String.<init>(String.java:415)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
> >>>>>>v
> >>>>>>a
> >>>>>>:
> >>>>>>114)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
> >>>>>>6
> >>>>>>3
> >>>>>>)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
> >>>>>>c
> >>>>>>t
> >>>>>>(
> >>>>>>Ole10Native.java:91)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
> >>>>>>c
> >>>>>>t
> >>>>>>(
> >>>>>>Ole10Native.java:63)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
> >>>>>>m
> >>>>>>b
> >>>>>>e
> >>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
> >>>>>>m
> >>>>>>b
> >>>>>>e
> >>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
> >>>>>>L
> >>>>>>(
> >>>>>>A
> >>>>>>bstractOOXMLExtractor.java:115)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
> >>>>>>X
> >>>>>>M
> >>>>>>L
> >>>>>>ExtractorFactory.java:112)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
> >>>>>>j
> >>>>>>a
> >>>>>>v
> >>>>>>a:82)
> >>>>>>          at
> >>>>>>
> >>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
> >>>>>>)
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Mattmann, Chris A (3980)
> >>>>>>[mailto:chris.a.mattmann@jpl.nasa.gov]
> >>>>>> Sent: Monday, July 28, 2014 12:22 AM
> >>>>>> To: dev@tika.apache.org
> >>>>>> Cc: user@tika.apache.org
> >>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
> >>>>>>
> >>>>>> Hi Folks,
> >>>>>>
> >>>>>> A candidate for the Tika 1.6 release is available at:
> >>>>>>
> >>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
> >>>>>>
> >>>>>>
> >>>>>> The release candidate is a zip archive of the sources in:
> >>>>>>
> >>>>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
> >>>>>>
> >>>>>> The SHA1 checksum of the archive is
> >>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
> >>>>>>
> >>>>>> A Maven staging repository is available at:
> >>>>>>
> >>>>>>
> >>>>>>
> https://repository.apache.org/content/repositories/orgapachetika-1003
> >>>>>>/
> >>>>>>
> >>>>>>
> >>>>>> Please vote on releasing this package as Apache Tika 1.6.
> >>>>>> The vote is open for the next 72 hours and passes if a majority of
> >>>>>>at
> >>>>>> least three +1 Tika PMC votes are cast.
> >>>>>>
> >>>>>>      [ ] +1 Release this package as Apache Tika 1.6
> >>>>>>      [ ] -1 Do not release this package becauseŠ
> >>>>>>
> >>>>>> Thank you!
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>>>
> >>>>>> P.S. Here is my +1!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >
>
>

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ugh, sorry. Maven release plugin issues, going to have to clean some
stuff up here. Don't mind me folks.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, August 31, 2014 12:37 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>OK RC #2 coming up shortly, just brought the branch up to date in
>r1621623. Also cleaned up JIRA.
>
>Here goes..
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>Date: Thursday, July 31, 2014 11:29 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Guys, based on all the comments here, I am going to roll another
>>RC #2 to address:
>>
>>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>- Dave's Lingo24 API plugin for translate
>>- Nick's POI updates
>>
>>I'll roll another RC #2 probably on Monday.
>>
>>Thanks!
>>
>>Cheers,
>>Chris
>>
>>P.S. When I do, I'll diff trunk against the branch and then roll any
>>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, July 28, 2014 11:45 AM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>>to get it in. If you don't have a patch yet, would you mind terribly if
>>>we pushed out 1.6, which already today has a ton of great updates, then
>>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>>TIKA-1367)?
>>>
>>>Cheers,
>>>Chris
>>>
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Sergey Beryozkin <sb...@gmail.com>
>>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>Date: Monday, July 28, 2014 11:38 AM
>>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>>>+0 given that it appears that the tika-parsers dependencies
>>>>documentation issue has been pushed away. I'm getting confused why.
>>>>
>>>>Thanks. Sergey
>>>>
>>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>>>
>>>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>>>> +1
>>>>>
>>>>> OSX 10.9.3, Java 1.7
>>>>>
>>>>> Tyler
>>>>>
>>>>>
>>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>>><ta...@mitre.org>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>>> Windows 7, Java 1.7
>>>>>>
>>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>>docs
>>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>>yielding
>>>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>>>PDFs
>>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>>
>>>>>> There was one regression:
>>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>>
>>>>>> Stacktrace:
>>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>>out
>>>>>>of
>>>>>> range: -369073454
>>>>>>          at java.lang.String.checkBounds(String.java:371)
>>>>>>          at java.lang.String.<init>(String.java:415)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.ja
>>>>>>v
>>>>>>a
>>>>>>:
>>>>>>114)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:1
>>>>>>6
>>>>>>3
>>>>>>)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>c
>>>>>>t
>>>>>>(
>>>>>>Ole10Native.java:91)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObje
>>>>>>c
>>>>>>t
>>>>>>(
>>>>>>Ole10Native.java:63)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>m
>>>>>>b
>>>>>>e
>>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleE
>>>>>>m
>>>>>>b
>>>>>>e
>>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTM
>>>>>>L
>>>>>>(
>>>>>>A
>>>>>>bstractOOXMLExtractor.java:115)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>>>>>>X
>>>>>>M
>>>>>>L
>>>>>>ExtractorFactory.java:112)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>>>>>>j
>>>>>>a
>>>>>>v
>>>>>>a:82)
>>>>>>          at
>>>>>> 
>>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243
>>>>>>)
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mattmann, Chris A (3980)
>>>>>>[mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>>> To: dev@tika.apache.org
>>>>>> Cc: user@tika.apache.org
>>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>>
>>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>>
>>>>>>
>>>>>> The release candidate is a zip archive of the sources in:
>>>>>>
>>>>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>>
>>>>>> The SHA1 checksum of the archive is
>>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>>
>>>>>> A Maven staging repository is available at:
>>>>>>
>>>>>> 
>>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003
>>>>>>/
>>>>>>
>>>>>>
>>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>>> The vote is open for the next 72 hours and passes if a majority of
>>>>>>at
>>>>>> least three +1 Tika PMC votes are cast.
>>>>>>
>>>>>>      [ ] +1 Release this package as Apache Tika 1.6
>>>>>>      [ ] -1 Do not release this package becauseŠ
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>> P.S. Here is my +1!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
OK RC #2 coming up shortly, just brought the branch up to date in
r1621623. Also cleaned up JIRA.

Here goes..

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Date: Thursday, July 31, 2014 11:29 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Guys, based on all the comments here, I am going to roll another
>RC #2 to address:
>
>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>- Dave's Lingo24 API plugin for translate
>- Nick's POI updates
>
>I'll roll another RC #2 probably on Monday.
>
>Thanks!
>
>Cheers,
>Chris
>
>P.S. When I do, I'll diff trunk against the branch and then roll any
>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:45 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>to get it in. If you don't have a patch yet, would you mind terribly if
>>we pushed out 1.6, which already today has a ton of great updates, then
>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>TIKA-1367)?
>>
>>Cheers,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Sergey Beryozkin <sb...@gmail.com>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, July 28, 2014 11:38 AM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>+0 given that it appears that the tika-parsers dependencies
>>>documentation issue has been pushed away. I'm getting confused why.
>>>
>>>Thanks. Sergey
>>>
>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>><ta...@mitre.org>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>>PDFs
>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>
>>>>> There was one regression:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>
>>>>> Stacktrace:
>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>out
>>>>>of
>>>>> range: -369073454
>>>>>          at java.lang.String.checkBounds(String.java:371)
>>>>>          at java.lang.String.<init>(String.java:415)
>>>>>          at
>>>>> 
>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.jav
>>>>>a
>>>>>:
>>>>>114)
>>>>>          at
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:16
>>>>>3
>>>>>)
>>>>>          at
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t
>>>>>(
>>>>>Ole10Native.java:91)
>>>>>          at
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t
>>>>>(
>>>>>Ole10Native.java:63)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>b
>>>>>e
>>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>b
>>>>>e
>>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML
>>>>>(
>>>>>A
>>>>>bstractOOXMLExtractor.java:115)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>>>>>M
>>>>>L
>>>>>ExtractorFactory.java:112)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>>>>>a
>>>>>v
>>>>>a:82)
>>>>>          at
>>>>> 
>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>> To: dev@tika.apache.org
>>>>> Cc: user@tika.apache.org
>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>
>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>
>>>>>
>>>>> The release candidate is a zip archive of the sources in:
>>>>>
>>>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>
>>>>> The SHA1 checksum of the archive is
>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>
>>>>> A Maven staging repository is available at:
>>>>>
>>>>> 
>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>> least three +1 Tika PMC votes are cast.
>>>>>
>>>>>      [ ] +1 Release this package as Apache Tika 1.6
>>>>>      [ ] -1 Do not release this package becauseŠ
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> P.S. Here is my +1!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Guys, based on all the comments here, I am going to roll another
RC #2 to address:

- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates

I'll roll another RC #2 probably on Monday.

Thanks!

Cheers,
Chris

P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Sergey Beryozkin <sb...@gmail.com>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>><ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>>          at java.lang.String.checkBounds(String.java:371)
>>>>          at java.lang.String.<init>(String.java:415)
>>>>          at
>>>> 
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163
>>>>)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:91)
>>>>          at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:63)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmb
>>>>e
>>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(
>>>>A
>>>>bstractOOXMLExtractor.java:115)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXM
>>>>L
>>>>ExtractorFactory.java:112)
>>>>          at
>>>> 
>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.ja
>>>>v
>>>>a:82)
>>>>          at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>>      [ ] +1 Release this package as Apache Tika 1.6
>>>>      [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
On 29/07/14 13:14, Nick Burch wrote:
> On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
>> This is not an issue that should block the release, I was careful not
>> to vote with a minus one. I've become a bit impatient, but no one
>> really blocks me from completing this pure documentation effort
>> myself, I was hoping that someone would do it first :-).
>
> Given that this is a documentation / website enhancement, I don't see
> any reason why we couldn't post the details for 1.6 (and even perhaps
> 1.5!) to the site in a few weeks time, irrespective of when the 1.6
> release goes out :)
Yes, you are right,

Cheers, Sergey
>
> Cheers
> Nick



Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
> This is not an issue that should block the release, I was careful not to 
> vote with a minus one. I've become a bit impatient, but no one really 
> blocks me from completing this pure documentation effort myself, I was 
> hoping that someone would do it first :-).

Given that this is a documentation / website enhancement, I don't see any 
reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to 
the site in a few weeks time, irrespective of when the 1.6 release goes 
out :)

Cheers
Nick

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thank you Sergey! OK I will proceed. THanks for your contributions
to Tika and yes we'll get there

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 3:16 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Hi Chris,
>
>This is not an issue that should block the release, I was careful not to
>vote with a minus one. I've become a bit impatient, but no one really
>blocks me from completing this pure documentation effort myself, I was
>hoping that someone would do it first :-).
>
>Please go ahead with the release as planned, thanks for offering the
>chance to delay the release, but I can not go for it, we'll get there as
>far as the documentation is concerned :-)
>
>Thanks, Sergey
>
>On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:
>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>> to get it in. If you don't have a patch yet, would you mind terribly if
>> we pushed out 1.6, which already today has a ton of great updates, then
>> shortly thereafter rolled a 1.7 (or did so when you finished with
>> TIKA-1367)?
>>
>> Cheers,
>> Chris
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <sb...@gmail.com>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Monday, July 28, 2014 11:38 AM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>> +0 given that it appears that the tika-parsers dependencies
>>> documentation issue has been pushed away. I'm getting confused why.
>>>
>>> Thanks. Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>> <ta...@mitre.org>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>> docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>> PDFs
>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>
>>>>> There was one regression:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>
>>>>> Stacktrace:
>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>out
>>>>> of
>>>>> range: -369073454
>>>>>           at java.lang.String.checkBounds(String.java:371)
>>>>>           at java.lang.String.<init>(String.java:415)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.jav
>>>>>a:
>>>>> 114)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:16
>>>>>3)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t(
>>>>> Ole10Native.java:91)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObjec
>>>>>t(
>>>>> Ole10Native.java:63)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>be
>>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>>>>>be
>>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML
>>>>>(A
>>>>> bstractOOXMLExtractor.java:115)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>>>>>ML
>>>>> ExtractorFactory.java:112)
>>>>>           at
>>>>>
>>>>> 
>>>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>>>>>av
>>>>> a:82)
>>>>>           at
>>>>> 
>>>>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>>> To: dev@tika.apache.org
>>>>> Cc: user@tika.apache.org
>>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A candidate for the Tika 1.6 release is available at:
>>>>>
>>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>>
>>>>>
>>>>> The release candidate is a zip archive of the sources in:
>>>>>
>>>>>       http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>>
>>>>> The SHA1 checksum of the archive is
>>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>>
>>>>> A Maven staging repository is available at:
>>>>>
>>>>> 
>>>>>https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>> least three +1 Tika PMC votes are cast.
>>>>>
>>>>>       [ ] +1 Release this package as Apache Tika 1.6
>>>>>       [ ] -1 Do not release this package becauseŠ
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> P.S. Here is my +1!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris,

This is not an issue that should block the release, I was careful not to 
vote with a minus one. I've become a bit impatient, but no one really 
blocks me from completing this pure documentation effort myself, I was 
hoping that someone would do it first :-).

Please go ahead with the release as planned, thanks for offering the 
chance to delay the release, but I can not go for it, we'll get there as 
far as the documentation is concerned :-)

Thanks, Sergey

On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:
> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
> thread for a few weeks about getting 1.6 out. Do you have a patch right
> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
> to get it in. If you don't have a patch yet, would you mind terribly if
> we pushed out 1.6, which already today has a ton of great updates, then
> shortly thereafter rolled a 1.7 (or did so when you finished with
> TIKA-1367)?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin <sb...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Monday, July 28, 2014 11:38 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>> +0 given that it appears that the tika-parsers dependencies
>> documentation issue has been pushed away. I'm getting confused why.
>>
>> Thanks. Sergey
>>
>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>> <ta...@mitre.org>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>> docs
>>>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>>>> 10,413 docs.  There were several improvements in text extraction for
>>>> PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>> of
>>>> range: -369073454
>>>>           at java.lang.String.checkBounds(String.java:371)
>>>>           at java.lang.String.<init>(String.java:415)
>>>>           at
>>>>
>>>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
>>>> 114)
>>>>           at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>>>>           at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>> Ole10Native.java:91)
>>>>           at
>>>>
>>>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>> Ole10Native.java:63)
>>>>           at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>> ddedOLE(AbstractOOXMLExtractor.java:250)
>>>>           at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>> ddedParts(AbstractOOXMLExtractor.java:199)
>>>>           at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
>>>> bstractOOXMLExtractor.java:115)
>>>>           at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
>>>> ExtractorFactory.java:112)
>>>>           at
>>>>
>>>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
>>>> a:82)
>>>>           at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>> Sent: Monday, July 28, 2014 12:22 AM
>>>> To: dev@tika.apache.org
>>>> Cc: user@tika.apache.org
>>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>>
>>>> Hi Folks,
>>>>
>>>> A candidate for the Tika 1.6 release is available at:
>>>>
>>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>>
>>>>
>>>> The release candidate is a zip archive of the sources in:
>>>>
>>>>       http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>>
>>>> The SHA1 checksum of the archive is
>>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>>
>>>> A Maven staging repository is available at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>>
>>>>
>>>> Please vote on releasing this package as Apache Tika 1.6.
>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>> least three +1 Tika PMC votes are cast.
>>>>
>>>>       [ ] +1 Release this package as Apache Tika 1.6
>>>>       [ ] -1 Do not release this package becauseŠ
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> P.S. Here is my +1!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
thread for a few weeks about getting 1.6 out. Do you have a patch right
now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
to get it in. If you don't have a patch yet, would you mind terribly if
we pushed out 1.6, which already today has a ton of great updates, then
shortly thereafter rolled a 1.7 (or did so when you finished with
TIKA-1367)?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, July 28, 2014 11:38 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>+0 given that it appears that the tika-parsers dependencies
>documentation issue has been pushed away. I'm getting confused why.
>
>Thanks. Sergey
>
>[1] https://issues.apache.org/jira/browse/TIKA-1367
>
>On 28/07/14 17:16, Tyler Palsulich wrote:
>> +1
>>
>> OSX 10.9.3, Java 1.7
>>
>> Tyler
>>
>>
>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>><ta...@mitre.org>
>> wrote:
>>
>>> +1
>>>
>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>> Windows 7, Java 1.7
>>>
>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>docs
>>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>>> 10,413 docs.  There were several improvements in text extraction for
>>>PDFs
>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>
>>> There was one regression:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>
>>> Stacktrace:
>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>of
>>> range: -369073454
>>>          at java.lang.String.checkBounds(String.java:371)
>>>          at java.lang.String.<init>(String.java:415)
>>>          at
>>> 
>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
>>>114)
>>>          at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>>>          at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:91)
>>>          at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:63)
>>>          at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>          at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>          at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
>>>bstractOOXMLExtractor.java:115)
>>>          at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
>>>ExtractorFactory.java:112)
>>>          at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
>>>a:82)
>>>          at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>
>>>
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>> Sent: Monday, July 28, 2014 12:22 AM
>>> To: dev@tika.apache.org
>>> Cc: user@tika.apache.org
>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>> Hi Folks,
>>>
>>> A candidate for the Tika 1.6 release is available at:
>>>
>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>
>>>
>>> The release candidate is a zip archive of the sources in:
>>>
>>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>
>>> The SHA1 checksum of the archive is
>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>
>>> A Maven staging repository is available at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>
>>>
>>> Please vote on releasing this package as Apache Tika 1.6.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 Tika PMC votes are cast.
>>>
>>>      [ ] +1 Release this package as Apache Tika 1.6
>>>      [ ] -1 Do not release this package becauseŠ
>>>
>>> Thank you!
>>>
>>> Cheers,
>>> Chris
>>>
>>> P.S. Here is my +1!
>>>
>>>
>>>
>>>
>>>
>>>
>>


Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Sergey Beryozkin <sb...@gmail.com>.
+0 given that it appears that the tika-parsers dependencies 
documentation issue has been pushed away. I'm getting confused why.

Thanks. Sergey

[1] https://issues.apache.org/jira/browse/TIKA-1367

On 28/07/14 17:16, Tyler Palsulich wrote:
> +1
>
> OSX 10.9.3, Java 1.7
>
> Tyler
>
>
> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
>> +1
>>
>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>> Windows 7, Java 1.7
>>
>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>> 10,413 docs.  There were several improvements in text extraction for PDFs
>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>
>> There was one regression:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>
>> Stacktrace:
>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
>> range: -369073454
>>          at java.lang.String.checkBounds(String.java:371)
>>          at java.lang.String.<init>(String.java:415)
>>          at
>> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
>>          at
>> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>>          at
>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
>>          at
>> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
>>          at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
>>          at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>>          at
>> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
>>          at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>>          at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>>          at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>
>>
>> -----Original Message-----
>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>> Sent: Monday, July 28, 2014 12:22 AM
>> To: dev@tika.apache.org
>> Cc: user@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>
>> Hi Folks,
>>
>> A candidate for the Tika 1.6 release is available at:
>>
>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>
>>
>> The release candidate is a zip archive of the sources in:
>>
>>      http://svn.apache.org/repos/asf/tika/tags/1.6/
>>
>> The SHA1 checksum of the archive is
>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>
>> A Maven staging repository is available at:
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>
>>
>> Please vote on releasing this package as Apache Tika 1.6.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>>      [ ] +1 Release this package as Apache Tika 1.6
>>      [ ] -1 Do not release this package becauseŠ
>>
>> Thank you!
>>
>> Cheers,
>> Chris
>>
>> P.S. Here is my +1!
>>
>>
>>
>>
>>
>>
>

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Tyler Palsulich <tp...@gmail.com>.
+1

OSX 10.9.3, Java 1.7

Tyler


On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> +1
>
> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
> Windows 7, Java 1.7
>
> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
> (all formats) plus all available msoffice-x files in govdocs1, yielding
> 10,413 docs.  There were several improvements in text extraction for PDFs
> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>
> There was one regression:
> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>
> Stacktrace:
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -369073454
>         at java.lang.String.checkBounds(String.java:371)
>         at java.lang.String.<init>(String.java:415)
>         at
> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
>         at
> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
>         at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
>         at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
>         at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
>         at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>         at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
>         at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
> Sent: Monday, July 28, 2014 12:22 AM
> To: dev@tika.apache.org
> Cc: user@tika.apache.org
> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
>     http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>     [ ] +1 Release this package as Apache Tika 1.6
>     [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>

RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs.  There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx 

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
	at java.lang.String.checkBounds(String.java:371)
	at java.lang.String.<init>(String.java:415)
	at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

    http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!






Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Tyler Palsulich <tp...@gmail.com>.
Hi All,

After the recent NPE that Chris found (
https://issues.apache.org/jira/browse/TIKA-1378), we should roll an RC#2.

Tyler


On Wed, Jul 30, 2014 at 10:55 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:
>
>> A candidate for the Tika 1.6 release is available at:
>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>
>
> Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5
> release that it shouldn't be
>
>
>
>  Please vote on releasing this package as Apache Tika 1.6.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>
> Otherwise I'm +1
>
> Nick
>

Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:
> A candidate for the Tika 1.6 release is available at:
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5 
release that it shouldn't be


> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.

Otherwise I'm +1

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs.  There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx 

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
	at java.lang.String.checkBounds(String.java:371)
	at java.lang.String.<init>(String.java:415)
	at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

    http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!






Re: [VOTE] Apache Tika 1.6 release candidate #1

Posted by Oleg Tikhonov <ol...@apache.org>.
[x] +1 Release this package as Apache Tika 1.6.

Tested on the following systems:
1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC
2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux

Thanks,
Oleg



On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
>     http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>     [ ] +1 Release this package as Apache Tika 1.6
>     [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>