You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/08/01 21:20:37 UTC

RE: [VOTE] Release Apache POI 3.11 Beta 1

Rat checked out, successful build on linux.

+1... with one reservation


I just ran a fresh update of trunk from Tika with RC for POI 3.11 Beta 1 against a random selection of ~10k files from govdocs1, covering many formats.  There aren't many office-x files, but there are some, and I made sure to include every one in the govdocs1 corpus within the ~10k files.

When comparing with Tika 1.5:
1) There are no new exceptions
2) There are 15 fewer exceptions (some pdf, but mostly POI)

The regression I reported on the Tika dev list (http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx) is really in fact fixed by POI 3.11 Beta1.

When I manually compared files with < 90% token overlap, I found improvements in POI's handling of rounding and that the newer version of POI is no longer incorrectly adding a "_" to some numbers in an xls file.

I found one regression in the handling of an xlsx file:
http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx 

Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika 1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


      Best,

                Tim




-----Original Message-----
From: Nick Burch [mailto:nick@apache.org] 
Sent: Friday, August 01, 2014 5:33 AM
To: dev@poi.apache.org
Subject: [VOTE] Release Apache POI 3.11 Beta 1

Hi All

It has been almost half a ear since our last release, so as previously 
discussed it seems time for another beta.

The release candidate for this release is available from:
   https://dist.apache.org/repos/dist/dev/poi/3.11-beta1-RC1/

And the tag in SVN from which it was built is:
   https://svn.apache.org/repos/asf/poi/tags/REL_3_11_BETA1

As with all Apache release votes, please check that not only does the
code work, and no major breakages have occurred since the last
release, but also that packaging is correct, license headers and
notices exist etc.

The vote will be open for 72 hours, until the end of Sunday 3rd August. 
(It's a slightly shorter vote than normal, as Apache Tika is waiting on a 
bug fix in the release before they roll Tika 1.6!)

The vote options are:
  +1  - I support this release
   0  - I don't object to this release, but I haven't checked it
  -1  - There's a problem with the release, and that is ....

Votes are welcomed (and encouraged) from everyone, committer or not!

Thanks
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Tika regression test on POI 3.11 Beta 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great to hear!  Maybe we just need to update  something on the Tika side to grab the cell comments: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java ?

The token check is my primordial TIKA-1302 code, not yet released.

To see the results you'll need to apply Nick's patch with my slight modifications in TIKA-1380 to Tika, build it and then run it: java -jar tika-app....jar.  Drop the xlsx file in and you'll see no comment text, whereas you will with Tika 1.5's app. jar.

-----Original Message-----
From: Andreas Beeker [mailto:andreas.beeker@gmx.de] 
Sent: Friday, August 01, 2014 4:59 PM
To: dev@tika.apache.org
Cc: POI Developers List
Subject: Tika regression test on POI 3.11 Beta 1

Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



RE: Tika regression test on POI 3.11 Beta 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great to hear!  Maybe we just need to update  something on the Tika side to grab the cell comments: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java ?

The token check is my primordial TIKA-1302 code, not yet released.

To see the results you'll need to apply Nick's patch with my slight modifications in TIKA-1380 to Tika, build it and then run it: java -jar tika-app....jar.  Drop the xlsx file in and you'll see no comment text, whereas you will with Tika 1.5's app. jar.

-----Original Message-----
From: Andreas Beeker [mailto:andreas.beeker@gmx.de] 
Sent: Friday, August 01, 2014 4:59 PM
To: dev@tika.apache.org
Cc: POI Developers List
Subject: Tika regression test on POI 3.11 Beta 1

Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Tika regression test on POI 3.11 Beta 1

Posted by Andreas Beeker <an...@gmx.de>.
Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



Re: [VOTE] Release Apache POI 3.11 Beta 1

Posted by Dominik Stadler <do...@gmx.at>.
+1

I ran some test-suites for a number of projects using poi that I
maintain, mostly XLS/XLSX, some reading, some writing, a bit of
formulas, some styling and images, everything looks good compared to
3.10 which i am using currently.

Thanks... Dominik.

On Sun, Aug 3, 2014 at 4:09 PM, Andreas Beeker <an...@gmx.de> wrote:
> On 03.08.2014 14:24, Nick Burch wrote:
>>
>>
>> Tika used to look up cell comments manually in the xlsx extractor, but
>> that logic has now been moved into the POI xlsx event handler. My hunch is
>> there's something not quite right in that, that's probably the place to look
>> + write unit test for!
>>
>
> Have you noticed my patch in TIKA-1380 [1] ... or is it still - with the
> patch applied - not working?
>
> [1] https://issues.apache.org/jira/browse/TIKA-1380
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: [VOTE] Release Apache POI 3.11 Beta 1

Posted by Andreas Beeker <an...@gmx.de>.
On 03.08.2014 14:24, Nick Burch wrote:
>
> Tika used to look up cell comments manually in the xlsx extractor, but that logic has now been moved into the POI xlsx event handler. My hunch is there's something not quite right in that, that's probably the place to look + write unit test for!
>

Have you noticed my patch in TIKA-1380 [1] ... or is it still - with the patch applied - not working?

[1] https://issues.apache.org/jira/browse/TIKA-1380

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [VOTE] Release Apache POI 3.11 Beta 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, many thanks to Andi for his patch to Tika! 

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Sunday, August 03, 2014 8:24 AM
To: POI Developers List
Subject: RE: [VOTE] Release Apache POI 3.11 Beta 1

On Fri, 1 Aug 2014, Allison, Timothy B. wrote:
> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx
>
> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, 
> whereas Tika 1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the 
> comments.  This suggests that the issue is with POI, but I haven't had a 
> chance to dig in, and unfortunately, I don't think I will have a chance 
> until Monday.

Tika used to look up cell comments manually in the xlsx extractor, but 
that logic has now been moved into the POI xlsx event handler. My hunch is 
there's something not quite right in that, that's probably the place to 
look + write unit test for!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [VOTE] Release Apache POI 3.11 Beta 1

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 1 Aug 2014, Allison, Timothy B. wrote:
> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx
>
> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, 
> whereas Tika 1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the 
> comments.  This suggests that the issue is with POI, but I haven't had a 
> chance to dig in, and unfortunately, I don't think I will have a chance 
> until Monday.

Tika used to look up cell comments manually in the xlsx extractor, but 
that logic has now been moved into the POI xlsx event handler. My hunch is 
there's something not quite right in that, that's probably the place to 
look + write unit test for!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Tika regression test on POI 3.11 Beta 1

Posted by Andreas Beeker <an...@gmx.de>.
Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org