You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Andreas Beeker <an...@gmx.de> on 2014/08/01 22:59:26 UTC

Tika regression test on POI 3.11 Beta 1

Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



RE: Tika regression test on POI 3.11 Beta 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great to hear!  Maybe we just need to update  something on the Tika side to grab the cell comments: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java ?

The token check is my primordial TIKA-1302 code, not yet released.

To see the results you'll need to apply Nick's patch with my slight modifications in TIKA-1380 to Tika, build it and then run it: java -jar tika-app....jar.  Drop the xlsx file in and you'll see no comment text, whereas you will with Tika 1.5's app. jar.

-----Original Message-----
From: Andreas Beeker [mailto:andreas.beeker@gmx.de] 
Sent: Friday, August 01, 2014 4:59 PM
To: dev@tika.apache.org
Cc: POI Developers List
Subject: Tika regression test on POI 3.11 Beta 1

Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



RE: Tika regression test on POI 3.11 Beta 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great to hear!  Maybe we just need to update  something on the Tika side to grab the cell comments: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java ?

The token check is my primordial TIKA-1302 code, not yet released.

To see the results you'll need to apply Nick's patch with my slight modifications in TIKA-1380 to Tika, build it and then run it: java -jar tika-app....jar.  Drop the xlsx file in and you'll see no comment text, whereas you will with Tika 1.5's app. jar.

-----Original Message-----
From: Andreas Beeker [mailto:andreas.beeker@gmx.de] 
Sent: Friday, August 01, 2014 4:59 PM
To: dev@tika.apache.org
Cc: POI Developers List
Subject: Tika regression test on POI 3.11 Beta 1

Hi Tim,

(thread base [1])

> I found one regression in the handling of an xlsx file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/598/598948.xlsx

> Tika 1.6 w/ POI 3.11 Beta 1 is not extracting the comments in this file, whereas Tika >1.5 (and Tika 1.6 w/ POI 3.10-Final) did extract the comments.  This suggests that the issue is with POI, but I haven't had a chance to dig in, and unfortunately, I don't think I will have a chance until Monday.


Just a quick check on the mentioned file [2], didn't result in problems on the extraction of cell comments.
I've used the trunk - which hasn't changed much since Beta 1 - and tried it on Windows with JDK 1.6.0_45 / 1.7.0_45.

I haven't used tika and its unit tests before, please point me out how I can reproduce the differences in the token check?

By comments you mean cell comments, right?

Andi.


[1] http://apache-poi.1045710.n5.nabble.com/VOTE-Release-Apache-POI-3-11-Beta-1-td5716184.html
[2] http://pastebin.com/CDNkhRNz



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org