You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/15 21:49:01 UTC

bug in parse-tika or Tika RTFParser?

Hi,

For some time (in 2.x) we have commented out this test as it was
waiting for TIKA-748 to be resolved... which now has been resolved
however I'm getting some confusing output when trying to resurrect the
test!

So @line 105 we do

String text = parse.getText();
assertEquals("The quick brown fox jumps over the lazy dog", text.trim());

But I was wanting to implement the suggested test for title e.g.

String title = parse.getTitle();
String text = parse.getText();
assertEquals("test rft document", title);
assertEquals("The quick brown fox jumps over the lazy dog", text.trim());

The test fails on the 2nd assertion which with the following

Testcase: testIt took 5.668 sec
	FAILED
null expected:<[The quick brown fox jumps over the lazy dog]> but
was:<[test rft document]>
junit.framework.ComparisonFailure: null expected:<[The quick brown fox
jumps over the lazy dog]> but was:<[test rft document]>
	at org.apache.nutch.parse.tika.TestRTFParser.testIt(TestRTFParser.java:)

So this looks like parse.getText() returns the same (in this instance)
as parse.getTitle()... which smells like rotting herring to me.

Any immediate thoughts whether this is a known problem in the Tika RTF
parser, parse-tika's DomContentUtils class or somewhere in between?

Thank you

Lewis

-- 
Lewis

Re: bug in parse-tika or Tika RTFParser?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Ken,

On Thu, Aug 16, 2012 at 1:28 AM, Ken Krugler
<kk...@transpac.com> wrote:

> For many Tika parsers, the text you get back from the document starts with
> the title (if any), and then contains the body.

For clarity, the document we are testing can be found here [0]
The title field contains the text "test rft document" and subject
field "subject tests" the text field then contains "The quick brown
fox…" however I'm not sure if it's the structure of the document that
is throwing this one off.

There is no doubt about it, doing parse.getText() definitely returns
the text contained within the title field.

>
> So I'm wondering if what you're seeing in the test failure is that the
> parse.getText() result is actually "test rtf document\nThe quick brown fox…"
>
> -- Ken
>
> On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote:

[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/parse-tika/sample/test.rtf

Re: bug in parse-tika or Tika RTFParser?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Lewis,

[Moving to the dev list]

For many Tika parsers, the text you get back from the document starts with the title (if any), and then contains the body.

So I'm wondering if what you're seeing in the test failure is that the parse.getText() result is actually "test rtf document\nThe quick brown fox…"

-- Ken

On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote:

> Hi,
> 
> For some time (in 2.x) we have commented out this test as it was
> waiting for TIKA-748 to be resolved... which now has been resolved
> however I'm getting some confusing output when trying to resurrect the
> test!
> 
> So @line 105 we do
> 
> String text = parse.getText();
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> But I was wanting to implement the suggested test for title e.g.
> 
> String title = parse.getTitle();
> String text = parse.getText();
> assertEquals("test rft document", title);
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> The test fails on the 2nd assertion which with the following
> 
> Testcase: testIt took 5.668 sec
> 	FAILED
> null expected:<[The quick brown fox jumps over the lazy dog]> but
> was:<[test rft document]>
> junit.framework.ComparisonFailure: null expected:<[The quick brown fox
> jumps over the lazy dog]> but was:<[test rft document]>
> 	at org.apache.nutch.parse.tika.TestRTFParser.testIt(TestRTFParser.java:)
> 
> So this looks like parse.getText() returns the same (in this instance)
> as parse.getTitle()... which smells like rotting herring to me.
> 
> Any immediate thoughts whether this is a known problem in the Tika RTF
> parser, parse-tika's DomContentUtils class or somewhere in between?
> 
> Thank you
> 
> Lewis
> 
> -- 
> Lewis

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr