You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Enrico Triolo <en...@gmail.com> on 2006/03/13 16:45:39 UTC

Can't parse html on some urls

Hi, I used nutch to fetch and index only this page:

http://www.althack.com/index.php?option=com_content&task=view&id=24&Itemid=27

When I perform a query to extract this document, I get it correctly,
but I can't get 'clean' content, just the html (*and* the content).
If I perform the same operation on other urls, everything works as expected.

Here's the code I use to extract the content:

NutchBean bean = ...; //Instantiate bean

//Perform query
...

Hit hit = hits.getHit(0);
HitDetails details = bean.getDetails(hit);

String content = new String(bean.getParseText(details).getText());

I guess it Is a problem on the parsing routine?

Cheers,
Enrico