You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Enrico Triolo <en...@gmail.com> on 2006/03/13 16:45:39 UTC
Can't parse html on some urls
Hi, I used nutch to fetch and index only this page:
http://www.althack.com/index.php?option=com_content&task=view&id=24&Itemid=27
When I perform a query to extract this document, I get it correctly,
but I can't get 'clean' content, just the html (*and* the content).
If I perform the same operation on other urls, everything works as expected.
Here's the code I use to extract the content:
NutchBean bean = ...; //Instantiate bean
//Perform query
...
Hit hit = hits.getHit(0);
HitDetails details = bean.getDetails(hit);
String content = new String(bean.getParseText(details).getText());
I guess it Is a problem on the parsing routine?
Cheers,
Enrico