You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mohammad_108 <mo...@yahoo.com> on 2009/02/08 14:05:30 UTC

Extracting the whole text of HTML documents when crawling

I am quite new to nutch. After a while, I was successful in installing
cygwin, tomcat, and nutch. I began a crawl of apache.org, and received a
bulk of files, but don't know even how to read them. I have relized that
they are index files and I need to learn Lucene, however, I am also not
familiar with Lucene and Java.
I want to crawl the web for a keyword and extract the purified text of each
html document, and concatenate the html files. I don't know how to do this.
-- 
View this message in context: http://www.nabble.com/Extracting-the-whole-text-of-HTML-documents-when-crawling-tp21898694p21898694.html
Sent from the Nutch - User mailing list archive at Nabble.com.