You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jim McHale <mc...@googlemail.com> on 2008/07/21 12:58:04 UTC

Using Nutch to Index Web Documents Excluding HTML?

Hi Nutch User List,

I was wondering whether anyone has used Nutch to fetch, parse and
index only documents of types other than HTML (e.g. PDF / MS Word / MS
Excel etc)?

I've been looking into ways of potentially implementing this. My
initial idea was to disable the HTML MIME type in Nutch in order to
'ignore' this type of content. However, it quickly dawned on me that
if I don't fetch the HTML pages then I will not be able to get the URL
links to other documents contained in the websites specified in my
urls-nutch.txt file?

The only other option I thought of was to index HTML along with all
the other document file types but exclude HTML MIME type from any
search results. I guess that this could give me the flexibility of
including HTML at a later stage but otherwise leaves me with an index
that is be much larger than it 'needs' to be.

Is there a way of excluding HTML that I am missing? Does anyone have
experience of doing something like this or an opinion they would like
to share?

Thanks in advance,
Jim