You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Foo Bar <fo...@yahoo.com> on 2008/05/19 06:45:39 UTC

How to "add a site" to Nutch?

Hello!

I am brand new to Nutch, so please excuse my ignorance on these things.

Basically, if I would want to provide an 'Add Site' link for my customers,
how would I go about implementing this? I should be adding the URL in a file
in the urls/ directory, and possibly the crawl-urlfilter.txt file as well.

And then I should re-index somehow, or - ideally - just do a crawl of
that new site, with the results somehow merged into the old craw results.

But that seems to be astonishingly difficult to get working.

After googling around about 're-crawling' or 'index updating' for a while,
it appears to me as if there is no one way to do this, which seems to work
for everyone. There are some scripts posted, but they only seem to work
for some people. Same with me. The basic re-crawler/index scripts I could
find don't work for me for some reason or the other.

I tried to use the command tools for merging crawl results, something
like this:

    nutch org.apache.nutch.crawl.CrawlDbMerger ...
    nutch org.apache.nutch.segment.SegmentMerger ...
    nutch org.apache.nutch.crawl.LinkDbMerger ...

And that seemed to work ok, but then I had to do the indexes:

    nutch org.apache.nutch.indexer.IndexMerger ...

And there maybe I am doing something wrong, but I find that in my old
and new crawl directory I have an index/ and indexes/ directory and now
I don't know how to use the IndexMerger.

For what it's worth: I'm using Nutch 0.9.

Thank you very much!