You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Li Li <fa...@gmail.com> on 2014/05/05 09:53:25 UTC

some questions about nutch?

I have designed a vertical spider and am interested in nutch's
archetecture. After reading some introductions, I have some questions.
1. why nutch 2.x use 3rd part databases such as hbase/cassandra?
    as far as I know, nutch 1.x store it's data in hdfs and manage by
itself. Using nosql like hbase will make manage data easily. But will
it become slower? for 1.x, all changes for urldb are executed by
mapreduce. it's batch operation. While in hbase/cassandra, it's random
access.
2. how nutch store data in 1.x and 2.x?
    In my desgin, there are webpages and links between them. How do
nutch 1.x store it? what's the table define of hbase/cassandra in 2.x?
3. one cycle of nutch is serial, why not concurrent
    one cycle of a crawl including
    1. select urls to be fetched
    2. fetch these urls
    3. extract children urls
    4. updatedb
4. how to select most 'important' urls from urldb?
    after a few cycles, there exists millions of urls to be fetched,
how to select most 'important' urls to be fetched first? does nutch
support custom plugins to do this?