You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2012/08/22 10:59:54 UTC

Two questions about Nutch

Hi everyone here:
      I have two questions which confused me for weeks. If anyone here can
help me, thanks so much!
      The first one, I know that Nutch won't store the HTTP code at all.
Instead, it encodes it as a single status byte. If Nutch fetches a bad link
whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
is robots denied or throttled by website because of frequently fetch. How
can we distinguish between these conditions from that status byte(e.g.
db_status_gone, db_redir_temp)?
      Second, I know a little about Ranking & Scoring mechanism in Nutch. I
know linkrank algorithm is the main algorithm. The linkrank algorithm is
just a single score factor in the index system of Nutch, what is other
factors about index and search in Nutch? The webgraph has not yet been
ported to the GORA-based API in Nutch 2.0. What is the result if we index
and search in Nutch 2.0?



--
View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Two questions about Nutch

Posted by Markus Jelsma <ma...@openindex.io>.
Hi weishenyun,

See inline:

Markus
 
 
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Wed 22-Aug-2012 11:02
> To: dev@nutch.apache.org
> Subject: Two questions about Nutch
> 
> Hi everyone here:
>       I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
>       The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?

Only in the fetcher you can distinquish between status codes and non-HTTP status codes such as being denied by robots or a problem with the robots crawl delay.

>       Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?

We also use the LinkRank to aggregate a score but a host and use that host score to select a master host when deduplicating hosts. The host among the duplicates with the highest score prevails and the others are removed.

> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?

You would still have a decent or good search result if you configured your weights properly. Keep in mind that LinkRank is not meant for scoring of URL's within a domain or host but across domains so it's a more internet scale scoring algorithm.

We don't use LinkRank for our site search services.

> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 

RE: Two questions about Nutch

Posted by Markus Jelsma <ma...@openindex.io>.
Hi weishenyun,

See inline:

Markus
 
 
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Wed 22-Aug-2012 11:02
> To: dev@nutch.apache.org
> Subject: Two questions about Nutch
> 
> Hi everyone here:
>       I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
>       The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?

Only in the fetcher you can distinquish between status codes and non-HTTP status codes such as being denied by robots or a problem with the robots crawl delay.

>       Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?

We also use the LinkRank to aggregate a score but a host and use that host score to select a master host when deduplicating hosts. The host among the duplicates with the highest score prevails and the others are removed.

> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?

You would still have a decent or good search result if you configured your weights properly. Keep in mind that LinkRank is not meant for scoring of URL's within a domain or host but across domains so it's a more internet scale scoring algorithm.

We don't use LinkRank for our site search services.

> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>