You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2012/08/22 10:59:54 UTC
Two questions about Nutch
Hi everyone here:
I have two questions which confused me for weeks. If anyone here can
help me, thanks so much!
The first one, I know that Nutch won't store the HTTP code at all.
Instead, it encodes it as a single status byte. If Nutch fetches a bad link
whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
is robots denied or throttled by website because of frequently fetch. How
can we distinguish between these conditions from that status byte(e.g.
db_status_gone, db_redir_temp)?
Second, I know a little about Ranking & Scoring mechanism in Nutch. I
know linkrank algorithm is the main algorithm. The linkrank algorithm is
just a single score factor in the index system of Nutch, what is other
factors about index and search in Nutch? The webgraph has not yet been
ported to the GORA-based API in Nutch 2.0. What is the result if we index
and search in Nutch 2.0?
--
View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
RE: Two questions about Nutch
Posted by Markus Jelsma <ma...@openindex.io>.
Hi weishenyun,
See inline:
Markus
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Wed 22-Aug-2012 11:02
> To: dev@nutch.apache.org
> Subject: Two questions about Nutch
>
> Hi everyone here:
> I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
> The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?
Only in the fetcher you can distinquish between status codes and non-HTTP status codes such as being denied by robots or a problem with the robots crawl delay.
> Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?
We also use the LinkRank to aggregate a score but a host and use that host score to select a master host when deduplicating hosts. The host among the duplicates with the highest score prevails and the others are removed.
> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?
You would still have a decent or good search result if you configured your weights properly. Keep in mind that LinkRank is not meant for scoring of URL's within a domain or host but across domains so it's a more internet scale scoring algorithm.
We don't use LinkRank for our site search services.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
RE: Two questions about Nutch
Posted by Markus Jelsma <ma...@openindex.io>.
Hi weishenyun,
See inline:
Markus
-----Original message-----
> From:weishenyun <wl...@yahoo.com.cn>
> Sent: Wed 22-Aug-2012 11:02
> To: dev@nutch.apache.org
> Subject: Two questions about Nutch
>
> Hi everyone here:
> I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
> The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?
Only in the fetcher you can distinquish between status codes and non-HTTP status codes such as being denied by robots or a problem with the robots crawl delay.
> Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?
We also use the LinkRank to aggregate a score but a host and use that host score to select a master host when deduplicating hosts. The host among the duplicates with the highest score prevails and the others are removed.
> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?
You would still have a decent or good search result if you configured your weights properly. Keep in mind that LinkRank is not meant for scoring of URL's within a domain or host but across domains so it's a more internet scale scoring algorithm.
We don't use LinkRank for our site search services.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>