You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "tech.notyet@foxmail.com" <te...@foxmail.com> on 2013/11/07 07:22:34 UTC

whst does the "host" table do in nutch2.2.1?

Hi,everyone,
I'm new to Nutch and using Nutch2.2.1 with Hbase as the datastore.When I finished a whole round of crawing,I found "host","webtable" in the Hbase. As to the "host" table,I am not quite sure about it's function, like in which step(inject,generate,fetch,parse,updatedb,updatehostdb) is this "host" table get involved in?  And what does the data stored in the 'host" table really mean?  Can anyone share some information? Thank a lot!


Edward 

Re: whst does the "host" table do in nutch2.2.1?

Posted by Talat UYARER <ta...@agmlab.com>.
You are welcome Edward,

If you use crawl shell script. You can look at 
https://github.com/apache/nutch/blob/branch-2.2.1/src/bin/crawl. It has 
not any host function. And I tried again it doesnt create host table.

I think it is created by second run. Because of UPDATEHOSTDB needs to 
host table.

I am not familiar but you are right. 2.x store everything in webpage table.

If you want to give special values like maxThreads, crawlDelay, 
mincrawlDelay you will needs. But other situation you dont need.

Talat

07-11-2013 09:29 tarihinde, tech.notyet@foxmail.com yazdı:
> Thanks Talat,
>
> I run Nutch2.2.1 in two ways, one is run the CRAWL commond directly ,and there is "host" table in the Hbase after the exceution of the CRAWL commond;another is run the commond line step by step, I start from INJECT ,and then GENERATE,FETCH,PARSE,UPDATEDB,UPDATEHOSTDB.
> By watching the changes of hbase through the steps, I noticed that the "host" table first showed up after GENERATE, with no content. And it is
> not empty until the excution of UPDATEHOSTDB.
>
> Compared to Nutch1.7, I think the "webpage" table in Nutch2.2.1 acts the same as "CrawlBD"+"LinkBD" in 1.7, am I right?  what really confused me
> is the "host" table, so, as you said , I can neglect the "host" in the most case ,right?
>
> Best Reagrds,
> Edward
>
> From: Talat UYARER
> Date: 2013-11-07 14:55
> To: user
> Subject: Re: whst does the "host" table do in nutch2.2.1?
> Hi Edward,
>
> Host table is using for Host Based configuration like maxThreads,
> crawlDelay, mincrawlDelay etc. But this tables is option.
>
> In normal usage Host table dont create. Can you explain how do you start
> your crawler ?
>
> Talat
>
>
> 07-11-2013 08:22 tarihinde, tech.notyet@foxmail.com yazdı:
>>
>> Hi,everyone,
>> I'm new to Nutch and using Nutch2.2.1 with Hbase as the datastore.When I finished a whole round of crawing,I found "host","webtable" in the Hbase. As to the "host" table,I am not quite sure about it's function, like in which step(inject,generate,fetch,parse,updatedb,updatehostdb) is this "host" table get involved in?  And what does the data stored in the 'host" table really mean?  Can anyone share some information? Thank a lot!
>>
>>
>> Edward
>>
>
> .
>


Re: Re: whst does the "host" table do in nutch2.2.1?

Posted by "tech.notyet@foxmail.com" <te...@foxmail.com>.
Thanks Talat,

I run Nutch2.2.1 in two ways, one is run the CRAWL commond directly ,and there is "host" table in the Hbase after the exceution of the CRAWL commond;another is run the commond line step by step, I start from INJECT ,and then GENERATE,FETCH,PARSE,UPDATEDB,UPDATEHOSTDB.
By watching the changes of hbase through the steps, I noticed that the "host" table first showed up after GENERATE, with no content. And it is 
not empty until the excution of UPDATEHOSTDB.

Compared to Nutch1.7, I think the "webpage" table in Nutch2.2.1 acts the same as "CrawlBD"+"LinkBD" in 1.7, am I right?  what really confused me
is the "host" table, so, as you said , I can neglect the "host" in the most case ,right?

Best Reagrds,
Edward

From: Talat UYARER
Date: 2013-11-07 14:55
To: user
Subject: Re: whst does the "host" table do in nutch2.2.1?
Hi Edward,

Host table is using for Host Based configuration like maxThreads, 
crawlDelay, mincrawlDelay etc. But this tables is option.

In normal usage Host table dont create. Can you explain how do you start 
your crawler ?

Talat


07-11-2013 08:22 tarihinde, tech.notyet@foxmail.com yazdı:
>
> Hi,everyone,
> I'm new to Nutch and using Nutch2.2.1 with Hbase as the datastore.When I finished a whole round of crawing,I found "host","webtable" in the Hbase. As to the "host" table,I am not quite sure about it's function, like in which step(inject,generate,fetch,parse,updatedb,updatehostdb) is this "host" table get involved in?  And what does the data stored in the 'host" table really mean?  Can anyone share some information? Thank a lot!
>
>
> Edward
>

.

Re: whst does the "host" table do in nutch2.2.1?

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Edward,

Host table is using for Host Based configuration like maxThreads, 
crawlDelay, mincrawlDelay etc. But this tables is option.

In normal usage Host table dont create. Can you explain how do you start 
your crawler ?

Talat


07-11-2013 08:22 tarihinde, tech.notyet@foxmail.com yazdı:
>
> Hi,everyone,
> I'm new to Nutch and using Nutch2.2.1 with Hbase as the datastore.When I finished a whole round of crawing,I found "host","webtable" in the Hbase. As to the "host" table,I am not quite sure about it's function, like in which step(inject,generate,fetch,parse,updatedb,updatehostdb) is this "host" table get involved in?  And what does the data stored in the 'host" table really mean?  Can anyone share some information? Thank a lot!
>
>
> Edward
>