You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bhavin pandya <bv...@gmail.com> on 2009/12/09 10:22:35 UTC

How to get all the crawled pages for perticular domain

Hi,

I have setup nutch 1.0 on cluster of 3 nodes.

We are running two application.

1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.

2. I am running application which needs to extract some information
for perticular domain.
As of date this application uses heritrix based crawler which crawls
the given domain, algorithms goes into pages and extract required
information.

As we are crawling in Nutch in distributed mode. we don't want to
recrawl using other tool like Heritrix for 2nd application.
I want to utilize same crawled data for 2nd application also.

But extraction algorithms requires all the crawled pages for
perticular domain, to extract all relevant information about that
domain.

I have thought of if somehow by writing some plugin in Nutch if i can
feed nutch crawled data to 2nd application then it will really save
our work, money and effort by not recrawling again.

But how do i get all the crawled pages for perticular domain in my
plugin?  Where i should look in nutch code.

Any pointer / idea in this direction will really help.

Thanks.
Bhavin




-- 
- Bhavin

Re: How to get all the crawled pages for perticular domain

Posted by Dennis Kubes <ku...@apache.org>.
There is a domain-url filter.  Is that what you were looking for?

Dennis

Yves Petinot wrote:
> Hi Bhavin,
> 
> other nutch users may comment on this, but it seems to me that working 
> on top of the nutchbase branch might allow you to perform that type of 
> processing quite easily.
> 
> -y
> 
> bhavin pandya wrote:
>> Hi,
>>
>> I have setup nutch 1.0 on cluster of 3 nodes.
>>
>> We are running two application.
>>
>> 1. Nutch based search application.
>> We have successfully crawled approx. 25m pages on 3 nodes.
>> It's working as per expectation.
>>
>> 2. I am running application which needs to extract some information
>> for perticular domain.
>> As of date this application uses heritrix based crawler which crawls
>> the given domain, algorithms goes into pages and extract required
>> information.
>>
>> As we are crawling in Nutch in distributed mode. we don't want to
>> recrawl using other tool like Heritrix for 2nd application.
>> I want to utilize same crawled data for 2nd application also.
>>
>> But extraction algorithms requires all the crawled pages for
>> perticular domain, to extract all relevant information about that
>> domain.
>>
>> I have thought of if somehow by writing some plugin in Nutch if i can
>> feed nutch crawled data to 2nd application then it will really save
>> our work, money and effort by not recrawling again.
>>
>> But how do i get all the crawled pages for perticular domain in my
>> plugin?  Where i should look in nutch code.
>>
>> Any pointer / idea in this direction will really help.
>>
>> Thanks.
>> Bhavin
>>
>>
>>
>>
>>   
> 

Re: How to get all the crawled pages for perticular domain

Posted by Yves Petinot <yv...@snooth.com>.
Hi Bhavin,

other nutch users may comment on this, but it seems to me that working 
on top of the nutchbase branch might allow you to perform that type of 
processing quite easily.

-y

bhavin pandya wrote:
> Hi,
>
> I have setup nutch 1.0 on cluster of 3 nodes.
>
> We are running two application.
>
> 1. Nutch based search application.
> We have successfully crawled approx. 25m pages on 3 nodes.
> It's working as per expectation.
>
> 2. I am running application which needs to extract some information
> for perticular domain.
> As of date this application uses heritrix based crawler which crawls
> the given domain, algorithms goes into pages and extract required
> information.
>
> As we are crawling in Nutch in distributed mode. we don't want to
> recrawl using other tool like Heritrix for 2nd application.
> I want to utilize same crawled data for 2nd application also.
>
> But extraction algorithms requires all the crawled pages for
> perticular domain, to extract all relevant information about that
> domain.
>
> I have thought of if somehow by writing some plugin in Nutch if i can
> feed nutch crawled data to 2nd application then it will really save
> our work, money and effort by not recrawling again.
>
> But how do i get all the crawled pages for perticular domain in my
> plugin?  Where i should look in nutch code.
>
> Any pointer / idea in this direction will really help.
>
> Thanks.
> Bhavin
>
>
>
>
>