You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/07 13:56:35 UTC

crawling a list of urls

Hello,

I have a case where I need to crawl a list of exact url's. Somewhere
in the range of 1 to 1.5M urls.

I have written those urls in numereus files under /home/urls , ie
/home/urls/1 /home/urls/2

Then by using the crawl command I am crawling to depth=1

Are there any recomendations or general guidelines that I should
follow when making nutch just to fetch and index a list of urls?


Best Regards,
C.B.

Re: crawling a list of urls

Posted by Cam Bazz <ca...@gmail.com>.

Thank you Lewis, this has been very illustrative, especially about
deleting documents.

Best.

On Thu, Jul 7, 2011 at 6:51 PM, lewis john mcgibbney
<le...@gmail.com> wrote:
> See comments below
>
> On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello Lewis,
>>
>> Pardon me for the non-verbose desription. I have a set of urls, namely
>> product urls, in range of millions.
>>
>
> Firstly, (this is juts a suggestion) but I assume that you wish Nutch to
> fetch the full page content. Ensure that http.content.limit is set to an
> appropriate limit to allow this.
>
>
>>
>> So I want to write my urls, in a flat file, and have nutch crawl them
>> to depth = 1
>>
>
> As you describe you have various seed directories, therefore I assume that
> crawling a large set of seeds will be a recursive task, IMHO I would save
> myself the manual task of running the jobs and write a bash script to do
> this for me, this will also enable you to schedule for once a day update of
> your crawldb, linkdb, solr index and so forth. There are plenty of scripts
> which have been tested and used throughout the community here
> http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration
>
>
>> However, I might remove url's from this list, or add new ones. I also
>> would like nutch to revisit each site each 1 day.
>>
>
> Check out nutch-site for crawldb fetch intervals, these values can be used
> to accommodate the dynamism of various pages. Once you have removed URLs
> (this is going to be a laborious and extremely tedious task if done
> manually), you would simply run your script again.
>
> I would like removed urls to be deleted, and new ones to be reinjected
>> each time nutch starts.
>>
>
> With regards to deleting URLs in your crawldb (and subsequently index) I am
> not sure of this exactly. Can you justify completely deleting the URLs from
> the data store? What happens if you add the URL in again the next day? I',
> not sure if this is a sustainable method for maintaining your data
> store/index.
>
>>
>> Best Regards,
>> -C.B.
>>
>> On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
>> <le...@gmail.com> wrote:
>> > Hi C.B.,
>> >
>> > This is way to vague. We really require more information regarding
>> roughly
>> > what kind of results you wish to get. It would be a near impossible task
>> for
>> > anyone to try and specify a solution to this open ended question.
>> >
>> > Please elaborate
>> >
>> > Thank you
>> >
>> > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <ca...@gmail.com> wrote:
>> >
>> >> Hello,
>> >>
>> >> I have a case where I need to crawl a list of exact url's. Somewhere
>> >> in the range of 1 to 1.5M urls.
>> >>
>> >> I have written those urls in numereus files under /home/urls , ie
>> >> /home/urls/1 /home/urls/2
>> >>
>> >> Then by using the crawl command I am crawling to depth=1
>> >>
>> >> Are there any recomendations or general guidelines that I should
>> >> follow when making nutch just to fetch and index a list of urls?
>> >>
>> >>
>> >> Best Regards,
>> >> C.B.
>> >>
>> >
>> >
>> >
>> > --
>> > *Lewis*
>> >
>>
>
>
>
> --
> *Lewis*
>

Re: crawling a list of urls

Posted by lewis john mcgibbney <le...@gmail.com>.

See comments below

On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello Lewis,
>
> Pardon me for the non-verbose desription. I have a set of urls, namely
> product urls, in range of millions.
>

Firstly, (this is juts a suggestion) but I assume that you wish Nutch to
fetch the full page content. Ensure that http.content.limit is set to an
appropriate limit to allow this.

>
> So I want to write my urls, in a flat file, and have nutch crawl them
> to depth = 1
>

As you describe you have various seed directories, therefore I assume that
crawling a large set of seeds will be a recursive task, IMHO I would save
myself the manual task of running the jobs and write a bash script to do
this for me, this will also enable you to schedule for once a day update of
your crawldb, linkdb, solr index and so forth. There are plenty of scripts
which have been tested and used throughout the community here
http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration

> However, I might remove url's from this list, or add new ones. I also
> would like nutch to revisit each site each 1 day.
>

Check out nutch-site for crawldb fetch intervals, these values can be used
to accommodate the dynamism of various pages. Once you have removed URLs
(this is going to be a laborious and extremely tedious task if done
manually), you would simply run your script again.

I would like removed urls to be deleted, and new ones to be reinjected
> each time nutch starts.
>

With regards to deleting URLs in your crawldb (and subsequently index) I am
not sure of this exactly. Can you justify completely deleting the URLs from
the data store? What happens if you add the URL in again the next day? I',
not sure if this is a sustainable method for maintaining your data
store/index.

>
> Best Regards,
> -C.B.
>
> On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
> <le...@gmail.com> wrote:
> > Hi C.B.,
> >
> > This is way to vague. We really require more information regarding
> roughly
> > what kind of results you wish to get. It would be a near impossible task
> for
> > anyone to try and specify a solution to this open ended question.
> >
> > Please elaborate
> >
> > Thank you
> >
> > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <ca...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> I have a case where I need to crawl a list of exact url's. Somewhere
> >> in the range of 1 to 1.5M urls.
> >>
> >> I have written those urls in numereus files under /home/urls , ie
> >> /home/urls/1 /home/urls/2
> >>
> >> Then by using the crawl command I am crawling to depth=1
> >>
> >> Are there any recomendations or general guidelines that I should
> >> follow when making nutch just to fetch and index a list of urls?
> >>
> >>
> >> Best Regards,
> >> C.B.
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>

-- 
*Lewis*

Re: crawling a list of urls

Posted by Cam Bazz <ca...@gmail.com>.

Hello Lewis,

Pardon me for the non-verbose desription. I have a set of urls, namely
product urls, in range of millions.

So I want to write my urls, in a flat file, and have nutch crawl them
to depth = 1

However, I might remove url's from this list, or add new ones. I also
would like nutch to revisit each site each 1 day.

I would like removed urls to be deleted, and new ones to be reinjected
each time nutch starts.

Best Regards,
-C.B.

On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
<le...@gmail.com> wrote:
> Hi C.B.,
>
> This is way to vague. We really require more information regarding roughly
> what kind of results you wish to get. It would be a near impossible task for
> anyone to try and specify a solution to this open ended question.
>
> Please elaborate
>
> Thank you
>
> On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a case where I need to crawl a list of exact url's. Somewhere
>> in the range of 1 to 1.5M urls.
>>
>> I have written those urls in numereus files under /home/urls , ie
>> /home/urls/1 /home/urls/2
>>
>> Then by using the crawl command I am crawling to depth=1
>>
>> Are there any recomendations or general guidelines that I should
>> follow when making nutch just to fetch and index a list of urls?
>>
>>
>> Best Regards,
>> C.B.
>>
>
>
>
> --
> *Lewis*
>

Re: crawling a list of urls

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi C.B.,

This is way to vague. We really require more information regarding roughly
what kind of results you wish to get. It would be a near impossible task for
anyone to try and specify a solution to this open ended question.

Please elaborate

Thank you

On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I have a case where I need to crawl a list of exact url's. Somewhere
> in the range of 1 to 1.5M urls.
>
> I have written those urls in numereus files under /home/urls , ie
> /home/urls/1 /home/urls/2
>
> Then by using the crawl command I am crawling to depth=1
>
> Are there any recomendations or general guidelines that I should
> follow when making nutch just to fetch and index a list of urls?
>
>
> Best Regards,
> C.B.
>

-- 
*Lewis*