You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sourajit Basak <so...@gmail.com> on 2012/12/05 18:09:54 UTC

fetcher partitioning

Per my understanding, Nutch partitions urls based on either host, ip or
domain. Is it possible to partition based on url patterns ?

For e.g my company, a publishing house, is planning to expose its content
like http://host/publicationA, http://host/publicationB. etc. We wish to
partition the fetching based on url patterns like /publicationA/* to a
thread, /publicationB/* to another, etc.

This will not only help us expedite indexing the content but also test the
throughput of the site, though the second is an additional benefit we get
by doing no extra work.

We can attempt to modify the URLPartitioner, but that does not seem to be
plug and play like the FetchSchedule. And would mean changes to the core.

Any suggestions ?

Best,
Sourajit

Re: fetcher partitioning

Posted by Sourajit Basak <so...@gmail.com>.

https://issues.apache.org/jira/browse/NUTCH-1504

On Mon, Dec 10, 2012 at 5:16 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
>
>
>
> -----Original message-----
> > From:Sourajit Basak <so...@gmail.com>
> > Sent: Mon 10-Dec-2012 12:17
> > To: user@nutch.apache.org
> > Subject: Re: fetcher partitioning
> >
> > Markus,
> > I will open an issue.
> >
> > But I am confused now. Does the partitioner have no effect on the
> fetchers
> > ?
>
> The partitioner decides which record ends up in which fetch list. When
> running locally, there is always one fetch list and one mapper to ingest
> that fetch list.
>
> > Even if we allot 10 threads to the fetcher (all urls belonging to the
> same
> > host), will each thread fetch its items simultaneously ?
>
> That depends on the queue mode used. The fetcher organizes URL's in
> queues, and threads will just pick the next URL to fetch. URL's are either
> queued by host, ip or domain. See nutch-default for descriptions on which
> queue to use and how many threads per queue to set up.
>
> > What is queue mode?
> >
> > Best,
> > Sourajit
> >
> > On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Sourajit,
> > >
> > > Looks fine at a first glance. A partitioner does not partition between
> > > threads, only mappers. It also makes little sense because in the
> fetcher
> > > number of threads can be set plus the queue mode.
> > >
> > > Can you open an issue and attach your patch?
> > >
> > > Thanks,
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Sourajit Basak <so...@gmail.com>
> > > > Sent: Mon 10-Dec-2012 10:55
> > > > To: user@nutch.apache.org
> > > > Cc: Markus Jelsma <ma...@openindex.io>
> > > > Subject: Re: fetcher partitioning
> > > >
> > > > Could anyone review this patch for using a pluggable custom
> partitioner ?
> > > > For the time, I have just copied over HashPartitioner impl. Need to
> > > understand a bit more about Hadoop's partitioning.
> > > >
> > > > Can the group also comment if this RandomPartioner will distribute
> urls
> > > from the same host across different fetcher threads ? Running in local
> > > mode, doesn't seem to have any affect.
> > > >
> > > > (My cluster is undergoing routine maintenance; need to wait for
> testing
> > > in distributed mode)
> > > >
> > > > Best,
> > > > Sourajit
> > > >
> > > > On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak <
> > > sourajit.basac@gmail.com <ma...@gmail.com> > wrote:
> > > > Ok. Give me some time.
> > > >
> > > > On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma <
> > > markus.jelsma@openindex.io <ma...@openindex.io> >
> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:Sourajit Basak <sourajit.basac@gmail.com <mailto:
> > > sourajit.basac@gmail.com> >
> > > > > Sent: Wed 05-Dec-2012 18:16
> > > > > To: user@nutch.apache.org <ma...@nutch.apache.org>
> > > > > Subject: fetcher partitioning
> > > > >
> > > > > Per my understanding, Nutch partitions urls based on either host,
> ip or
> > > > > domain. Is it possible to partition based on url patterns ?
> > > > >
> > > > > For e.g my company, a publishing house, is planning to expose its
> > > content
> > > > > like http://host/publicationA <http://host/publicationA> ,
> > > http://host/publicationB <http://host/publicationB> . etc. We wish to
> > > > > partition the fetching based on url patterns like /publicationA/*
> to a
> > > > > thread, /publicationB/* to another, etc.
> > > > >
> > > > > This will not only help us expedite indexing the content but also
> test
> > > the
> > > > > throughput of the site, though the second is an additional benefit
> we
> > > get
> > > > > by doing no extra work.
> > > > >
> > > > > We can attempt to modify the URLPartitioner, but that does not
> seem to
> > > be
> > > > > plug and play like the FetchSchedule. And would mean changes to the
> > > core.
> > > >
> > > > Indeed, you have to modify the partitioner to make this happen. You
> are
> > > free to do so but you can also make it pluggable as fetch schedule via
> > > config and provide a patch so it can be added to the Nutch sources.
> > > >
> > > > >
> > > > > Any suggestions ?
> > > > >
> > > > > Best,
> > > > > Sourajit
> > > > >
> > > >
> > > >
> > > >
> > >
> >
>

RE: fetcher partitioning

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Sourajit Basak <so...@gmail.com>
> Sent: Mon 10-Dec-2012 12:17
> To: user@nutch.apache.org
> Subject: Re: fetcher partitioning
> 
> Markus,
> I will open an issue.
> 
> But I am confused now. Does the partitioner have no effect on the fetchers
> ?

The partitioner decides which record ends up in which fetch list. When running locally, there is always one fetch list and one mapper to ingest that fetch list.

> Even if we allot 10 threads to the fetcher (all urls belonging to the same
> host), will each thread fetch its items simultaneously ?

That depends on the queue mode used. The fetcher organizes URL's in queues, and threads will just pick the next URL to fetch. URL's are either queued by host, ip or domain. See nutch-default for descriptions on which queue to use and how many threads per queue to set up.

> What is queue mode?
> 
> Best,
> Sourajit
> 
> On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Sourajit,
> >
> > Looks fine at a first glance. A partitioner does not partition between
> > threads, only mappers. It also makes little sense because in the fetcher
> > number of threads can be set plus the queue mode.
> >
> > Can you open an issue and attach your patch?
> >
> > Thanks,
> >
> >
> >
> > -----Original message-----
> > > From:Sourajit Basak <so...@gmail.com>
> > > Sent: Mon 10-Dec-2012 10:55
> > > To: user@nutch.apache.org
> > > Cc: Markus Jelsma <ma...@openindex.io>
> > > Subject: Re: fetcher partitioning
> > >
> > > Could anyone review this patch for using a pluggable custom partitioner ?
> > > For the time, I have just copied over HashPartitioner impl. Need to
> > understand a bit more about Hadoop's partitioning.
> > >
> > > Can the group also comment if this RandomPartioner will distribute urls
> > from the same host across different fetcher threads ? Running in local
> > mode, doesn't seem to have any affect.
> > >
> > > (My cluster is undergoing routine maintenance; need to wait for testing
> > in distributed mode)
> > >
> > > Best,
> > > Sourajit
> > >
> > > On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak <
> > sourajit.basac@gmail.com <ma...@gmail.com> > wrote:
> > > Ok. Give me some time.
> > >
> > > On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma <
> > markus.jelsma@openindex.io <ma...@openindex.io> > wrote:
> > >
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Sourajit Basak <sourajit.basac@gmail.com <mailto:
> > sourajit.basac@gmail.com> >
> > > > Sent: Wed 05-Dec-2012 18:16
> > > > To: user@nutch.apache.org <ma...@nutch.apache.org>
> > > > Subject: fetcher partitioning
> > > >
> > > > Per my understanding, Nutch partitions urls based on either host, ip or
> > > > domain. Is it possible to partition based on url patterns ?
> > > >
> > > > For e.g my company, a publishing house, is planning to expose its
> > content
> > > > like http://host/publicationA <http://host/publicationA> ,
> > http://host/publicationB <http://host/publicationB> . etc. We wish to
> > > > partition the fetching based on url patterns like /publicationA/* to a
> > > > thread, /publicationB/* to another, etc.
> > > >
> > > > This will not only help us expedite indexing the content but also test
> > the
> > > > throughput of the site, though the second is an additional benefit we
> > get
> > > > by doing no extra work.
> > > >
> > > > We can attempt to modify the URLPartitioner, but that does not seem to
> > be
> > > > plug and play like the FetchSchedule. And would mean changes to the
> > core.
> > >
> > > Indeed, you have to modify the partitioner to make this happen. You are
> > free to do so but you can also make it pluggable as fetch schedule via
> > config and provide a patch so it can be added to the Nutch sources.
> > >
> > > >
> > > > Any suggestions ?
> > > >
> > > > Best,
> > > > Sourajit
> > > >
> > >
> > >
> > >
> >
>

Re: fetcher partitioning

Posted by Sourajit Basak <so...@gmail.com>.

Markus,
I will open an issue.

But I am confused now. Does the partitioner have no effect on the fetchers
?
Even if we allot 10 threads to the fetcher (all urls belonging to the same
host), will each thread fetch its items simultaneously ? What is queue mode
?

Best,
Sourajit

On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Sourajit,
>
> Looks fine at a first glance. A partitioner does not partition between
> threads, only mappers. It also makes little sense because in the fetcher
> number of threads can be set plus the queue mode.
>
> Can you open an issue and attach your patch?
>
> Thanks,
>
>
>
> -----Original message-----
> > From:Sourajit Basak <so...@gmail.com>
> > Sent: Mon 10-Dec-2012 10:55
> > To: user@nutch.apache.org
> > Cc: Markus Jelsma <ma...@openindex.io>
> > Subject: Re: fetcher partitioning
> >
> > Could anyone review this patch for using a pluggable custom partitioner ?
> > For the time, I have just copied over HashPartitioner impl. Need to
> understand a bit more about Hadoop's partitioning.
> >
> > Can the group also comment if this RandomPartioner will distribute urls
> from the same host across different fetcher threads ? Running in local
> mode, doesn't seem to have any affect.
> >
> > (My cluster is undergoing routine maintenance; need to wait for testing
> in distributed mode)
> >
> > Best,
> > Sourajit
> >
> > On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak <
> sourajit.basac@gmail.com <ma...@gmail.com> > wrote:
> > Ok. Give me some time.
> >
> > On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma <
> markus.jelsma@openindex.io <ma...@openindex.io> > wrote:
> >
> >
> >
> >
> > -----Original message-----
> > > From:Sourajit Basak <sourajit.basac@gmail.com <mailto:
> sourajit.basac@gmail.com> >
> > > Sent: Wed 05-Dec-2012 18:16
> > > To: user@nutch.apache.org <ma...@nutch.apache.org>
> > > Subject: fetcher partitioning
> > >
> > > Per my understanding, Nutch partitions urls based on either host, ip or
> > > domain. Is it possible to partition based on url patterns ?
> > >
> > > For e.g my company, a publishing house, is planning to expose its
> content
> > > like http://host/publicationA <http://host/publicationA> ,
> http://host/publicationB <http://host/publicationB> . etc. We wish to
> > > partition the fetching based on url patterns like /publicationA/* to a
> > > thread, /publicationB/* to another, etc.
> > >
> > > This will not only help us expedite indexing the content but also test
> the
> > > throughput of the site, though the second is an additional benefit we
> get
> > > by doing no extra work.
> > >
> > > We can attempt to modify the URLPartitioner, but that does not seem to
> be
> > > plug and play like the FetchSchedule. And would mean changes to the
> core.
> >
> > Indeed, you have to modify the partitioner to make this happen. You are
> free to do so but you can also make it pluggable as fetch schedule via
> config and provide a patch so it can be added to the Nutch sources.
> >
> > >
> > > Any suggestions ?
> > >
> > > Best,
> > > Sourajit
> > >
> >
> >
> >
>

RE: fetcher partitioning

Posted by Markus Jelsma <ma...@openindex.io>.

Sourajit,

Looks fine at a first glance. A partitioner does not partition between threads, only mappers. It also makes little sense because in the fetcher number of threads can be set plus the queue mode.

Can you open an issue and attach your patch? 

Thanks,

 
 
-----Original message-----
> From:Sourajit Basak <so...@gmail.com>
> Sent: Mon 10-Dec-2012 10:55
> To: user@nutch.apache.org
> Cc: Markus Jelsma <ma...@openindex.io>
> Subject: Re: fetcher partitioning
> 
> Could anyone review this patch for using a pluggable custom partitioner ?
> For the time, I have just copied over HashPartitioner impl. Need to understand a bit more about Hadoop's partitioning.
> 
> Can the group also comment if this RandomPartioner will distribute urls from the same host across different fetcher threads ? Running in local mode, doesn't seem to have any affect. 
> 
> (My cluster is undergoing routine maintenance; need to wait for testing in distributed mode)
> 
> Best,
> Sourajit
> 
> On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak <sourajit.basac@gmail.com <ma...@gmail.com> > wrote:
> Ok. Give me some time. 
> 
> On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io> > wrote:
> 
> 
> 
> 
> -----Original message-----
> > From:Sourajit Basak <sourajit.basac@gmail.com <ma...@gmail.com> >
> > Sent: Wed 05-Dec-2012 18:16
> > To: user@nutch.apache.org <ma...@nutch.apache.org> 
> > Subject: fetcher partitioning
> >
> > Per my understanding, Nutch partitions urls based on either host, ip or
> > domain. Is it possible to partition based on url patterns ?
> >
> > For e.g my company, a publishing house, is planning to expose its content
> > like http://host/publicationA <http://host/publicationA> , http://host/publicationB <http://host/publicationB> . etc. We wish to
> > partition the fetching based on url patterns like /publicationA/* to a
> > thread, /publicationB/* to another, etc.
> >
> > This will not only help us expedite indexing the content but also test the
> > throughput of the site, though the second is an additional benefit we get
> > by doing no extra work.
> >
> > We can attempt to modify the URLPartitioner, but that does not seem to be
> > plug and play like the FetchSchedule. And would mean changes to the core.
> 
> Indeed, you have to modify the partitioner to make this happen. You are free to do so but you can also make it pluggable as fetch schedule via config and provide a patch so it can be added to the Nutch sources.
> 
> >
> > Any suggestions ?
> >
> > Best,
> > Sourajit
> >
> 
> 
>

Re: fetcher partitioning

Posted by Sourajit Basak <so...@gmail.com>.

Could anyone review this patch for using a pluggable custom partitioner ?
For the time, I have just copied over HashPartitioner impl. Need to
understand a bit more about Hadoop's partitioning.

Can the group also comment if this RandomPartioner will distribute urls
from the same host across different fetcher threads ? Running in local
mode, doesn't seem to have any affect.

(My cluster is undergoing routine maintenance; need to wait for testing in
distributed mode)

Best,
Sourajit

On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak <so...@gmail.com>wrote:

> Ok. Give me some time.
>
> On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>>
>>
>>
>>
>> -----Original message-----
>> > From:Sourajit Basak <so...@gmail.com>
>> > Sent: Wed 05-Dec-2012 18:16
>> > To: user@nutch.apache.org
>> > Subject: fetcher partitioning
>> >
>> > Per my understanding, Nutch partitions urls based on either host, ip or
>> > domain. Is it possible to partition based on url patterns ?
>> >
>> > For e.g my company, a publishing house, is planning to expose its
>> content
>> > like http://host/publicationA, http://host/publicationB. etc. We wish
>> to
>> > partition the fetching based on url patterns like /publicationA/* to a
>> > thread, /publicationB/* to another, etc.
>> >
>> > This will not only help us expedite indexing the content but also test
>> the
>> > throughput of the site, though the second is an additional benefit we
>> get
>> > by doing no extra work.
>> >
>> > We can attempt to modify the URLPartitioner, but that does not seem to
>> be
>> > plug and play like the FetchSchedule. And would mean changes to the
>> core.
>>
>> Indeed, you have to modify the partitioner to make this happen. You are
>> free to do so but you can also make it pluggable as fetch schedule via
>> config and provide a patch so it can be added to the Nutch sources.
>>
>> >
>> > Any suggestions ?
>> >
>> > Best,
>> > Sourajit
>> >
>>
>
>

Re: fetcher partitioning

Posted by Sourajit Basak <so...@gmail.com>.

Ok. Give me some time.

On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma
<ma...@openindex.io>wrote:

>
>
>
>
> -----Original message-----
> > From:Sourajit Basak <so...@gmail.com>
> > Sent: Wed 05-Dec-2012 18:16
> > To: user@nutch.apache.org
> > Subject: fetcher partitioning
> >
> > Per my understanding, Nutch partitions urls based on either host, ip or
> > domain. Is it possible to partition based on url patterns ?
> >
> > For e.g my company, a publishing house, is planning to expose its content
> > like http://host/publicationA, http://host/publicationB. etc. We wish to
> > partition the fetching based on url patterns like /publicationA/* to a
> > thread, /publicationB/* to another, etc.
> >
> > This will not only help us expedite indexing the content but also test
> the
> > throughput of the site, though the second is an additional benefit we get
> > by doing no extra work.
> >
> > We can attempt to modify the URLPartitioner, but that does not seem to be
> > plug and play like the FetchSchedule. And would mean changes to the core.
>
> Indeed, you have to modify the partitioner to make this happen. You are
> free to do so but you can also make it pluggable as fetch schedule via
> config and provide a patch so it can be added to the Nutch sources.
>
> >
> > Any suggestions ?
> >
> > Best,
> > Sourajit
> >
>

RE: fetcher partitioning

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Sourajit Basak <so...@gmail.com>
> Sent: Wed 05-Dec-2012 18:16
> To: user@nutch.apache.org
> Subject: fetcher partitioning
> 
> Per my understanding, Nutch partitions urls based on either host, ip or
> domain. Is it possible to partition based on url patterns ?
> 
> For e.g my company, a publishing house, is planning to expose its content
> like http://host/publicationA, http://host/publicationB. etc. We wish to
> partition the fetching based on url patterns like /publicationA/* to a
> thread, /publicationB/* to another, etc.
> 
> This will not only help us expedite indexing the content but also test the
> throughput of the site, though the second is an additional benefit we get
> by doing no extra work.
> 
> We can attempt to modify the URLPartitioner, but that does not seem to be
> plug and play like the FetchSchedule. And would mean changes to the core.

Indeed, you have to modify the partitioner to make this happen. You are free to do so but you can also make it pluggable as fetch schedule via config and provide a patch so it can be added to the Nutch sources.

> 
> Any suggestions ?
> 
> Best,
> Sourajit
>