You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by SUJIT PAL <su...@comcast.net> on 2012/02/22 04:45:35 UTC

[nutchgora] - proposal to support distributed indexing

Hi,

I need to move the SOLR based search platform to a distributed setup, and therefore need to be able to write to multiple SOLR servers from Nutch (working on the nutchgora branch, so this may be specific to this branch). Here is what I think I need to do...

Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it converts the WebPage to a NutchDocument, then passes the NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter adds the NutchDocument to a queue and when the commit size is exceeded, writes out the queue and does a commit (and another one in the shutdown step).

My proposal is to specify the SolrConstants.SERVER_URL parameter as a comma-separated list of URLs. The SolrWriter splits this parameter by "," and creates an array of server URLs and the same size array of inputDocs queue. It then takes the URL, runs it through a hashMod partitioner and writes it out to the inputDocs queue pointed by the partition.

Then my pages get split up into a number of SOLR servers, where I can query them in a distributed fashion (according to the SOLR docs, it is advisable to do this in a random manner to make sure the (unreliable) idf values do not influence scores from one server too much).

Is this a reasonable way to go about this? Or is there a simpler method I am overlooking?

TIA for any help you can provide.

-sujit

Re: [nutchgora] - proposal to support distributed indexing

Posted by Markus Jelsma <ma...@openindex.io>.

In that case the algorithm doesn't matter as you still need to reindex the 
corpus if you upgrade to 4.x.

Cheers!

> Thanks Marcus, I guess I'll probably still need to build nutch side
> partitioning for myself since I am on Solr 3.5, it would be throw-away
> code, to be changed when I get on to 4.x.
> 
> -sujit
> 
> On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > We're in the process of testing Solr trunk's cloud features that recently
> > includes initial work for distributed indexing. With it, there is no need
> > anymore for doing the partitioning client site because Solr will forward
> > the input documents to the proper shard. Solr uses the MurMur hashing
> > algorithm to decide the target shard so i would stick to that in any
> > case.
> > 
> > Anyway, with Solr being able to handle incoming documents on any node,
> > and distributing them appropriately there is no need anymore for hashing
> > at all. What we do need to to select a target server from a pool per
> > batch. Committing is not needed if soft autocommitting is enabled, quite
> > useful for Solr's new NRT features.
> > 
> > If Solr 4.0 is released in the coming months (and that's what it looks
> > like) i would suggest to patch Nutch to allow for a list of Solr server
> > URL's instead of doing partitioning on the client site.
> > 
> > In our case we don't even need a pool of Solr servers in Nutch to select
> > from because we pass the documents through a proxy that is aware of
> > running and offline servers.
> > 
> > Markus
> > 
> >> Thanks Julien and Lewis.
> >> 
> >> Being able to specify the partitioner class sounds good - I am thinking
> >> that perhaps they could all be impls of the Hadoop
> >> org.apache.hadoop.mapreduce.Partitioner interface.
> >> 
> >> Would it be okay if I annotated NUTCH-945 saying that I am working on
> >> providing a patch for the NutchGora branch initially (I haven't looked
> >> at the head code yet, its likely to be slightly different), and then
> >> try to port the change over to the head?
> >> 
> >> -sujit
> >> 
> >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
> >>> Hi.
> >>> 
> >>> There was an issue [0] opened for this some time ago and it looks that
> >>> apart from the (bare minimal) description, there has been no work done
> >>> on it.
> >>> 
> >>> Would be a real nice feature to have.
> >>> 
> >>> [0] https://issues.apache.org/jira/browse/NUTCH-945
> >>> 
> >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> >>> 
> >>> lists.digitalpebble@gmail.com> wrote:
> >>>> Hi Sujit,
> >>>> 
> >>>> Sounds good. A nice way of doing it would be to make so that people
> >>>> can define how to partition over the SOLR instances in the way they
> >>>> want e.g. consistent hashing, URL range or crawldb metadata by taking
> >>>> a class name as parameter. Does not need to be pluggable I think. I
> >>>> had implemented something along these lines some time ago for a
> >>>> customer but could not release it open source.
> >>>> 
> >>>> Feel free to open a JIRA  to comment on this issue and attach a patch.
> >>>> 
> >>>> Thanks
> >>>> 
> >>>> Julien
> >>>> 
> >>>> On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> I need to move the SOLR based search platform to a distributed setup,
> >>>>> and therefore need to be able to write to multiple SOLR servers from
> >>>>> Nutch (working on the nutchgora branch, so this may be specific to
> >>>>> this
> >>>> 
> >>>> branch).
> >>>> 
> >>>>> Here is what I think I need to do...
> >>>>> 
> >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
> >>>>> it converts the WebPage to a NutchDocument, then passes the
> >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
> >>>>> case). The
> >>>> 
> >>>> SolrWriter
> >>>> 
> >>>>> adds the NutchDocument to a queue and when the commit size is
> >>>>> exceeded, writes out the queue and does a commit (and another one in
> >>>>> the shutdown step).
> >>>>> 
> >>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> >>>>> comma-separated list of URLs. The SolrWriter splits this parameter by
> >>>>> "," and creates an array of server URLs and the same size array of
> >>>>> inputDocs queue. It then takes the URL, runs it through a hashMod
> >>>>> partitioner and writes it out to the inputDocs queue pointed by the
> >>>>> partition.
> >>>>> 
> >>>>> Then my pages get split up into a number of SOLR servers, where I can
> >>>>> query them in a distributed fashion (according to the SOLR docs, it
> >>>>> is advisable to do this in a random manner to make sure the
> >>>>> (unreliable) idf values do not influence scores from one server too
> >>>>> much).
> >>>>> 
> >>>>> Is this a reasonable way to go about this? Or is there a simpler
> >>>>> method I am overlooking?
> >>>>> 
> >>>>> TIA for any help you can provide.
> >>>>> 
> >>>>> -sujit
> >>>> 
> >>>> --
> >>>> *
> >>>> *Open Source Solutions for Text Engineering
> >>>> 
> >>>> http://digitalpebble.blogspot.com/
> >>>> http://www.digitalpebble.com
> >>>> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Posted by SUJIT PAL <su...@comcast.net>.

Thanks Marcus, I guess I'll probably still need to build nutch side partitioning for myself since I am on Solr 3.5, it would be throw-away code, to be changed when I get on to 4.x.

-sujit

On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote:

> Hi,
> 
> We're in the process of testing Solr trunk's cloud features that recently 
> includes initial work for distributed indexing. With it, there is no need 
> anymore for doing the partitioning client site because Solr will forward the 
> input documents to the proper shard. Solr uses the MurMur hashing algorithm to 
> decide the target shard so i would stick to that in any case.
> 
> Anyway, with Solr being able to handle incoming documents on any node, and 
> distributing them appropriately there is no need anymore for hashing at all. 
> What we do need to to select a target server from a pool per batch.  
> Committing is not needed if soft autocommitting is enabled, quite useful for 
> Solr's new NRT features.
> 
> If Solr 4.0 is released in the coming months (and that's what it looks like) i 
> would suggest to patch Nutch to allow for a list of Solr server URL's instead 
> of doing partitioning on the client site.
> 
> In our case we don't even need a pool of Solr servers in Nutch to select from 
> because we pass the documents through a proxy that is aware of running and 
> offline servers.
> 
> Markus
> 
>> Thanks Julien and Lewis.
>> 
>> Being able to specify the partitioner class sounds good - I am thinking
>> that perhaps they could all be impls of the Hadoop
>> org.apache.hadoop.mapreduce.Partitioner interface.
>> 
>> Would it be okay if I annotated NUTCH-945 saying that I am working on
>> providing a patch for the NutchGora branch initially (I haven't looked at
>> the head code yet, its likely to be slightly different), and then try to
>> port the change over to the head?
>> 
>> -sujit
>> 
>> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
>>> Hi.
>>> 
>>> There was an issue [0] opened for this some time ago and it looks that
>>> apart from the (bare minimal) description, there has been no work done on
>>> it.
>>> 
>>> Would be a real nice feature to have.
>>> 
>>> [0] https://issues.apache.org/jira/browse/NUTCH-945
>>> 
>>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
>>> 
>>> lists.digitalpebble@gmail.com> wrote:
>>>> Hi Sujit,
>>>> 
>>>> Sounds good. A nice way of doing it would be to make so that people can
>>>> define how to partition over the SOLR instances in the way they want
>>>> e.g. consistent hashing, URL range or crawldb metadata by taking a
>>>> class name as parameter. Does not need to be pluggable I think. I had
>>>> implemented something along these lines some time ago for a customer
>>>> but could not release it open source.
>>>> 
>>>> Feel free to open a JIRA  to comment on this issue and attach a patch.
>>>> 
>>>> Thanks
>>>> 
>>>> Julien
>>>> 
>>>> On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:
>>>>> Hi,
>>>>> 
>>>>> I need to move the SOLR based search platform to a distributed setup,
>>>>> and therefore need to be able to write to multiple SOLR servers from
>>>>> Nutch (working on the nutchgora branch, so this may be specific to
>>>>> this
>>>> 
>>>> branch).
>>>> 
>>>>> Here is what I think I need to do...
>>>>> 
>>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
>>>>> it converts the WebPage to a NutchDocument, then passes the
>>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
>>>>> case). The
>>>> 
>>>> SolrWriter
>>>> 
>>>>> adds the NutchDocument to a queue and when the commit size is exceeded,
>>>>> writes out the queue and does a commit (and another one in the shutdown
>>>>> step).
>>>>> 
>>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
>>>>> comma-separated list of URLs. The SolrWriter splits this parameter by
>>>>> "," and creates an array of server URLs and the same size array of
>>>>> inputDocs queue. It then takes the URL, runs it through a hashMod
>>>>> partitioner and writes it out to the inputDocs queue pointed by the
>>>>> partition.
>>>>> 
>>>>> Then my pages get split up into a number of SOLR servers, where I can
>>>>> query them in a distributed fashion (according to the SOLR docs, it is
>>>>> advisable to do this in a random manner to make sure the (unreliable)
>>>>> idf values do not influence scores from one server too much).
>>>>> 
>>>>> Is this a reasonable way to go about this? Or is there a simpler method
>>>>> I am overlooking?
>>>>> 
>>>>> TIA for any help you can provide.
>>>>> 
>>>>> -sujit
>>>> 
>>>> --
>>>> *
>>>> *Open Source Solutions for Text Engineering
>>>> 
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

We're in the process of testing Solr trunk's cloud features that recently 
includes initial work for distributed indexing. With it, there is no need 
anymore for doing the partitioning client site because Solr will forward the 
input documents to the proper shard. Solr uses the MurMur hashing algorithm to 
decide the target shard so i would stick to that in any case.

Anyway, with Solr being able to handle incoming documents on any node, and 
distributing them appropriately there is no need anymore for hashing at all. 
What we do need to to select a target server from a pool per batch.  
Committing is not needed if soft autocommitting is enabled, quite useful for 
Solr's new NRT features.

If Solr 4.0 is released in the coming months (and that's what it looks like) i 
would suggest to patch Nutch to allow for a list of Solr server URL's instead 
of doing partitioning on the client site.

In our case we don't even need a pool of Solr servers in Nutch to select from 
because we pass the documents through a proxy that is aware of running and 
offline servers.

Markus

> Thanks Julien and Lewis.
> 
> Being able to specify the partitioner class sounds good - I am thinking
> that perhaps they could all be impls of the Hadoop
> org.apache.hadoop.mapreduce.Partitioner interface.
> 
> Would it be okay if I annotated NUTCH-945 saying that I am working on
> providing a patch for the NutchGora branch initially (I haven't looked at
> the head code yet, its likely to be slightly different), and then try to
> port the change over to the head?
> 
> -sujit
> 
> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
> > Hi.
> > 
> > There was an issue [0] opened for this some time ago and it looks that
> > apart from the (bare minimal) description, there has been no work done on
> > it.
> > 
> > Would be a real nice feature to have.
> > 
> > [0] https://issues.apache.org/jira/browse/NUTCH-945
> > 
> > On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> > 
> > lists.digitalpebble@gmail.com> wrote:
> >> Hi Sujit,
> >> 
> >> Sounds good. A nice way of doing it would be to make so that people can
> >> define how to partition over the SOLR instances in the way they want
> >> e.g. consistent hashing, URL range or crawldb metadata by taking a
> >> class name as parameter. Does not need to be pluggable I think. I had
> >> implemented something along these lines some time ago for a customer
> >> but could not release it open source.
> >> 
> >> Feel free to open a JIRA  to comment on this issue and attach a patch.
> >> 
> >> Thanks
> >> 
> >> Julien
> >> 
> >> On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:
> >>> Hi,
> >>> 
> >>> I need to move the SOLR based search platform to a distributed setup,
> >>> and therefore need to be able to write to multiple SOLR servers from
> >>> Nutch (working on the nutchgora branch, so this may be specific to
> >>> this
> >> 
> >> branch).
> >> 
> >>> Here is what I think I need to do...
> >>> 
> >>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
> >>> it converts the WebPage to a NutchDocument, then passes the
> >>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
> >>> case). The
> >> 
> >> SolrWriter
> >> 
> >>> adds the NutchDocument to a queue and when the commit size is exceeded,
> >>> writes out the queue and does a commit (and another one in the shutdown
> >>> step).
> >>> 
> >>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> >>> comma-separated list of URLs. The SolrWriter splits this parameter by
> >>> "," and creates an array of server URLs and the same size array of
> >>> inputDocs queue. It then takes the URL, runs it through a hashMod
> >>> partitioner and writes it out to the inputDocs queue pointed by the
> >>> partition.
> >>> 
> >>> Then my pages get split up into a number of SOLR servers, where I can
> >>> query them in a distributed fashion (according to the SOLR docs, it is
> >>> advisable to do this in a random manner to make sure the (unreliable)
> >>> idf values do not influence scores from one server too much).
> >>> 
> >>> Is this a reasonable way to go about this? Or is there a simpler method
> >>> I am overlooking?
> >>> 
> >>> TIA for any help you can provide.
> >>> 
> >>> -sujit
> >> 
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >> 
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Posted by SUJIT PAL <su...@comcast.net>.

I have updated a patch for NUTCH-945. It works locally as described in the JIRA.

-sujit


On Feb 23, 2012, at 10:35 PM, SUJIT PAL wrote:

> Hi Lewis,
> 
> Ok, thanks, I will attach the patch to NUTCH-945 after I am done with it, and update this thread as well...
> 
> -sujit
> 
> On Feb 23, 2012, at 3:43 AM, Lewis John Mcgibbney wrote:
> 
>> Hi Sujit,
>> 
>> 
>> On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <su...@comcast.net> wrote:
>> 
>>> Being able to specify the partitioner class sounds good - I am thinking
>>> that perhaps they could all be impls of the Hadoop
>>> org.apache.hadoop.mapreduce.Partitioner interface.
>>> 
>> 
>> Sounds good!
>> 
>> 
>>> 
>>> Would it be okay if I annotated NUTCH-945 saying that I am working on
>>> providing a patch for the NutchGora branch initially (I haven't looked at
>>> the head code yet, its likely to be slightly different), and then try to
>>> port the change over to the head?
>>> 
>> 
>> Yes please fire ahead and if you are able to implement this feature then
>> please attach your patch and we can hopefully review. Based on Markus'
>> comments I think that although things over @ Solr development 4.X are scope
>> for change in the 'near' future, I think this would be useful for people in
>> the meantime.
>> 
>> Thank you
>

Re: [nutchgora] - proposal to support distributed indexing

Posted by SUJIT PAL <su...@comcast.net>.

Hi Lewis,

Ok, thanks, I will attach the patch to NUTCH-945 after I am done with it, and update this thread as well...

-sujit

On Feb 23, 2012, at 3:43 AM, Lewis John Mcgibbney wrote:

> Hi Sujit,
> 
> 
> On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <su...@comcast.net> wrote:
> 
>> Being able to specify the partitioner class sounds good - I am thinking
>> that perhaps they could all be impls of the Hadoop
>> org.apache.hadoop.mapreduce.Partitioner interface.
>> 
> 
> Sounds good!
> 
> 
>> 
>> Would it be okay if I annotated NUTCH-945 saying that I am working on
>> providing a patch for the NutchGora branch initially (I haven't looked at
>> the head code yet, its likely to be slightly different), and then try to
>> port the change over to the head?
>> 
> 
> Yes please fire ahead and if you are able to implement this feature then
> please attach your patch and we can hopefully review. Based on Markus'
> comments I think that although things over @ Solr development 4.X are scope
> for change in the 'near' future, I think this would be useful for people in
> the meantime.
> 
> Thank you

Re: [nutchgora] - proposal to support distributed indexing

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Sujit,

On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <su...@comcast.net> wrote:

> Being able to specify the partitioner class sounds good - I am thinking
> that perhaps they could all be impls of the Hadoop
> org.apache.hadoop.mapreduce.Partitioner interface.
>

Sounds good!

>
> Would it be okay if I annotated NUTCH-945 saying that I am working on
> providing a patch for the NutchGora branch initially (I haven't looked at
> the head code yet, its likely to be slightly different), and then try to
> port the change over to the head?
>

Yes please fire ahead and if you are able to implement this feature then
please attach your patch and we can hopefully review. Based on Markus'
comments I think that although things over @ Solr development 4.X are scope
for change in the 'near' future, I think this would be useful for people in
the meantime.

Thank you

Re: [nutchgora] - proposal to support distributed indexing

Posted by SUJIT PAL <su...@comcast.net>.

Thanks Julien and Lewis.

Being able to specify the partitioner class sounds good - I am thinking that perhaps they could all be impls of the Hadoop org.apache.hadoop.mapreduce.Partitioner interface. 

Would it be okay if I annotated NUTCH-945 saying that I am working on providing a patch for the NutchGora branch initially (I haven't looked at the head code yet, its likely to be slightly different), and then try to port the change over to the head?

-sujit

On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:

> Hi.
> 
> There was an issue [0] opened for this some time ago and it looks that
> apart from the (bare minimal) description, there has been no work done on
> it.
> 
> Would be a real nice feature to have.
> 
> [0] https://issues.apache.org/jira/browse/NUTCH-945
> 
> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
> 
>> Hi Sujit,
>> 
>> Sounds good. A nice way of doing it would be to make so that people can
>> define how to partition over the SOLR instances in the way they want e.g.
>> consistent hashing, URL range or crawldb metadata by taking a class name as
>> parameter. Does not need to be pluggable I think. I had implemented
>> something along these lines some time ago for a customer but could not
>> release it open source.
>> 
>> Feel free to open a JIRA  to comment on this issue and attach a patch.
>> 
>> Thanks
>> 
>> Julien
>> 
>> On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:
>> 
>>> Hi,
>>> 
>>> I need to move the SOLR based search platform to a distributed setup, and
>>> therefore need to be able to write to multiple SOLR servers from Nutch
>>> (working on the nutchgora branch, so this may be specific to this
>> branch).
>>> Here is what I think I need to do...
>>> 
>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
>>> converts the WebPage to a NutchDocument, then passes the NutchDocument to
>>> the appropriate NutchIndexWriter (SolrWriter in this case). The
>> SolrWriter
>>> adds the NutchDocument to a queue and when the commit size is exceeded,
>>> writes out the queue and does a commit (and another one in the shutdown
>>> step).
>>> 
>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
>>> comma-separated list of URLs. The SolrWriter splits this parameter by ","
>>> and creates an array of server URLs and the same size array of inputDocs
>>> queue. It then takes the URL, runs it through a hashMod partitioner and
>>> writes it out to the inputDocs queue pointed by the partition.
>>> 
>>> Then my pages get split up into a number of SOLR servers, where I can
>>> query them in a distributed fashion (according to the SOLR docs, it is
>>> advisable to do this in a random manner to make sure the (unreliable) idf
>>> values do not influence scores from one server too much).
>>> 
>>> Is this a reasonable way to go about this? Or is there a simpler method I
>>> am overlooking?
>>> 
>>> TIA for any help you can provide.
>>> 
>>> -sujit
>>> 
>>> 
>> 
>> 
>> --
>> *
>> *Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>> 
> 
> 
> 
> -- 
> *Lewis*

Re: [nutchgora] - proposal to support distributed indexing

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi.

There was an issue [0] opened for this some time ago and it looks that
apart from the (bare minimal) description, there has been no work done on
it.

Would be a real nice feature to have.

[0] https://issues.apache.org/jira/browse/NUTCH-945

On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Sujit,
>
> Sounds good. A nice way of doing it would be to make so that people can
> define how to partition over the SOLR instances in the way they want e.g.
> consistent hashing, URL range or crawldb metadata by taking a class name as
> parameter. Does not need to be pluggable I think. I had implemented
> something along these lines some time ago for a customer but could not
> release it open source.
>
> Feel free to open a JIRA  to comment on this issue and attach a patch.
>
> Thanks
>
> Julien
>
> On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:
>
> > Hi,
> >
> > I need to move the SOLR based search platform to a distributed setup, and
> > therefore need to be able to write to multiple SOLR servers from Nutch
> > (working on the nutchgora branch, so this may be specific to this
> branch).
> > Here is what I think I need to do...
> >
> > Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
> > converts the WebPage to a NutchDocument, then passes the NutchDocument to
> > the appropriate NutchIndexWriter (SolrWriter in this case). The
> SolrWriter
> > adds the NutchDocument to a queue and when the commit size is exceeded,
> > writes out the queue and does a commit (and another one in the shutdown
> > step).
> >
> > My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> > comma-separated list of URLs. The SolrWriter splits this parameter by ","
> > and creates an array of server URLs and the same size array of inputDocs
> > queue. It then takes the URL, runs it through a hashMod partitioner and
> > writes it out to the inputDocs queue pointed by the partition.
> >
> > Then my pages get split up into a number of SOLR servers, where I can
> > query them in a distributed fashion (according to the SOLR docs, it is
> > advisable to do this in a random manner to make sure the (unreliable) idf
> > values do not influence scores from one server too much).
> >
> > Is this a reasonable way to go about this? Or is there a simpler method I
> > am overlooking?
> >
> > TIA for any help you can provide.
> >
> > -sujit
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
*Lewis*

Re: [nutchgora] - proposal to support distributed indexing

Posted by Julien Nioche <li...@gmail.com>.

Hi Sujit,

Sounds good. A nice way of doing it would be to make so that people can
define how to partition over the SOLR instances in the way they want e.g.
consistent hashing, URL range or crawldb metadata by taking a class name as
parameter. Does not need to be pluggable I think. I had implemented
something along these lines some time ago for a customer but could not
release it open source.

Feel free to open a JIRA  to comment on this issue and attach a patch.

Thanks

Julien

On 22 February 2012 03:45, SUJIT PAL <su...@comcast.net> wrote:

> Hi,
>
> I need to move the SOLR based search platform to a distributed setup, and
> therefore need to be able to write to multiple SOLR servers from Nutch
> (working on the nutchgora branch, so this may be specific to this branch).
> Here is what I think I need to do...
>
> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
> converts the WebPage to a NutchDocument, then passes the NutchDocument to
> the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter
> adds the NutchDocument to a queue and when the commit size is exceeded,
> writes out the queue and does a commit (and another one in the shutdown
> step).
>
> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> comma-separated list of URLs. The SolrWriter splits this parameter by ","
> and creates an array of server URLs and the same size array of inputDocs
> queue. It then takes the URL, runs it through a hashMod partitioner and
> writes it out to the inputDocs queue pointed by the partition.
>
> Then my pages get split up into a number of SOLR servers, where I can
> query them in a distributed fashion (according to the SOLR docs, it is
> advisable to do this in a random manner to make sure the (unreliable) idf
> values do not influence scores from one server too much).
>
> Is this a reasonable way to go about this? Or is there a simpler method I
> am overlooking?
>
> TIA for any help you can provide.
>
> -sujit
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble