You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Soheb Mahmood <so...@gmail.com> on 2011/01/26 17:29:14 UTC

Distributed Indexing

Hello,

We are going to implement distributed indexing for Solr - without the
use of SolrCloud (so it can be easily up-scaled). We have a deadline by
February to get this done, so we need to get cracking ;) 

So far, we've had a look at the solr classes and thought about
distributed indexing on Solr, and we have come up with these ideas:

1. We plan to modify SimplePostTool to accommodate posting to specific
shards. We are going to add an optional system property to allow the
user to specify a list of shards to index to Solr.
Example of this being "java
-Durl=http://localhost:7574/solr/collection1/update
-Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
XML files>"

We also plan to modify server request processing to handle distributed
indexing. We are looking at CommonsHttpSolrServer.java for ways to
accomplish this.

With all these changes, we realise that we are only modifying the Java
version, and that other languages need to be updated to accommodate our
changes (e.g. perl). We were wondering if there was a simple way of
applying these changes we wrote in Java across all the other languages.

2. We are going to make an interface to handle distributed writing. We
plan for it to sit between the Solr server and the shards - if no shards
are specified, then the post.jar tool will work exactly the same way it
does now. However, if the user specifies shards for post.jar, then we
want a class that has extended our interface to kick into action. 

3. We plan to test our results by acceptance testing (we run Solr and
see if it works ourselves) and writing a test class.

Does anyone have any comments to share?

Thanks,
Soheb Mahmood


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Soheb Mahmood <so...@gmail.com>.

Hey guys!

On Thu, 2011-01-27 at 10:04 +1300, Todd Nine wrote:
> Just throwing in my 2 cents.  If you're on a tight deadline have you
> had a look at Solandra?  We were already using Cassandra, so it was
> incredibly easy to get a scalable Solr installation up and running.
> 

In short: We are doing this implementation for a university group
project.

In long: But then again, you could be forgiven for thinking we are
trying to implement this feature to use it ourselves (as in, for some
sort of business), but we are actually doing this for a group project,
as I've said above. You see, we are UCL University students that have
been given this task to contribute something to the Apache SolrCloud
open source project.

We are doing this higher level goal of basically adding native
distributed indexing into Solr so it indirectly benefits SolrCloud in
the future. We are hoping to implement this and hopefully get it
contributed into the Apache Solr project. *crosses fingers*

> Hi Soheb,
> 
> Sounds good! A few things I thought of:
> 
> With regard to #1, would the list of shards to index to (if present)
> be exclusive or would we assume that the shard the update request was
> sent to should also be included? For example, say, using the example
> you gave, an update request was sent like so:
> java -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:8983/solr -jar post.jar <list of XML files>
> 
> should the documents be indexed exclusively to the 'shards list' (ie.
> just localhost:8983/solr) or the 'shards list' & the server the
> request was sent to? So specifying something like this:
> java -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:7574/solr -jar post.jar <list of XML files>
> would be equivalent to:
> java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar
> <list of XML files>

I reckon we should stick exclusively to the list that the user
specifies. I personally would find it strange behaviour that the shard
be also added to the shard I was on. 

For an example, if we had a user like me (as thick as a brick), our user
(me) may try to index a document on shard localhost:8983 when in fact
I've pointed it to localhost:7574. The user would then get terribly
confused at solr and think he has accidentally broken Solr if it got
indexed on both.

My ideal situation is, unless I... err... I mean unless the "user"
strictly specifies the shard he wants to index to, then it shouldn't be
indexed at that shard.

> For a default interface to decide which shard to index to, we were
> thinking of using either a simple hash function on the document's
> uniqueKey modulo the number of shards specified in the list (as
> mentioned here:
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Indexing) or
> some sort of round robin method, indexing a document to each shard in
> turn, until there are no more documents left to index.

Good point, I completely missed that out last time. 

> Also, how will we deal with failures? Should we simply return a list
> of all documents which weren't indexed or have a retry period after
> the initial indexing?

Well I was thinking of something along the lines of what download
managers do - either retry if the shard is busy or fail it if the shard
is somehow inaccessible. The ones that were failed should possibly be
spat out by Solr.

Are we planning on having a GUI front-end to this? I mean not now, given
we have two weeks to do this, but is one of the future possible goals of
this to implement a front-end UI so that the user can index documents
painlessly? If so, I suggest we should also consider having an XML
output, or some kind of output that can easily be parsed into XML.

Soheb

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Todd Nine <to...@spidertracks.com>.

Just throwing in my 2 cents.  If you're on a tight deadline have you had a
look at Solandra?  We were already using Cassandra, so it was incredibly
easy to get a scalable Solr installation up and running.

On 27 January 2011 08:17, Alex Cowell <al...@gmail.com> wrote:

> Hi Soheb,
>
> Sounds good! A few things I thought of:
>
> With regard to #1, would the list of shards to index to (if present) be
> exclusive or would we assume that the shard the update request was sent to
> should also be included? For example, say, using the example you gave, an
> update request was sent like so:
> java -Durl=http://localhost:7574/solr/collection1/update-Dshards=localhost:8983/solr -jar post.jar <list of XML files>
>
> should the documents be indexed exclusively to the 'shards list' (ie. just
> localhost:8983/solr) or the 'shards list' & the server the request was sent
> to? So specifying something like this:
> java -Durl=http://localhost:7574/solr/collection1/update-Dshards=localhost:7574/solr -jar post.jar <list of XML files>
> would be equivalent to:
> java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar
> <list of XML files>
>
> For a default interface to decide which shard to index to, we were thinking
> of using either a simple hash function on the document's uniqueKey modulo
> the number of shards specified in the list (as mentioned here:
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Indexing) or
> some sort of round robin method, indexing a document to each shard in turn,
> until there are no more documents left to index.
>
> Also, how will we deal with failures? Should we simply return a list of all
> documents which weren't indexed or have a retry period after the initial
> indexing?
>
> Regards,
>
> Alex
>
>
>
> On Wed, Jan 26, 2011 at 4:29 PM, Soheb Mahmood <so...@gmail.com>wrote:
>
>> Hello,
>>
>> We are going to implement distributed indexing for Solr - without the
>> use of SolrCloud (so it can be easily up-scaled). We have a deadline by
>> February to get this done, so we need to get cracking ;)
>>
>> So far, we've had a look at the solr classes and thought about
>> distributed indexing on Solr, and we have come up with these ideas:
>>
>> 1. We plan to modify SimplePostTool to accommodate posting to specific
>> shards. We are going to add an optional system property to allow the
>> user to specify a list of shards to index to Solr.
>> Example of this being "java
>> -Durl=http://localhost:7574/solr/collection1/update
>> -Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
>> XML files>"
>>
>> We also plan to modify server request processing to handle distributed
>> indexing. We are looking at CommonsHttpSolrServer.java for ways to
>> accomplish this.
>>
>> With all these changes, we realise that we are only modifying the Java
>> version, and that other languages need to be updated to accommodate our
>> changes (e.g. perl). We were wondering if there was a simple way of
>> applying these changes we wrote in Java across all the other languages.
>>
>> 2. We are going to make an interface to handle distributed writing. We
>> plan for it to sit between the Solr server and the shards - if no shards
>> are specified, then the post.jar tool will work exactly the same way it
>> does now. However, if the user specifies shards for post.jar, then we
>> want a class that has extended our interface to kick into action.
>>
>> 3. We plan to test our results by acceptance testing (we run Solr and
>> see if it works ourselves) and writing a test class.
>>
>> Does anyone have any comments to share?
>>
>> Thanks,
>> Soheb Mahmood
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

Hi Soheb,

Sounds good! A few things I thought of:

With regard to #1, would the list of shards to index to (if present) be
exclusive or would we assume that the shard the update request was sent to
should also be included? For example, say, using the example you gave, an
update request was sent like so:
java -Durl=http://localhost:7574/solr/collection1/update-Dshards=localhost:8983/solr
-jar post.jar <list of XML files>

should the documents be indexed exclusively to the 'shards list' (ie. just
localhost:8983/solr) or the 'shards list' & the server the request was sent
to? So specifying something like this:
java -Durl=http://localhost:7574/solr/collection1/update-Dshards=localhost:7574/solr
-jar post.jar <list of XML files>
would be equivalent to:
java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar <list
of XML files>

For a default interface to decide which shard to index to, we were thinking
of using either a simple hash function on the document's uniqueKey modulo
the number of shards specified in the list (as mentioned here:
http://wiki.apache.org/solr/DistributedSearch#Distributed_Indexing) or some
sort of round robin method, indexing a document to each shard in turn, until
there are no more documents left to index.

Also, how will we deal with failures? Should we simply return a list of all
documents which weren't indexed or have a retry period after the initial
indexing?

Regards,

Alex


On Wed, Jan 26, 2011 at 4:29 PM, Soheb Mahmood <so...@gmail.com>wrote:

> Hello,
>
> We are going to implement distributed indexing for Solr - without the
> use of SolrCloud (so it can be easily up-scaled). We have a deadline by
> February to get this done, so we need to get cracking ;)
>
> So far, we've had a look at the solr classes and thought about
> distributed indexing on Solr, and we have come up with these ideas:
>
> 1. We plan to modify SimplePostTool to accommodate posting to specific
> shards. We are going to add an optional system property to allow the
> user to specify a list of shards to index to Solr.
> Example of this being "java
> -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
> XML files>"
>
> We also plan to modify server request processing to handle distributed
> indexing. We are looking at CommonsHttpSolrServer.java for ways to
> accomplish this.
>
> With all these changes, we realise that we are only modifying the Java
> version, and that other languages need to be updated to accommodate our
> changes (e.g. perl). We were wondering if there was a simple way of
> applying these changes we wrote in Java across all the other languages.
>
> 2. We are going to make an interface to handle distributed writing. We
> plan for it to sit between the Solr server and the shards - if no shards
> are specified, then the post.jar tool will work exactly the same way it
> does now. However, if the user specifies shards for post.jar, then we
> want a class that has extended our interface to kick into action.
>
> 3. We plan to test our results by acceptance testing (we run Solr and
> see if it works ourselves) and writing a test class.
>
> Does anyone have any comments to share?
>
> Thanks,
> Soheb Mahmood
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

Hi Yonik and Upayavira,

Thank you both for your insightful responses. We now have a much better
understanding of how to implement distributed indexing, although no doubt
more issues will emerge along the way.

Just to clarify (and for critique), our approach goes something like this:
We will use a DistributedUpdateRequestHandler to process an update request
when a 'shards' parameter is present in the URL (as with distributed
search). For example

http://localhost:8983/solr/collection1/update?shards=localhost:8983/solr,localhost:7574/solr

will index the docs across both servers specified. Of course, as Yonik
suggested, this could easily be extended (by using a different URL or
additional params) to handle an entire cluster or a logical shard.

The server would then use the information received from the request handler
to add the documents to the index. To do this, a
ShardPolicy/ShardDistributionPolicy would be consulted  - as specified in
the solrconfig.xml? (with a default method if none specified) - which would
decide which shard to send that document to. Then the documents would
actually be forwarded on to their respective shards to be indexed.

We'll be sure to keep the mailing list posted on our progress.

Thanks,

Alex

Re: Distributed Indexing

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Jan 28, 2011 at 7:55 AM, Upayavira <uv...@odoko.co.uk> wrote:
>
> On Thu, 27 Jan 2011 16:01 +0000, "Alex Cowell" <al...@gmail.com> wrote:
>
> Making it easy for clients I think is key... one should be able to
> update any node in the solr cluster and have solr take care of the
> hard part about updating all relevant shards.  This will most likely
> involve an update processor.  This approach allows all existing update
> methods (including things like CSV file upload) to still work
> correctly.
>
> Does that then imply that distributed indexing would become the default
> method of indexing?

Should be possible I think.  Seems nice to be able to use the exact
same update command w/o worrying if it's a cluster or not.

> What if a user, for some reason, wanted to only target
> one specific node in a cluster?

Yeah, that should always be possible too.
We could either utilize different URLs, or use a parameter.

> Surely it would be just the same as distributed search. If you provide a
> 'shards' request parameter, your content is distributed amongst shards. If
> you don't, it goes directly to the host you are posting to. Control remains
> in the hand of the person accessing Solr.

People sometimes set up a request handler like /search that sets
shards to the correct value by default.
Now with SolrCloud, one can also just use distrib=true and not worry
about the shards param at all.

Off the top of my head, the following make sense to me:
- target cluster (solr decides what shards the docs belong on, via
solr hash on id, or user supplied hash on id)
- target single solr core (i.e. a single physical shard)
- target logical shard (i.e. sort does not decide what shard these
docs live on, but does handle the distrib update to all replicas)

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

On Thu, 27 Jan 2011 16:01 +0000, "Alex Cowell"
<al...@gmail.com> wrote:

  Making it easy for clients I think is key... one should be
  able to
  update any node in the solr cluster and have solr take care of
  the
  hard part about updating all relevant shards.  This will most
  likely
  involve an update processor.  This approach allows all
  existing update
  methods (including things like CSV file upload) to still work
  correctly.

Does that then imply that distributed indexing would become the
default method of indexing? What if a user, for some reason,
wanted to only target one specific node in a cluster? Wouldn't a
message need to be sent to the server ahead of the documents
stating what method of indexing to use or is this behaviour not
necessary?


Surely it would be just the same as distributed search. If you
provide a 'shards' request parameter, your content is distributed
amongst shards. If you don't, it goes directly to the host you
are posting to. Control remains in the hand of the person
accessing Solr.

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

>
> Making it easy for clients I think is key... one should be able to
> update any node in the solr cluster and have solr take care of the
> hard part about updating all relevant shards.  This will most likely
> involve an update processor.  This approach allows all existing update
> methods (including things like CSV file upload) to still work
> correctly.
>

Does that then imply that distributed indexing would become the default
method of indexing? What if a user, for some reason, wanted to only target
one specific node in a cluster? Wouldn't a message need to be sent to the
server ahead of the documents stating what method of indexing to use or is
this behaviour not necessary?

Alex

Re: Distributed Indexing

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Feb 14, 2011 at 10:04 AM, Alex Cowell <al...@gmail.com> wrote:
> There seem to be some nuances which we have yet to encounter/discover like
> the way you've implemented the processCommit() method to wait for all the
> adds/deletes to complete before sending the commits. Are these things which
> you were aware of in advance that would need to be dealt with?

Yeah, it really just has to do with the fact that I was using multiple
threads to send update commands to other nodes.
This means that if you do an add and then a delete of the same doc,
those could get reordered to do the delete first, and then the add.
And the commit at the end could sneak in front of some adds and
deletes still in progress on other threads.

For true distributed indexing, I think we'll want a version number
somehow (perhaps based on timestamp by default) so updates can be
ordered, and all nodes can agree on the ordering.  For example, one
client could update node A with doc X, and a different client could
update node B with doc X.  If that happens very close together, we
need all shard replicas to agree which doc X will win.

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

I've uploaded a patch of what we've done so far:

https://issues.apache.org/jira/browse/SOLR-2358

It's still very much work in progress and there are some obvious issues
which are being resolved at the moment (such as the inefficient method of
waiting for all the docs to be processed before distributing them in one
batch and handling shard replicas), but any feedback is welcomed.

As it stands, you can distribute add and commit requests using the
HashedDistributionPolicy by simply specifying a 'shards' request parameter.
Using a user specified distribution policy (either as a param in the URL or
defined in the solrconfig as Upayavira suggested) is in the works as well.
Regarding that, I figure the priority for determining which policy to use
would be (highest to lowest):

1. Param in the URL
2. Specified in the solrconfig
3. Hard-coded default to fall back on

That way if a user changed their mind about which distribution policy they
wanted to use, they could override the default policy with their chosen one
as a request parameter.

The code has only been acceptance tested at the moment. There is a test
class but it's a bit messy, so once that's tidied up and improved a little
more I'll include it in the next patch.


> I haven't had time to follow all of this discussion, but this issue might
> help:
> https://issues.apache.org/jira/browse/SOLR-2355
>

Thanks - very interesting! It's reassuring to see our implementation has
been following a similar structure.

There seem to be some nuances which we have yet to encounter/discover like
the way you've implemented the processCommit() method to wait for all the
adds/deletes to complete before sending the commits. Are these things which
you were aware of in advance that would need to be dealt with?

Alex

Re: Distributed Indexing

Posted by Yonik Seeley <yo...@lucidimagination.com>.

I haven't had time to follow all of this discussion, but this issue might help:
https://issues.apache.org/jira/browse/SOLR-2355

It's an implementation of the basic
http://localhost:8983/solr/update/csv?shards=shard1,shard2...

-Yonik
http://lucidimagination.com

On Mon, Feb 7, 2011 at 8:55 AM, Upayavira <uv...@odoko.co.uk> wrote:
> Surely you want to be implementing an UpdateRequestProcessor, rather than a
> RequestHandler.
>
> The ContentStreamHandlerBase, in the handleRequestBody method gets an
> UpdateRequestProcessor and uses it to process the request. What we need is
> that handleRequestBody method to, as you have suggested, check on the shards
> parameter, and if necessary call a different UpdateRequestProcessor (a
> DistributedUpdateRequestProcessor).
>
> I don't think we really need it to be configurable at this point. The
> ContentStreamHandlerBase could just use a single hardwired implementation.
> If folks want choice of DistributedUpdateRequestProcessor, it can be added
> later.
>
> For configuration, the DistributedUpdateRequestProcessor should get its
> config from the parent RequestHandler. The configuration I'm most interested
> in is the DistributionPolicy. And that can be done with a
> distributionPolicyClass=solr.IDHashDistributionPolicy request parameter,
> which could potentially be configured in solrconfig.xml as an invariant, or
> provided in the request by the user if necessary.
>
> So, I'd avoid another "thing" that needs to be configured unless there are
> real benefits to it (which there don't seem to me to be right now).
>
> Upayavira
>
> On Sun, 06 Feb 2011 23:08 +0000, "Alex Cowell" <al...@gmail.com> wrote:
>
> Hey,
>
> We're making good progress, but our DistributedUpdateRequestHandler is
> having a bit of an identity crisis, so we thought we'd ask what other
> people's opinions are. The current situation is as follows:
>
> We've added a method to ContentStreamHandlerBase to check if an update
> request is distributed or not (based on the presence/validity of the
> 'shards' parameter). So a non-distributed request will proceed as normal but
> a distributed request would be passed on to the
> DistributedUpdateRequestHandler to deal with.
>
> The reason this choice is made in the ContentStreamHandlerBase is so that
> the DistributedUpdateRequestHandler can use the URL the request came in on
> to determine where to distribute update requests. Eg. an update request is
> sent to:
> http://localhost:8983/solr/update/csv?shards=shard1,shard2...
> then the DistributedUpdateRequestHandler knows to send requests to:
> shard1/update/csv
> shard2/update/csv
>
> Alternatively, if the request wasn't distributed, it would simply be handled
> by whichever request handler "/update/csv" uses.
>
> Herein lies the problem. The DistributedUpdateRequestHandler is not really a
> request handler in the same way as the CSVRequestHandler or
> XmlUpdateRequestHandlers are. If anything, it's more like a "plugin" for the
> various existing update request handlers, to allow them to deal with
> distributed requests - a "distributor" if you will. It isn't designed to be
> able to receive and handle requests directly.
>
> We would like this "DistributedUpdateRequestHandler" to be defined in the
> solrconfig to allow flexibility for setting up multiple different
> DistributedUpdateRequestHandlers with different ShardDistributionPolicies
> etc.and also to allow us to get the appropriate instance from the core in
> the code. There seem to be two paths for doing this:
>
> 1. Leave it as an implementation of SolrRequestHandler and hope the user
> doesn't directly send update requests to it (ie. a request to
> http://localhost:8983/solr/<distrib update handler path> would most likely
> cripple something). So it would be defined in the solrconfig something like:
> <requestHandler name="distrib-update"
> class="solr.DistributedUpdateRequestHandler" />
>
> 2. Create a new plugin type for the solrconfig, say
> "updateRequestDistributor" which would involve creating a new interface for
> the DistributedUpdateRequestHandler to implement, then registering it with
> the core. It would be defined in the solrconfig something like:
> <updateRequestDistributor name="distrib-update"
> class="solr.DistributedUpdateRequestHandler">
>   <lst name="defaults">
>     <str name="policy">solr.HashedDistributionPolicy</str>
>   </lst>
> </updateRequestDistributor>
>
> This would mean that it couldn't directly receive requests, but that an
> instance could still easily be retrieved from the core to handle the
> distribution of update requests.
>
> Any thoughts on the above issue (or a more succinct, descriptive name for
> the class) are most welcome!
>
> Alex
>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

Surely you want to be implementing an UpdateRequestProcessor,
rather than a RequestHandler.

The ContentStreamHandlerBase, in the handleRequestBody method
gets an UpdateRequestProcessor and uses it to process the
request. What we need is that handleRequestBody method to, as you
have suggested, check on the shards parameter, and if necessary
call a different UpdateRequestProcessor (a
DistributedUpdateRequestProcessor).

I don't think we really need it to be configurable at this point.
The ContentStreamHandlerBase could just use a single hardwired
implementation. If folks want choice of
DistributedUpdateRequestProcessor, it can be added later.

For configuration, the DistributedUpdateRequestProcessor should
get its config from the parent RequestHandler. The configuration
I'm most interested in is the DistributionPolicy. And that can be
done with a distributionPolicyClass=solr.IDHashDistributionPolicy
request parameter, which could potentially be configured in
solrconfig.xml as an invariant, or provided in the request by the
user if necessary.

So, I'd avoid another "thing" that needs to be configured unless
there are real benefits to it (which there don't seem to me to be
right now).

Upayavira

On Sun, 06 Feb 2011 23:08 +0000, "Alex Cowell"
<al...@gmail.com> wrote:

  Hey,
  We're making good progress, but our
  DistributedUpdateRequestHandler is having a bit of an identity
  crisis, so we thought we'd ask what other people's opinions
  are. The current situation is as follows:
  We've added a method to ContentStreamHandlerBase to check if
  an update request is distributed or not (based on the
  presence/validity of the 'shards' parameter). So a
  non-distributed request will proceed as normal but a
  distributed request would be passed on to the
  DistributedUpdateRequestHandler to deal with.
  The reason this choice is made in the ContentStreamHandlerBase
  is so that the DistributedUpdateRequestHandler can use the URL
  the request came in on to determine where to distribute update
  requests. Eg. an update request is sent to:
  [1]http://localhost:8983/solr/update/csv?shards=shard1,shard2.
  ..
  then the DistributedUpdateRequestHandler knows to send
  requests to:
  shard1/update/csv
  shard2/update/csv
  Alternatively, if the request wasn't distributed, it would
  simply be handled by whichever request handler "/update/csv"
  uses.
  Herein lies the problem. The DistributedUpdateRequestHandler
  is not really a request handler in the same way as the
  CSVRequestHandler or XmlUpdateRequestHandlers are. If
  anything, it's more like a "plugin" for the various existing
  update request handlers, to allow them to deal with
  distributed requests - a "distributor" if you will. It isn't
  designed to be able to receive and handle requests directly.
  We would like this "DistributedUpdateRequestHandler" to be
  defined in the solrconfig to allow flexibility for setting up
  multiple different DistributedUpdateRequestHandlers with
  different ShardDistributionPolicies etc.and also to allow us
  to get the appropriate instance from the core in the code.
  There seem to be two paths for doing this:
  1. Leave it as an implementation of SolrRequestHandler and
  hope the user doesn't directly send update requests to it (ie.
  a request to [2]http://localhost:8983/solr/<distrib update
  handler path> would most likely cripple something). So it
  would be defined in the solrconfig something like:
  <requestHandler name="distrib-update"
  class="solr.DistributedUpdateRequestHandler" />
  2. Create a new plugin type for the solrconfig, say
  "updateRequestDistributor" which would involve creating a new
  interface for the DistributedUpdateRequestHandler to
  implement, then registering it with the core. It would be
  defined in the solrconfig something like:
  <updateRequestDistributor name="distrib-update"
  class="solr.DistributedUpdateRequestHandler">
    <lst name="defaults">
      <str name="policy">solr.HashedDistributionPolicy</str>
    </lst>
  </updateRequestDistributor>
  This would mean that it couldn't directly receive requests,
  but that an instance could still easily be retrieved from the
  core to handle the distribution of update requests.
  Any thoughts on the above issue (or a more succinct,
  descriptive name for the class) are most welcome!
  Alex

References

1. http://localhost:8983/solr/update/csv?shards=shard1,shard2.
2. http://localhost:8983/solr/
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

Hey,

We're making good progress, but our DistributedUpdateRequestHandler is
having a bit of an identity crisis, so we thought we'd ask what other
people's opinions are. The current situation is as follows:

We've added a method to ContentStreamHandlerBase to check if an update
request is distributed or not (based on the presence/validity of the
'shards' parameter). So a non-distributed request will proceed as normal but
a distributed request would be passed on to the
DistributedUpdateRequestHandler to deal with.

The reason this choice is made in the ContentStreamHandlerBase is so that
the DistributedUpdateRequestHandler can use the URL the request came in on
to determine where to distribute update requests. Eg. an update request is
sent to:
http://localhost:8983/solr/update/csv?shards=shard1,shard2...
then the DistributedUpdateRequestHandler knows to send requests to:
shard1/update/csv
shard2/update/csv

Alternatively, if the request wasn't distributed, it would simply be handled
by whichever request handler "/update/csv" uses.

Herein lies the problem. The DistributedUpdateRequestHandler is not really a
request handler in the same way as the CSVRequestHandler or
XmlUpdateRequestHandlers are. If anything, it's more like a "plugin" for the
various existing update request handlers, to allow them to deal with
distributed requests - a "distributor" if you will. It isn't designed to be
able to receive and handle requests directly.

We would like this "DistributedUpdateRequestHandler" to be defined in the
solrconfig to allow flexibility for setting up multiple different
DistributedUpdateRequestHandlers with different ShardDistributionPolicies
etc.and also to allow us to get the appropriate instance from the core in
the code. There seem to be two paths for doing this:

1. Leave it as an implementation of SolrRequestHandler and hope the user
doesn't directly send update requests to it (ie. a request to
http://localhost:8983/solr/<distrib update handler path> would most likely
cripple something). So it would be defined in the solrconfig something like:
<requestHandler name="distrib-update"
class="solr.DistributedUpdateRequestHandler" />

2. Create a new plugin type for the solrconfig, say
"updateRequestDistributor" which would involve creating a new interface for
the DistributedUpdateRequestHandler to implement, then registering it with
the core. It would be defined in the solrconfig something like:
<updateRequestDistributor name="distrib-update"
class="solr.DistributedUpdateRequestHandler">
  <lst name="defaults">
    <str name="policy">solr.HashedDistributionPolicy</str>
  </lst>
</updateRequestDistributor>

This would mean that it couldn't directly receive requests, but that an
instance could still easily be retrieved from the core to handle the
distribution of update requests.

Any thoughts on the above issue (or a more succinct, descriptive name for
the class) are most welcome!

Alex

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

I'm saying that deterministic policies are a requirement that
*some* people will want. Others might want a random spread. Thus,
I'd have deterministic based on ID and random as the two initial
implementations.

Upayavira
NB. In case folks haven't worked it out already, I have been
tasked to mentor this group of students in this work, and had the
fortune to be able to point them to a task I've already thought a
lot about myself, but had no time to do :-)

On Sun, 06 Feb 2011 21:57 +0000, "William Mayor"
<ma...@williammayor.co.uk> wrote:

  Hi



Good call about the policies being deterministic, should've
thought of that earlier.



We've changed the patch to include this and I've removed the
random assignment one (for obvious reasons).



Take a look and let me know what's to do.
([1]https://issues.apache.org/jira/browse/SOLR-2341)



Cheers



William
On Thu, Feb 3, 2011 at 5:00 PM, Upayavira <[2...@odoko.co.uk>
wrote:


On Thu, 03 Feb 2011 15:12 +0000, "Alex Cowell"
<[3...@gmail.com> wrote:

  Hi all,
  Just a couple of questions that have arisen.
  1. For handling non-distributed update requests (shards param
  is not present or is invalid), our code currently
  * assumes the user would like the data indexed, so gets the
    request handler assigned to "/update"
  * executes the request using core.execute() for the SolrCore
    associated with the original request

  Is this what we want it to do and is using core.execute() from
  within a request handler a valid method of passing on the
  update request?


Take a look at how it is done in
handler.component.SearchHandler.handleRequestBody(). I'd say try
to follow as similar approach as possible. E.g. it is the
SearchHandler that does much of the work, branching depending on
whether it found a shards parameter.

  2. We have partially implemented an update processor which
  actually generates and sends the split update requests to each
  specified shard (as designated by the policy). As it stands,
  the code shares a lot in common with the HttpCommComponent
  class used for distributed search. Should we look at "opening
  up" the HttpCommComponent class so it could be used by our
  request handler as well or should we continue with our current
  implementation and worry about that later?


I agree that you are going to want to implement an
UpdateRequestProcessor. However, it would seem to me that, unlike
search, you're not going to want to bother with the existing
processor and associated component chain, you're going to want to
replace the processor with a distributed version.

As to the HttpCommComponent, I'd suggest you make your own
educated decision. How similar is the class? Could one serve both
needs effectively?

  3. Our update processor uses a
  MultiThreadedHttpConnectionManager to send parallel updates to
  shards, can anyone give some appropriate values to be used for
  the defaultMaxConnectionsPerHost and maxTotalConnections
  params? Won't the  values used for distributed search be a
  little high for distributed indexing?


You are right, these will likely be lower for distributed
indexing, however I'd suggest not worrying about it for now, as
it is easy to tweak later.

Upayavira

---
Enterprise Search Consultant at Sourcesense UK,
Making Sense of Open Source

References

1. https://issues.apache.org/jira/browse/SOLR-2341
2. mailto:uv@odoko.co.uk
3. mailto:alxcwll@gmail.com
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Distributed Indexing

Posted by William Mayor <ma...@williammayor.co.uk>.

Hi

Good call about the policies being deterministic, should've thought of that
earlier.

We've changed the patch to include this and I've removed the random
assignment one (for obvious reasons).

Take a look and let me know what's to do. (
https://issues.apache.org/jira/browse/SOLR-2341)

Cheers

William

On Thu, Feb 3, 2011 at 5:00 PM, Upayavira <uv...@odoko.co.uk> wrote:

>
>  On Thu, 03 Feb 2011 15:12 +0000, "Alex Cowell" <al...@gmail.com> wrote:
>
> Hi all,
>
> Just a couple of questions that have arisen.
>
> 1. For handling non-distributed update requests (shards param is not
> present or is invalid), our code currently
>
>    - assumes the user would like the data indexed, so gets the request
>    handler assigned to "/update"
>    - executes the request using core.execute() for the SolrCore associated
>    with the original request
>
> Is this what we want it to do and is using core.execute() from within a
> request handler a valid method of passing on the update request?
>
>
> Take a look at how it is done in
> handler.component.SearchHandler.handleRequestBody(). I'd say try to follow
> as similar approach as possible. E.g. it is the SearchHandler that does much
> of the work, branching depending on whether it found a shards parameter.
>
>
> 2. We have partially implemented an update processor which actually
> generates and sends the split update requests to each specified shard (as
> designated by the policy). As it stands, the code shares a lot in common
> with the HttpCommComponent class used for distributed search. Should we look
> at "opening up" the HttpCommComponent class so it could be used by our
> request handler as well or should we continue with our current
> implementation and worry about that later?
>
>
> I agree that you are going to want to implement an UpdateRequestProcessor.
> However, it would seem to me that, unlike search, you're not going to want
> to bother with the existing processor and associated component chain, you're
> going to want to replace the processor with a distributed version.
>
> As to the HttpCommComponent, I'd suggest you make your own educated
> decision. How similar is the class? Could one serve both needs effectively?
>
>
> 3. Our update processor uses a MultiThreadedHttpConnectionManager to send
> parallel updates to shards, can anyone give some appropriate values to be
> used for the defaultMaxConnectionsPerHost and maxTotalConnections params?
> Won't the  values used for distributed search be a little high for
> distributed indexing?
>
>
> You are right, these will likely be lower for distributed indexing, however
> I'd suggest not worrying about it for now, as it is easy to tweak later.
>
> Upayavira
>
>  ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

On Thu, 03 Feb 2011 15:12 +0000, "Alex Cowell"
<al...@gmail.com> wrote:

  Hi all,
  Just a couple of questions that have arisen.
  1. For handling non-distributed update requests (shards param
  is not present or is invalid), our code currently
  * assumes the user would like the data indexed, so gets the
    request handler assigned to "/update"
  * executes the request using core.execute() for the SolrCore
    associated with the original request

  Is this what we want it to do and is using core.execute() from
  within a request handler a valid method of passing on the
  update request?


Take a look at how it is done in
handler.component.SearchHandler.handleRequestBody(). I'd say try
to follow as similar approach as possible. E.g. it is the
SearchHandler that does much of the work, branching depending on
whether it found a shards parameter.

  2. We have partially implemented an update processor which
  actually generates and sends the split update requests to each
  specified shard (as designated by the policy). As it stands,
  the code shares a lot in common with the HttpCommComponent
  class used for distributed search. Should we look at "opening
  up" the HttpCommComponent class so it could be used by our
  request handler as well or should we continue with our current
  implementation and worry about that later?


I agree that you are going to want to implement an
UpdateRequestProcessor. However, it would seem to me that, unlike
search, you're not going to want to bother with the existing
processor and associated component chain, you're going to want to
replace the processor with a distributed version.

As to the HttpCommComponent, I'd suggest you make your own
educated decision. How similar is the class? Could one serve both
needs effectively?

  3. Our update processor uses a
  MultiThreadedHttpConnectionManager to send parallel updates to
  shards, can anyone give some appropriate values to be used for
  the defaultMaxConnectionsPerHost and maxTotalConnections
  params? Won't the  values used for distributed search be a
  little high for distributed indexing?


You are right, these will likely be lower for distributed
indexing, however I'd suggest not worrying about it for now, as
it is easy to tweak later.

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

Hi all,

Just a couple of questions that have arisen.

1. For handling non-distributed update requests (shards param is not present
or is invalid), our code currently

   - assumes the user would like the data indexed, so gets the request
   handler assigned to "/update"
   - executes the request using core.execute() for the SolrCore associated
   with the original request

Is this what we want it to do and is using core.execute() from within a
request handler a valid method of passing on the update request?

2. We have partially implemented an update processor which actually
generates and sends the split update requests to each specified shard (as
designated by the policy). As it stands, the code shares a lot in common
with the HttpCommComponent class used for distributed search. Should we look
at "opening up" the HttpCommComponent class so it could be used by our
request handler as well or should we continue with our current
implementation and worry about that later?

3. Our update processor uses a MultiThreadedHttpConnectionManager to send
parallel updates to shards, can anyone give some appropriate values to be
used for the defaultMaxConnectionsPerHost and maxTotalConnections params?
Won't the  values used for distributed search be a little high for
distributed indexing?

Thanks,

Alex

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.


On Tue, 01 Feb 2011 19:52 -0800, "Lance Norskog" <go...@gmail.com>
wrote:
> Another use case is that N indexers operate independently, all pulling
> data from the  same database. Each has a separate query to get the
> documents in its policy.

But surely in this case, you are externalising the policy, and Solr
doesn't need to know about it? I.e. your indexers are deciding what goes
in what shard, not Solr?

Upayavira

> On Tue, Feb 1, 2011 at 12:38 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > On Tue, 01 Feb 2011 19:04 +0000, "Alex Cowell" <al...@gmail.com> wrote:
> >
> > I noticed there is a comment in the
> > org.apache.solr.servlet.DirectSolrConnection class which reads, "//Find a
> > way to turn List<ContentStream> into File/SolrDocument". Did anyone find a
> > way to do this?
> >
> > Turns out that comment was left over from some experimenting one of our team
> > was doing. But I suppose the question still stands.
> >
> > Addressing the "retrieve the unique ID from the document" issue, does it
> > matter if the unique ID you do the hash on is the actual uniqueKey of the
> > document? Surely as long as you generate some value unique for each document
> > to index (for example, the name of the doc/stream + the current time) it
> > would still distribute the documents as we expect?
> >
> >
> > Well, one requirement I've heard for this is for it to be deterministic.
> > That is, a document will always go to the same shard, and you can work out
> > at any point in time where a particular document is.
> >
> > Once you've parsed the document to a SolrInputDocument, surely you can get
> > the ID/uniqueKey out? I'll do some digging tomorrow AM.
> >
> > Upayavira
> >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Lance Norskog <go...@gmail.com>.

Another use case is that N indexers operate independently, all pulling
data from the  same database. Each has a separate query to get the
documents in its policy.

On Tue, Feb 1, 2011 at 12:38 PM, Upayavira <uv...@odoko.co.uk> wrote:
>
> On Tue, 01 Feb 2011 19:04 +0000, "Alex Cowell" <al...@gmail.com> wrote:
>
> I noticed there is a comment in the
> org.apache.solr.servlet.DirectSolrConnection class which reads, "//Find a
> way to turn List<ContentStream> into File/SolrDocument". Did anyone find a
> way to do this?
>
> Turns out that comment was left over from some experimenting one of our team
> was doing. But I suppose the question still stands.
>
> Addressing the "retrieve the unique ID from the document" issue, does it
> matter if the unique ID you do the hash on is the actual uniqueKey of the
> document? Surely as long as you generate some value unique for each document
> to index (for example, the name of the doc/stream + the current time) it
> would still distribute the documents as we expect?
>
>
> Well, one requirement I've heard for this is for it to be deterministic.
> That is, a document will always go to the same shard, and you can work out
> at any point in time where a particular document is.
>
> Once you've parsed the document to a SolrInputDocument, surely you can get
> the ID/uniqueKey out? I'll do some digging tomorrow AM.
>
> Upayavira
>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, 01 Feb 2011 19:04 +0000, "Alex Cowell"
<al...@gmail.com> wrote:

  I noticed there is a comment in the
  org.apache.solr.servlet.DirectSolrConnection class which
  reads, "//Find a way to turn List<ContentStream> into
  File/SolrDocument". Did anyone find a way to do this?

  Turns out that comment was left over from some experimenting
  one of our team was doing. But I suppose the question still
  stands.
  Addressing the "retrieve the unique ID from the document"
  issue, does it matter if the unique ID you do the hash on is
  the actual uniqueKey of the document? Surely as long as you
  generate some value unique for each document to index (for
  example, the name of the doc/stream + the current time) it
  would still distribute the documents as we expect?


Well, one requirement I've heard for this is for it to be
deterministic. That is, a document will always go to the same
shard, and you can work out at any point in time where a
particular document is.

Once you've parsed the document to a SolrInputDocument, surely
you can get the ID/uniqueKey out? I'll do some digging tomorrow
AM.

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

>
> I noticed there is a comment in the
> org.apache.solr.servlet.DirectSolrConnection class which reads, "//Find a
> way to turn List<ContentStream> into File/SolrDocument". Did anyone find a
> way to do this?
>

Turns out that comment was left over from some experimenting one of our team
was doing. But I suppose the question still stands.

Addressing the "retrieve the unique ID from the document" issue, does it
matter if the unique ID you do the hash on is the actual uniqueKey of the
document? Surely as long as you generate some value unique for each document
to index (for example, the name of the doc/stream + the current time) it
would still distribute the documents as we expect?

Alex

Re: Distributed Indexing

Posted by Alex Cowell <al...@gmail.com>.

>
> Your code looks fine to me, except it should take in a SolrDocument
> object or list of, rather than strings. Then, for your Hash version, you
> can take a hash of the "id" field.
>

As far as I can see I have access to a List<ContentStream> that
> represents all of the files being POSTed. Do I want to open these
> streams, get the info and then stream them out? This seems wasteful.


I noticed there is a comment in the
org.apache.solr.servlet.DirectSolrConnection class which reads, "//Find a
way to turn List<ContentStream> into File/SolrDocument". Did anyone find a
way to do this?

Re: Distributed Indexing

Posted by William Mayor <ma...@williammayor.co.uk>.

Hello

Thanks for your prompt reply.

In regards to using a SolrDocument instead of Strings (and I agree
that List<String> doesn't seem to be the best way of going) how do I
get reference to a SolrDoc?

As far as I can see I have access to a List<ContentStream> that
represents all of the files being POSTed. Do I want to open these
streams, get the info and then stream them out? This seems wasteful.

I had instead thought that the DistributedUpdatedRequestHandler would
take this List<ContentStream>, create some kind mapping between each
stream and a unique id and then pass the ids to the policy.

Thanks for your help

Billy

On Tue, Feb 1, 2011 at 11:27 AM, Upayavira <uv...@odoko.co.uk> wrote:
> On Tue, 01 Feb 2011 00:26 +0000, "William Mayor"
> <ma...@williammayor.co.uk> wrote:
>> Hi Guys
>>
>> I've had a go at creating the ShardDistributionPolicy interface and a
>> few implementations. I've created a patch
>> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what
>> needs doing.
>
>
>> Currently I assume that the documents passed to the policy will be
>> represented by some kind of identifier and that one needs only to
>> match the ID with a shard. This is better (I think) than reading the
>> document from the POST and figuring out some kind of unique
>> identifier?
>
> Your code looks fine to me, except it should take in a SolrDocument
> object or list of, rather than strings. Then, for your Hash version, you
> can take a hash of the "id" field.
>
>> A question we've had about this is who decides what policy to use and
>> where do they specify? I'm inclided to think that the user (the person
>> POSTing data) does not mind what policy is used but the administrator
>> might. This leads me to think that the policy should be set in the
>> solr config file? My collegues disagree that the user will not mind
>> and would rather see the policy be specified in the url. We've noticed
>> that request handlers can be specified in both so should we adopt this
>> idea instead (and as a kind of comprimise :) ).
>
> To stick with Solr conventions, you would specify the
> ShardDistributionPolicy in the solrconfig.xml, within the configuration
> of your DistributedUpdateRequestHandler, so in that sense, it is hidden
> from your users and managed by the administrator.
>
> However, if you follow this approach, an administrator could expose
> multiple policies by having multiple DistributedUpdateRequestHandler
> definitions in solrconfig.xml, with different URLs.
>
> To give you an example, but for search rather than indexing:
>
>  <requestHandler name="/dismax" class="solr.SearchHandler"
>  default="true">
>    <!-- default values for query parameters -->
>     <lst name="defaults">
>       <str name="defType">dismax</str>
>     </lst>
>  </requestHandler>
>
> This will configure requests to http://localhost:8983/solr/dismax?q=blah
>
> to be handled by the dismax query parser.
>
> More relevant to you:
>
>  <requestHandler name="/distrib" class="solr.SearchHandler"
>  default="true">
>    <!-- default values for query parameters -->
>     <lst name="defaults">
>       <str
>       name="shards">http://localhost:8983/solr,http://localhost:7983/solr</str>
>     </lst>
>  </requestHandler>
>
> This would, by default, distribute all queries to
> http://localhost:8983/solr/distrib?q=blah across two Solr instances at
> the URLs described.
>
> For now, I'd say see if you can add a
> distributionPolicyClass="org.apache.solr.blah" to define the class that
> this updateRequestHandler is going to use.
>
> To everyone else who got this far - please chip in if you see better
> ways of doing this.
>
> Upayavira
>
>> All the best
>>
>> William
>>
>> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <uv...@odoko.co.uk> wrote:
>> > Lance,
>> >
>> > Firstly, we're proposing a ShardDistributionPolicy interface for which
>> > there is a default (mod of the doc ID) but other implementations are
>> > possible. Another easy implementation would be a randomised or round
>> > robin one.
>> >
>> > As to threading, the first task would be to put all of the source
>> > documents into "buckets", one bucket per shard, using the above
>> > ShardDistributionPolicy to assign documents to buckets/shards. Then all
>> > of the documents in a "bucket" could be sent to the relevant shard for
>> > indexing (which would be nothing more than a normal HTTP post (or solrj
>> > call?)).
>> >
>> > As to whether this would be single threaded or multithreaded, I would
>> > guess we would aim to do it the same as the distributed search code
>> > (which I have not yet reviewed). However, it could presumably be
>> > single-threaded, but use asynchronous HTTP.
>> >
>> > Regards, Upayavira
>> >
>> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <go...@gmail.com>
>> > wrote:
>> >> I would suggest that a DistributedRequestUpdateHandler run
>> >> single-threaded, doing only one document at a time. If I want more
>> >> than one, I run it twice or N times with my own program.
>> >>
>> >> Also, this should have a policy object which decides exactly how
>> >> documents are distributed. There are different techniques for
>> >> different use cases.
>> >>
>> >> Lance
>> >>
>> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <so...@gmail.com>
>> >> wrote:
>> >> > Hello Yonik,
>> >> >
>> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
>> >> >> Making it easy for clients I think is key... one should be able to
>> >> >> update any node in the solr cluster and have solr take care of the
>> >> >> hard part about updating all relevant shards.  This will most likely
>> >> >> involve an update processor.  This approach allows all existing update
>> >> >> methods (including things like CSV file upload) to still work
>> >> >> correctly.
>> >> >>
>> >> >> Also post.jar is really just for testing... a command-line replacement
>> >> >> for "curl" for those who may not have it.  It's not really a
>> >> >> recommended way for updating Solr servers in production.
>> >> >
>> >> > OK, I've abandoned the post.jar tool idea in favour of a
>> >> > DistributedUpdateRequestProcessor class (I've been looking into other
>> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
>> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
>> >> > are used/what data they store - hence why I've taken some time to
>> >> > respond).
>> >> >
>> >> > My big question now is that is it necessary to have a Factory class for
>> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
>> >> > RunUpdateProcessorFactory (where the factory class was only a few lines
>> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
>> >> > would be a good design idea to include it in (in a generic sense), but
>> >> > then I thought harder and I thought that the
>> >> > DistributedUpdateRequestHander would only be running once, taking in all
>> >> > the requests, so it seems sort of pointless to write one in.
>> >> >
>> >> > That is my "burning" question for now. I have got a few more questions,
>> >> > but I'm sure that when I look further into the code, I'll either have
>> >> > more or all of my questions are answered.
>> >> >
>> >> > Many thanks!
>> >> >
>> >> > Soheb Mahmood
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> > ---
>> > Enterprise Search Consultant at Sourcesense UK,
>> > Making Sense of Open Source
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, 01 Feb 2011 00:26 +0000, "William Mayor"
<ma...@williammayor.co.uk> wrote:
> Hi Guys
> 
> I've had a go at creating the ShardDistributionPolicy interface and a
> few implementations. I've created a patch
> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what
> needs doing.


> Currently I assume that the documents passed to the policy will be
> represented by some kind of identifier and that one needs only to
> match the ID with a shard. This is better (I think) than reading the
> document from the POST and figuring out some kind of unique
> identifier?

Your code looks fine to me, except it should take in a SolrDocument
object or list of, rather than strings. Then, for your Hash version, you
can take a hash of the "id" field.

> A question we've had about this is who decides what policy to use and
> where do they specify? I'm inclided to think that the user (the person
> POSTing data) does not mind what policy is used but the administrator
> might. This leads me to think that the policy should be set in the
> solr config file? My collegues disagree that the user will not mind
> and would rather see the policy be specified in the url. We've noticed
> that request handlers can be specified in both so should we adopt this
> idea instead (and as a kind of comprimise :) ).

To stick with Solr conventions, you would specify the
ShardDistributionPolicy in the solrconfig.xml, within the configuration
of your DistributedUpdateRequestHandler, so in that sense, it is hidden
from your users and managed by the administrator.

However, if you follow this approach, an administrator could expose
multiple policies by having multiple DistributedUpdateRequestHandler
definitions in solrconfig.xml, with different URLs.

To give you an example, but for search rather than indexing:

  <requestHandler name="/dismax" class="solr.SearchHandler"
  default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="defType">dismax</str>
     </lst>
  </requestHandler>

This will configure requests to http://localhost:8983/solr/dismax?q=blah

to be handled by the dismax query parser.

More relevant to you:

  <requestHandler name="/distrib" class="solr.SearchHandler"
  default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str
       name="shards">http://localhost:8983/solr,http://localhost:7983/solr</str>
     </lst>
  </requestHandler>

This would, by default, distribute all queries to
http://localhost:8983/solr/distrib?q=blah across two Solr instances at
the URLs described.

For now, I'd say see if you can add a
distributionPolicyClass="org.apache.solr.blah" to define the class that
this updateRequestHandler is going to use.

To everyone else who got this far - please chip in if you see better
ways of doing this.

Upayavira

> All the best
> 
> William
> 
> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > Lance,
> >
> > Firstly, we're proposing a ShardDistributionPolicy interface for which
> > there is a default (mod of the doc ID) but other implementations are
> > possible. Another easy implementation would be a randomised or round
> > robin one.
> >
> > As to threading, the first task would be to put all of the source
> > documents into "buckets", one bucket per shard, using the above
> > ShardDistributionPolicy to assign documents to buckets/shards. Then all
> > of the documents in a "bucket" could be sent to the relevant shard for
> > indexing (which would be nothing more than a normal HTTP post (or solrj
> > call?)).
> >
> > As to whether this would be single threaded or multithreaded, I would
> > guess we would aim to do it the same as the distributed search code
> > (which I have not yet reviewed). However, it could presumably be
> > single-threaded, but use asynchronous HTTP.
> >
> > Regards, Upayavira
> >
> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <go...@gmail.com>
> > wrote:
> >> I would suggest that a DistributedRequestUpdateHandler run
> >> single-threaded, doing only one document at a time. If I want more
> >> than one, I run it twice or N times with my own program.
> >>
> >> Also, this should have a policy object which decides exactly how
> >> documents are distributed. There are different techniques for
> >> different use cases.
> >>
> >> Lance
> >>
> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <so...@gmail.com>
> >> wrote:
> >> > Hello Yonik,
> >> >
> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> >> >> Making it easy for clients I think is key... one should be able to
> >> >> update any node in the solr cluster and have solr take care of the
> >> >> hard part about updating all relevant shards.  This will most likely
> >> >> involve an update processor.  This approach allows all existing update
> >> >> methods (including things like CSV file upload) to still work
> >> >> correctly.
> >> >>
> >> >> Also post.jar is really just for testing... a command-line replacement
> >> >> for "curl" for those who may not have it.  It's not really a
> >> >> recommended way for updating Solr servers in production.
> >> >
> >> > OK, I've abandoned the post.jar tool idea in favour of a
> >> > DistributedUpdateRequestProcessor class (I've been looking into other
> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
> >> > are used/what data they store - hence why I've taken some time to
> >> > respond).
> >> >
> >> > My big question now is that is it necessary to have a Factory class for
> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
> >> > RunUpdateProcessorFactory (where the factory class was only a few lines
> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
> >> > would be a good design idea to include it in (in a generic sense), but
> >> > then I thought harder and I thought that the
> >> > DistributedUpdateRequestHander would only be running once, taking in all
> >> > the requests, so it seems sort of pointless to write one in.
> >> >
> >> > That is my "burning" question for now. I have got a few more questions,
> >> > but I'm sure that when I look further into the code, I'll either have
> >> > more or all of my questions are answered.
> >> >
> >> > Many thanks!
> >> >
> >> > Soheb Mahmood
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by William Mayor <ma...@williammayor.co.uk>.

Hi Guys

I've had a go at creating the ShardDistributionPolicy interface and a
few implementations. I've created a patch
(https://issues.apache.org/jira/browse/SOLR-2341) let me know what
needs doing.

Currently I assume that the documents passed to the policy will be
represented by some kind of identifier and that one needs only to
match the ID with a shard. This is better (I think) than reading the
document from the POST and figuring out some kind of unique
identifier?

A question we've had about this is who decides what policy to use and
where do they specify? I'm inclided to think that the user (the person
POSTing data) does not mind what policy is used but the administrator
might. This leads me to think that the policy should be set in the
solr config file? My collegues disagree that the user will not mind
and would rather see the policy be specified in the url. We've noticed
that request handlers can be specified in both so should we adopt this
idea instead (and as a kind of comprimise :) ).

All the best

William

On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <uv...@odoko.co.uk> wrote:
> Lance,
>
> Firstly, we're proposing a ShardDistributionPolicy interface for which
> there is a default (mod of the doc ID) but other implementations are
> possible. Another easy implementation would be a randomised or round
> robin one.
>
> As to threading, the first task would be to put all of the source
> documents into "buckets", one bucket per shard, using the above
> ShardDistributionPolicy to assign documents to buckets/shards. Then all
> of the documents in a "bucket" could be sent to the relevant shard for
> indexing (which would be nothing more than a normal HTTP post (or solrj
> call?)).
>
> As to whether this would be single threaded or multithreaded, I would
> guess we would aim to do it the same as the distributed search code
> (which I have not yet reviewed). However, it could presumably be
> single-threaded, but use asynchronous HTTP.
>
> Regards, Upayavira
>
> On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <go...@gmail.com>
> wrote:
>> I would suggest that a DistributedRequestUpdateHandler run
>> single-threaded, doing only one document at a time. If I want more
>> than one, I run it twice or N times with my own program.
>>
>> Also, this should have a policy object which decides exactly how
>> documents are distributed. There are different techniques for
>> different use cases.
>>
>> Lance
>>
>> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <so...@gmail.com>
>> wrote:
>> > Hello Yonik,
>> >
>> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
>> >> Making it easy for clients I think is key... one should be able to
>> >> update any node in the solr cluster and have solr take care of the
>> >> hard part about updating all relevant shards.  This will most likely
>> >> involve an update processor.  This approach allows all existing update
>> >> methods (including things like CSV file upload) to still work
>> >> correctly.
>> >>
>> >> Also post.jar is really just for testing... a command-line replacement
>> >> for "curl" for those who may not have it.  It's not really a
>> >> recommended way for updating Solr servers in production.
>> >
>> > OK, I've abandoned the post.jar tool idea in favour of a
>> > DistributedUpdateRequestProcessor class (I've been looking into other
>> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
>> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
>> > are used/what data they store - hence why I've taken some time to
>> > respond).
>> >
>> > My big question now is that is it necessary to have a Factory class for
>> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
>> > RunUpdateProcessorFactory (where the factory class was only a few lines
>> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
>> > would be a good design idea to include it in (in a generic sense), but
>> > then I thought harder and I thought that the
>> > DistributedUpdateRequestHander would only be running once, taking in all
>> > the requests, so it seems sort of pointless to write one in.
>> >
>> > That is my "burning" question for now. I have got a few more questions,
>> > but I'm sure that when I look further into the code, I'll either have
>> > more or all of my questions are answered.
>> >
>> > Many thanks!
>> >
>> > Soheb Mahmood
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Soheb Mahmood <so...@gmail.com>.

(I'm sending this on behalf of William, a guy on our team working on
ShardDistributedPolicy):

Hi Guys

I've had a go at creating the ShardDistributionPolicy interface and a
few implementations. I've created a patch
(https://issues.apache.org/jira/browse/SOLR-2341) let me know what
needs doing.

Currently I assume that the documents passed to the policy will be
represented by some kind of identifier and that one needs only to
match the ID with a shard. This is better (I think) than reading the
document from the POST and figuring out some kind of unique
identifier?

A question we've had about this is who decides what policy to use and
where do they specify? I'm inclided to think that the user (the person
POSTing data) does not mind what policy is used but the administrator
might. This leads me to think that the policy should be set in the
solr config file? My collegues disagree that the user will not mind
and would rather see the policy be specified in the url. We've noticed
that request handlers can be specified in both so should we adopt this
idea instead (and as a kind of comprimise :) ).

All the best

William

> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > Lance,
> >
> > Firstly, we're proposing a ShardDistributionPolicy interface for
> which
> > there is a default (mod of the doc ID) but other implementations are
> > possible. Another easy implementation would be a randomised or round
> > robin one.
> >
> > As to threading, the first task would be to put all of the source
> > documents into "buckets", one bucket per shard, using the above
> > ShardDistributionPolicy to assign documents to buckets/shards. Then
> all
> > of the documents in a "bucket" could be sent to the relevant shard
> for
> > indexing (which would be nothing more than a normal HTTP post (or
> solrj
> > call?)).
> >
> > As to whether this would be single threaded or multithreaded, I
> would
> > guess we would aim to do it the same as the distributed search code
> > (which I have not yet reviewed). However, it could presumably be
> > single-threaded, but use asynchronous HTTP.
> >
> > Regards, Upayavira
> >
> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <go...@gmail.com>
> > wrote:
> >> I would suggest that a DistributedRequestUpdateHandler run
> >> single-threaded, doing only one document at a time. If I want more
> >> than one, I run it twice or N times with my own program.
> >>
> >> Also, this should have a policy object which decides exactly how
> >> documents are distributed. There are different techniques for
> >> different use cases.
> >>
> >> Lance
> >>
> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood
> <so...@gmail.com>
> >> wrote:
> >> > Hello Yonik,
> >> >
> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> >> >> Making it easy for clients I think is key... one should be able
> to
> >> >> update any node in the solr cluster and have solr take care of
> the
> >> >> hard part about updating all relevant shards.  This will most
> likely
> >> >> involve an update processor.  This approach allows all existing
> update
> >> >> methods (including things like CSV file upload) to still work
> >> >> correctly.
> >> >>
> >> >> Also post.jar is really just for testing... a command-line
> replacement
> >> >> for "curl" for those who may not have it.  It's not really a
> >> >> recommended way for updating Solr servers in production.
> >> >
> >> > OK, I've abandoned the post.jar tool idea in favour of a
> >> > DistributedUpdateRequestProcessor class (I've been looking into
> other
> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how
> they
> >> > are used/what data they store - hence why I've taken some time to
> >> > respond).
> >> >
> >> > My big question now is that is it necessary to have a Factory
> class for
> >> > DistributedUpdateRequestProcessor? I've seen this lots of times,
> as in
> >> > RunUpdateProcessorFactory (where the factory class was only a few
> lines
> >> > of code) to SignatureUpdateProcessorFactory? At first I was
> thinking it
> >> > would be a good design idea to include it in (in a generic
> sense), but
> >> > then I thought harder and I thought that the
> >> > DistributedUpdateRequestHander would only be running once, taking
> in all
> >> > the requests, so it seems sort of pointless to write one in.
> >> >
> >> > That is my "burning" question for now. I have got a few more
> questions,
> >> > but I'm sure that when I look further into the code, I'll either
> have
> >> > more or all of my questions are answered.
> >> >
> >> > Many thanks!
> >> >
> >> > Soheb Mahmood
> >> >
> >> >
> >> >
> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

Lance,

Firstly, we're proposing a ShardDistributionPolicy interface for which
there is a default (mod of the doc ID) but other implementations are
possible. Another easy implementation would be a randomised or round
robin one.

As to threading, the first task would be to put all of the source
documents into "buckets", one bucket per shard, using the above
ShardDistributionPolicy to assign documents to buckets/shards. Then all
of the documents in a "bucket" could be sent to the relevant shard for
indexing (which would be nothing more than a normal HTTP post (or solrj
call?)).

As to whether this would be single threaded or multithreaded, I would
guess we would aim to do it the same as the distributed search code
(which I have not yet reviewed). However, it could presumably be
single-threaded, but use asynchronous HTTP.

Regards, Upayavira

On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <go...@gmail.com>
wrote:
> I would suggest that a DistributedRequestUpdateHandler run
> single-threaded, doing only one document at a time. If I want more
> than one, I run it twice or N times with my own program.
> 
> Also, this should have a policy object which decides exactly how
> documents are distributed. There are different techniques for
> different use cases.
> 
> Lance
> 
> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <so...@gmail.com>
> wrote:
> > Hello Yonik,
> >
> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> >> Making it easy for clients I think is key... one should be able to
> >> update any node in the solr cluster and have solr take care of the
> >> hard part about updating all relevant shards.  This will most likely
> >> involve an update processor.  This approach allows all existing update
> >> methods (including things like CSV file upload) to still work
> >> correctly.
> >>
> >> Also post.jar is really just for testing... a command-line replacement
> >> for "curl" for those who may not have it.  It's not really a
> >> recommended way for updating Solr servers in production.
> >
> > OK, I've abandoned the post.jar tool idea in favour of a
> > DistributedUpdateRequestProcessor class (I've been looking into other
> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
> > are used/what data they store - hence why I've taken some time to
> > respond).
> >
> > My big question now is that is it necessary to have a Factory class for
> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
> > RunUpdateProcessorFactory (where the factory class was only a few lines
> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
> > would be a good design idea to include it in (in a generic sense), but
> > then I thought harder and I thought that the
> > DistributedUpdateRequestHander would only be running once, taking in all
> > the requests, so it seems sort of pointless to write one in.
> >
> > That is my "burning" question for now. I have got a few more questions,
> > but I'm sure that when I look further into the code, I'll either have
> > more or all of my questions are answered.
> >
> > Many thanks!
> >
> > Soheb Mahmood
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Lance Norskog <go...@gmail.com>.

I would suggest that a DistributedRequestUpdateHandler run
single-threaded, doing only one document at a time. If I want more
than one, I run it twice or N times with my own program.

Also, this should have a policy object which decides exactly how
documents are distributed. There are different techniques for
different use cases.

Lance

On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <so...@gmail.com> wrote:
> Hello Yonik,
>
> On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
>> Making it easy for clients I think is key... one should be able to
>> update any node in the solr cluster and have solr take care of the
>> hard part about updating all relevant shards.  This will most likely
>> involve an update processor.  This approach allows all existing update
>> methods (including things like CSV file upload) to still work
>> correctly.
>>
>> Also post.jar is really just for testing... a command-line replacement
>> for "curl" for those who may not have it.  It's not really a
>> recommended way for updating Solr servers in production.
>
> OK, I've abandoned the post.jar tool idea in favour of a
> DistributedUpdateRequestProcessor class (I've been looking into other
> classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
> are used/what data they store - hence why I've taken some time to
> respond).
>
> My big question now is that is it necessary to have a Factory class for
> DistributedUpdateRequestProcessor? I've seen this lots of times, as in
> RunUpdateProcessorFactory (where the factory class was only a few lines
> of code) to SignatureUpdateProcessorFactory? At first I was thinking it
> would be a good design idea to include it in (in a generic sense), but
> then I thought harder and I thought that the
> DistributedUpdateRequestHander would only be running once, taking in all
> the requests, so it seems sort of pointless to write one in.
>
> That is my "burning" question for now. I have got a few more questions,
> but I'm sure that when I look further into the code, I'll either have
> more or all of my questions are answered.
>
> Many thanks!
>
> Soheb Mahmood
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Soheb Mahmood <so...@gmail.com>.

Hello Yonik,

On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> Making it easy for clients I think is key... one should be able to
> update any node in the solr cluster and have solr take care of the
> hard part about updating all relevant shards.  This will most likely
> involve an update processor.  This approach allows all existing update
> methods (including things like CSV file upload) to still work
> correctly.
> 
> Also post.jar is really just for testing... a command-line replacement
> for "curl" for those who may not have it.  It's not really a
> recommended way for updating Solr servers in production.

OK, I've abandoned the post.jar tool idea in favour of a
DistributedUpdateRequestProcessor class (I've been looking into other
classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
are used/what data they store - hence why I've taken some time to
respond). 

My big question now is that is it necessary to have a Factory class for
DistributedUpdateRequestProcessor? I've seen this lots of times, as in
RunUpdateProcessorFactory (where the factory class was only a few lines
of code) to SignatureUpdateProcessorFactory? At first I was thinking it
would be a good design idea to include it in (in a generic sense), but
then I thought harder and I thought that the
DistributedUpdateRequestHander would only be running once, taking in all
the requests, so it seems sort of pointless to write one in.

That is my "burning" question for now. I have got a few more questions,
but I'm sure that when I look further into the code, I'll either have
more or all of my questions are answered.

Many thanks!

Soheb Mahmood

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Jan 26, 2011 at 11:29 AM, Soheb Mahmood <so...@gmail.com> wrote:
> We were wondering if there was a simple way of
> applying these changes we wrote in Java across all the other languages.

Making it easy for clients I think is key... one should be able to
update any node in the solr cluster and have solr take care of the
hard part about updating all relevant shards.  This will most likely
involve an update processor.  This approach allows all existing update
methods (including things like CSV file upload) to still work
correctly.

Also post.jar is really just for testing... a command-line replacement
for "curl" for those who may not have it.  It's not really a
recommended way for updating Solr servers in production.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

Another point that will need some thought, as I have heard alluded to,
is error handling.

Currently, as I understand it, if you post 500 documents to Solr, and
one has an error, the whole batch will fail.

Leaving aside whether that is the best behaviour, it is a behaviour that
will be impossible to mimic in a distributed indexing scenario. (without
effectively implementing distributed transactions).

I guess the simplest would be to find a way to report back to the user
that documents for these shards succeeded, and documents for these
shards failed, and here's the error. The issue here is that when Solr
returns an error, it doesn't return error XML, it returns a (for
example) Tomcat stack trace (i.e. HTML). Perhaps all we can do here is
to embed that HTML as CDATA in the XML that the distributed request
handler returns to its client.

Then, worst case the client could fix the error and repost everything.
All documents would be re-indexed across all shards, but in the long
run, there's no big issue with that.

Upayavira

On Fri, 28 Jan 2011 12:24 +0000, "Upayavira" <uv...@odoko.co.uk> wrote:
> Hi Soheb,
> 
> On Wed, 26 Jan 2011 16:29 +0000, "Soheb Mahmood"
> <so...@gmail.com> wrote:
> 
> > We are going to implement distributed indexing for Solr - without the
> > use of SolrCloud (so it can be easily up-scaled). We have a deadline by
> > February to get this done, so we need to get cracking ;) 
> 
> :-)
>  
> > So far, we've had a look at the solr classes and thought about
> > distributed indexing on Solr, and we have come up with these ideas:
> > 
> > 1. We plan to modify SimplePostTool to accommodate posting to specific
> > shards. We are going to add an optional system property to allow the
> > user to specify a list of shards to index to Solr.
> > Example of this being "java
> > -Durl=http://localhost:7574/solr/collection1/update
> > -Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
> > XML files>"
> 
> As Yonik says, the SimplePostTool is really for testing. The shard
> information must be contained within the URL, and processed by an
> UpdateRequestHandler (called DistributedUpdateRequestHandler?). That
> way, you can embed that data into the solrconfig.xml file as an
> invariant or a default, or later it can be derived from Zookeeper in
> SolrCloud.
> 
> > We also plan to modify server request processing to handle distributed
> > indexing. We are looking at CommonsHttpSolrServer.java for ways to
> > accomplish this.
> > 
> > With all these changes, we realise that we are only modifying the Java
> > version, and that other languages need to be updated to accommodate our
> > changes (e.g. perl). We were wondering if there was a simple way of
> > applying these changes we wrote in Java across all the other languages.
> 
> If you add this support to Solr itself, it is then the responsibility of
> each client library to worry about supporting it.
> 
> You should only be focussing on the Solr DistributedUpdateHandler code
> rather than on any client libraries (other than the code you use as your
> test harness.
> 
> > 2. We are going to make an interface to handle distributed writing. We
> > plan for it to sit between the Solr server and the shards - if no shards
> > are specified, then the post.jar tool will work exactly the same way it
> > does now. However, if the user specifies shards for post.jar, then we
> > want a class that has extended our interface to kick into action. 
> 
> The interface you need will be a ShardPolicy or some such. You will hand
> to it a document, and a number of or list of shards, and it will tell
> you which shard that document should go in. This interface will then
> allow for pluggable shard policies, whether a simple modulo on the
> document ID (for deterministic indexing) or a simple round-robin (for
> random indexing).
> 
> You'll then need to split the documents you've gathered from the post
> request to the UpdateRequestHandler, and forward them to whichever
> shards the ShardPolicy suggested.
> 
> > 3. We plan to test our results by acceptance testing (we run Solr and
> > see if it works ourselves) and writing a test class.
> 
> Sounds great.
> 
> Upayavira
> --- 
> Enterprise Search Consultant at Sourcesense UK, 
> Making Sense of Open Source
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distributed Indexing

Posted by Upayavira <uv...@odoko.co.uk>.

Hi Soheb,

On Wed, 26 Jan 2011 16:29 +0000, "Soheb Mahmood"
<so...@gmail.com> wrote:

> We are going to implement distributed indexing for Solr - without the
> use of SolrCloud (so it can be easily up-scaled). We have a deadline by
> February to get this done, so we need to get cracking ;) 

:-)

> So far, we've had a look at the solr classes and thought about
> distributed indexing on Solr, and we have come up with these ideas:
> 
> 1. We plan to modify SimplePostTool to accommodate posting to specific
> shards. We are going to add an optional system property to allow the
> user to specify a list of shards to index to Solr.
> Example of this being "java
> -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
> XML files>"

As Yonik says, the SimplePostTool is really for testing. The shard
information must be contained within the URL, and processed by an
UpdateRequestHandler (called DistributedUpdateRequestHandler?). That
way, you can embed that data into the solrconfig.xml file as an
invariant or a default, or later it can be derived from Zookeeper in
SolrCloud.

> We also plan to modify server request processing to handle distributed
> indexing. We are looking at CommonsHttpSolrServer.java for ways to
> accomplish this.
> 
> With all these changes, we realise that we are only modifying the Java
> version, and that other languages need to be updated to accommodate our
> changes (e.g. perl). We were wondering if there was a simple way of
> applying these changes we wrote in Java across all the other languages.

If you add this support to Solr itself, it is then the responsibility of
each client library to worry about supporting it.

You should only be focussing on the Solr DistributedUpdateHandler code
rather than on any client libraries (other than the code you use as your
test harness.

> 2. We are going to make an interface to handle distributed writing. We
> plan for it to sit between the Solr server and the shards - if no shards
> are specified, then the post.jar tool will work exactly the same way it
> does now. However, if the user specifies shards for post.jar, then we
> want a class that has extended our interface to kick into action. 

The interface you need will be a ShardPolicy or some such. You will hand
to it a document, and a number of or list of shards, and it will tell
you which shard that document should go in. This interface will then
allow for pluggable shard policies, whether a simple modulo on the
document ID (for deterministic indexing) or a simple round-robin (for
random indexing).

You'll then need to split the documents you've gathered from the post
request to the UpdateRequestHandler, and forward them to whichever
shards the ShardPolicy suggested.

> 3. We plan to test our results by acceptance testing (we run Solr and
> see if it works ourselves) and writing a test class.

Sounds great.

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org