You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Aleksander M. Stensby" <al...@integrasco.no> on 2009/06/09 13:07:47 UTC

Sharding strategy

Hi all,
I'm trying to figure out how to shard our index as it is growing rapidly  
and we want to make our solution scalable.
So, we have documents that are most commonly sorted by their date. My  
initial thought is to shard the index by date, but I wonder if you have  
any input on this and how to best solve this...

I know that the most frequent queries will be executed against the  
"latest" shard, but then let's say we shard by year, how do we best solve  
the situation that will occur in the beginning of a new year? (Some of the  
data will be in the last shard, but most of it will be on the second last  
shard.)

Would it be stupid to have a "latest" shard with duplicate data (always  
consisting of the last 6 months or something like that) and maintain that  
index in addition to the regular yearly shards? Any one else facing a  
similar situation with a good solution?

Any input would be greatly appreciated :)

Cheers,
  Aleksander



-- 
Aleksander M. Stensby
Lead software developer and system architect
Integrasco A/S
www.integrasco.no
http://twitter.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Sharding strategy

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
Hi Otis,
thanks for your reply!
You could say I'm lucky (and I totally agree since I've made the choice of  
ordering the data that way:p).
What you describe is what I've thought about doing and I'm happy to read  
that you approve. It is always nice to know that you are not doing things  
completely off - that's what I love about this mailing list!

I've implemented a sharded "yellow pages" that builds up the shard  
parameter and it will obviously be easy to search in two shards to  
overcome the beginning of the year situation, just thought it might be a  
bit stupid to search for 1% of the data in the "latest shard" and the rest  
in shard n-1. How much of a performance decrease do you recon I will get  
 from searching two shards instead of one?

Anyways, thanks for confirming things, Otis!

Cheers,
  Aleksander




On Wed, 10 Jun 2009 07:51:16 +0200, Otis Gospodnetic  
<ot...@yahoo.com> wrote:

>
> Aleksander,
>
> In a sense you are lucky you have time-ordered data.  That makes it very  
> easy to shard and cheaper to search - you know exactly which shards you  
> need to query.  The beginning of the year situation should also be  
> easy.  Do start with the latest shard for the current year, and go to  
> next shard only if you have to (e.g. if you don't get enough results  
> from the first shard).
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Aleksander M. Stensby <al...@integrasco.no>
>> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>> Sent: Tuesday, June 9, 2009 7:07:47 AM
>> Subject: Sharding strategy
>>
>> Hi all,
>> I'm trying to figure out how to shard our index as it is growing  
>> rapidly and we
>> want to make our solution scalable.
>> So, we have documents that are most commonly sorted by their date. My  
>> initial
>> thought is to shard the index by date, but I wonder if you have any  
>> input on
>> this and how to best solve this...
>>
>> I know that the most frequent queries will be executed against the  
>> "latest"
>> shard, but then let's say we shard by year, how do we best solve the  
>> situation
>> that will occur in the beginning of a new year? (Some of the data will  
>> be in the
>> last shard, but most of it will be on the second last shard.)
>>
>> Would it be stupid to have a "latest" shard with duplicate data (always
>> consisting of the last 6 months or something like that) and maintain  
>> that index
>> in addition to the regular yearly shards? Any one else facing a similar
>> situation with a good solution?
>>
>> Any input would be greatly appreciated :)
>>
>> Cheers,
>> Aleksander
>>
>>
>>
>> --Aleksander M. Stensby
>> Lead software developer and system architect
>> Integrasco A/S
>> www.integrasco.no
>> http://twitter.com/Integrasco
>>
>> Please consider the environment before printing all or any of this  
>> e-mail
>
>



-- 
Aleksander M. Stensby
Lead software developer and system architect
Integrasco A/S
www.integrasco.no
http://twitter.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Sharding strategy

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Aleksander,

In a sense you are lucky you have time-ordered data.  That makes it very easy to shard and cheaper to search - you know exactly which shards you need to query.  The beginning of the year situation should also be easy.  Do start with the latest shard for the current year, and go to next shard only if you have to (e.g. if you don't get enough results from the first shard).

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Aleksander M. Stensby <al...@integrasco.no>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Tuesday, June 9, 2009 7:07:47 AM
> Subject: Sharding strategy
> 
> Hi all,
> I'm trying to figure out how to shard our index as it is growing rapidly and we 
> want to make our solution scalable.
> So, we have documents that are most commonly sorted by their date. My initial 
> thought is to shard the index by date, but I wonder if you have any input on 
> this and how to best solve this...
> 
> I know that the most frequent queries will be executed against the "latest" 
> shard, but then let's say we shard by year, how do we best solve the situation 
> that will occur in the beginning of a new year? (Some of the data will be in the 
> last shard, but most of it will be on the second last shard.)
> 
> Would it be stupid to have a "latest" shard with duplicate data (always 
> consisting of the last 6 months or something like that) and maintain that index 
> in addition to the regular yearly shards? Any one else facing a similar 
> situation with a good solution?
> 
> Any input would be greatly appreciated :)
> 
> Cheers,
> Aleksander
> 
> 
> 
> --Aleksander M. Stensby
> Lead software developer and system architect
> Integrasco A/S
> www.integrasco.no
> http://twitter.com/Integrasco
> 
> Please consider the environment before printing all or any of this e-mail