You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tomasv <da...@gmail.com> on 2014/07/19 02:18:31 UTC

shards as subset of All Shards

Hello, This is kind of weird, but here goes:

We are setting up a document repository (SOLR4). This will be a large (to
us) repository of approximately 500B documents. The documents are based on
"people".

Once all my documents are uploaded, we will receive new (follow-up)
information on our "people" every month (or so).

Our client facing application has two modes "all inclusive data" or "recent
data".
We want the "recent data" mode to query against the data in the follow-up
information only. We want the "all inclusive" mode to query against the
initial load AND the follow-up data.

We currently have 30 shards with 2 replicas of each shard (60 shards total)
in a SOLR cloud setup including a Zookeeper. This is currently hosting our
data in what will become the "all inclusive" query.

What is the best approach to to a requirement such as this? (Probably not
celar enough??)
(I'm a newbie so please bear with my questions! :-) )
1. Should we create two separate collections ("initial" and "followup")? And
then have the front end app query against each collection as needed?
2. Is it possible to index the follow-up records to specific shards and then
query those specific shards when the client is in "follow up" mode? Will a
"all inclusive" include the followup shards?
3. Is it possible for one collection to be a subset of a larger collection?

I realize this is quite "fuzzy", but any insights are appreciated.

-tomas

--
View this message in context: http://lucene.472066.n3.nabble.com/shards-as-subset-of-All-Shards-tp4147998.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: shards as subset of All Shards

Posted by IJ <ja...@gmail.com>.

Here is one potential design approach:

1. Create a single collection (instead of two collections).
Let your schema have a "RecordType" field which can take the values of
either "initial" or "follow-up" for documents that are indexed into this
collection.

2. Let there be 30 shards - just like you have it. However - implement a
document co-location strategy in your indexing - so that a single customers
records (both "initial" and "follow-up") always get indexed into the same
single shard.

Read up this link on "Document Routing" to learn more on how to implement
this -
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-DocumentRouting

3. When your search App queries the Collection - use the _route_=<customer
Id / Name> parameter to force searches on the correct shard.

Such a design ensures that your queries doesn't get distributed across all
nodes / shards on your system - which could cause latency issues of its own.





--
View this message in context: http://lucene.472066.n3.nabble.com/shards-as-subset-of-All-Shards-tp4147998p4148038.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: shards as subset of All Shards

Posted by Jack Krupansky <ja...@basetechnology.com>.

"500B" - as in 500,000,000,000? Really?

-- Jack Krupansky

-----Original Message----- 
From: tomasv
Sent: Friday, July 18, 2014 8:18 PM
To: solr-user@lucene.apache.org
Subject: shards as subset of All Shards

Hello, This is kind of weird, but here goes:

We are setting up a document repository (SOLR4). This will be a large (to
us) repository of approximately 500B documents. The documents are based on
"people".

Once all my documents are uploaded, we will receive new (follow-up)
information on our "people" every month (or so).

Our client facing application has two modes "all inclusive data" or "recent
data".
We want the "recent data" mode to query against the data in the follow-up
information only. We want the "all inclusive" mode to query against the
initial load AND the follow-up data.

We currently have 30 shards with 2 replicas of each shard (60 shards total)
in a SOLR cloud setup including a Zookeeper. This is currently hosting our
data in what will become the "all inclusive" query.

What is the best approach to to a requirement such as this? (Probably not
celar enough??)
(I'm a newbie so please bear with my questions! :-)  )
1. Should we create two separate collections ("initial" and "followup")? And
then have the front end app query against each collection as needed?
2. Is it possible to index the follow-up records to specific shards and then
query those specific shards when the client is in "follow up" mode? Will a
"all inclusive" include the followup shards?
3. Is it possible for one collection to be a subset of a larger collection?

I realize this is quite "fuzzy", but any insights are appreciated.

-tomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/shards-as-subset-of-All-Shards-tp4147998.html
Sent from the Solr - User mailing list archive at Nabble.com.