You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by wwang525 <ww...@gmail.com> on 2015/06/18 20:05:28 UTC

How to do a Data sharding for data in a database table

Hi,

We probably would like to shard the data since the response time for
demanding queries at > 10M records is getting > 1 second in a single request
scenario.

I have not done any data sharding before. What are some recommended way to
do data sharding. For example, may be by a criteria with a list of specific
values?





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by Jack Krupansky <ja...@gmail.com>.

10M doesn't sound too demanding.

How complex are your queries?

How complex is your data - like number of fields and size, like very large
documents?

Are you sure you have enough RAM to fully cache your index?

Are your queries compute-bound or I/O bound? If I/O-bound, get more RAM. If
compute-bound, sharding may help, but have to examine query complexity
first.

-- Jack Krupansky

On Thu, Jun 18, 2015 at 2:05 PM, wwang525 <ww...@gmail.com> wrote:

> Hi,
>
> We probably would like to shard the data since the response time for
> demanding queries at > 10M records is getting > 1 second in a single
> request
> scenario.
>
> I have not done any data sharding before. What are some recommended way to
> do data sharding. For example, may be by a criteria with a list of specific
> values?
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

RE: How to do a Data sharding for data in a database table

Posted by Carlos Maroto <CM...@searchtechnologies.com>.

As stated previously, using Field Collapsing (group parameters) tends to significantly slow down queries.  In my experience, search response gets even worst when:
- Requesting facets, which more often than not I do in my query formulation
- Asking for the facet counts to be on the groups via the group.facet=true parameter (way worst in some of my use cases that had a lot of distinct values for at least one of the facets)
- Queries are matching many hits, i.e. individual counts (hundreds of thousands or more in our case) and total groups counts (in the few thousands)

Also stated by someone, switching to CollapseQParserPlugin will likely reduce significantly the response time given its different implementation.  Using CollapseQParserPlugin means that you:

1- Have to change how the query gets created
2- May need to change how you consume the Solr response (depending on what you are using today)
3- Will not have the total number of individual hits (before collapsing count) because the numFound returned by the CollapseQParserPlugin represents the total number of groups (like groups.ngroups does)
4- You may have an issue with facet value counts not being exact in the CollapseQParserPlugin response

With respect to sharding, there are multiple considerations.  The most relevant given your need for grouping is to implement custom routing of documents to shards so that all members of a group are indexed in the same shard, if you can.  Otherwise your grouping across shards will have some issues (particularly with counts, I believe.)

CARLOS MAROTO       
http://www.searchtechnologies.com/
M +1 626 354 7750

-----Original Message-----
From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org] 
Sent: Friday, June 19, 2015 12:08 PM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Also, since you are tuning for relative times, you can tune on the smaller index.   Surely, you will want to test at scale.   But tuning query, analyzer or schema options is usually easier to do on a smaller index.   If you get a 3x improvement at small scale, it may only be 2.5x at full scale.

E.g. storing the group field as doc values is one option that can help grouping performance in some cases (at least according to this list, I haven't tried it yet).

The number of distinct values of the grouping field is important as well.  If there are very many, you may want to try CollapsingQParserPlugin.     

The point being, some of these options may require reindexing!   So, again, it is a much easier and faster process to tune on a smaller index.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, June 19, 2015 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

Do be aware that turning on &debug=query adds a load. I've seen the debug component take 90% of the query time. (to be fair it usually takes a much smaller percentage).

But you'll see a section at the end of the response if you set debug=all with the time each component took so you'll have a sense of the relative time used by each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang <ww...@gmail.com> wrote:
> As for now, the index size is 6.5 M records, and the performance is 
> good enough. I will re-build the index for all the records (14 M) and 
> test it again with debug turned on.
>
> Thanks
>
>
> On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
> <er...@gmail.com>
> wrote:
>
>> First and most obvious thing to try:
>>
>> bq: the Solr was started with maximal 4G for JVM, and index size is < 
>> 2G
>>
>> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is 
>> very loosely coupled to JVM requirements. It's quite possible that 
>> you're spending all your time in GC cycles. Consider gathering GC 
>> characteristics, see:
>> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>>
>> As Charles says, on the face of it the system you describe should 
>> handle quite a load, so it feels like things can be tuned and you 
>> won't have to resort to sharding.
>> Sharding inevitably imposes some overhead so it's best to go there last.
>>
>> From my perspective, this is, indeed, an XY problem. You're assuming 
>> that sharding is your solution. But you really haven't identified the 
>> _problem_ other than "queries are too slow". Let's nail down the 
>> reason queries are taking a second before jumping into sharding. I've 
>> just spent too much of my life fixing the wrong thing ;)
>>
>> It would be useful to see a couple of sample queries so we can get a 
>> feel for how complex they are. Especially if you append, as Charles 
>> mentions, "debug=true"
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles 
>> <Ch...@tiaa-cref.org> wrote:
>> > Grouping does tend to be expensive.   Our regular queries typically
>> return in 10-15ms while the grouping queries take 60-80ms in a test 
>> environment (< 1M docs).
>> >
>> > This is ok for us, since we wrote our app to take the grouping 
>> > queries
>> out of the critical path (async query in parallel with two primary queries
>> and some work in middle tier).   But this approach is unlikely to work for
>> most cases.
>> >
>> > -----Original Message-----
>> > From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
>> > Sent: Friday, June 19, 2015 9:52 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: RE: How to do a Data sharding for data in a database table
>> >
>> > Hi Wenbin,
>> >
>> > To me, your instance appears well provisioned.  Likewise, your 
>> > analysis
>> of test vs. production performance makes a lot of sense.  Perhaps 
>> your time would be well spent tuning the query performance for your 
>> app before resorting to sharding?
>> >
>> > To that end, what do you see when you set debugQuery=true?   Where does
>> solr spend the time?   My guess would be in the grouping and sorting steps,
>> but which?   Sometime the schema details matter for performance.   Folks on
>> this list can help with that.
>> >
>> > -Charlie
>> >
>> > -----Original Message-----
>> > From: Wenbin Wang [mailto:wwang525@gmail.com]
>> > Sent: Friday, June 19, 2015 7:55 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: How to do a Data sharding for data in a database table
>> >
>> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound 
>> > or
>> computer disk bound. In addition, the Solr was started with maximal 
>> 4G for JVM, and index size is < 2G. In a typical test, I made sure 
>> enough free RAM of 10G was available. I have not tuned any parameter 
>> in the configuration, it is default configuration.
>> >
>> > The number of fields for each record is around 10, and the number 
>> > of
>> results to be returned per page is 30. So the response time should 
>> not be affected by network traffic, and it is tested in the same 
>> machine. The query has a list of 4 search parameters, and each 
>> parameter takes a list of values or date range. The results will also 
>> be grouped and sorted. The response time of a typical single request 
>> is around 1 second. It can be > 1 second with more demanding requests.
>> >
>> > In our production environment, we have 64 cores, and we need to 
>> > support >
>> > 300 concurrent users, that is about 300 concurrent request per second.
>> Each core will have to process about 5 request per second. The 
>> response time under this load will not be 1 second any more. My 
>> estimate is that an average of 200 ms response time of a single 
>> request would be able to handle
>> > 300 concurrent users in production. There is no plan to increase 
>> > the
>> total number of cores 5 times.
>> >
>> > In a previous test, a search index around 6M data size was able to
>> handle >
>> > 5 request per second in each core of my 8-core machine.
>> >
>> > By doing data sharding from one single index of 13M to 2 indexes of
>> > 6 or
>> 7 M/each, I am expecting much faster response time that can meet the 
>> demand of production environment. That is the motivation of doing data sharding.
>> > However, I am also open to solution that can improve the 
>> > performance of
>> the  index of 13M to 14M size so that I do not need to do a data sharding.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> >> You've repeated your original statement. Shawn's observation is 
>> >> that 10M docs is a very small corpus by Solr standards. You either 
>> >> have very demanding document/search combinations or you have a 
>> >> poorly tuned Solr installation.
>> >>
>> >> On reasonable hardware I expect 25-50M documents to have 
>> >> sub-second response time.
>> >>
>> >> So what we're trying to do is be sure this isn't an "XY" problem, 
>> >> from Hossman's apache page:
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are 
>> >> dealing with "X", you are assuming "Y" will help you, and you are
>> asking about "Y"
>> >> without giving more details about the "X" so that we can 
>> >> understand the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >> So again, how would you characterize your documents? How many fields?
>> >> What do queries look like? How much physical memory on the machine?
>> >> How much memory have you allocated to the JVM?
>> >>
>> >> You might review:
>> >> http://wiki.apache.org/solr/UsingMailingLists
>> >>
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
>> >> > The query without load is still under 1 second. But under load, 
>> >> > response
>> >> time
>> >> > can be much longer due to the queued up query.
>> >> >
>> >> > We would like to shard the data to something like 6 M / shard, 
>> >> > which will still give a under 1 second response time under load.
>> >> >
>> >> > What are some best practice to shard the data? for example, we 
>> >> > could
>> >> shard
>> >> > the data by date range, but that is pretty dynamic, and we could 
>> >> > shard
>> >> data
>> >> > by some other properties, but if the data is not evenly 
>> >> > distributed, you
>> >> may
>> >> > not be able shard it anymore.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-d
>> >> ata- in-a-database-table-tp4212765p4212803.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

RE: How to do a Data sharding for data in a database table

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.

Also, since you are tuning for relative times, you can tune on the smaller index.   Surely, you will want to test at scale.   But tuning query, analyzer or schema options is usually easier to do on a smaller index.   If you get a 3x improvement at small scale, it may only be 2.5x at full scale.

E.g. storing the group field as doc values is one option that can help grouping performance in some cases (at least according to this list, I haven't tried it yet).

The number of distinct values of the grouping field is important as well.  If there are very many, you may want to try CollapsingQParserPlugin.     

The point being, some of these options may require reindexing!   So, again, it is a much easier and faster process to tune on a smaller index.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, June 19, 2015 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

Do be aware that turning on &debug=query adds a load. I've seen the debug component take 90% of the query time. (to be fair it usually takes a much smaller percentage).

But you'll see a section at the end of the response if you set debug=all with the time each component took so you'll have a sense of the relative time used by each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang <ww...@gmail.com> wrote:
> As for now, the index size is 6.5 M records, and the performance is 
> good enough. I will re-build the index for all the records (14 M) and 
> test it again with debug turned on.
>
> Thanks
>
>
> On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
> <er...@gmail.com>
> wrote:
>
>> First and most obvious thing to try:
>>
>> bq: the Solr was started with maximal 4G for JVM, and index size is < 
>> 2G
>>
>> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is 
>> very loosely coupled to JVM requirements. It's quite possible that 
>> you're spending all your time in GC cycles. Consider gathering GC 
>> characteristics, see:
>> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>>
>> As Charles says, on the face of it the system you describe should 
>> handle quite a load, so it feels like things can be tuned and you 
>> won't have to resort to sharding.
>> Sharding inevitably imposes some overhead so it's best to go there last.
>>
>> From my perspective, this is, indeed, an XY problem. You're assuming 
>> that sharding is your solution. But you really haven't identified the 
>> _problem_ other than "queries are too slow". Let's nail down the 
>> reason queries are taking a second before jumping into sharding. I've 
>> just spent too much of my life fixing the wrong thing ;)
>>
>> It would be useful to see a couple of sample queries so we can get a 
>> feel for how complex they are. Especially if you append, as Charles 
>> mentions, "debug=true"
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles 
>> <Ch...@tiaa-cref.org> wrote:
>> > Grouping does tend to be expensive.   Our regular queries typically
>> return in 10-15ms while the grouping queries take 60-80ms in a test 
>> environment (< 1M docs).
>> >
>> > This is ok for us, since we wrote our app to take the grouping 
>> > queries
>> out of the critical path (async query in parallel with two primary queries
>> and some work in middle tier).   But this approach is unlikely to work for
>> most cases.
>> >
>> > -----Original Message-----
>> > From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
>> > Sent: Friday, June 19, 2015 9:52 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: RE: How to do a Data sharding for data in a database table
>> >
>> > Hi Wenbin,
>> >
>> > To me, your instance appears well provisioned.  Likewise, your 
>> > analysis
>> of test vs. production performance makes a lot of sense.  Perhaps 
>> your time would be well spent tuning the query performance for your 
>> app before resorting to sharding?
>> >
>> > To that end, what do you see when you set debugQuery=true?   Where does
>> solr spend the time?   My guess would be in the grouping and sorting steps,
>> but which?   Sometime the schema details matter for performance.   Folks on
>> this list can help with that.
>> >
>> > -Charlie
>> >
>> > -----Original Message-----
>> > From: Wenbin Wang [mailto:wwang525@gmail.com]
>> > Sent: Friday, June 19, 2015 7:55 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: How to do a Data sharding for data in a database table
>> >
>> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound 
>> > or
>> computer disk bound. In addition, the Solr was started with maximal 
>> 4G for JVM, and index size is < 2G. In a typical test, I made sure 
>> enough free RAM of 10G was available. I have not tuned any parameter 
>> in the configuration, it is default configuration.
>> >
>> > The number of fields for each record is around 10, and the number 
>> > of
>> results to be returned per page is 30. So the response time should 
>> not be affected by network traffic, and it is tested in the same 
>> machine. The query has a list of 4 search parameters, and each 
>> parameter takes a list of values or date range. The results will also 
>> be grouped and sorted. The response time of a typical single request 
>> is around 1 second. It can be > 1 second with more demanding requests.
>> >
>> > In our production environment, we have 64 cores, and we need to 
>> > support >
>> > 300 concurrent users, that is about 300 concurrent request per second.
>> Each core will have to process about 5 request per second. The 
>> response time under this load will not be 1 second any more. My 
>> estimate is that an average of 200 ms response time of a single 
>> request would be able to handle
>> > 300 concurrent users in production. There is no plan to increase 
>> > the
>> total number of cores 5 times.
>> >
>> > In a previous test, a search index around 6M data size was able to
>> handle >
>> > 5 request per second in each core of my 8-core machine.
>> >
>> > By doing data sharding from one single index of 13M to 2 indexes of 
>> > 6 or
>> 7 M/each, I am expecting much faster response time that can meet the 
>> demand of production environment. That is the motivation of doing data sharding.
>> > However, I am also open to solution that can improve the 
>> > performance of
>> the  index of 13M to 14M size so that I do not need to do a data sharding.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> >> You've repeated your original statement. Shawn's observation is 
>> >> that 10M docs is a very small corpus by Solr standards. You either 
>> >> have very demanding document/search combinations or you have a 
>> >> poorly tuned Solr installation.
>> >>
>> >> On reasonable hardware I expect 25-50M documents to have 
>> >> sub-second response time.
>> >>
>> >> So what we're trying to do is be sure this isn't an "XY" problem, 
>> >> from Hossman's apache page:
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are 
>> >> dealing with "X", you are assuming "Y" will help you, and you are
>> asking about "Y"
>> >> without giving more details about the "X" so that we can 
>> >> understand the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >> So again, how would you characterize your documents? How many fields?
>> >> What do queries look like? How much physical memory on the machine?
>> >> How much memory have you allocated to the JVM?
>> >>
>> >> You might review:
>> >> http://wiki.apache.org/solr/UsingMailingLists
>> >>
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
>> >> > The query without load is still under 1 second. But under load, 
>> >> > response
>> >> time
>> >> > can be much longer due to the queued up query.
>> >> >
>> >> > We would like to shard the data to something like 6 M / shard, 
>> >> > which will still give a under 1 second response time under load.
>> >> >
>> >> > What are some best practice to shard the data? for example, we 
>> >> > could
>> >> shard
>> >> > the data by date range, but that is pretty dynamic, and we could 
>> >> > shard
>> >> data
>> >> > by some other properties, but if the data is not evenly 
>> >> > distributed, you
>> >> may
>> >> > not be able shard it anymore.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-d
>> >> ata- in-a-database-table-tp4212765p4212803.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

bq: Does Solr automatically loads search index into memory after the index is
built?

No. That's what the autowarm counts on on your queryResultCache
and filterCache are intended to facilitate. Also after every commit,
a newSearcher event is fired and any warmup queries you have configured
in the newSearcher section of your solrconfig.xml file are fired that you
should configure so as to load whatever low-level caches you expect
should be loaded.

What have you looked for to try to answer this question before you posted
the question? The top two Google responses outline this in some detail.

Best,
Erick

On Thu, Jul 2, 2015 at 8:41 AM, wwang525 <ww...@gmail.com> wrote:
> Hi,
>
> I worked with other search solutions before, and cache management is
> important in boosting performance. Apart from the cache generated due to
> user's requests, loading the search index into memory is the very initial
> step after the index is built. This is to ensure search results to be
> retrieved from memory, and not from disk I/O.
>
> The observation is that if the search index has not been accessed for a long
> time, the performance will be degraded greatly due to the swap of the search
> index from memory to disk by OS.
>
> Does Solr automatically loads search index into memory after the index is
> built? Otherwise, is there any tool or command that can accomplish this
> task.
>
> Regards
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4215398.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

Hi,

I worked with other search solutions before, and cache management is
important in boosting performance. Apart from the cache generated due to
user's requests, loading the search index into memory is the very initial
step after the index is built. This is to ensure search results to be
retrieved from memory, and not from disk I/O.

The observation is that if the search index has not been accessed for a long
time, the performance will be degraded greatly due to the swap of the search
index from memory to disk by OS.

Does Solr automatically loads search index into memory after the index is
built? Otherwise, is there any tool or command that can accomplish this
task. 

Regards




--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4215398.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

bq: The index size is only 1 M records. A 10 times of the record size (> 10M)
will likely bring the total response time to > 1 second

This is an extrapolation you simply cannot make. Plus you cannot really tell
anything from just a few queries about system performance. In fact you must
disregard the first few queries due to loading Lucene indexes into memory.

Plus you cannot extrapolate from just a few queries. Part of the time
is loading
the low-level Lucene caches for querying. And I'm assuming that time times
you're reportinga re QTimes, but if they're not then there's the time spent
assembling the response packet (i.e. reading/decompressing the data to get
the stored data) which is almost entirely independent of the number of docs.

In short, I don't have faith that your test methodology is reliable (although
kudos for having methodology at all, lots of people don't!). And I'm 99.99%
sure that you can't rely on the calculation that 10X the number of docs is
10X the response time.

Best,
Erick

On Tue, Jun 30, 2015 at 2:51 PM, wwang525 <ww...@gmail.com> wrote:
> Hi All,
>
> I did many tests with very consistent test results. Each query was executed
> after re-indexing, and only one request was sent to query the index. I
> disabled filterCache and queryResultCache for this test based on Erick's
> recommendation.
>
> The test document was posted to this email list earlier. Briefly, the query
> without grouping and faceting took about 60 ms, and grouping on top of the
> same query adds about 15 ms. However, the faceting adds additional 70 ms,
> brings it to 140 ms
>
> The index size is only 1 M records. A 10 times of the record size (> 10M)
> will likely bring the total response time to > 1 second for these two
> queries. My goal is to make the query as performant as possible so that we
> can achieve a < 1 second response time under load.
>
> Is a 50 ms to 60 ms response time (single request scenario) a bit too long
> for 1M records with Solr? Is the faceting taking too long  (70 ms)to
> process?
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4215019.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

Hi All,

I did many tests with very consistent test results. Each query was executed
after re-indexing, and only one request was sent to query the index. I
disabled filterCache and queryResultCache for this test based on Erick's
recommendation.

The test document was posted to this email list earlier. Briefly, the query
without grouping and faceting took about 60 ms, and grouping on top of the
same query adds about 15 ms. However, the faceting adds additional 70 ms,
brings it to 140 ms

The index size is only 1 M records. A 10 times of the record size (> 10M)
will likely bring the total response time to > 1 second for these two
queries. My goal is to make the query as performant as possible so that we
can achieve a < 1 second response time under load.

Is a 50 ms to 60 ms response time (single request scenario) a bit too long
for 1M records with Solr? Is the faceting taking too long  (70 ms)to
process?

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4215019.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

Test_results_round_2.doc
<http://lucene.472066.n3.nabble.com/file/n4215016/Test_results_round_2.doc>  



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4215016.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

I'd set  filterCache and queryResultCache to zero (size and autowarm count)

Leave documentCache alone IMO as it's used to store documents on disk
as the pass through various query components and doesn't autowarm anyway.
I'd think taking it out would skew your results because of multiple
decompressions.

Best,
Erick

On Tue, Jun 30, 2015 at 10:29 AM, wwang525 <ww...@gmail.com> wrote:
> Hi,
>
> I am currently investigating the queries with a much small index size (1M)
> to see the grouping, faceting on the performance degradation. This will
> allow me to do a lot of tests in a short period of time.
>
> However, it looks like the query is executed much faster the second time.
> This is tested after re-indexing, and not immediately executed again. It
> looks like it may be due to auto warming during or after re-indexing?
>
> I would like to get the response profile (query, faceting etc) for the same
> query in two separate requests without any cache or warming so that I get a
> good average number and not much fluctuation. What are the settings that I
> need to disable (temporarily) just for the purpose of the investigation? In
> the solrconfig.xml, I can see filterCache, queryResultCache, documentCache
> etc. I am not sure what need to be disabled to facilitate my work.
>
> I understand that cache and warming setting will be very helpful in load
> test later on. However, if I can optimize the query in a single request
> scenario, the performance will be in a much better shape with all the cache
> and warming setting during a load test scenario.
>
> Thanks,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4214968.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

Hi,

I am currently investigating the queries with a much small index size (1M)
to see the grouping, faceting on the performance degradation. This will
allow me to do a lot of tests in a short period of time.

However, it looks like the query is executed much faster the second time.
This is tested after re-indexing, and not immediately executed again. It
looks like it may be due to auto warming during or after re-indexing?

I would like to get the response profile (query, faceting etc) for the same
query in two separate requests without any cache or warming so that I get a
good average number and not much fluctuation. What are the settings that I
need to disable (temporarily) just for the purpose of the investigation? In
the solrconfig.xml, I can see filterCache, queryResultCache, documentCache
etc. I am not sure what need to be disabled to facilitate my work.

I understand that cache and warming setting will be very helpful in load
test later on. However, if I can optimize the query in a single request
scenario, the performance will be in a much better shape with all the cache
and warming setting during a load test scenario.

Thanks,



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4214968.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, indeed it does. Never mind ;)

I guess the thing I'd be looking at is garbage
collection, here's a very good writeup:
http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

Kind of a shot in the dark, but it's possible.
Good luck!
Erick

On Thu, Jun 25, 2015 at 3:26 PM, Wenbin Wang <ww...@gmail.com> wrote:
> Hi Guys,
>
> I have no problem changing it to 2. However, we are talking about two
> different applications.
>
> The Solr 4.7 has two applications: example and example-DIH. The application
> example-DIH is the one I started with since it works with database.
>
> The example-DIH has the default setting to 4.
>
> Regards,
>
>
>
>
> On Thu, Jun 25, 2015 at 1:27 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 6/25/2015 10:27 AM, Wenbin Wang wrote:
>> > To clarify the work:
>> >
>> > We are very early in the investigative phase, and the indexing is NOT
>> done
>> > continuously.
>> >
>> > I indexed the data once through Admin UI, and test the query. If I need
>> to
>> > index again, I can use curl or through the Admin UI.
>> >
>> > The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.
>>
>> The example configs that come with Solr have been setting
>> maxWarmingSearchers to 2 for the entire time I've been using Solr, which
>> started five years ago with version 1.4.0.  That is the value that we
>> see most often.  I have never seen an example config with 4, which is
>> part of how Erick knows that your config has been modified.  Most people
>> will not change that value unless they see an error message in their
>> logs about maxWarmingSearchers, and normally when that error message
>> appears, they are committing too frequently.  Adjusting
>> maxWarmingSearchers is rarely the proper fix ... either committing less
>> frequently or reducing the time required for each commit is the right
>> way to fix it.  Reducing the commit time is not always easy, but
>> reducing or eliminating cache autowarming will often take care of it.
>> Erick mentioned this already.
>>
>>
>> http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
>>
>> More information than you probably wanted to know: The default
>> maxWarmingSearchers value in the code (if you do not specify it in your
>> config) is Integer.MAX_VALUE -- a little over 2 billion.  If the config
>> doesn't specify, then there effectively is no limit.
>>
>> Thanks,
>> Shawn
>>
>>

Re: How to do a Data sharding for data in a database table

Posted by Wenbin Wang <ww...@gmail.com>.

Hi Guys,

I have no problem changing it to 2. However, we are talking about two
different applications.

The Solr 4.7 has two applications: example and example-DIH. The application
example-DIH is the one I started with since it works with database.

The example-DIH has the default setting to 4.

Regards,




On Thu, Jun 25, 2015 at 1:27 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/25/2015 10:27 AM, Wenbin Wang wrote:
> > To clarify the work:
> >
> > We are very early in the investigative phase, and the indexing is NOT
> done
> > continuously.
> >
> > I indexed the data once through Admin UI, and test the query. If I need
> to
> > index again, I can use curl or through the Admin UI.
> >
> > The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.
>
> The example configs that come with Solr have been setting
> maxWarmingSearchers to 2 for the entire time I've been using Solr, which
> started five years ago with version 1.4.0.  That is the value that we
> see most often.  I have never seen an example config with 4, which is
> part of how Erick knows that your config has been modified.  Most people
> will not change that value unless they see an error message in their
> logs about maxWarmingSearchers, and normally when that error message
> appears, they are committing too frequently.  Adjusting
> maxWarmingSearchers is rarely the proper fix ... either committing less
> frequently or reducing the time required for each commit is the right
> way to fix it.  Reducing the commit time is not always easy, but
> reducing or eliminating cache autowarming will often take care of it.
> Erick mentioned this already.
>
>
> http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
>
> More information than you probably wanted to know: The default
> maxWarmingSearchers value in the code (if you do not specify it in your
> config) is Integer.MAX_VALUE -- a little over 2 billion.  If the config
> doesn't specify, then there effectively is no limit.
>
> Thanks,
> Shawn
>
>

Re: How to do a Data sharding for data in a database table

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/25/2015 10:27 AM, Wenbin Wang wrote:
> To clarify the work:
>
> We are very early in the investigative phase, and the indexing is NOT done
> continuously.
>
> I indexed the data once through Admin UI, and test the query. If I need to
> index again, I can use curl or through the Admin UI.
>
> The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.

The example configs that come with Solr have been setting
maxWarmingSearchers to 2 for the entire time I've been using Solr, which
started five years ago with version 1.4.0.  That is the value that we
see most often.  I have never seen an example config with 4, which is
part of how Erick knows that your config has been modified.  Most people
will not change that value unless they see an error message in their
logs about maxWarmingSearchers, and normally when that error message
appears, they are committing too frequently.  Adjusting
maxWarmingSearchers is rarely the proper fix ... either committing less
frequently or reducing the time required for each commit is the right
way to fix it.  Reducing the commit time is not always easy, but
reducing or eliminating cache autowarming will often take care of it. 
Erick mentioned this already.

http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F

More information than you probably wanted to know: The default
maxWarmingSearchers value in the code (if you do not specify it in your
config) is Integer.MAX_VALUE -- a little over 2 billion.  If the config
doesn't specify, then there effectively is no limit.

Thanks,
Shawn

Re: How to do a Data sharding for data in a database table

Posted by Wenbin Wang <ww...@gmail.com>.

To clarify the work:

We are very early in the investigative phase, and the indexing is NOT done
continuously.

I indexed the data once through Admin UI, and test the query. If I need to
index again, I can use curl or through the Admin UI.

The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.

In an earlier email, I shared the statistics when debugQuery=true, and the
time spent on both processing query and facet. I will try to set the
debug=all to see if there is any additional information







On Thu, Jun 25, 2015 at 10:53 AM, Erick Erickson <er...@gmail.com>
wrote:

> You're missing the point. One of the things that can really affect
> response time is too-frequent commits. The fact that the commit
> configurations have been commented out indicate that the commits
> are happening either manually (curl, HTTP request or the like) _or_
> you have, say, a SolrJ client that does a commit. Or, your index never
> changes.
>
> The fact that the maxWarmingSearchers setting is 4 rather than the
> default 2 indicates that someone did change the config file. The fact
> that the autoCommit is all commented out additionally points to
> someone modifying it as these are not default settings.
>
> So again,
> 1> are commits happening from some client?
> or
> 2> does your index just never change?
>
> And you haven't posted the results of issuing queries with
> &debug=all either, this will show the time taken by various Solr
> Solr components and may point to where the slowdown is coming from.
>
> Best,
> Erick
>
> On Thu, Jun 25, 2015 at 9:48 AM, Wenbin Wang <ww...@gmail.com> wrote:
> > Hi Erick,
> >
> > The configuration is largely the default one, and I have not made much
> > change. I am also quite new to Solr although I have a lot of experience
> in
> > other search products.
> >
> > The whole list of fields need to be retrieved, so I do not have much of a
> > choice. The total size of the index files is about 1.2 G. I am not sure
> if
> > this is a reasonable size for 14 M records in Solr. One field that could
> be
> > removed is hotel name which can be retrieved/matched by mid-tier
> > application based on hotelcode (in the search index).
> >
> > You mentioned maxWarmingSearchers and commented out configuration of
> > "commit". That seems more related to indexing performance, and may not be
> > related to query performance? Actually, these were out-of-box default
> > configuration that I have not changed.
> >
> > Obviously the 1 second response time with a single request does not
> > translate well in a concurrent users scenario. Do you see any necessary
> > changes on the configuration files to make query perform faster?
> >
> > Thanks,
> >
> > On Thu, Jun 25, 2015 at 8:38 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> bq: Try not to store fields as much as possible.
> >>
> >> Why? Storing fields certainly adds lots of size to the _disk_ files, but
> >> have
> >> much less effect on memory requirements than one might think. The
> >> *.fdt and *.fdx files in your index are used for the stored data, and
> >> they're
> >> only read for the top N docs returned (30 in this case). And since the
> >> stored
> >> data is decompressed in 16K blocks, you'll only really pay a performance
> >> penalty if you have very large documents. The memory requirements for
> >> stored fields is pretty much governed by the documentCache.
> >>
> >> How are you committing? your solrconfig file has all commits commented
> out
> >> and it also has maxWarmingSearchers set to 4. Based on this scanty
> >> evidence,
> >> I'm guessing that your committing from a client, and committing far
> >> too often. If
> >> that's true, your performance is probably largely governed by loading
> >> low-level
> >> caches.
> >>
> >> Your autowarming numbers in filterCache and queryResultCache are, on the
> >> face of it, far too large.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Jun 25, 2015 at 8:12 AM, wwang525 <ww...@gmail.com> wrote:
> >> > schema.xml <
> http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>
> >> > solrconfig.xml
> >> > <http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

You're missing the point. One of the things that can really affect
response time is too-frequent commits. The fact that the commit
configurations have been commented out indicate that the commits
are happening either manually (curl, HTTP request or the like) _or_
you have, say, a SolrJ client that does a commit. Or, your index never
changes.

The fact that the maxWarmingSearchers setting is 4 rather than the
default 2 indicates that someone did change the config file. The fact
that the autoCommit is all commented out additionally points to
someone modifying it as these are not default settings.

So again,
1> are commits happening from some client?
or
2> does your index just never change?

And you haven't posted the results of issuing queries with
&debug=all either, this will show the time taken by various Solr
Solr components and may point to where the slowdown is coming from.

Best,
Erick

On Thu, Jun 25, 2015 at 9:48 AM, Wenbin Wang <ww...@gmail.com> wrote:
> Hi Erick,
>
> The configuration is largely the default one, and I have not made much
> change. I am also quite new to Solr although I have a lot of experience in
> other search products.
>
> The whole list of fields need to be retrieved, so I do not have much of a
> choice. The total size of the index files is about 1.2 G. I am not sure if
> this is a reasonable size for 14 M records in Solr. One field that could be
> removed is hotel name which can be retrieved/matched by mid-tier
> application based on hotelcode (in the search index).
>
> You mentioned maxWarmingSearchers and commented out configuration of
> "commit". That seems more related to indexing performance, and may not be
> related to query performance? Actually, these were out-of-box default
> configuration that I have not changed.
>
> Obviously the 1 second response time with a single request does not
> translate well in a concurrent users scenario. Do you see any necessary
> changes on the configuration files to make query perform faster?
>
> Thanks,
>
> On Thu, Jun 25, 2015 at 8:38 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> bq: Try not to store fields as much as possible.
>>
>> Why? Storing fields certainly adds lots of size to the _disk_ files, but
>> have
>> much less effect on memory requirements than one might think. The
>> *.fdt and *.fdx files in your index are used for the stored data, and
>> they're
>> only read for the top N docs returned (30 in this case). And since the
>> stored
>> data is decompressed in 16K blocks, you'll only really pay a performance
>> penalty if you have very large documents. The memory requirements for
>> stored fields is pretty much governed by the documentCache.
>>
>> How are you committing? your solrconfig file has all commits commented out
>> and it also has maxWarmingSearchers set to 4. Based on this scanty
>> evidence,
>> I'm guessing that your committing from a client, and committing far
>> too often. If
>> that's true, your performance is probably largely governed by loading
>> low-level
>> caches.
>>
>> Your autowarming numbers in filterCache and queryResultCache are, on the
>> face of it, far too large.
>>
>> Best,
>> Erick
>>
>> On Thu, Jun 25, 2015 at 8:12 AM, wwang525 <ww...@gmail.com> wrote:
>> > schema.xml <http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>
>> > solrconfig.xml
>> > <http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: How to do a Data sharding for data in a database table

Posted by Wenbin Wang <ww...@gmail.com>.

Hi Erick,

The configuration is largely the default one, and I have not made much
change. I am also quite new to Solr although I have a lot of experience in
other search products.

The whole list of fields need to be retrieved, so I do not have much of a
choice. The total size of the index files is about 1.2 G. I am not sure if
this is a reasonable size for 14 M records in Solr. One field that could be
removed is hotel name which can be retrieved/matched by mid-tier
application based on hotelcode (in the search index).

You mentioned maxWarmingSearchers and commented out configuration of
"commit". That seems more related to indexing performance, and may not be
related to query performance? Actually, these were out-of-box default
configuration that I have not changed.

Obviously the 1 second response time with a single request does not
translate well in a concurrent users scenario. Do you see any necessary
changes on the configuration files to make query perform faster?

Thanks,

On Thu, Jun 25, 2015 at 8:38 AM, Erick Erickson <er...@gmail.com>
wrote:

> bq: Try not to store fields as much as possible.
>
> Why? Storing fields certainly adds lots of size to the _disk_ files, but
> have
> much less effect on memory requirements than one might think. The
> *.fdt and *.fdx files in your index are used for the stored data, and
> they're
> only read for the top N docs returned (30 in this case). And since the
> stored
> data is decompressed in 16K blocks, you'll only really pay a performance
> penalty if you have very large documents. The memory requirements for
> stored fields is pretty much governed by the documentCache.
>
> How are you committing? your solrconfig file has all commits commented out
> and it also has maxWarmingSearchers set to 4. Based on this scanty
> evidence,
> I'm guessing that your committing from a client, and committing far
> too often. If
> that's true, your performance is probably largely governed by loading
> low-level
> caches.
>
> Your autowarming numbers in filterCache and queryResultCache are, on the
> face of it, far too large.
>
> Best,
> Erick
>
> On Thu, Jun 25, 2015 at 8:12 AM, wwang525 <ww...@gmail.com> wrote:
> > schema.xml <http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>
> > solrconfig.xml
> > <http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

bq: Try not to store fields as much as possible.

Why? Storing fields certainly adds lots of size to the _disk_ files, but have
much less effect on memory requirements than one might think. The
*.fdt and *.fdx files in your index are used for the stored data, and they're
only read for the top N docs returned (30 in this case). And since the stored
data is decompressed in 16K blocks, you'll only really pay a performance
penalty if you have very large documents. The memory requirements for
stored fields is pretty much governed by the documentCache.

How are you committing? your solrconfig file has all commits commented out
and it also has maxWarmingSearchers set to 4. Based on this scanty evidence,
I'm guessing that your committing from a client, and committing far
too often. If
that's true, your performance is probably largely governed by loading low-level
caches.

Your autowarming numbers in filterCache and queryResultCache are, on the
face of it, far too large.

Best,
Erick

On Thu, Jun 25, 2015 at 8:12 AM, wwang525 <ww...@gmail.com> wrote:
> schema.xml <http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>
> solrconfig.xml
> <http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

schema.xml <http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>  
solrconfig.xml
<http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>  



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by William Bell <bi...@gmail.com>.

1GB is too small to start. Try starting the same on both:

 -Xms8196m -Xmx8196m

We use 12GB for these on a similar sized index and it works good.

Send schema.xml and solrconfig.xml.

Try not to store fields as much as possible.

On Wed, Jun 24, 2015 at 8:08 AM, wwang525 <ww...@gmail.com> wrote:

> Hi All,
>
> I built the Solr index with 14 M records.
>
> I have > 20 G RAM in my local machine, and the Solr instance was started
> with -Xms1024m -Xmx8196m
>
> The following query:
>
>
> http://localhost:8983/solr/db-mssql/select?q=*:*&fq=GatewayCode:(YYZ)&fq=DestCode:(CUN)&fq=Duration:(5
> OR 6 OR 7 OR 8)&fq=DateDep:([20150610 TO
>
> 20150810])&facet=true&facet.field=DestCode&facet.field=DateDep&facet.field=GatewayCode&facet.field=HotelName&facet.sort=count&facet.limit=40&facet.mincount=1&rows=30&group=true&group.field=HotelCode&group.ngroups=true&group.facet=true&debugQuery=true
>
> The response found a total matched base records of 98105, these records
> were
> grouped at hotelcode level to give the ngroups: 143, however, the query
> only
> retrieve the first base record of each group, and only 30 groups were
> retrieved.
>
> The performance statistics:
>
> Total response time in solr.log: 1791 ms
> From the query response page: the query took 764 ms and facet took 1007 ms.
> Debug took 13 ms
>
> This is a typical query that business need. Previously, I was testing the
> data size of 6 M and no faceted search, the typical response time at single
> request scenario was around 200 ms.
>
> Please let me know if additional information is needed.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213648.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

Hi All,

I built the Solr index with 14 M records.

I have > 20 G RAM in my local machine, and the Solr instance was started
with -Xms1024m -Xmx8196m

The following query:

http://localhost:8983/solr/db-mssql/select?q=*:*&fq=GatewayCode:(YYZ)&fq=DestCode:(CUN)&fq=Duration:(5
OR 6 OR 7 OR 8)&fq=DateDep:([20150610 TO
20150810])&facet=true&facet.field=DestCode&facet.field=DateDep&facet.field=GatewayCode&facet.field=HotelName&facet.sort=count&facet.limit=40&facet.mincount=1&rows=30&group=true&group.field=HotelCode&group.ngroups=true&group.facet=true&debugQuery=true

The response found a total matched base records of 98105, these records were
grouped at hotelcode level to give the ngroups: 143, however, the query only
retrieve the first base record of each group, and only 30 groups were
retrieved.

The performance statistics:

Total response time in solr.log: 1791 ms
>From the query response page: the query took 764 ms and facet took 1007 ms.
Debug took 13 ms

This is a typical query that business need. Previously, I was testing the
data size of 6 M and no faceted search, the typical response time at single
request scenario was around 200 ms.

Please let me know if additional information is needed.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213648.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

Do be aware that turning on &debug=query adds a load. I've seen the
debug component
take 90% of the query time. (to be fair it usually takes a much
smaller percentage).

But you'll see a section at the end of the response if you set
debug=all with the time each
component took so you'll have a sense of the relative time used by
each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang <ww...@gmail.com> wrote:
> As for now, the index size is 6.5 M records, and the performance is good
> enough. I will re-build the index for all the records (14 M) and test it
> again with debug turned on.
>
> Thanks
>
>
> On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> First and most obvious thing to try:
>>
>> bq: the Solr was started with maximal 4G for JVM, and index size is < 2G
>>
>> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
>> loosely coupled to JVM requirements. It's quite possible that you're
>> spending
>> all your time in GC cycles. Consider gathering GC characteristics, see:
>> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>>
>> As Charles says, on the face of it the system you describe should handle
>> quite
>> a load, so it feels like things can be tuned and you won't have to
>> resort to sharding.
>> Sharding inevitably imposes some overhead so it's best to go there last.
>>
>> From my perspective, this is, indeed, an XY problem. You're assuming
>> that sharding
>> is your solution. But you really haven't identified the _problem_ other
>> than
>> "queries are too slow". Let's nail down the reason queries are taking
>> a second before
>> jumping into sharding. I've just spent too much of my life fixing the
>> wrong thing ;)
>>
>> It would be useful to see a couple of sample queries so we can get a
>> feel for how complex they
>> are. Especially if you append, as Charles mentions, "debug=true"
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
>> <Ch...@tiaa-cref.org> wrote:
>> > Grouping does tend to be expensive.   Our regular queries typically
>> return in 10-15ms while the grouping queries take 60-80ms in a test
>> environment (< 1M docs).
>> >
>> > This is ok for us, since we wrote our app to take the grouping queries
>> out of the critical path (async query in parallel with two primary queries
>> and some work in middle tier).   But this approach is unlikely to work for
>> most cases.
>> >
>> > -----Original Message-----
>> > From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
>> > Sent: Friday, June 19, 2015 9:52 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: RE: How to do a Data sharding for data in a database table
>> >
>> > Hi Wenbin,
>> >
>> > To me, your instance appears well provisioned.  Likewise, your analysis
>> of test vs. production performance makes a lot of sense.  Perhaps your time
>> would be well spent tuning the query performance for your app before
>> resorting to sharding?
>> >
>> > To that end, what do you see when you set debugQuery=true?   Where does
>> solr spend the time?   My guess would be in the grouping and sorting steps,
>> but which?   Sometime the schema details matter for performance.   Folks on
>> this list can help with that.
>> >
>> > -Charlie
>> >
>> > -----Original Message-----
>> > From: Wenbin Wang [mailto:wwang525@gmail.com]
>> > Sent: Friday, June 19, 2015 7:55 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: How to do a Data sharding for data in a database table
>> >
>> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
>> computer disk bound. In addition, the Solr was started with maximal 4G for
>> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
>> of 10G was available. I have not tuned any parameter in the configuration,
>> it is default configuration.
>> >
>> > The number of fields for each record is around 10, and the number of
>> results to be returned per page is 30. So the response time should not be
>> affected by network traffic, and it is tested in the same machine. The
>> query has a list of 4 search parameters, and each parameter takes a list of
>> values or date range. The results will also be grouped and sorted. The
>> response time of a typical single request is around 1 second. It can be > 1
>> second with more demanding requests.
>> >
>> > In our production environment, we have 64 cores, and we need to support >
>> > 300 concurrent users, that is about 300 concurrent request per second.
>> Each core will have to process about 5 request per second. The response
>> time under this load will not be 1 second any more. My estimate is that an
>> average of 200 ms response time of a single request would be able to handle
>> > 300 concurrent users in production. There is no plan to increase the
>> total number of cores 5 times.
>> >
>> > In a previous test, a search index around 6M data size was able to
>> handle >
>> > 5 request per second in each core of my 8-core machine.
>> >
>> > By doing data sharding from one single index of 13M to 2 indexes of 6 or
>> 7 M/each, I am expecting much faster response time that can meet the demand
>> of production environment. That is the motivation of doing data sharding.
>> > However, I am also open to solution that can improve the performance of
>> the  index of 13M to 14M size so that I do not need to do a data sharding.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> >> You've repeated your original statement. Shawn's observation is that
>> >> 10M docs is a very small corpus by Solr standards. You either have
>> >> very demanding document/search combinations or you have a poorly tuned
>> >> Solr installation.
>> >>
>> >> On reasonable hardware I expect 25-50M documents to have sub-second
>> >> response time.
>> >>
>> >> So what we're trying to do is be sure this isn't an "XY" problem, from
>> >> Hossman's apache page:
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are
>> >> dealing with "X", you are assuming "Y" will help you, and you are
>> asking about "Y"
>> >> without giving more details about the "X" so that we can understand
>> >> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >> So again, how would you characterize your documents? How many fields?
>> >> What do queries look like? How much physical memory on the machine?
>> >> How much memory have you allocated to the JVM?
>> >>
>> >> You might review:
>> >> http://wiki.apache.org/solr/UsingMailingLists
>> >>
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
>> >> > The query without load is still under 1 second. But under load,
>> >> > response
>> >> time
>> >> > can be much longer due to the queued up query.
>> >> >
>> >> > We would like to shard the data to something like 6 M / shard, which
>> >> > will still give a under 1 second response time under load.
>> >> >
>> >> > What are some best practice to shard the data? for example, we could
>> >> shard
>> >> > the data by date range, but that is pretty dynamic, and we could
>> >> > shard
>> >> data
>> >> > by some other properties, but if the data is not evenly distributed,
>> >> > you
>> >> may
>> >> > not be able shard it anymore.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
>> >> in-a-database-table-tp4212765p4212803.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>> > *************************************************************************
>> > This e-mail may contain confidential or privileged information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *************************************************************************
>> >
>> > *************************************************************************
>> > This e-mail may contain confidential or privileged information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *************************************************************************
>>

Re: How to do a Data sharding for data in a database table

Posted by Wenbin Wang <ww...@gmail.com>.

As for now, the index size is 6.5 M records, and the performance is good
enough. I will re-build the index for all the records (14 M) and test it
again with debug turned on.

Thanks


On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson <er...@gmail.com>
wrote:

> First and most obvious thing to try:
>
> bq: the Solr was started with maximal 4G for JVM, and index size is < 2G
>
> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
> loosely coupled to JVM requirements. It's quite possible that you're
> spending
> all your time in GC cycles. Consider gathering GC characteristics, see:
> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>
> As Charles says, on the face of it the system you describe should handle
> quite
> a load, so it feels like things can be tuned and you won't have to
> resort to sharding.
> Sharding inevitably imposes some overhead so it's best to go there last.
>
> From my perspective, this is, indeed, an XY problem. You're assuming
> that sharding
> is your solution. But you really haven't identified the _problem_ other
> than
> "queries are too slow". Let's nail down the reason queries are taking
> a second before
> jumping into sharding. I've just spent too much of my life fixing the
> wrong thing ;)
>
> It would be useful to see a couple of sample queries so we can get a
> feel for how complex they
> are. Especially if you append, as Charles mentions, "debug=true"
>
> Best,
> Erick
>
> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
> <Ch...@tiaa-cref.org> wrote:
> > Grouping does tend to be expensive.   Our regular queries typically
> return in 10-15ms while the grouping queries take 60-80ms in a test
> environment (< 1M docs).
> >
> > This is ok for us, since we wrote our app to take the grouping queries
> out of the critical path (async query in parallel with two primary queries
> and some work in middle tier).   But this approach is unlikely to work for
> most cases.
> >
> > -----Original Message-----
> > From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
> > Sent: Friday, June 19, 2015 9:52 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: How to do a Data sharding for data in a database table
> >
> > Hi Wenbin,
> >
> > To me, your instance appears well provisioned.  Likewise, your analysis
> of test vs. production performance makes a lot of sense.  Perhaps your time
> would be well spent tuning the query performance for your app before
> resorting to sharding?
> >
> > To that end, what do you see when you set debugQuery=true?   Where does
> solr spend the time?   My guess would be in the grouping and sorting steps,
> but which?   Sometime the schema details matter for performance.   Folks on
> this list can help with that.
> >
> > -Charlie
> >
> > -----Original Message-----
> > From: Wenbin Wang [mailto:wwang525@gmail.com]
> > Sent: Friday, June 19, 2015 7:55 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to do a Data sharding for data in a database table
> >
> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
> computer disk bound. In addition, the Solr was started with maximal 4G for
> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
> of 10G was available. I have not tuned any parameter in the configuration,
> it is default configuration.
> >
> > The number of fields for each record is around 10, and the number of
> results to be returned per page is 30. So the response time should not be
> affected by network traffic, and it is tested in the same machine. The
> query has a list of 4 search parameters, and each parameter takes a list of
> values or date range. The results will also be grouped and sorted. The
> response time of a typical single request is around 1 second. It can be > 1
> second with more demanding requests.
> >
> > In our production environment, we have 64 cores, and we need to support >
> > 300 concurrent users, that is about 300 concurrent request per second.
> Each core will have to process about 5 request per second. The response
> time under this load will not be 1 second any more. My estimate is that an
> average of 200 ms response time of a single request would be able to handle
> > 300 concurrent users in production. There is no plan to increase the
> total number of cores 5 times.
> >
> > In a previous test, a search index around 6M data size was able to
> handle >
> > 5 request per second in each core of my 8-core machine.
> >
> > By doing data sharding from one single index of 13M to 2 indexes of 6 or
> 7 M/each, I am expecting much faster response time that can meet the demand
> of production environment. That is the motivation of doing data sharding.
> > However, I am also open to solution that can improve the performance of
> the  index of 13M to 14M size so that I do not need to do a data sharding.
> >
> >
> >
> >
> >
> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> You've repeated your original statement. Shawn's observation is that
> >> 10M docs is a very small corpus by Solr standards. You either have
> >> very demanding document/search combinations or you have a poorly tuned
> >> Solr installation.
> >>
> >> On reasonable hardware I expect 25-50M documents to have sub-second
> >> response time.
> >>
> >> So what we're trying to do is be sure this isn't an "XY" problem, from
> >> Hossman's apache page:
> >>
> >> Your question appears to be an "XY Problem" ... that is: you are
> >> dealing with "X", you are assuming "Y" will help you, and you are
> asking about "Y"
> >> without giving more details about the "X" so that we can understand
> >> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >>
> >> So again, how would you characterize your documents? How many fields?
> >> What do queries look like? How much physical memory on the machine?
> >> How much memory have you allocated to the JVM?
> >>
> >> You might review:
> >> http://wiki.apache.org/solr/UsingMailingLists
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
> >> > The query without load is still under 1 second. But under load,
> >> > response
> >> time
> >> > can be much longer due to the queued up query.
> >> >
> >> > We would like to shard the data to something like 6 M / shard, which
> >> > will still give a under 1 second response time under load.
> >> >
> >> > What are some best practice to shard the data? for example, we could
> >> shard
> >> > the data by date range, but that is pretty dynamic, and we could
> >> > shard
> >> data
> >> > by some other properties, but if the data is not evenly distributed,
> >> > you
> >> may
> >> > not be able shard it anymore.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
> >> in-a-database-table-tp4212765p4212803.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> > *************************************************************************
> > This e-mail may contain confidential or privileged information.
> > If you are not the intended recipient, please notify the sender
> immediately and then delete it.
> >
> > TIAA-CREF
> > *************************************************************************
> >
> > *************************************************************************
> > This e-mail may contain confidential or privileged information.
> > If you are not the intended recipient, please notify the sender
> immediately and then delete it.
> >
> > TIAA-CREF
> > *************************************************************************
>

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

First and most obvious thing to try:

bq: the Solr was started with maximal 4G for JVM, and index size is < 2G

Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
loosely coupled to JVM requirements. It's quite possible that you're spending
all your time in GC cycles. Consider gathering GC characteristics, see:
http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

As Charles says, on the face of it the system you describe should handle quite
a load, so it feels like things can be tuned and you won't have to
resort to sharding.
Sharding inevitably imposes some overhead so it's best to go there last.

>From my perspective, this is, indeed, an XY problem. You're assuming
that sharding
is your solution. But you really haven't identified the _problem_ other than
"queries are too slow". Let's nail down the reason queries are taking
a second before
jumping into sharding. I've just spent too much of my life fixing the
wrong thing ;)

It would be useful to see a couple of sample queries so we can get a
feel for how complex they
are. Especially if you append, as Charles mentions, "debug=true"

Best,
Erick

On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
<Ch...@tiaa-cref.org> wrote:
> Grouping does tend to be expensive.   Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment (< 1M docs).
>
> This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier).   But this approach is unlikely to work for most cases.
>
> -----Original Message-----
> From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org]
> Sent: Friday, June 19, 2015 9:52 AM
> To: solr-user@lucene.apache.org
> Subject: RE: How to do a Data sharding for data in a database table
>
> Hi Wenbin,
>
> To me, your instance appears well provisioned.  Likewise, your analysis of test vs. production performance makes a lot of sense.  Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding?
>
> To that end, what do you see when you set debugQuery=true?   Where does solr spend the time?   My guess would be in the grouping and sorting steps, but which?   Sometime the schema details matter for performance.   Folks on this list can help with that.
>
> -Charlie
>
> -----Original Message-----
> From: Wenbin Wang [mailto:wwang525@gmail.com]
> Sent: Friday, June 19, 2015 7:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to do a Data sharding for data in a database table
>
> I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is < 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration.
>
> The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be > 1 second with more demanding requests.
>
> In our production environment, we have 64 cores, and we need to support >
> 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle
> 300 concurrent users in production. There is no plan to increase the total number of cores 5 times.
>
> In a previous test, a search index around 6M data size was able to handle >
> 5 request per second in each core of my 8-core machine.
>
> By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding.
> However, I am also open to solution that can improve the performance of the  index of 13M to 14M size so that I do not need to do a data sharding.
>
>
>
>
>
> On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> You've repeated your original statement. Shawn's observation is that
>> 10M docs is a very small corpus by Solr standards. You either have
>> very demanding document/search combinations or you have a poorly tuned
>> Solr installation.
>>
>> On reasonable hardware I expect 25-50M documents to have sub-second
>> response time.
>>
>> So what we're trying to do is be sure this isn't an "XY" problem, from
>> Hossman's apache page:
>>
>> Your question appears to be an "XY Problem" ... that is: you are
>> dealing with "X", you are assuming "Y" will help you, and you are asking about "Y"
>> without giving more details about the "X" so that we can understand
>> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>
>> So again, how would you characterize your documents? How many fields?
>> What do queries look like? How much physical memory on the machine?
>> How much memory have you allocated to the JVM?
>>
>> You might review:
>> http://wiki.apache.org/solr/UsingMailingLists
>>
>>
>> Best,
>> Erick
>>
>> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
>> > The query without load is still under 1 second. But under load,
>> > response
>> time
>> > can be much longer due to the queued up query.
>> >
>> > We would like to shard the data to something like 6 M / shard, which
>> > will still give a under 1 second response time under load.
>> >
>> > What are some best practice to shard the data? for example, we could
>> shard
>> > the data by date range, but that is pretty dynamic, and we could
>> > shard
>> data
>> > by some other properties, but if the data is not evenly distributed,
>> > you
>> may
>> > not be able shard it anymore.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
>> in-a-database-table-tp4212765p4212803.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately and then delete it.
>
> TIAA-CREF
> *************************************************************************
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately and then delete it.
>
> TIAA-CREF
> *************************************************************************

RE: How to do a Data sharding for data in a database table

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.

Grouping does tend to be expensive.   Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment (< 1M docs).

This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier).   But this approach is unlikely to work for most cases.

-----Original Message-----
From: Reitzel, Charles [mailto:Charles.Reitzel@tiaa-cref.org] 
Sent: Friday, June 19, 2015 9:52 AM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Hi Wenbin,

To me, your instance appears well provisioned.  Likewise, your analysis of test vs. production performance makes a lot of sense.  Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding?   

To that end, what do you see when you set debugQuery=true?   Where does solr spend the time?   My guess would be in the grouping and sorting steps, but which?   Sometime the schema details matter for performance.   Folks on this list can help with that.

-Charlie

-----Original Message-----
From: Wenbin Wang [mailto:wwang525@gmail.com]
Sent: Friday, June 19, 2015 7:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is < 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration.

The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be > 1 second with more demanding requests.

In our production environment, we have 64 cores, and we need to support >
300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle >
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the  index of 13M to 14M size so that I do not need to do a data sharding.

On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <er...@gmail.com>
wrote:

> You've repeated your original statement. Shawn's observation is that 
> 10M docs is a very small corpus by Solr standards. You either have 
> very demanding document/search combinations or you have a poorly tuned 
> Solr installation.
>
> On reasonable hardware I expect 25-50M documents to have sub-second 
> response time.
>
> So what we're trying to do is be sure this isn't an "XY" problem, from 
> Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are 
> dealing with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand 
> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> So again, how would you characterize your documents? How many fields? 
> What do queries look like? How much physical memory on the machine? 
> How much memory have you allocated to the JVM?
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
> > The query without load is still under 1 second. But under load, 
> > response
> time
> > can be much longer due to the queued up query.
> >
> > We would like to shard the data to something like 6 M / shard, which 
> > will still give a under 1 second response time under load.
> >
> > What are some best practice to shard the data? for example, we could
> shard
> > the data by date range, but that is pretty dynamic, and we could 
> > shard
> data
> > by some other properties, but if the data is not evenly distributed, 
> > you
> may
> > not be able shard it anymore.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
> in-a-database-table-tp4212765p4212803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

RE: How to do a Data sharding for data in a database table

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.

Hi Wenbin,

To me, your instance appears well provisioned.  Likewise, your analysis of test vs. production performance makes a lot of sense.  Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding?   

To that end, what do you see when you set debugQuery=true?   Where does solr spend the time?   My guess would be in the grouping and sorting steps, but which?   Sometime the schema details matter for performance.   Folks on this list can help with that.

-Charlie

-----Original Message-----
From: Wenbin Wang [mailto:wwang525@gmail.com] 
Sent: Friday, June 19, 2015 7:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is < 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration.

The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be > 1 second with more demanding requests.

In our production environment, we have 64 cores, and we need to support >
300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle >
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the  index of 13M to 14M size so that I do not need to do a data sharding.

On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <er...@gmail.com>
wrote:

> You've repeated your original statement. Shawn's observation is that 
> 10M docs is a very small corpus by Solr standards. You either have 
> very demanding document/search combinations or you have a poorly tuned 
> Solr installation.
>
> On reasonable hardware I expect 25-50M documents to have sub-second 
> response time.
>
> So what we're trying to do is be sure this isn't an "XY" problem, from 
> Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are 
> dealing with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand 
> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> So again, how would you characterize your documents? How many fields? 
> What do queries look like? How much physical memory on the machine? 
> How much memory have you allocated to the JVM?
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
> > The query without load is still under 1 second. But under load, 
> > response
> time
> > can be much longer due to the queued up query.
> >
> > We would like to shard the data to something like 6 M / shard, which 
> > will still give a under 1 second response time under load.
> >
> > What are some best practice to shard the data? for example, we could
> shard
> > the data by date range, but that is pretty dynamic, and we could 
> > shard
> data
> > by some other properties, but if the data is not evenly distributed, 
> > you
> may
> > not be able shard it anymore.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
> in-a-database-table-tp4212765p4212803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

Re: How to do a Data sharding for data in a database table

Posted by Wenbin Wang <ww...@gmail.com>.

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
computer disk bound. In addition, the Solr was started with maximal 4G for
JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
of 10G was available. I have not tuned any parameter in the configuration,
it is default configuration.

The number of fields for each record is around 10, and the number of
results to be returned per page is 30. So the response time should not be
affected by network traffic, and it is tested in the same machine. The
query has a list of 4 search parameters, and each parameter takes a list of
values or date range. The results will also be grouped and sorted. The
response time of a typical single request is around 1 second. It can be > 1
second with more demanding requests.

In our production environment, we have 64 cores, and we need to support >
300 concurrent users, that is about 300 concurrent request per second. Each
core will have to process about 5 request per second. The response time
under this load will not be 1 second any more. My estimate is that an
average of 200 ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle >
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7
M/each, I am expecting much faster response time that can meet the demand
of production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the
 index of 13M to 14M size so that I do not need to do a data sharding.

On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <er...@gmail.com>
wrote:

> You've repeated your original statement. Shawn's
> observation is that 10M docs is a very small corpus
> by Solr standards. You either have very demanding
> document/search combinations or you have a poorly
> tuned Solr installation.
>
> On reasonable hardware I expect 25-50M documents to have
> sub-second response time.
>
> So what we're trying to do is be sure this isn't
> an "XY" problem, from Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> So again, how would you characterize your documents? How many
> fields? What do queries look like? How much physical memory on the
> machine? How much memory have you allocated to the JVM?
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
> > The query without load is still under 1 second. But under load, response
> time
> > can be much longer due to the queued up query.
> >
> > We would like to shard the data to something like 6 M / shard, which will
> > still give a under 1 second response time under load.
> >
> > What are some best practice to shard the data? for example, we could
> shard
> > the data by date range, but that is pretty dynamic, and we could shard
> data
> > by some other properties, but if the data is not evenly distributed, you
> may
> > not be able shard it anymore.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to do a Data sharding for data in a database table

Posted by Erick Erickson <er...@gmail.com>.

You've repeated your original statement. Shawn's
observation is that 10M docs is a very small corpus
by Solr standards. You either have very demanding
document/search combinations or you have a poorly
tuned Solr installation.

On reasonable hardware I expect 25-50M documents to have
sub-second response time.

So what we're trying to do is be sure this isn't
an "XY" problem, from Hossman's apache page:

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

So again, how would you characterize your documents? How many
fields? What do queries look like? How much physical memory on the
machine? How much memory have you allocated to the JVM?

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <ww...@gmail.com> wrote:
> The query without load is still under 1 second. But under load, response time
> can be much longer due to the queued up query.
>
> We would like to shard the data to something like 6 M / shard, which will
> still give a under 1 second response time under load.
>
> What are some best practice to shard the data? for example, we could shard
> the data by date range, but that is pretty dynamic, and we could shard data
> by some other properties, but if the data is not evenly distributed, you may
> not be able shard it anymore.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to do a Data sharding for data in a database table

Posted by wwang525 <ww...@gmail.com>.

The query without load is still under 1 second. But under load, response time
can be much longer due to the queued up query.

We would like to shard the data to something like 6 M / shard, which will
still give a under 1 second response time under load.

What are some best practice to shard the data? for example, we could shard
the data by date range, but that is pretty dynamic, and we could shard data
by some other properties, but if the data is not evenly distributed, you may
not be able shard it anymore.



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html
Sent from the Solr - User mailing list archive at Nabble.com.