You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Aman Tandon <am...@gmail.com> on 2017/09/12 03:06:37 UTC

Provide suggestion on indexing performance

Hi,

We want to know about the indexing performance in the below mentioned
scenarios, consider the total number of 10 string fields and total number
of documents are 10 million.

1) indexed=true, stored=true
2) indexed=true, docValues=true

Which one should we prefer in terms of indexing performance, please share
your experience.

With regards,
Aman Tandon

Re: Provide suggestion on indexing performance

Posted by Aman Tandon <am...@gmail.com>.
Hi Shawn,

Thanks for your reply, this is really helpful. I will try this out to see
the performance with the docValues.

With regards,
Aman Tandon

On Sep 15, 2017 9:10 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:

> On 9/11/2017 9:06 PM, Aman Tandon wrote:
> > We want to know about the indexing performance in the below mentioned
> > scenarios, consider the total number of 10 string fields and total number
> > of documents are 10 million.
> >
> > 1) indexed=true, stored=true
> > 2) indexed=true, docValues=true
> >
> > Which one should we prefer in terms of indexing performance, please share
> > your experience.
>
> There are several settings in the schema for each field, things like
> indexed, stored, docValues, multiValued, and others.  You should base
> your choices on what you need Solr to do.  Choosing these settings based
> purely on desired indexing speed may result in Solr not doing what you
> want it to do.
>
> When the indexing system sends data to Solr with several threads or
> processes, Solr is *usually* capable of indexing data faster than most
> systems can supply it.  The more settings you disable on a field, the
> faster Solr will be able to index.
>
> It is not possible to provide precise numbers, because performance
> depends on many factors, some of which you may not even know until you
> build a production system.
>
> https://lucidworks.com/sizing-hardware-in-the-abstract-why-
> we-dont-have-a-definitive-answer/
>
> All that said ... docValues MIGHT be a little bit faster than stored,
> because stored data is compressed, and the compression takes CPU time.
> On a fully populated production system, that statement might turn out to
> be wrong.  There may be factors that result in stored fields working
> better.  The best way to decide is to try it both ways with all your data.
>
> Thanks,
> Shawn
>
>

Re: Provide suggestion on indexing performance

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/11/2017 9:06 PM, Aman Tandon wrote:
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.

There are several settings in the schema for each field, things like
indexed, stored, docValues, multiValued, and others.  You should base
your choices on what you need Solr to do.  Choosing these settings based
purely on desired indexing speed may result in Solr not doing what you
want it to do.

When the indexing system sends data to Solr with several threads or
processes, Solr is *usually* capable of indexing data faster than most
systems can supply it.  The more settings you disable on a field, the
faster Solr will be able to index.

It is not possible to provide precise numbers, because performance
depends on many factors, some of which you may not even know until you
build a production system.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

All that said ... docValues MIGHT be a little bit faster than stored,
because stored data is compressed, and the compression takes CPU time. 
On a fully populated production system, that statement might turn out to
be wrong.  There may be factors that result in stored fields working
better.  The best way to decide is to try it both ways with all your data.

Thanks,
Shawn


Re: Provide suggestion on indexing performance

Posted by Aman Tandon <am...@gmail.com>.
Hi Tom,

Thanks for your suggestion and the information.

I will try this out to test and will share the results.

On Sep 14, 2017 2:32 PM, "Sreenivas.T" <sr...@gmail.com> wrote:

> I agree with Tom. Doc values and stored fields are present for different
> reasons. Doc values is another index that gets build for faster
> sorting/faceting.
>
> On Wed, Sep 13, 2017 at 11:30 PM Tom Evans <te...@googlemail.com>
> wrote:
>
> > On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <am...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > We want to know about the indexing performance in the below mentioned
> > > scenarios, consider the total number of 10 string fields and total
> number
> > > of documents are 10 million.
> > >
> > > 1) indexed=true, stored=true
> > > 2) indexed=true, docValues=true
> > >
> > > Which one should we prefer in terms of indexing performance, please
> share
> > > your experience.
> > >
> > > With regards,
> > > Aman Tandon
> >
> > Your question doesn't make much sense. You turn on stored when you
> > need to retrieve the original contents of the fields after searching,
> > and you use docvalues to speed up faceting, sorting and grouping.
> > Using docvalues to retrieve values during search is more expensive
> > than simply using stored values, so if your primary aim is retrieving
> > stored values, use stored=true.
> >
> > Secondly, the only way to answer performance questions for your schema
> > and data is to try it out. Generate 10 million docs, store them in a
> > doc (eg as CSV), and then use the post tool to try different schema
> > and query options.
> >
> > Cheers
> >
> > Tom
> >
>

Re: Provide suggestion on indexing performance

Posted by "Sreenivas.T" <sr...@gmail.com>.
I agree with Tom. Doc values and stored fields are present for different
reasons. Doc values is another index that gets build for faster
sorting/faceting.

On Wed, Sep 13, 2017 at 11:30 PM Tom Evans <te...@googlemail.com> wrote:

> On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <am...@gmail.com>
> wrote:
> > Hi,
> >
> > We want to know about the indexing performance in the below mentioned
> > scenarios, consider the total number of 10 string fields and total number
> > of documents are 10 million.
> >
> > 1) indexed=true, stored=true
> > 2) indexed=true, docValues=true
> >
> > Which one should we prefer in terms of indexing performance, please share
> > your experience.
> >
> > With regards,
> > Aman Tandon
>
> Your question doesn't make much sense. You turn on stored when you
> need to retrieve the original contents of the fields after searching,
> and you use docvalues to speed up faceting, sorting and grouping.
> Using docvalues to retrieve values during search is more expensive
> than simply using stored values, so if your primary aim is retrieving
> stored values, use stored=true.
>
> Secondly, the only way to answer performance questions for your schema
> and data is to try it out. Generate 10 million docs, store them in a
> doc (eg as CSV), and then use the post tool to try different schema
> and query options.
>
> Cheers
>
> Tom
>

Re: Provide suggestion on indexing performance

Posted by Tom Evans <te...@googlemail.com>.
On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <am...@gmail.com> wrote:
> Hi,
>
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.
>
> With regards,
> Aman Tandon

Your question doesn't make much sense. You turn on stored when you
need to retrieve the original contents of the fields after searching,
and you use docvalues to speed up faceting, sorting and grouping.
Using docvalues to retrieve values during search is more expensive
than simply using stored values, so if your primary aim is retrieving
stored values, use stored=true.

Secondly, the only way to answer performance questions for your schema
and data is to try it out. Generate 10 million docs, store them in a
doc (eg as CSV), and then use the post tool to try different schema
and query options.

Cheers

Tom