You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2013/11/05 11:47:35 UTC

DocValue on Strings slow and OOM

Hi

We have a 6-Solr-node (release 4.4.0) setup with 12billion "small" 
documents loadad. The documents have the following fields
* a_dlng_doc_sto
* b_dlng_doc_sto
* c_dstr_doc_sto
* timestamp_lng_ind_sto
* d_lng_ind_sto
 From schema.xml
     <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false" 
stored="true" required="true" docValues="true"/>
     <dynamicField name="*_lng_ind_sto" type="long" indexed="true" 
stored="true"/>
     <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" 
stored="true" required="true" docValues="true"/>
...
     <fieldType name="dstring" class="solr.StrField" 
sortMissingLast="true" docValuesFormat="Disk"/>
     <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" 
positionIncrementGap="0" docValuesFormat="Disk"/>

We execute queries on the following format:
* q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
* facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1

F.ex executing a query with values for x, y, a, b ... and n that hits 
only 6 documents (out of the 12billion) total
* With <field>=a_dlng_doc_sto (long docvalue) the query responds fairly 
quick (< 2 sec)
* With <field>=c_dstr_doc_sto (string docvalue) the query responds very 
slowly (> 100 sec) and only if we give the Solr-nodes a lot of Xmx. If 
Xmx is too low we experience OOM on involved Solr-nodes and never see a 
response
c_dstr_doc_sto strings are all about 10-15 chars, so it is not very long 
strings

Is it a known issue that there is such a big difference between facet 
searches on longs and strings? And that memory usage seems to very 
different, also?
If yes, has it been optimized after 4.4.0?

Regards, Per Steffensen

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

Please note, for now, that this problem is not relevant for us anymore, 
and we will change our c-field from being of type string (docValue) to 
being of type long (docValue). And faceting on huge numbers of long 
docValues seem to perform very well - except for 
https://issues.apache.org/jira/browse/SOLR-5444, but we have handled 
that now

I would like to help verifying that the string-faceting problem that 
this mailing-thread has been about, that it has been fixed in 4.5.1 - 
that things are performing better and no huge mem usage. In order to be 
able to do that I would really like to be able to deploy 4.5.1 on top of 
my 12 billion documents indexed with 4.4.0. Can anyone confirm that I 
ought to be able to do that? I have tried shortly but ran into problems. 
When trying to start Solr it says

[2013-11-08 17:45:48,829]ERROR [coreLoadExecutor-4-thread-19] [logid: ] - org.apache.solr.common.SolrException.log(SolrException.java:119) -null:org.apache.solr.common.SolrException: Unable to create core: mycoll_shard13_replica1
         at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:934)
         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:566)
         at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
         at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error openingnew  searcher
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:834)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:625)
         at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:256)
         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:555)
         ... 10 more
Caused by: org.apache.solr.common.SolrException: Error openingnew  searcher
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1477)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1589)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:821)
         ... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format: 12, input=MMapIndexInput(path="/usr/lib/solr/data/mycoll_shard13_replica1/data/index/_1k63_Disk_0.dvdm")
         at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readNumericEntry(Lucene45DocValuesProducer.java:207)
         at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readFields(Lucene45DocValuesProducer.java:120)
         at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.<init>(Lucene45DocValuesProducer.java:85)
         at org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.<init>(DiskDocValuesProducer.java:31)
         at org.apache.lucene.codecs.diskdv.DiskDocValuesFormat.fieldsProducer(DiskDocValuesFormat.java:56)
         at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.<init>(PerFieldDocValuesFormat.java:215)
         at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:300)
         at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:140)
         at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:56)
         at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
         at org.apache.lucene.index.ReadersAndLiveDocs.getReadOnlyClone(ReadersAndLiveDocs.java:217)
         at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
         at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:379)
         at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
         at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:41)
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1443)
         ... 15 more

Besides that, see comments below

On 11/14/13 7:54 PM, Joel Bernstein wrote:
> Per,
>
> As you are seeing there are different implementations for calculating 
> facets for numeric fields and string fields. The numeric fields I 
> believe are using an int-to-int or long-to-int hashmap to hold the 
> facet counts. This map grows as values are added to it. The String 
> version uses an int array the size of the number of distinct values in 
> the field to hold the facet counts. So if you have a very large number 
> of distinct values in the field, you'll have a very large array.
Do not think this part is a problem
> Also the distinct values themselves are held in memory in the 
> fieldCache for string fields.
Yes, that is probably a problem

Also note 
https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png 
and my comments on it in a mail earlier in this thread.
>
> So, basically as you are seeing you'll take up a much larger memory 
> footprint when when faceting on a high cardinality string field, then 
> on a high cardinality numeric field.
>
> There are docvalues faceting implementations that will kick-in on a 
> field that has docvalues. You can try setting the on disk flag
Believe I did that for my string field "c_dstr_doc_sto"?
 From schema.xml
     <dynamicField name="**_dstr_doc_sto*" type="*dstring*" 
indexed="false" stored="true" required="true" docValues="true"/>
     <dynamicField name="*_lng_ind_sto" type="long" indexed="true" 
stored="true"/>
     <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" 
stored="true" required="true" docValues="true"/>
...
     <fieldType name="*dstring*" class="solr.StrField" 
sortMissingLast="true" *docValuesFormat="Disk"*/>
     <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" 
positionIncrementGap="0" docValuesFormat="Disk"/>

Did I miss something?
> and this will test memory and performance.
>
> Joel
>
> Joel
>
>
>
>
> On Thu, Nov 14, 2013 at 8:13 AM, Per Steffensen <steff@designware.dk 
> <ma...@designware.dk>> wrote:
>
>     If anyone if following this one, just an update. We are not going
>     to upgrade to 4.5.1 in order to see if the String facet
>     performance problem has been fixed. Instead we have made a few
>     hacks around our data so that we can store the c-field
>     (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So now we only
>     need to struggle with long-facet performance. There is a
>     performance issue with facets on longs though, but I will tell
>     about in another mailing-thread - need your input on what solution
>     you prefer.
>
https://issues.apache.org/jira/browse/SOLR-5444
>
>
>     Regards, Per Steffensen
>

Re: DocValue on Strings slow and OOM

Posted by Joel Bernstein <jo...@gmail.com>.

Per,

As you are seeing there are different implementations for calculating
facets for numeric fields and string fields. The numeric fields I believe
are using an int-to-int or long-to-int hashmap to hold the facet counts.
This map grows as values are added to it. The String version uses an int
array the size of the number of distinct values in the field to hold the
facet counts. So if you have a very large number of distinct values in the
field, you'll have a very large array. Also the distinct values themselves
are held in memory in the fieldCache for string fields.

So, basically as you are seeing you'll take up a much larger memory
footprint when when faceting on a high cardinality string field, then on a
high cardinality numeric field.

There are docvalues faceting implementations that will kick-in on a field
that has docvalues. You can try setting the on disk flag and this will test
memory and performance.

Joel

Joel

On Thu, Nov 14, 2013 at 8:13 AM, Per Steffensen <st...@designware.dk> wrote:

>  If anyone if following this one, just an update. We are not going to
> upgrade to 4.5.1 in order to see if the String facet performance problem
> has been fixed. Instead we have made a few hacks around our data so that we
> can store the c-field (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So
> now we only need to struggle with long-facet performance. There is a
> performance issue with facets on longs though, but I will tell about in
> another mailing-thread - need your input on what solution you prefer.
>
> Regards, Per Steffensen
>
>
> On 11/6/13 12:15 PM, Per Steffensen wrote:
>
> On 11/6/13 11:43 AM, Robert Muir wrote:
>
> Before lucene 4.5 docvalues were loaded entirely into RAM.
>
> I'm not going to waste time debugging any old code releases here, you
> should upgrade to the latest release!
>
>  Ok, thanks!
>
> I do not consider it a bug (just a performance issue), so no debugging
> needed.
> It is just that we do not want to spend time upgrading to 4.5 if there is
> not a justified hope/explanation that it will probably make things
> better. But I guess there is.
>
> One short question: Will 4.5 index things differently (compared to 4.4)
> for documents with fields like I showed earlier? Im basically asking if we
> need to reindex the 12billion documents again after upgrading to 4.5, or if
> we ought to be able to deploy 4.5 on top of the already indexed documents.
>
> Regards, Per Steffensen
>
>
>

-- 
Joel Bernstein
Search Engineer at Heliosearch

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

If anyone if following this one, just an update. We are not going to 
upgrade to 4.5.1 in order to see if the String facet performance problem 
has been fixed. Instead we have made a few hacks around our data so that 
we can store the c-field (c_dstr_doc_sto) as long instead 
(c_dlng_doc_sto). So now we only need to struggle with long-facet 
performance. There is a performance issue with facets on longs though, 
but I will tell about in another mailing-thread - need your input on 
what solution you prefer.

Regards, Per Steffensen

On 11/6/13 12:15 PM, Per Steffensen wrote:
> On 11/6/13 11:43 AM, Robert Muir wrote:
>> Before lucene 4.5 docvalues were loaded entirely into RAM.
>>
>> I'm not going to waste time debugging any old code releases here, you
>> should upgrade to the latest release!
> Ok, thanks!
>
> I do not consider it a bug (just a performance issue), so no debugging 
> needed.
> It is just that we do not want to spend time upgrading to 4.5 if there 
> is not a justified hope/explanation that it will probably make things 
> better. But I guess there is.
>
> One short question: Will 4.5 index things differently (compared to 
> 4.4) for documents with fields like I showed earlier? Im basically 
> asking if we need to reindex the 12billion documents again after 
> upgrading to 4.5, or if we ought to be able to deploy 4.5 on top of 
> the already indexed documents.
>
> Regards, Per Steffensen

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

On 11/6/13 11:43 AM, Robert Muir wrote:
> Before lucene 4.5 docvalues were loaded entirely into RAM.
>
> I'm not going to waste time debugging any old code releases here, you
> should upgrade to the latest release!
Ok, thanks!

I do not consider it a bug (just a performance issue), so no debugging 
needed.
It is just that we do not want to spend time upgrading to 4.5 if there 
is not a justified hope/explanation that it will probably make things 
better. But I guess there is.

One short question: Will 4.5 index things differently (compared to 4.4) 
for documents with fields like I showed earlier? Im basically asking if 
we need to reindex the 12billion documents again after upgrading to 4.5, 
or if we ought to be able to deploy 4.5 on top of the already indexed 
documents.

Regards, Per Steffensen

Re: DocValue on Strings slow and OOM

Posted by Robert Muir <rc...@gmail.com>.

Before lucene 4.5 docvalues were loaded entirely into RAM.

I'm not going to waste time debugging any old code releases here, you
should upgrade to the latest release!

On Wed, Nov 6, 2013 at 4:58 AM, Per Steffensen <st...@designware.dk> wrote:
> Forget about the quoted comment a the bottom below. It is not true. Both the
> fast/efficient and the slow/memory-consuming query follow the
> getTermCounts-path.
>
> But I have identified another place where they take different paths in the
> code. In SimpleFacets.getTermCounts you will find the code below. I have
> pointed out where the two queries go.
>     if (params.getFieldBool(field, GroupParams.GROUP_FACET, false)) {
>       counts = getGroupedCounts(searcher, docs, field, multiToken,
> offset,limit, mincount, missing, sort, prefix);
>     } else {
>       assert method != null;
>       switch (method) {
>         case ENUM:
>           assert TrieField.getMainValuePrefix(ft) == null;
>           counts = getFacetTermEnumCounts(searcher, docs, field, offset,
> limit, mincount,missing,sort,prefix);
>           break;
>         case FCS:
>           assert !multiToken;
>           if (ft.getNumericType() != null && !sf.multiValued()) {
> *** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
>             // force numeric faceting
>             if (prefix != null && !prefix.isEmpty()) {
>               throw new SolrException(ErrorCode.BAD_REQUEST,
> FacetParams.FACET_PREFIX + " is not supported on numeric types");
>             }
>             counts = NumericFacets.getCounts(searcher, docs, field, offset,
> limit, mincount, missing, sort);
>           } else {
>             PerSegmentSingleValuedFaceting ps = new
> PerSegmentSingleValuedFaceting(searcher, docs, field, offset,limit,
> mincount, missing, sort, prefix);
>             Executor executor = threads == 0 ? directExecutor :
> facetExecutor;
>             ps.setNumThreads(threads);
>             counts = ps.getFacetCounts(executor);
>           }
>           break;
>         case FC:
>           if (sf.hasDocValues()) {
> *** ---> The slow/memory-consuming query (facet.field=c_dstr_doc_sto) goes
> here
>             counts = DocValuesFacets.getCounts(searcher, docs, field,
> offset,limit, mincount, missing, sort, prefix);
>           } else if (multiToken || TrieField.getMainValuePrefix(ft) != null)
> {
>             UnInvertedField uif = UnInvertedField.getUnInvertedField(field,
> searcher);
>             counts = uif.getCounts(searcher, docs, offset, limit,
> mincount,missing,sort,prefix);
>           } else {
>             counts = getFieldCacheCounts(searcher, docs, field,
> offset,limit, mincount, missing, sort, prefix);
>           }
>           break;
>         default:
>           throw new AssertionError();
>       }
>     }
>
> I also believe I have found where the huge memory allocation is done. Did a
> memory dump while the slow/memory-consuming c_dstr_doc_sto-query was going
> on (penty of time to do that - 100+ secs). It seems that a lot of memory is
> allocated under SlowCompositeReaderWrapper.cachedOrdMaps which holds
> HashMaps containing MultiDocValues$OrdinalMaps as values, and those
> MultiDocValues$OrdinalMaps have a field ordDeltas-array of
> MonotonicAppendingLongBuffers ... bla bla ... containing Packed64 containing
> long-arrays.
> See
> https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png
>
> SlowCompositeReaderWrapper and all this memory-allocation does not seem to
> be part of the fast a_dlng_doc_sto-query.
>
> Does this information provide any leads on how to fix
> response-time/memory-consumption issue? Maybe it helps telling if going to
> 4.5 will fix the issue?
>
> Regards, Per Steffensen
>
>
> On 11/5/13 1:47 PM, Per Steffensen wrote:
>
> Looking at threaddumps
>
> It seems like one of the major differences in what is done for
> c_dstr_doc_sto vs a_dlng_doc_sto is in SimpleFactes.getFacetFieldCounts,
> where c_dstr_doc_sto takes the "getTermCounts"-path and a_dlng_doc_sto takes
> the "getListedTermCounts"-path.
>
>             String termList = localParams == null ? null :
> localParams.get(CommonParams.TERMS);
>             if (termList != null) {
>               res.add(key, getListedTermCounts(facetValue, termList));
>             } else {
>               res.add(key, getTermCounts(facetValue));
>             }
>
> getTermCounts seems to do a lot more and to be a lot more complex than
> getListedTermCounts
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

It seems like NumericFacets.getCounts is using the FieldCache. This is 
what we wanted to avoid by using doc-values in the first place - because 
we have experienced so many times that the FieldCache makes us go OOM. 
We where told that if we used doc-values the FieldCache would not be 
used. But then again if doing those kinds of doc-value queries with 
docValuesFormat="Disk" will still use enormous amounts of memory 
(lineary dependent on the documents managed by the Solr-node) it is not 
worth much anyway - compared to FieldCache. And/or if it make us end up 
with 100+ secs response-times (on billions of documents all in all, but 
only a limited number hit by the query) it is not worth much either.

Will someone please help clarify
* Will this perform significantly be better in 4.5+ (vs 4.4)? Is 100+ 
secs expected, for a facet search that hits only 6 documents among 12 
billion in total, when facet.field is set to a field like c_dstr_doc_sto?
* Will doc-value (docValuesFormat="Disk") still use memory that is 
lineary dependent on the total number of documents handled by the 
Solr-node, when doing facet searches with facet.field set to one of 
those doc-values fields?

Any help is very appreciated!

Regards, Per Steffensen

On 11/6/13 10:58 AM, Per Steffensen wrote:
>           if (ft.getNumericType() != null && !sf.multiValued()) {
> *** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
>             // force numeric faceting
>             if (prefix != null && !prefix.isEmpty()) {
>               throw new SolrException(ErrorCode.BAD_REQUEST, 
> FacetParams.FACET_PREFIX + " is not supported on numeric types");
>             }
>             counts = NumericFacets.getCounts(searcher, docs, field, 
> offset, limit, mincount, missing, sort);
>           } else {

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

Forget about the quoted comment a the bottom below. It is not true. Both 
the fast/efficient and the slow/memory-consuming query follow the 
getTermCounts-path.

But I have identified another place where they take different paths in 
the code. In SimpleFacets.getTermCounts you will find the code below. I 
have pointed out where the two queries go.
     if (params.getFieldBool(field, GroupParams.GROUP_FACET, false)) {
       counts = getGroupedCounts(searcher, docs, field, multiToken, 
offset,limit, mincount, missing, sort, prefix);
     } else {
       assert method != null;
       switch (method) {
         case ENUM:
           assert TrieField.getMainValuePrefix(ft) == null;
           counts = getFacetTermEnumCounts(searcher, docs, field, 
offset, limit, mincount,missing,sort,prefix);
           break;
         case FCS:
           assert !multiToken;
           if (ft.getNumericType() != null && !sf.multiValued()) {
*** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
             // force numeric faceting
             if (prefix != null && !prefix.isEmpty()) {
               throw new SolrException(ErrorCode.BAD_REQUEST, 
FacetParams.FACET_PREFIX + " is not supported on numeric types");
             }
             counts = NumericFacets.getCounts(searcher, docs, field, 
offset, limit, mincount, missing, sort);
           } else {
             PerSegmentSingleValuedFaceting ps = new 
PerSegmentSingleValuedFaceting(searcher, docs, field, offset,limit, 
mincount, missing, sort, prefix);
             Executor executor = threads == 0 ? directExecutor : 
facetExecutor;
             ps.setNumThreads(threads);
             counts = ps.getFacetCounts(executor);
           }
           break;
         case FC:
           if (sf.hasDocValues()) {
*** ---> The slow/memory-consuming query (facet.field=c_dstr_doc_sto) 
goes here
             counts = DocValuesFacets.getCounts(searcher, docs, field, 
offset,limit, mincount, missing, sort, prefix);
           } else if (multiToken || TrieField.getMainValuePrefix(ft) != 
null) {
             UnInvertedField uif = 
UnInvertedField.getUnInvertedField(field, searcher);
             counts = uif.getCounts(searcher, docs, offset, limit, 
mincount,missing,sort,prefix);
           } else {
             counts = getFieldCacheCounts(searcher, docs, field, 
offset,limit, mincount, missing, sort, prefix);
           }
           break;
         default:
           throw new AssertionError();
       }
     }

I also believe I have found where the huge memory allocation is done. 
Did a memory dump while the slow/memory-consuming c_dstr_doc_sto-query 
was going on (penty of time to do that - 100+ secs). It seems that a lot 
of memory is allocated under SlowCompositeReaderWrapper.cachedOrdMaps 
which holds HashMaps containing MultiDocValues$OrdinalMaps as values, 
and those MultiDocValues$OrdinalMaps have a field ordDeltas-array of 
MonotonicAppendingLongBuffers ... bla bla ... containing Packed64 
containing long-arrays.
See 
https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png

SlowCompositeReaderWrapper and all this memory-allocation does not seem 
to be part of the fast a_dlng_doc_sto-query.

Does this information provide any leads on how to fix 
response-time/memory-consumption issue? Maybe it helps telling if going 
to 4.5 will fix the issue?

Regards, Per Steffensen

On 11/5/13 1:47 PM, Per Steffensen wrote:
> Looking at threaddumps
>
> It seems like one of the major differences in what is done for 
> c_dstr_doc_sto vs a_dlng_doc_sto is in 
> SimpleFactes.getFacetFieldCounts, where c_dstr_doc_sto takes the 
> "getTermCounts"-path and a_dlng_doc_sto takes the 
> "getListedTermCounts"-path.
>
>             String termList = localParams == null ? null : 
> localParams.get(CommonParams.TERMS);
>             if (termList != null) {
>               res.add(key, getListedTermCounts(facetValue, termList));
>             } else {
>               res.add(key, getTermCounts(facetValue));
>             }
>
> getTermCounts seems to do a lot more and to be a lot more complex than 
> getListedTermCounts

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

Looking at threaddumps

It seems like one of the major differences in what is done for 
c_dstr_doc_sto vs a_dlng_doc_sto is in SimpleFactes.getFacetFieldCounts, 
where c_dstr_doc_sto takes the "getTermCounts"-path and a_dlng_doc_sto 
takes the "getListedTermCounts"-path.

             String termList = localParams == null ? null : 
localParams.get(CommonParams.TERMS);
             if (termList != null) {
               res.add(key, getListedTermCounts(facetValue, termList));
             } else {
               res.add(key, getTermCounts(facetValue));
             }

getTermCounts seems to do a lot more and to be a lot more complex than 
getListedTermCounts

On 11/5/13 11:47 AM, Per Steffensen wrote:
> Hi
>
> We have a 6-Solr-node (release 4.4.0) setup with 12billion "small" 
> documents loadad. The documents have the following fields
> * a_dlng_doc_sto
> * b_dlng_doc_sto
> * c_dstr_doc_sto
> * timestamp_lng_ind_sto
> * d_lng_ind_sto
> From schema.xml
>     <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false" 
> stored="true" required="true" docValues="true"/>
>     <dynamicField name="*_lng_ind_sto" type="long" indexed="true" 
> stored="true"/>
>     <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" 
> stored="true" required="true" docValues="true"/>
> ...
>     <fieldType name="dstring" class="solr.StrField" 
> sortMissingLast="true" docValuesFormat="Disk"/>
>     <fieldType name="dlng" class="solr.TrieLongField" 
> precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>
>
> We execute queries on the following format:
> * q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
> * facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1
>
> F.ex executing a query with values for x, y, a, b ... and n that hits 
> only 6 documents (out of the 12billion) total
> * With <field>=a_dlng_doc_sto (long docvalue) the query responds 
> fairly quick (< 2 sec)
> * With <field>=c_dstr_doc_sto (string docvalue) the query responds 
> very slowly (> 100 sec) and only if we give the Solr-nodes a lot of 
> Xmx. If Xmx is too low we experience OOM on involved Solr-nodes and 
> never see a response
> c_dstr_doc_sto strings are all about 10-15 chars, so it is not very 
> long strings
>
> Is it a known issue that there is such a big difference between facet 
> searches on longs and strings? And that memory usage seems to very 
> different, also?
> If yes, has it been optimized after 4.4.0?
>
> Regards, Per Steffensen

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

Thanks for all the help, guys!

Just to clarify. Everything is working functionality-wise - we have 
tests showing that.

It is just that two similar queries (hitting the same number of rows 
(only 6 among 12billion in this example) and resulting in the same 
number of facet-groups etc etc) is performing very differently depending 
on the type of the facet.field. It is fast (< 2 secs) and efficient when 
the facet.field is
     <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" 
stored="true" required="true" docValues="true"/>
     <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" 
positionIncrementGap="0" docValuesFormat="Disk"/>
But it is very slow (> 100 secs) and memory-consuming (eating GBs) when 
the facet.field is
     <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false" 
stored="true" required="true" docValues="true"/>
     <fieldType name="dstring" class="solr.StrField" 
sortMissingLast="true" docValuesFormat="Disk"/>

We use docValuesFormat="Disk" because we have so much data, that 
everything will never fit in memory. Are you saying that this does not 
work before 4.5? But how does it explain the huge difference in 
response-time and memory-consumption? Guess, if it does not work in 4.4, 
that it does not work for neither of the types?
Just a side-question: We never have more than one value per field. Would 
we benefit from adding multiValued=false to our field-declarations?

Regards, Per Steffensen

On 11/5/13 11:44 PM, Shawn Heisey wrote:
> On 11/5/2013 11:56 AM, Erick Erickson wrote:
>> Hmmm, what I'm referring to is this bit:
>>
>> |<||fieldType||name||=||"string_ondisk"||class||=||"solr.StrField"||docValuesFormat||=||"Disk"||/>| 
>>
>> |
>> |
>> |The docValuesFormat="Disk" bit isn't supported until 4.5, which 
>> doesn't seem clear in either place. Feel free to disagree of course :).|
>>
>>
>
>
> I'm pretty sure that the disk format was supported from 4.2, when 
> docvalues first came to Solr.  Not sure about earlier.  Here's someone 
> with it working on 4.2.1:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201304.mbox/%3C51766344.5060706@gmail.com%3E 
>
>
> Something that wasn't supported that far back (and as far as I know 
> still isn't supported) is upgrading Solr with an existing index that 
> uses the disk format.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Shawn Heisey <so...@elyograg.org>.

On 11/5/2013 11:56 AM, Erick Erickson wrote:
> Hmmm, what I'm referring to is this bit:
>
> |<||fieldType||name||=||"string_ondisk"||class||=||"solr.StrField"||docValuesFormat||=||"Disk"||/>|
> |
> |
> |The docValuesFormat="Disk" bit isn't supported until 4.5, which 
> doesn't seem clear in either place. Feel free to disagree of course :).|
>
>

I'm pretty sure that the disk format was supported from 4.2, when 
docvalues first came to Solr.  Not sure about earlier.  Here's someone 
with it working on 4.2.1:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201304.mbox/%3C51766344.5060706@gmail.com%3E

Something that wasn't supported that far back (and as far as I know 
still isn't supported) is upgrading Solr with an existing index that 
uses the disk format.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, what I'm referring to is this bit:

<fieldType name="string_ondisk" class="solr.StrField" docValuesFormat="Disk"
/>

The docValuesFormat="Disk" bit isn't supported until 4.5, which doesn't
seem clear in either place. Feel free to disagree of course :).


On Tue, Nov 5, 2013 at 11:43 AM, Cassandra Targett <ca...@gmail.com>wrote:

> On Tue, Nov 5, 2013 at 3:27 PM, Erick Erickson <er...@gmail.com>
> wrote:
> > Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
> > about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently
> it
> > kind of looks like you can do that with 4.2....
> >
>
> It's in the Solr Ref Guide:
> https://cwiki.apache.org/confluence/display/solr/DocValues, fixed for
> 4.5
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocValue on Strings slow and OOM

Posted by Cassandra Targett <ca...@gmail.com>.

On Tue, Nov 5, 2013 at 3:27 PM, Erick Erickson <er...@gmail.com> wrote:
> Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
> about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently it
> kind of looks like you can do that with 4.2....
>

It's in the Solr Ref Guide:
https://cwiki.apache.org/confluence/display/solr/DocValues, fixed for
4.5

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Erick Erickson <er...@gmail.com>.

Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently it
kind of looks like you can do that with 4.2....

Or am I off base here? I'm going from CHANGES.txt about LUCENE-5124

Erick


On Tue, Nov 5, 2013 at 9:46 AM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Nov 5, 2013 at 9:42 AM, Per Steffensen <st...@designware.dk>
> wrote:
> > On 11/5/13 3:30 PM, Robert Muir wrote:
> >>
> >> If you are querying on a field, you should index it!
> >
> > Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y]
> AND
> > d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto" and
> > "d_lng_ind_sto" are indexed.
> > Please elaborate!
> >
>
> solr faceting often runs queries behind the scenes. please, only turn
> off indexed=true if you are really really sure you do not need it.
>
> and use 4.5.0 if you have memory concerns.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocValue on Strings slow and OOM

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 5, 2013 at 9:42 AM, Per Steffensen <st...@designware.dk> wrote:
> On 11/5/13 3:30 PM, Robert Muir wrote:
>>
>> If you are querying on a field, you should index it!
>
> Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y] AND
> d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto" and
> "d_lng_ind_sto" are indexed.
> Please elaborate!
>

solr faceting often runs queries behind the scenes. please, only turn
off indexed=true if you are really really sure you do not need it.

and use 4.5.0 if you have memory concerns.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Per Steffensen <st...@designware.dk>.

On 11/5/13 3:30 PM, Robert Muir wrote:
> If you are querying on a field, you should index it!
Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y] 
AND d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto" 
and "d_lng_ind_sto" are indexed.
Please elaborate!

I facet/group on fields that are indexed=false and docValues=true, but 
that is the case for both of the facet.fields "a_dlng_doc_sto" and 
"c_dstr_doc_sto", so it shouldnt explain the big difference between 
faceting on the long-field vs faceting on the string-field.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValue on Strings slow and OOM

Posted by Robert Muir <rc...@gmail.com>.

If you are querying on a field, you should index it!

On Tue, Nov 5, 2013 at 5:47 AM, Per Steffensen <st...@designware.dk> wrote:
> Hi
>
> We have a 6-Solr-node (release 4.4.0) setup with 12billion "small" documents
> loadad. The documents have the following fields
> * a_dlng_doc_sto
> * b_dlng_doc_sto
> * c_dstr_doc_sto
> * timestamp_lng_ind_sto
> * d_lng_ind_sto
> From schema.xml
>     <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false"
> stored="true" required="true" docValues="true"/>
>     <dynamicField name="*_lng_ind_sto" type="long" indexed="true"
> stored="true"/>
>     <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
> stored="true" required="true" docValues="true"/>
> ...
>     <fieldType name="dstring" class="solr.StrField" sortMissingLast="true"
> docValuesFormat="Disk"/>
>     <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0" docValuesFormat="Disk"/>
>
> We execute queries on the following format:
> * q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
> * facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1
>
> F.ex executing a query with values for x, y, a, b ... and n that hits only 6
> documents (out of the 12billion) total
> * With <field>=a_dlng_doc_sto (long docvalue) the query responds fairly
> quick (< 2 sec)
> * With <field>=c_dstr_doc_sto (string docvalue) the query responds very
> slowly (> 100 sec) and only if we give the Solr-nodes a lot of Xmx. If Xmx
> is too low we experience OOM on involved Solr-nodes and never see a response
> c_dstr_doc_sto strings are all about 10-15 chars, so it is not very long
> strings
>
> Is it a known issue that there is such a big difference between facet
> searches on longs and strings? And that memory usage seems to very
> different, also?
> If yes, has it been optimized after 4.4.0?
>
> Regards, Per Steffensen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org