You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2013/11/05 11:47:35 UTC
DocValue on Strings slow and OOM
Hi
We have a 6-Solr-node (release 4.4.0) setup with 12billion "small"
documents loadad. The documents have the following fields
* a_dlng_doc_sto
* b_dlng_doc_sto
* c_dstr_doc_sto
* timestamp_lng_ind_sto
* d_lng_ind_sto
From schema.xml
<dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false"
stored="true" required="true" docValues="true"/>
<dynamicField name="*_lng_ind_sto" type="long" indexed="true"
stored="true"/>
<dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
stored="true" required="true" docValues="true"/>
...
<fieldType name="dstring" class="solr.StrField"
sortMissingLast="true" docValuesFormat="Disk"/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0" docValuesFormat="Disk"/>
We execute queries on the following format:
* q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
* facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1
F.ex executing a query with values for x, y, a, b ... and n that hits
only 6 documents (out of the 12billion) total
* With <field>=a_dlng_doc_sto (long docvalue) the query responds fairly
quick (< 2 sec)
* With <field>=c_dstr_doc_sto (string docvalue) the query responds very
slowly (> 100 sec) and only if we give the Solr-nodes a lot of Xmx. If
Xmx is too low we experience OOM on involved Solr-nodes and never see a
response
c_dstr_doc_sto strings are all about 10-15 chars, so it is not very long
strings
Is it a known issue that there is such a big difference between facet
searches on longs and strings? And that memory usage seems to very
different, also?
If yes, has it been optimized after 4.4.0?
Regards, Per Steffensen
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
Please note, for now, that this problem is not relevant for us anymore,
and we will change our c-field from being of type string (docValue) to
being of type long (docValue). And faceting on huge numbers of long
docValues seem to perform very well - except for
https://issues.apache.org/jira/browse/SOLR-5444, but we have handled
that now
I would like to help verifying that the string-faceting problem that
this mailing-thread has been about, that it has been fixed in 4.5.1 -
that things are performing better and no huge mem usage. In order to be
able to do that I would really like to be able to deploy 4.5.1 on top of
my 12 billion documents indexed with 4.4.0. Can anyone confirm that I
ought to be able to do that? I have tried shortly but ran into problems.
When trying to start Solr it says
[2013-11-08 17:45:48,829]ERROR [coreLoadExecutor-4-thread-19] [logid: ] - org.apache.solr.common.SolrException.log(SolrException.java:119) -null:org.apache.solr.common.SolrException: Unable to create core: mycoll_shard13_replica1
at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:934)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:566)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error openingnew searcher
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:834)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:625)
at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:256)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:555)
... 10 more
Caused by: org.apache.solr.common.SolrException: Error openingnew searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1477)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1589)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:821)
... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format: 12, input=MMapIndexInput(path="/usr/lib/solr/data/mycoll_shard13_replica1/data/index/_1k63_Disk_0.dvdm")
at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readNumericEntry(Lucene45DocValuesProducer.java:207)
at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.readFields(Lucene45DocValuesProducer.java:120)
at org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer.<init>(Lucene45DocValuesProducer.java:85)
at org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.<init>(DiskDocValuesProducer.java:31)
at org.apache.lucene.codecs.diskdv.DiskDocValuesFormat.fieldsProducer(DiskDocValuesFormat.java:56)
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.<init>(PerFieldDocValuesFormat.java:215)
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:300)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:140)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:56)
at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
at org.apache.lucene.index.ReadersAndLiveDocs.getReadOnlyClone(ReadersAndLiveDocs.java:217)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:379)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:41)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1443)
... 15 more
Besides that, see comments below
On 11/14/13 7:54 PM, Joel Bernstein wrote:
> Per,
>
> As you are seeing there are different implementations for calculating
> facets for numeric fields and string fields. The numeric fields I
> believe are using an int-to-int or long-to-int hashmap to hold the
> facet counts. This map grows as values are added to it. The String
> version uses an int array the size of the number of distinct values in
> the field to hold the facet counts. So if you have a very large number
> of distinct values in the field, you'll have a very large array.
Do not think this part is a problem
> Also the distinct values themselves are held in memory in the
> fieldCache for string fields.
Yes, that is probably a problem
Also note
https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png
and my comments on it in a mail earlier in this thread.
>
> So, basically as you are seeing you'll take up a much larger memory
> footprint when when faceting on a high cardinality string field, then
> on a high cardinality numeric field.
>
> There are docvalues faceting implementations that will kick-in on a
> field that has docvalues. You can try setting the on disk flag
Believe I did that for my string field "c_dstr_doc_sto"?
From schema.xml
<dynamicField name="**_dstr_doc_sto*" type="*dstring*"
indexed="false" stored="true" required="true" docValues="true"/>
<dynamicField name="*_lng_ind_sto" type="long" indexed="true"
stored="true"/>
<dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
stored="true" required="true" docValues="true"/>
...
<fieldType name="*dstring*" class="solr.StrField"
sortMissingLast="true" *docValuesFormat="Disk"*/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0" docValuesFormat="Disk"/>
Did I miss something?
> and this will test memory and performance.
>
> Joel
>
> Joel
>
>
>
>
> On Thu, Nov 14, 2013 at 8:13 AM, Per Steffensen <steff@designware.dk
> <ma...@designware.dk>> wrote:
>
> If anyone if following this one, just an update. We are not going
> to upgrade to 4.5.1 in order to see if the String facet
> performance problem has been fixed. Instead we have made a few
> hacks around our data so that we can store the c-field
> (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So now we only
> need to struggle with long-facet performance. There is a
> performance issue with facets on longs though, but I will tell
> about in another mailing-thread - need your input on what solution
> you prefer.
>
https://issues.apache.org/jira/browse/SOLR-5444
>
>
> Regards, Per Steffensen
>
Re: DocValue on Strings slow and OOM
Posted by Joel Bernstein <jo...@gmail.com>.
Per,
As you are seeing there are different implementations for calculating
facets for numeric fields and string fields. The numeric fields I believe
are using an int-to-int or long-to-int hashmap to hold the facet counts.
This map grows as values are added to it. The String version uses an int
array the size of the number of distinct values in the field to hold the
facet counts. So if you have a very large number of distinct values in the
field, you'll have a very large array. Also the distinct values themselves
are held in memory in the fieldCache for string fields.
So, basically as you are seeing you'll take up a much larger memory
footprint when when faceting on a high cardinality string field, then on a
high cardinality numeric field.
There are docvalues faceting implementations that will kick-in on a field
that has docvalues. You can try setting the on disk flag and this will test
memory and performance.
Joel
Joel
On Thu, Nov 14, 2013 at 8:13 AM, Per Steffensen <st...@designware.dk> wrote:
> If anyone if following this one, just an update. We are not going to
> upgrade to 4.5.1 in order to see if the String facet performance problem
> has been fixed. Instead we have made a few hacks around our data so that we
> can store the c-field (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So
> now we only need to struggle with long-facet performance. There is a
> performance issue with facets on longs though, but I will tell about in
> another mailing-thread - need your input on what solution you prefer.
>
> Regards, Per Steffensen
>
>
> On 11/6/13 12:15 PM, Per Steffensen wrote:
>
> On 11/6/13 11:43 AM, Robert Muir wrote:
>
> Before lucene 4.5 docvalues were loaded entirely into RAM.
>
> I'm not going to waste time debugging any old code releases here, you
> should upgrade to the latest release!
>
> Ok, thanks!
>
> I do not consider it a bug (just a performance issue), so no debugging
> needed.
> It is just that we do not want to spend time upgrading to 4.5 if there is
> not a justified hope/explanation that it will probably make things
> better. But I guess there is.
>
> One short question: Will 4.5 index things differently (compared to 4.4)
> for documents with fields like I showed earlier? Im basically asking if we
> need to reindex the 12billion documents again after upgrading to 4.5, or if
> we ought to be able to deploy 4.5 on top of the already indexed documents.
>
> Regards, Per Steffensen
>
>
>
--
Joel Bernstein
Search Engineer at Heliosearch
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
If anyone if following this one, just an update. We are not going to
upgrade to 4.5.1 in order to see if the String facet performance problem
has been fixed. Instead we have made a few hacks around our data so that
we can store the c-field (c_dstr_doc_sto) as long instead
(c_dlng_doc_sto). So now we only need to struggle with long-facet
performance. There is a performance issue with facets on longs though,
but I will tell about in another mailing-thread - need your input on
what solution you prefer.
Regards, Per Steffensen
On 11/6/13 12:15 PM, Per Steffensen wrote:
> On 11/6/13 11:43 AM, Robert Muir wrote:
>> Before lucene 4.5 docvalues were loaded entirely into RAM.
>>
>> I'm not going to waste time debugging any old code releases here, you
>> should upgrade to the latest release!
> Ok, thanks!
>
> I do not consider it a bug (just a performance issue), so no debugging
> needed.
> It is just that we do not want to spend time upgrading to 4.5 if there
> is not a justified hope/explanation that it will probably make things
> better. But I guess there is.
>
> One short question: Will 4.5 index things differently (compared to
> 4.4) for documents with fields like I showed earlier? Im basically
> asking if we need to reindex the 12billion documents again after
> upgrading to 4.5, or if we ought to be able to deploy 4.5 on top of
> the already indexed documents.
>
> Regards, Per Steffensen
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
On 11/6/13 11:43 AM, Robert Muir wrote:
> Before lucene 4.5 docvalues were loaded entirely into RAM.
>
> I'm not going to waste time debugging any old code releases here, you
> should upgrade to the latest release!
Ok, thanks!
I do not consider it a bug (just a performance issue), so no debugging
needed.
It is just that we do not want to spend time upgrading to 4.5 if there
is not a justified hope/explanation that it will probably make things
better. But I guess there is.
One short question: Will 4.5 index things differently (compared to 4.4)
for documents with fields like I showed earlier? Im basically asking if
we need to reindex the 12billion documents again after upgrading to 4.5,
or if we ought to be able to deploy 4.5 on top of the already indexed
documents.
Regards, Per Steffensen
Re: DocValue on Strings slow and OOM
Posted by Robert Muir <rc...@gmail.com>.
Before lucene 4.5 docvalues were loaded entirely into RAM.
I'm not going to waste time debugging any old code releases here, you
should upgrade to the latest release!
On Wed, Nov 6, 2013 at 4:58 AM, Per Steffensen <st...@designware.dk> wrote:
> Forget about the quoted comment a the bottom below. It is not true. Both the
> fast/efficient and the slow/memory-consuming query follow the
> getTermCounts-path.
>
> But I have identified another place where they take different paths in the
> code. In SimpleFacets.getTermCounts you will find the code below. I have
> pointed out where the two queries go.
> if (params.getFieldBool(field, GroupParams.GROUP_FACET, false)) {
> counts = getGroupedCounts(searcher, docs, field, multiToken,
> offset,limit, mincount, missing, sort, prefix);
> } else {
> assert method != null;
> switch (method) {
> case ENUM:
> assert TrieField.getMainValuePrefix(ft) == null;
> counts = getFacetTermEnumCounts(searcher, docs, field, offset,
> limit, mincount,missing,sort,prefix);
> break;
> case FCS:
> assert !multiToken;
> if (ft.getNumericType() != null && !sf.multiValued()) {
> *** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
> // force numeric faceting
> if (prefix != null && !prefix.isEmpty()) {
> throw new SolrException(ErrorCode.BAD_REQUEST,
> FacetParams.FACET_PREFIX + " is not supported on numeric types");
> }
> counts = NumericFacets.getCounts(searcher, docs, field, offset,
> limit, mincount, missing, sort);
> } else {
> PerSegmentSingleValuedFaceting ps = new
> PerSegmentSingleValuedFaceting(searcher, docs, field, offset,limit,
> mincount, missing, sort, prefix);
> Executor executor = threads == 0 ? directExecutor :
> facetExecutor;
> ps.setNumThreads(threads);
> counts = ps.getFacetCounts(executor);
> }
> break;
> case FC:
> if (sf.hasDocValues()) {
> *** ---> The slow/memory-consuming query (facet.field=c_dstr_doc_sto) goes
> here
> counts = DocValuesFacets.getCounts(searcher, docs, field,
> offset,limit, mincount, missing, sort, prefix);
> } else if (multiToken || TrieField.getMainValuePrefix(ft) != null)
> {
> UnInvertedField uif = UnInvertedField.getUnInvertedField(field,
> searcher);
> counts = uif.getCounts(searcher, docs, offset, limit,
> mincount,missing,sort,prefix);
> } else {
> counts = getFieldCacheCounts(searcher, docs, field,
> offset,limit, mincount, missing, sort, prefix);
> }
> break;
> default:
> throw new AssertionError();
> }
> }
>
> I also believe I have found where the huge memory allocation is done. Did a
> memory dump while the slow/memory-consuming c_dstr_doc_sto-query was going
> on (penty of time to do that - 100+ secs). It seems that a lot of memory is
> allocated under SlowCompositeReaderWrapper.cachedOrdMaps which holds
> HashMaps containing MultiDocValues$OrdinalMaps as values, and those
> MultiDocValues$OrdinalMaps have a field ordDeltas-array of
> MonotonicAppendingLongBuffers ... bla bla ... containing Packed64 containing
> long-arrays.
> See
> https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png
>
> SlowCompositeReaderWrapper and all this memory-allocation does not seem to
> be part of the fast a_dlng_doc_sto-query.
>
> Does this information provide any leads on how to fix
> response-time/memory-consumption issue? Maybe it helps telling if going to
> 4.5 will fix the issue?
>
> Regards, Per Steffensen
>
>
> On 11/5/13 1:47 PM, Per Steffensen wrote:
>
> Looking at threaddumps
>
> It seems like one of the major differences in what is done for
> c_dstr_doc_sto vs a_dlng_doc_sto is in SimpleFactes.getFacetFieldCounts,
> where c_dstr_doc_sto takes the "getTermCounts"-path and a_dlng_doc_sto takes
> the "getListedTermCounts"-path.
>
> String termList = localParams == null ? null :
> localParams.get(CommonParams.TERMS);
> if (termList != null) {
> res.add(key, getListedTermCounts(facetValue, termList));
> } else {
> res.add(key, getTermCounts(facetValue));
> }
>
> getTermCounts seems to do a lot more and to be a lot more complex than
> getListedTermCounts
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
It seems like NumericFacets.getCounts is using the FieldCache. This is
what we wanted to avoid by using doc-values in the first place - because
we have experienced so many times that the FieldCache makes us go OOM.
We where told that if we used doc-values the FieldCache would not be
used. But then again if doing those kinds of doc-value queries with
docValuesFormat="Disk" will still use enormous amounts of memory
(lineary dependent on the documents managed by the Solr-node) it is not
worth much anyway - compared to FieldCache. And/or if it make us end up
with 100+ secs response-times (on billions of documents all in all, but
only a limited number hit by the query) it is not worth much either.
Will someone please help clarify
* Will this perform significantly be better in 4.5+ (vs 4.4)? Is 100+
secs expected, for a facet search that hits only 6 documents among 12
billion in total, when facet.field is set to a field like c_dstr_doc_sto?
* Will doc-value (docValuesFormat="Disk") still use memory that is
lineary dependent on the total number of documents handled by the
Solr-node, when doing facet searches with facet.field set to one of
those doc-values fields?
Any help is very appreciated!
Regards, Per Steffensen
On 11/6/13 10:58 AM, Per Steffensen wrote:
> if (ft.getNumericType() != null && !sf.multiValued()) {
> *** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
> // force numeric faceting
> if (prefix != null && !prefix.isEmpty()) {
> throw new SolrException(ErrorCode.BAD_REQUEST,
> FacetParams.FACET_PREFIX + " is not supported on numeric types");
> }
> counts = NumericFacets.getCounts(searcher, docs, field,
> offset, limit, mincount, missing, sort);
> } else {
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
Forget about the quoted comment a the bottom below. It is not true. Both
the fast/efficient and the slow/memory-consuming query follow the
getTermCounts-path.
But I have identified another place where they take different paths in
the code. In SimpleFacets.getTermCounts you will find the code below. I
have pointed out where the two queries go.
if (params.getFieldBool(field, GroupParams.GROUP_FACET, false)) {
counts = getGroupedCounts(searcher, docs, field, multiToken,
offset,limit, mincount, missing, sort, prefix);
} else {
assert method != null;
switch (method) {
case ENUM:
assert TrieField.getMainValuePrefix(ft) == null;
counts = getFacetTermEnumCounts(searcher, docs, field,
offset, limit, mincount,missing,sort,prefix);
break;
case FCS:
assert !multiToken;
if (ft.getNumericType() != null && !sf.multiValued()) {
*** ---> The fast/efficient query (facet.field=a_dlng_doc_sto) goes here
// force numeric faceting
if (prefix != null && !prefix.isEmpty()) {
throw new SolrException(ErrorCode.BAD_REQUEST,
FacetParams.FACET_PREFIX + " is not supported on numeric types");
}
counts = NumericFacets.getCounts(searcher, docs, field,
offset, limit, mincount, missing, sort);
} else {
PerSegmentSingleValuedFaceting ps = new
PerSegmentSingleValuedFaceting(searcher, docs, field, offset,limit,
mincount, missing, sort, prefix);
Executor executor = threads == 0 ? directExecutor :
facetExecutor;
ps.setNumThreads(threads);
counts = ps.getFacetCounts(executor);
}
break;
case FC:
if (sf.hasDocValues()) {
*** ---> The slow/memory-consuming query (facet.field=c_dstr_doc_sto)
goes here
counts = DocValuesFacets.getCounts(searcher, docs, field,
offset,limit, mincount, missing, sort, prefix);
} else if (multiToken || TrieField.getMainValuePrefix(ft) !=
null) {
UnInvertedField uif =
UnInvertedField.getUnInvertedField(field, searcher);
counts = uif.getCounts(searcher, docs, offset, limit,
mincount,missing,sort,prefix);
} else {
counts = getFieldCacheCounts(searcher, docs, field,
offset,limit, mincount, missing, sort, prefix);
}
break;
default:
throw new AssertionError();
}
}
I also believe I have found where the huge memory allocation is done.
Did a memory dump while the slow/memory-consuming c_dstr_doc_sto-query
was going on (penty of time to do that - 100+ secs). It seems that a lot
of memory is allocated under SlowCompositeReaderWrapper.cachedOrdMaps
which holds HashMaps containing MultiDocValues$OrdinalMaps as values,
and those MultiDocValues$OrdinalMaps have a field ordDeltas-array of
MonotonicAppendingLongBuffers ... bla bla ... containing Packed64
containing long-arrays.
See
https://dl.dropboxusercontent.com/u/25718039/mem-dump-while-searching-on-facet.field-c_dstr_doc_sto.png
SlowCompositeReaderWrapper and all this memory-allocation does not seem
to be part of the fast a_dlng_doc_sto-query.
Does this information provide any leads on how to fix
response-time/memory-consumption issue? Maybe it helps telling if going
to 4.5 will fix the issue?
Regards, Per Steffensen
On 11/5/13 1:47 PM, Per Steffensen wrote:
> Looking at threaddumps
>
> It seems like one of the major differences in what is done for
> c_dstr_doc_sto vs a_dlng_doc_sto is in
> SimpleFactes.getFacetFieldCounts, where c_dstr_doc_sto takes the
> "getTermCounts"-path and a_dlng_doc_sto takes the
> "getListedTermCounts"-path.
>
> String termList = localParams == null ? null :
> localParams.get(CommonParams.TERMS);
> if (termList != null) {
> res.add(key, getListedTermCounts(facetValue, termList));
> } else {
> res.add(key, getTermCounts(facetValue));
> }
>
> getTermCounts seems to do a lot more and to be a lot more complex than
> getListedTermCounts
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
Looking at threaddumps
It seems like one of the major differences in what is done for
c_dstr_doc_sto vs a_dlng_doc_sto is in SimpleFactes.getFacetFieldCounts,
where c_dstr_doc_sto takes the "getTermCounts"-path and a_dlng_doc_sto
takes the "getListedTermCounts"-path.
String termList = localParams == null ? null :
localParams.get(CommonParams.TERMS);
if (termList != null) {
res.add(key, getListedTermCounts(facetValue, termList));
} else {
res.add(key, getTermCounts(facetValue));
}
getTermCounts seems to do a lot more and to be a lot more complex than
getListedTermCounts
On 11/5/13 11:47 AM, Per Steffensen wrote:
> Hi
>
> We have a 6-Solr-node (release 4.4.0) setup with 12billion "small"
> documents loadad. The documents have the following fields
> * a_dlng_doc_sto
> * b_dlng_doc_sto
> * c_dstr_doc_sto
> * timestamp_lng_ind_sto
> * d_lng_ind_sto
> From schema.xml
> <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false"
> stored="true" required="true" docValues="true"/>
> <dynamicField name="*_lng_ind_sto" type="long" indexed="true"
> stored="true"/>
> <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
> stored="true" required="true" docValues="true"/>
> ...
> <fieldType name="dstring" class="solr.StrField"
> sortMissingLast="true" docValuesFormat="Disk"/>
> <fieldType name="dlng" class="solr.TrieLongField"
> precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>
>
> We execute queries on the following format:
> * q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
> * facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1
>
> F.ex executing a query with values for x, y, a, b ... and n that hits
> only 6 documents (out of the 12billion) total
> * With <field>=a_dlng_doc_sto (long docvalue) the query responds
> fairly quick (< 2 sec)
> * With <field>=c_dstr_doc_sto (string docvalue) the query responds
> very slowly (> 100 sec) and only if we give the Solr-nodes a lot of
> Xmx. If Xmx is too low we experience OOM on involved Solr-nodes and
> never see a response
> c_dstr_doc_sto strings are all about 10-15 chars, so it is not very
> long strings
>
> Is it a known issue that there is such a big difference between facet
> searches on longs and strings? And that memory usage seems to very
> different, also?
> If yes, has it been optimized after 4.4.0?
>
> Regards, Per Steffensen
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
Thanks for all the help, guys!
Just to clarify. Everything is working functionality-wise - we have
tests showing that.
It is just that two similar queries (hitting the same number of rows
(only 6 among 12billion in this example) and resulting in the same
number of facet-groups etc etc) is performing very differently depending
on the type of the facet.field. It is fast (< 2 secs) and efficient when
the facet.field is
<dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
stored="true" required="true" docValues="true"/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0" docValuesFormat="Disk"/>
But it is very slow (> 100 secs) and memory-consuming (eating GBs) when
the facet.field is
<dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false"
stored="true" required="true" docValues="true"/>
<fieldType name="dstring" class="solr.StrField"
sortMissingLast="true" docValuesFormat="Disk"/>
We use docValuesFormat="Disk" because we have so much data, that
everything will never fit in memory. Are you saying that this does not
work before 4.5? But how does it explain the huge difference in
response-time and memory-consumption? Guess, if it does not work in 4.4,
that it does not work for neither of the types?
Just a side-question: We never have more than one value per field. Would
we benefit from adding multiValued=false to our field-declarations?
Regards, Per Steffensen
On 11/5/13 11:44 PM, Shawn Heisey wrote:
> On 11/5/2013 11:56 AM, Erick Erickson wrote:
>> Hmmm, what I'm referring to is this bit:
>>
>> |<||fieldType||name||=||"string_ondisk"||class||=||"solr.StrField"||docValuesFormat||=||"Disk"||/>|
>>
>> |
>> |
>> |The docValuesFormat="Disk" bit isn't supported until 4.5, which
>> doesn't seem clear in either place. Feel free to disagree of course :).|
>>
>>
>
>
> I'm pretty sure that the disk format was supported from 4.2, when
> docvalues first came to Solr. Not sure about earlier. Here's someone
> with it working on 4.2.1:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201304.mbox/%3C51766344.5060706@gmail.com%3E
>
>
> Something that wasn't supported that far back (and as far as I know
> still isn't supported) is upgrading Solr with an existing index that
> uses the disk format.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Shawn Heisey <so...@elyograg.org>.
On 11/5/2013 11:56 AM, Erick Erickson wrote:
> Hmmm, what I'm referring to is this bit:
>
> |<||fieldType||name||=||"string_ondisk"||class||=||"solr.StrField"||docValuesFormat||=||"Disk"||/>|
> |
> |
> |The docValuesFormat="Disk" bit isn't supported until 4.5, which
> doesn't seem clear in either place. Feel free to disagree of course :).|
>
>
I'm pretty sure that the disk format was supported from 4.2, when
docvalues first came to Solr. Not sure about earlier. Here's someone
with it working on 4.2.1:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201304.mbox/%3C51766344.5060706@gmail.com%3E
Something that wasn't supported that far back (and as far as I know
still isn't supported) is upgrading Solr with an existing index that
uses the disk format.
Thanks,
Shawn
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Erick Erickson <er...@gmail.com>.
Hmmm, what I'm referring to is this bit:
<fieldType name="string_ondisk" class="solr.StrField" docValuesFormat="Disk"
/>
The docValuesFormat="Disk" bit isn't supported until 4.5, which doesn't
seem clear in either place. Feel free to disagree of course :).
On Tue, Nov 5, 2013 at 11:43 AM, Cassandra Targett <ca...@gmail.com>wrote:
> On Tue, Nov 5, 2013 at 3:27 PM, Erick Erickson <er...@gmail.com>
> wrote:
> > Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
> > about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently
> it
> > kind of looks like you can do that with 4.2....
> >
>
> It's in the Solr Ref Guide:
> https://cwiki.apache.org/confluence/display/solr/DocValues, fixed for
> 4.5
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: DocValue on Strings slow and OOM
Posted by Cassandra Targett <ca...@gmail.com>.
On Tue, Nov 5, 2013 at 3:27 PM, Erick Erickson <er...@gmail.com> wrote:
> Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
> about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently it
> kind of looks like you can do that with 4.2....
>
It's in the Solr Ref Guide:
https://cwiki.apache.org/confluence/display/solr/DocValues, fixed for
4.5
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Erick Erickson <er...@gmail.com>.
Hmmmm. I was just looking at the DocValues Wiki page. Should I add a bit
about docValuesFormat supporting "Disk" as a 4.5 plus feature? Currently it
kind of looks like you can do that with 4.2....
Or am I off base here? I'm going from CHANGES.txt about LUCENE-5124
Erick
On Tue, Nov 5, 2013 at 9:46 AM, Robert Muir <rc...@gmail.com> wrote:
> On Tue, Nov 5, 2013 at 9:42 AM, Per Steffensen <st...@designware.dk>
> wrote:
> > On 11/5/13 3:30 PM, Robert Muir wrote:
> >>
> >> If you are querying on a field, you should index it!
> >
> > Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y]
> AND
> > d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto" and
> > "d_lng_ind_sto" are indexed.
> > Please elaborate!
> >
>
> solr faceting often runs queries behind the scenes. please, only turn
> off indexed=true if you are really really sure you do not need it.
>
> and use 4.5.0 if you have memory concerns.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: DocValue on Strings slow and OOM
Posted by Robert Muir <rc...@gmail.com>.
On Tue, Nov 5, 2013 at 9:42 AM, Per Steffensen <st...@designware.dk> wrote:
> On 11/5/13 3:30 PM, Robert Muir wrote:
>>
>> If you are querying on a field, you should index it!
>
> Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y] AND
> d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto" and
> "d_lng_ind_sto" are indexed.
> Please elaborate!
>
solr faceting often runs queries behind the scenes. please, only turn
off indexed=true if you are really really sure you do not need it.
and use 4.5.0 if you have memory concerns.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Per Steffensen <st...@designware.dk>.
On 11/5/13 3:30 PM, Robert Muir wrote:
> If you are querying on a field, you should index it!
Believe I do that. Query looks like this "timestamp_lng_ind_sto:[x TO y]
AND d_lng_ind_sto:(a OR b OR ... OR n)" and both "timestamp_lng_ind_sto"
and "d_lng_ind_sto" are indexed.
Please elaborate!
I facet/group on fields that are indexed=false and docValues=true, but
that is the case for both of the facet.fields "a_dlng_doc_sto" and
"c_dstr_doc_sto", so it shouldnt explain the big difference between
faceting on the long-field vs faceting on the string-field.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: DocValue on Strings slow and OOM
Posted by Robert Muir <rc...@gmail.com>.
If you are querying on a field, you should index it!
On Tue, Nov 5, 2013 at 5:47 AM, Per Steffensen <st...@designware.dk> wrote:
> Hi
>
> We have a 6-Solr-node (release 4.4.0) setup with 12billion "small" documents
> loadad. The documents have the following fields
> * a_dlng_doc_sto
> * b_dlng_doc_sto
> * c_dstr_doc_sto
> * timestamp_lng_ind_sto
> * d_lng_ind_sto
> From schema.xml
> <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false"
> stored="true" required="true" docValues="true"/>
> <dynamicField name="*_lng_ind_sto" type="long" indexed="true"
> stored="true"/>
> <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false"
> stored="true" required="true" docValues="true"/>
> ...
> <fieldType name="dstring" class="solr.StrField" sortMissingLast="true"
> docValuesFormat="Disk"/>
> <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0" docValuesFormat="Disk"/>
>
> We execute queries on the following format:
> * q=timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n)
> * facet=true&facet.field=<field>&facet.zeros=false&facet.mincount=1
>
> F.ex executing a query with values for x, y, a, b ... and n that hits only 6
> documents (out of the 12billion) total
> * With <field>=a_dlng_doc_sto (long docvalue) the query responds fairly
> quick (< 2 sec)
> * With <field>=c_dstr_doc_sto (string docvalue) the query responds very
> slowly (> 100 sec) and only if we give the Solr-nodes a lot of Xmx. If Xmx
> is too low we experience OOM on involved Solr-nodes and never see a response
> c_dstr_doc_sto strings are all about 10-15 chars, so it is not very long
> strings
>
> Is it a known issue that there is such a big difference between facet
> searches on longs and strings? And that memory usage seems to very
> different, also?
> If yes, has it been optimized after 4.4.0?
>
> Regards, Per Steffensen
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org