You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by oferiko <of...@gmail.com> on 2010/07/19 01:43:10 UTC

filter query on timestamp slowing query???

I have a query that seems to be running much slower when i try to filter it.
the field is of type pdate (solr.DateField) and the filter is for example
timestamp:[2010-01-01T00:00:00Z TO NOW] (to look only for documents since
Jan 1st.
if i don't use the filter, the query returns pretty fast, but adding the
filter (either as a filter or as part of the query itself) slows the query a
lot.

any idea anyone???
thanks for the help
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p977280.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.
sorry for not mentioning it, we use solr 1.4.1

thanks again for any idea
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p977299.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: filter query on timestamp slowing query???

Posted by Jonathan Rochkind <ro...@jhu.edu>.
britske wrote:
>> *If* you could query on internal docid (I'm not sure that it's available
>> out-of-the-box, or if you can at all)
>> your original problem, quoted below, could imo be simplified to asking for
>> the last docid inserted (that match the other criteria from your use-case)
>> and in the next call filter from that docid forward.
>
>that sounds great, is there really a way to do that?

I don't know about internal docids, but no reason you can't use that same technique with timestamps, if you want to do the two-query-remember-30-minutes-agos-last-doc approach. 

Query for latest timestamp by sorting by timestamp descending, set rows=1, the row you get back has the greatest timestamp. 

30 minutes later, query with fq=timestamp>that_one_we_remembered. 

Would this be any slower with timestamps than with docids?  I don't think so, but one way to find out. 

Also, with any sorting, you probably might want to include a warming query that sorts by the column you are going to be sorted on. I haven't figured out yet if a warming query that sorts on a field will help speed up later range-queries (rather than just later sorts) on that field too, but I'm thinking it might.  

Jonathan

Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.

britske wrote:
> 
> just wanted to mention a possible other route, which might be entirely
> hypothetical :-)
> 
> *If* you could query on internal docid (I'm not sure that it's available
> out-of-the-box, or if you can at all)
> your original problem, quoted below, could imo be simplified to asking for
> the last docid inserted (that match the other criteria from your use-case)
> and in the next call filter from that docid forward.
> 
that sounds great, is there really a way to do that? 


britske wrote:
> 
>>Every 30 minutes, i ask the index what are the documents that were added
to
>>it, since the last time i queried it, that match a certain criteria.
>>>From time to time, once a week or so, i ask the index for ALL the
documents
>>that match that criteria. (i also do this for not only one query, but
>>several)
>>This is why i need the timestamp filter.
> 
> Again, I'm not entirely sure that quering / filtering on internal docid's
> is
> possible (perhaps someone can comment) but if it is, it would perhaps be
> more performant.
> Big IF, I know.
> 
> Geert-Jan
> 
> 2010/7/23 Chris Hostetter <ho...@fucit.org>
> 
>> : On top of using trie dates, you might consider separating the timestamp
>> : portion and the type portion of the fq into seperate fq parameters --
>> : that will allow them to to be stored in the filter cache seperately. So
>> : for instance, if you include "type:x OR type:y" in queries a lot, but
>> : with different date ranges, then when you make a new query, the set for
>> : "type:x OR type:y" can be pulled from the filter cache and intersected
>>
>> definitely ... that's the one big thing that jumped out at me once you
>> showed us *how* you were constructing these queries.
>>
>>
>>
>> -Hoss
>>
>>
> 
> 
that's also something that i'll integrate into my testing environment,
thanks
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p994679.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by Geert-Jan Brits <gb...@gmail.com>.
just wanted to mention a possible other route, which might be entirely
hypothetical :-)

*If* you could query on internal docid (I'm not sure that it's available
out-of-the-box, or if you can at all)
your original problem, quoted below, could imo be simplified to asking for
the last docid inserted (that match the other criteria from your use-case)
and in the next call filter from that docid forward.

>Every 30 minutes, i ask the index what are the documents that were added to
>it, since the last time i queried it, that match a certain criteria.
>>From time to time, once a week or so, i ask the index for ALL the documents
>that match that criteria. (i also do this for not only one query, but
>several)
>This is why i need the timestamp filter.

Again, I'm not entirely sure that quering / filtering on internal docid's is
possible (perhaps someone can comment) but if it is, it would perhaps be
more performant.
Big IF, I know.

Geert-Jan

2010/7/23 Chris Hostetter <ho...@fucit.org>

> : On top of using trie dates, you might consider separating the timestamp
> : portion and the type portion of the fq into seperate fq parameters --
> : that will allow them to to be stored in the filter cache seperately. So
> : for instance, if you include "type:x OR type:y" in queries a lot, but
> : with different date ranges, then when you make a new query, the set for
> : "type:x OR type:y" can be pulled from the filter cache and intersected
>
> definitely ... that's the one big thing that jumped out at me once you
> showed us *how* you were constructing these queries.
>
>
>
> -Hoss
>
>

RE: filter query on timestamp slowing query???

Posted by Chris Hostetter <ho...@fucit.org>.
: On top of using trie dates, you might consider separating the timestamp 
: portion and the type portion of the fq into seperate fq parameters -- 
: that will allow them to to be stored in the filter cache seperately. So 
: for instance, if you include "type:x OR type:y" in queries a lot, but 
: with different date ranges, then when you make a new query, the set for 
: "type:x OR type:y" can be pulled from the filter cache and intersected 

definitely ... that's the one big thing that jumped out at me once you 
showed us *how* you were constructing these queries.  



-Hoss


RE: filter query on timestamp slowing query???

Posted by Jonathan Rochkind <ro...@jhu.edu>.
> and a typical query would be:
>
fl=id,type,timestamp,score&start=0&q="Coca+Cola"+pepsi+-"dr+pepper"&fq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)&
> rows=2000

On top of using trie dates, you might consider separating the timestamp portion and the type portion of the fq into seperate fq parameters -- that will allow them to to be stored in the filter cache seperately. So for instance, if you include "type:x OR type:y" in queries a lot, but with different date ranges, then when you make a new query, the set for "type:x OR type:y" can be pulled from the filter cache and intersected with the other result set, that portion won't have to be run again. That's probably not where your slowness is coming from, but shouldn't hurt. 

Multiple fq's are essentially AND'd together, so whenever you have an 'fq' that's seperate clauses AND'd together, you can always seperate them into multiple fq's, wont' effect the result set, will effect the caching possibilities. 

Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.
I'm in the process of indexing my demi data to test that, I'll have more
valid data on whether or not it made the differeve In a few days
Thanks


ב-23/07/2010, בשעה 19:42, "Jonathan Rochkind [via Lucene]" <
ml-node+990234-2085494904-316247@n3.nabble.com> כתב/ה:

> and a typical query would be:
>
fl=id,type,timestamp,score&start=0&q="Coca+Cola"+pepsi+-"dr+pepper"&fq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)&

> rows=2000

My understanding is that this is essentially what the solr 1.4 trie date
fields are made for, I'd use them, should speed things up.  Not sure where
the best documentation for them is, but see:

http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/




------------------------------
 View message @
http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p990234.html
To unsubscribe from Re: filter query on timestamp slowing query???, click
here< (link removed) =>.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p990337.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: filter query on timestamp slowing query???

Posted by Jonathan Rochkind <ro...@jhu.edu>.
> and a typical query would be:
> fl=id,type,timestamp,score&start=0&q="Coca+Cola"+pepsi+-"dr+pepper"&fq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)&
> rows=2000

My understanding is that this is essentially what the solr 1.4 trie date fields are made for, I'd use them, should speed things up.  Not sure where the best documentation for them is, but see:

http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/



Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.
I don't specify any sort order, and i do request for the score, so it is
ordered based on that.

My schema consists of these fields:
<field name="id" type="string" indexed="true" stored="true" required="true"
/> 
<field name="timestamp" type="pdate" indexed="true" stored="true"
default="NOW" multiValued="false"/> (changing now to tdate)
<field name="type" type="string" indexed="true" stored="true"
required="true" /> 
<field name="contents" type="text" indexed="true" stored="false"
termVectors="true" />

and a typical query would be:
fl=id,type,timestamp,score&start=0&q="Coca+Cola"+pepsi+-"dr+pepper"&fq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)&rows=2000

thanks again for you time
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p989536.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by Chris Hostetter <ho...@fucit.org>.
: You are correct, first of all i haven't move yet to the TrieDateField, but i
: am still waiting to find out a bit more information about it, and there's
: not a lot of info, other then in the xml file.

In general TrieFields are a way of trading disk space for range query 
speed.  they are explained fairly well if you look at the docs...

http://lucene.apache.org/solr/api/org/apache/solr/schema/TrieField.html
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/NumericRangeQuery.html

...allthough i realize now that "TrieDateField's docs don't actually 
link to "TrieField" where the explanation is provided.

AS for your usecase...

: I'll explain my use case, so you'll know a bit more. I have an  index that's
: being updated regularly, (every second i have 10 to 50 new documents, most
: of them are small)
: 
: Every 30 minutes, i ask the index what are the documents that were added to
: it, since the last time i queried it, that match a certain criteria.
: >From time to time, once a week or so, i ask the index for ALL the documents
: that match that criteria. (i also do this for not only one query, but
: several)
: This is why i need the timestamp filter.
: 
: The queries that don't have any time range, take a few seconds to finish,
: while the ones with time range, take a few minutes.
: Hope that helps understanding my situation, and i am open to any suggestion
: how to change the way things work, if it will improve performance.

you keep saying you run "simple queries" and gave an example of 
"myStrField:foo" and you say you "ask the index what are the documents 
that were added to it, since the last time i queried it" ... but you've 
never given any concrete example of a full Solr request that incorporates 
these timestamp filtering so we can see *exactly* what your requests look 
like.  Even with an index the size you are describing, and even with the 
slower performance of "DateField" compared to TreiDateField i find it hard 
to believe that a query for "myStrField:foo" would go fro ma few seconds 
to several minutes by adding an fq range query for a span of ~30 minutes.  
are you by any chance also *sorting* the documents by that timestamp field 
when you do this?

My best guess is that either:

  a) your "raw query performance" is generally really bad, but you don't 
notice when you do your "simple queries" because of solr's 
queryResultCache -- but this can't be used when you add the fq so you see 
the bad performance then.  If this is the situation I have no real 
suggestions

  b) when you do your individual requests that filter by your timestamp 
field you are also sorting by your timestamp field -- a field you don't 
ever sort on in any other queries so the filterCache needed for sorting 
needs to be built before those queries can be returned.  if you stop 
sorting onthis timestamp field (or add a newSearcher warming query that 
does the same sort) then the problem should go away.



-Hoss


Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.

Chris Hostetter-3 wrote:
> 
> : updating your index between queries, so you may be reloading
> : your cache every time. Are you updating or adding documents
> : between queries and if so, how?
> : 
> : If this is vaguely on target, have you tried firing up warmup queries
> : after you update your index that involve your timestamp?
> 
> based on the usecase, i'm not sure that that will really help -- it sounds 
> like the range query is alwasy based on the exact timestamp of the most 
> recent doc from the lat time this particular query was run -- which means 
> by definition that that timestamp changes every time the query is 
> run, so caching it is useless.
> 
> skimming the thread, it's seems like the OP isn't using TrieDateField 
> (when asked if he was, he posted a followup about precision if he did use 
> it -- implying he is not currently).  Switching to TrieDateField is 
> probably the only thing improvement possibly to make a significant 
> differnece in speeding up these one time queries.
> 
> i've also seen no explanation of how big the index is, or what the OP's 
> definition of "slow" is (how fast are the queries with and w/o these 
> filters?).  that type of information is fairly critical to being able to 
> offer performance suggestions.
> 
> I'm also suspicious of hte entire line of questioning -- it smells like 
> there might be an XY Problem here.  knowing what the ultimate goal that 
> lead to this timestamp based filter query appraoch might help us suggest 
> an alternate/better/faster solution.
> 
> 
> 
> -Hoss
> 
> 
> 
You are correct, first of all i haven't move yet to the TrieDateField, but i
am still waiting to find out a bit more information about it, and there's
not a lot of info, other then in the xml file.
Second, i also think caching is not my problem, as the queries are usually
of different time ranges.
The index is pretty big, right now we have around 700M documents, the size
of it on the disk is about 600GB. 
More then half of the documents are pretty short, 10-20 words, the others
are around 300 words.

I'll explain my use case, so you'll know a bit more. I have an  index that's
being updated regularly, (every second i have 10 to 50 new documents, most
of them are small)

Every 30 minutes, i ask the index what are the documents that were added to
it, since the last time i queried it, that match a certain criteria.
>From time to time, once a week or so, i ask the index for ALL the documents
that match that criteria. (i also do this for not only one query, but
several)
This is why i need the timestamp filter.

The queries that don't have any time range, take a few seconds to finish,
while the ones with time range, take a few minutes.
Hope that helps understanding my situation, and i am open to any suggestion
how to change the way things work, if it will improve performance.

Thank you guys
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p980526.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by Chris Hostetter <ho...@fucit.org>.
: updating your index between queries, so you may be reloading
: your cache every time. Are you updating or adding documents
: between queries and if so, how?
: 
: If this is vaguely on target, have you tried firing up warmup queries
: after you update your index that involve your timestamp?

based on the usecase, i'm not sure that that will really help -- it sounds 
like the range query is alwasy based on the exact timestamp of the most 
recent doc from the lat time this particular query was run -- which means 
by definition that that timestamp changes every time the query is 
run, so caching it is useless.

skimming the thread, it's seems like the OP isn't using TrieDateField 
(when asked if he was, he posted a followup about precision if he did use 
it -- implying he is not currently).  Switching to TrieDateField is 
probably the only thing improvement possibly to make a significant 
differnece in speeding up these one time queries.

i've also seen no explanation of how big the index is, or what the OP's 
definition of "slow" is (how fast are the queries with and w/o these 
filters?).  that type of information is fairly critical to being able to 
offer performance suggestions.

I'm also suspicious of hte entire line of questioning -- it smells like 
there might be an XY Problem here.  knowing what the ultimate goal that 
lead to this timestamp based filter query appraoch might help us suggest 
an alternate/better/faster solution.



-Hoss


Re: filter query on timestamp slowing query???

Posted by Erick Erickson <er...@gmail.com>.
Here's my guess, and it's only a guess. I'm inferring that you're
updating your index between queries, so you may be reloading
your cache every time. Are you updating or adding documents
between queries and if so, how?

If this is vaguely on target, have you tried firing up warmup queries
after you update your index that involve your timestamp?

Best
Erick



On Mon, Jul 19, 2010 at 10:18 AM, oferiko <of...@gmail.com> wrote:

>
> 1.I query my index once every 30 minutes. I save the timestamp of the
> newest
> returned document. next time i query doe documents with timestamp between
> the timestamp i saved from the previous query and NOW.
>
> 2.Sad to day it is not optimized, i'm at 60% of the disk space, and waiting
> for another disk to be added before i can optimize.
>
> 3.it is a simple myStrField:foo
>
> thanks for helping
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p978595.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.
1.I query my index once every 30 minutes. I save the timestamp of the newest
returned document. next time i query doe documents with timestamp between
the timestamp i saved from the previous query and NOW.

2.Sad to day it is not optimized, i'm at 60% of the disk space, and waiting
for another disk to be added before i can optimize.

3.it is a simple myStrField:foo

thanks for helping
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p978595.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by Ahmet Arslan <io...@yahoo.com>.
> my goal is to run a query and limit it with the timestamp
> of the last document i found.

I didn't understand this part.

> will TrieDateField give me this precision? 

You should use tdate instead of pdate for faster date range queries and date faceting. Please comments in schema.xml file.

> i also see slow queries when using a filter on a field that
> is a simple  string(StrField), that has only 3 types of values, don't
> understand what  might cause it

Is your index optimized?

Is this query also a range query? Or it is something like myStrField:foo?



      

Re: filter query on timestamp slowing query???

Posted by oferiko <of...@gmail.com>.
my goal is to run a query and limit it with the timestamp of the last
document i found. will TrieDateField give me this precision? is there any
other way to achieve that?

i also see slow queries when using a filter on a field that is a simple
string(StrField), that has only 3 types of values, don't understand what
might cause it
-- 
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-on-timestamp-slowing-query-tp977280p977831.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on timestamp slowing query???

Posted by Ahmet Arslan <io...@yahoo.com>.
> I have a query that seems to be running much slower when i
> try to filter it.
> the field is of type pdate (solr.DateField) and the filter
> is for example
> timestamp:[2010-01-01T00:00:00Z TO NOW] (to look only for
> documents since
> Jan 1st.
> if i don't use the filter, the query returns pretty fast,
> but adding the
> filter (either as a filter or as part of the query itself)
> slows the query a lot.

It is expected. Timestamp (with milliseconds) field will produce a lot of unique terms. Probably it will be unique for all documents.

To speed-up range queries:

1-) use tdate (solr.TrieDateField)
2-) use fq=timestamp:[2010-01-01T00:00:00Z TO NOW/MINUTE+1MINUTE]

Idea of 2 comes from http://search-lucene.com/m/rj7LLb20Nl1/