You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Aki Balogh <ak...@marketmuse.com> on 2015/10/23 14:19:26 UTC

Does docValues impact termfreq ?

Hello,

In our solr application, we use a Function Query (termfreq) very heavily.

Index time and disk space are not important, but we're looking to improve
performance on termfreq at query time.
I've been reading up on docValues. Would this be a way to improve
performance?

I had read that Lucene uses Field Cache for Function Queries, so
performance may not be affected.


And, any general suggestions for improving query performance on Function
Queries?

Thanks,
Aki

Re: Does docValues impact termfreq ?

Posted by Erick Erickson <er...@gmail.com>.
Do be aware that docValues can only be used for non-text types,
i.e. numerics, strings and the like. Specifically, docValues are
_not_ possible for solr.textField and docValues don't support
analysis chains because the underlying primitive types don't. You'll
get an error if you try to specify docValues on a solr.TextField
type.

Does that change the discussion?

Best,
Erick

On Mon, Oct 26, 2015 at 7:36 AM, Emir Arnautovic
<em...@sematext.com> wrote:
> Hi Aki,
> IMO this is underuse of Solr (not to mention SolrCloud). I would recommend
> doing in memory document parsin (if you need something from Lucene/Solr
> analysis classes, use it) and use some other cache like solution to store
> term/total frequency pairs (you can try Redis).
>
> That way you will have updatable, fast total frequency lookups.
>
> Thanks,
> Emir
>
> On 26.10.2015 14:43, Aki Balogh wrote:
>>
>> Hi Emir,
>>
>> This is correct. This is the only way we use the index.
>>
>> Thanks,
>> Aki
>>
>> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
>> emir.arnautovic@sematext.com> wrote:
>>
>>> If I got it right, you are using term query, use function to get TF as
>>> score, iterate all documents in results and sum up total number of
>>> occurrences of specific term in index? Is this only way you use index or
>>> this is side functionality?
>>>
>>> Thanks,
>>> Emir
>>>
>>>
>>> On 24.10.2015 22:28, Aki Balogh wrote:
>>>
>>>> Certainly, yes. I'm just doing a word count, ie how often does a
>>>> specific
>>>> term come up in the corpus?
>>>> On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>>
>>>> yes, but what do you want to do with the TF? What problem are you
>>>>>
>>>>> solving with it? If you are able to share that...
>>>>>
>>>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>>>
>>>>>> Yes, sorry, I am not being clear.
>>>>>>
>>>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>>>> doing
>>>>>> this in solr because it can scale well.
>>>>>>
>>>>>> But with large corpora, retrieving the word counts takes some time, in
>>>>>> part
>>>>>> because solr is splitting up word count by document and generating a
>>>>>> large
>>>>>> request. We then get the request and just sum it all up. I'm wondering
>>>>>> if
>>>>>> there's a more direct way.
>>>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>>>>
>>>>>> Can you explain more what you are using TF for? Because it sounds
>>>>>> rather
>>>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>>> be
>>>>>> mostly TF, no?
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>>>
>>>>>>>> Thanks, let me think about that.
>>>>>>>>
>>>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>>>
>>>>>>> term
>>>>>>
>>>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>>>>
>>>>>>>> termfreq
>>>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>>>
>>>>>>> like
>>>>>>
>>>>>> it'd require some development work to compute that, and our code
>>>>>>>
>>>>>>> would be
>>>>>>
>>>>>> fragile.
>>>>>>>>
>>>>>>>> Let me think about that more.
>>>>>>>>
>>>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>>>> architectural
>>>>>>>> decision anyway.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>>>>
>>>>>>>> If you just want word length, then do work during indexing - index
>>>>>>>> a
>>>>>>
>>>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>>>
>>>>>>>> e.g.
>>>>>>
>>>>>> with the json faceting API I believe you can do a sum()
>>>>>>>>
>>>>>>>> calculation on
>>>>>>
>>>>>> a
>>>>>>>>
>>>>>>>> field rather than the more traditional count.
>>>>>>>>>
>>>>>>>>> Thinking aloud, there might be an easier way - index a field that
>>>>>>>>>
>>>>>>>> is
>>>>>>
>>>>>> the
>>>>>>>>
>>>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>>> number
>>>>>>
>>>>>> of documents, calculate the sum() of your word count field.
>>>>>>>>>
>>>>>>>>> I *think* that should work.
>>>>>>>>>
>>>>>>>>> Upayavira
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jack,
>>>>>>>>>>
>>>>>>>>>> I'm just using solr to get word count across a large number of
>>>>>>>>>>
>>>>>>>>> documents.
>>>>>>>>
>>>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>>>
>>>>>>>>> but it
>>>>>>
>>>>>> seems
>>>>>>>>>>
>>>>>>>>>> to work well for this use case otherwise.
>>>>>>>>>>
>>>>>>>>>> My understanding then is:
>>>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>>>>>>>>>>
>>>>>>>>> way
>>>>>>
>>>>>> to
>>>>>>>>
>>>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>>>
>>>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>>>
>>>>>>>>> across all
>>>>>>
>>>>>> documents in a search and just return one number for total
>>>>>>>>>
>>>>>>>>> termfreqs
>>>>>>>>>>
>>>>>>>>>> Are these correct?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Aki
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>>>> <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>>> used
>>>>>>
>>>>>> in
>>>>>>>>
>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>>> producing a
>>>>>>>>
>>>>>>>> single number, the score, for each document. That's the way
>>>>>>>>>>
>>>>>>>>>> Solr is
>>>>>>
>>>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>>>
>>>>>>>>>> trying
>>>>>>>>
>>>>>>>> to use
>>>>>>>>>>
>>>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>>>
>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>>>
>>>>>>>>>> aki@marketmuse.com>
>>>>>>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>>>
>>>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>>>>>>>>>>
>>>>>>>>>>> each
>>>>>>>>
>>>>>>>> document
>>>>>>>>>>>>
>>>>>>>>>>>> one-by-one.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>>>>>>>>>>
>>>>>>>>>>> request,
>>>>>>>>
>>>>>>>> so I
>>>>>>>>>>
>>>>>>>>>> only receive one number in the response?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>>>
>>>>>>>>>>>> I'm
>>>>>>
>>>>>> not
>>>>>>>>
>>>>>>>> sure
>>>>>>>>>>
>>>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>>>>>>>>>>>
>>>>>>>>>>>> stored
>>>>>>>>
>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>>>
>>>>>>>>>>>> changing,
>>>>>>>>
>>>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>>>
>>>>>>>>>>>> thus
>>>>>>
>>>>>> might
>>>>>>>>>>
>>>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>>>
>>>>>>>>>>>> but I
>>>>>>>>
>>>>>>>> doubt
>>>>>>>>>>
>>>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Upayavira
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>>>>>>>>>>>>
>>>>>>>>>>>>> results.
>>>>>>>>
>>>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>>>
>>>>>>>>>>>>> concurrent
>>>>>>>>
>>>>>>>> requests
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> "real-time". The
>>>>>>>>>>
>>>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>>>>>>>>>>>>
>>>>>>>>>>>>> slightly.
>>>>>>>>
>>>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> processing
>>>>>>>>
>>>>>>>> onto
>>>>>>>>>>
>>>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>>>
>>>>>>>>>>>>> might
>>>>>>
>>>>>> speed up
>>>>>>>>>>
>>>>>>>>>> termfreq at query time?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>>>> <ja...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>>>>>>>>>>>> tokenized
>>>>>>>>
>>>>>>>> field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> DocValues is really just a copy of the original source
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> text
>>>>>>
>>>>>> and is
>>>>>>>>>>
>>>>>>>>>> not
>>>>>>>>>>>>>
>>>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> frequency in
>>>>>>>>>>
>>>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> about
>>>>>>>>
>>>>>>>> your
>>>>>>>>>>
>>>>>>>>>> usage?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>>>>>>>>>>>> advisable to
>>>>>>>>
>>>>>>>> heavy
>>>>>>>>>>>>
>>>>>>>>>>>> usage,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> aki@marketmuse.com>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (termfreq)
>>>>>>>>
>>>>>>>> very
>>>>>>>>>>
>>>>>>>>>> heavily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Index time and disk space are not important, but
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we're
>>>>>>
>>>>>> looking to
>>>>>>>>>>
>>>>>>>>>> improve
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> way to
>>>>>>
>>>>>> improve
>>>>>>>>>>
>>>>>>>>>> performance?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Queries, so
>>>>>>>>>>
>>>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And, any general suggestions for improving query
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> performance
>>>>>>>>
>>>>>>>> on
>>>>>>>>>>
>>>>>>>>>> Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>>>
>>>
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: Does docValues impact termfreq ?

Posted by Emir Arnautovic <em...@sematext.com>.
Hi Aki,
IMO this is underuse of Solr (not to mention SolrCloud). I would 
recommend doing in memory document parsin (if you need something from 
Lucene/Solr analysis classes, use it) and use some other cache like 
solution to store term/total frequency pairs (you can try Redis).

That way you will have updatable, fast total frequency lookups.

Thanks,
Emir

On 26.10.2015 14:43, Aki Balogh wrote:
> Hi Emir,
>
> This is correct. This is the only way we use the index.
>
> Thanks,
> Aki
>
> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
> emir.arnautovic@sematext.com> wrote:
>
>> If I got it right, you are using term query, use function to get TF as
>> score, iterate all documents in results and sum up total number of
>> occurrences of specific term in index? Is this only way you use index or
>> this is side functionality?
>>
>> Thanks,
>> Emir
>>
>>
>> On 24.10.2015 22:28, Aki Balogh wrote:
>>
>>> Certainly, yes. I'm just doing a word count, ie how often does a specific
>>> term come up in the corpus?
>>> On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>
>>> yes, but what do you want to do with the TF? What problem are you
>>>> solving with it? If you are able to share that...
>>>>
>>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>>
>>>>> Yes, sorry, I am not being clear.
>>>>>
>>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>>> doing
>>>>> this in solr because it can scale well.
>>>>>
>>>>> But with large corpora, retrieving the word counts takes some time, in
>>>>> part
>>>>> because solr is splitting up word count by document and generating a
>>>>> large
>>>>> request. We then get the request and just sum it all up. I'm wondering
>>>>> if
>>>>> there's a more direct way.
>>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>>>
>>>>> Can you explain more what you are using TF for? Because it sounds
>>>>> rather
>>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>> be
>>>>> mostly TF, no?
>>>>>> Upayavira
>>>>>>
>>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>>
>>>>>>> Thanks, let me think about that.
>>>>>>>
>>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>>
>>>>>> term
>>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>>> termfreq
>>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>>
>>>>>> like
>>>>> it'd require some development work to compute that, and our code
>>>>>> would be
>>>>> fragile.
>>>>>>> Let me think about that more.
>>>>>>>
>>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>>> architectural
>>>>>>> decision anyway.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>>>
>>>>>>> If you just want word length, then do work during indexing - index
>>>>>>> a
>>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>> e.g.
>>>>> with the json faceting API I believe you can do a sum()
>>>>>>> calculation on
>>>>> a
>>>>>>> field rather than the more traditional count.
>>>>>>>> Thinking aloud, there might be an easier way - index a field that
>>>>>>>>
>>>>>>> is
>>>>> the
>>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>> number
>>>>> of documents, calculate the sum() of your word count field.
>>>>>>>> I *think* that should work.
>>>>>>>>
>>>>>>>> Upayavira
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>>
>>>>>>>>> Hi Jack,
>>>>>>>>>
>>>>>>>>> I'm just using solr to get word count across a large number of
>>>>>>>>>
>>>>>>>> documents.
>>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>> but it
>>>>> seems
>>>>>>>>> to work well for this use case otherwise.
>>>>>>>>>
>>>>>>>>> My understanding then is:
>>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>>>>>>>>>
>>>>>>>> way
>>>>> to
>>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>>
>>>>>>>> across all
>>>>> documents in a search and just return one number for total
>>>>>>>> termfreqs
>>>>>>>>> Are these correct?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Aki
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>>> <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>> used
>>>>> in
>>>>>>> the
>>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>> producing a
>>>>>>> single number, the score, for each document. That's the way
>>>>>>>>> Solr is
>>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>> trying
>>>>>>> to use
>>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>>
>>>>>>>>> aki@marketmuse.com>
>>>>> wrote:
>>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>>>>>>>>>
>>>>>>>>>> each
>>>>>>> document
>>>>>>>>>>> one-by-one.
>>>>>>>>>>>
>>>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>>>>>>>>>
>>>>>>>>>> request,
>>>>>>> so I
>>>>>>>>> only receive one number in the response?
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>> I'm
>>>>> not
>>>>>>> sure
>>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>>>>>>>>>>
>>>>>>>>>>> stored
>>>>>>> in
>>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>> changing,
>>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>> thus
>>>>> might
>>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>> but I
>>>>>>> doubt
>>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>> Upayavira
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>>>>>>>>>>>
>>>>>>>>>>>> results.
>>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>> concurrent
>>>>>>> requests
>>>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>>>>>>>>>
>>>>>>>>>>>> "real-time". The
>>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>>>>>>>>>>>
>>>>>>>>>>>> slightly.
>>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>>>>>>>>>>>
>>>>>>>>>>>> processing
>>>>>>> onto
>>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>> might
>>>>> speed up
>>>>>>>>> termfreq at query time?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>>> <ja...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>>>>>>>>>>> tokenized
>>>>>>> field.
>>>>>>>>>>>> DocValues is really just a copy of the original source
>>>>>>>>>>>>> text
>>>>> and is
>>>>>>>>> not
>>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>>>>>>>>>>
>>>>>>>>>>>>> frequency in
>>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>> about
>>>>>>> your
>>>>>>>>> usage?
>>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>>>>>>>>>>> advisable to
>>>>>>> heavy
>>>>>>>>>>> usage,
>>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> aki@marketmuse.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (termfreq)
>>>>>>> very
>>>>>>>>> heavily.
>>>>>>>>>>>>> Index time and disk space are not important, but
>>>>>>>>>>>>>> we're
>>>>> looking to
>>>>>>>>> improve
>>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> way to
>>>>> improve
>>>>>>>>> performance?
>>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Queries, so
>>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And, any general suggestions for improving query
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> performance
>>>>>>> on
>>>>>>>>> Function
>>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>>
>>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


Re: Does docValues impact termfreq ?

Posted by Scott Stults <ss...@opensourceconnections.com>.
Aki, does the sumtotaltermfreq function do what you need?


On Mon, Oct 26, 2015 at 9:43 AM, Aki Balogh <ak...@marketmuse.com> wrote:

> Hi Emir,
>
> This is correct. This is the only way we use the index.
>
> Thanks,
> Aki
>
> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
> emir.arnautovic@sematext.com> wrote:
>
> > If I got it right, you are using term query, use function to get TF as
> > score, iterate all documents in results and sum up total number of
> > occurrences of specific term in index? Is this only way you use index or
> > this is side functionality?
> >
> > Thanks,
> > Emir
> >
> >
> > On 24.10.2015 22:28, Aki Balogh wrote:
> >
> >> Certainly, yes. I'm just doing a word count, ie how often does a
> specific
> >> term come up in the corpus?
> >> On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
> >>
> >> yes, but what do you want to do with the TF? What problem are you
> >>> solving with it? If you are able to share that...
> >>>
> >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> >>>
> >>>> Yes, sorry, I am not being clear.
> >>>>
> >>>> We are not even doing scoring, just getting the raw TF values. We're
> >>>> doing
> >>>> this in solr because it can scale well.
> >>>>
> >>>> But with large corpora, retrieving the word counts takes some time, in
> >>>> part
> >>>> because solr is splitting up word count by document and generating a
> >>>> large
> >>>> request. We then get the request and just sum it all up. I'm wondering
> >>>> if
> >>>> there's a more direct way.
> >>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
> >>>>
> >>>> Can you explain more what you are using TF for? Because it sounds
> >>>>>
> >>>> rather
> >>>
> >>>> like scoring. You could disable field norms and IDF and scoring would
> >>>>>
> >>>> be
> >>>
> >>>> mostly TF, no?
> >>>>>
> >>>>> Upayavira
> >>>>>
> >>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> >>>>>
> >>>>>> Thanks, let me think about that.
> >>>>>>
> >>>>>> We're using termfreq to get the TF score, but we don't know which
> >>>>>>
> >>>>> term
> >>>
> >>>> we'll need the TF for. So we'd have to do a corpuswide summing of
> >>>>>> termfreq
> >>>>>> for each potential term across all documents in the corpus. It seems
> >>>>>>
> >>>>> like
> >>>
> >>>> it'd require some development work to compute that, and our code
> >>>>>>
> >>>>> would be
> >>>
> >>>> fragile.
> >>>>>>
> >>>>>> Let me think about that more.
> >>>>>>
> >>>>>> It might make sense to just move to solrcloud, it's the right
> >>>>>> architectural
> >>>>>> decision anyway.
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >>>>>>
> >>>>>> If you just want word length, then do work during indexing - index
> >>>>>>>
> >>>>>> a
> >>>
> >>>> field for the word length. Then, I believe you can do faceting -
> >>>>>>>
> >>>>>> e.g.
> >>>
> >>>> with the json faceting API I believe you can do a sum()
> >>>>>>>
> >>>>>> calculation on
> >>>
> >>>> a
> >>>>>
> >>>>>> field rather than the more traditional count.
> >>>>>>>
> >>>>>>> Thinking aloud, there might be an easier way - index a field that
> >>>>>>>
> >>>>>> is
> >>>
> >>>> the
> >>>>>
> >>>>>> same for all documents, and facet on it. Instead of counting the
> >>>>>>>
> >>>>>> number
> >>>
> >>>> of documents, calculate the sum() of your word count field.
> >>>>>>>
> >>>>>>> I *think* that should work.
> >>>>>>>
> >>>>>>> Upayavira
> >>>>>>>
> >>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> >>>>>>>
> >>>>>>>> Hi Jack,
> >>>>>>>>
> >>>>>>>> I'm just using solr to get word count across a large number of
> >>>>>>>>
> >>>>>>> documents.
> >>>>>
> >>>>>> It's somewhat non-standard, because we're ignoring relevance,
> >>>>>>>>
> >>>>>>> but it
> >>>
> >>>> seems
> >>>>>>>> to work well for this use case otherwise.
> >>>>>>>>
> >>>>>>>> My understanding then is:
> >>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
> >>>>>>>>
> >>>>>>> way
> >>>
> >>>> to
> >>>>>
> >>>>>> speed it up (except by caching earlier calculations)
> >>>>>>>>
> >>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
> >>>>>>>>
> >>>>>>> across all
> >>>
> >>>> documents in a search and just return one number for total
> >>>>>>>>
> >>>>>>> termfreqs
> >>>
> >>>>
> >>>>>>>> Are these correct?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Aki
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> >>>>>>>> <ja...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> That's what a normal query does - Lucene takes all the terms
> >>>>>>>>>
> >>>>>>>> used
> >>>
> >>>> in
> >>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>> query and sums them up for each document in the response,
> >>>>>>>>>
> >>>>>>>> producing a
> >>>>>
> >>>>>> single number, the score, for each document. That's the way
> >>>>>>>>>
> >>>>>>>> Solr is
> >>>
> >>>> designed to be used. You still haven't elaborated why you are
> >>>>>>>>>
> >>>>>>>> trying
> >>>>>
> >>>>>> to use
> >>>>>>>
> >>>>>>>> Solr in a way other than it was intended.
> >>>>>>>>>
> >>>>>>>>> -- Jack Krupansky
> >>>>>>>>>
> >>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
> >>>>>>>>>
> >>>>>>>> aki@marketmuse.com>
> >>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Gotcha - that's disheartening.
> >>>>>>>>>>
> >>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
> >>>>>>>>>>
> >>>>>>>>> each
> >>>>>
> >>>>>> document
> >>>>>>>>>
> >>>>>>>>>> one-by-one.
> >>>>>>>>>>
> >>>>>>>>>> Is there a way to have solr sum it up before creating the
> >>>>>>>>>>
> >>>>>>>>> request,
> >>>>>
> >>>>>> so I
> >>>>>>>
> >>>>>>>> only receive one number in the response?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>
> >>>>>> If you mean using the term frequency function query, then
> >>>>>>>>>>>
> >>>>>>>>>> I'm
> >>>
> >>>> not
> >>>>>
> >>>>>> sure
> >>>>>>>
> >>>>>>>> there's a huge amount you can do to improve performance.
> >>>>>>>>>>>
> >>>>>>>>>>> The term frequency is a number that is used often, so it is
> >>>>>>>>>>>
> >>>>>>>>>> stored
> >>>>>
> >>>>>> in
> >>>>>>>
> >>>>>>>> the index pre-calculated. Perhaps, if your data is not
> >>>>>>>>>>>
> >>>>>>>>>> changing,
> >>>>>
> >>>>>> optimising your index would reduce it to one segment, and
> >>>>>>>>>>>
> >>>>>>>>>> thus
> >>>
> >>>> might
> >>>>>>>
> >>>>>>>> ever so slightly speed the aggregation of term frequencies,
> >>>>>>>>>>>
> >>>>>>>>>> but I
> >>>>>
> >>>>>> doubt
> >>>>>>>
> >>>>>>>> it'd make enough difference to make it worth doing.
> >>>>>>>>>>>
> >>>>>>>>>>> Upayavira
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
> >>>>>>>>>>>>
> >>>>>>>>>>> results.
> >>>>>
> >>>>>> In our application, we are making multiple (think: 50)
> >>>>>>>>>>>>
> >>>>>>>>>>> concurrent
> >>>>>
> >>>>>> requests
> >>>>>>>>>>>> to calculate term frequency on a set of documents in
> >>>>>>>>>>>>
> >>>>>>>>>>> "real-time". The
> >>>>>>>
> >>>>>>>> faster that results return, the better.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Most of these requests are unique, so cache only helps
> >>>>>>>>>>>>
> >>>>>>>>>>> slightly.
> >>>>>
> >>>>>> This analysis is happening on a single solr instance.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Other than moving to solr cloud and splitting out the
> >>>>>>>>>>>>
> >>>>>>>>>>> processing
> >>>>>
> >>>>>> onto
> >>>>>>>
> >>>>>>>> multiple servers, do you have any suggestions for what
> >>>>>>>>>>>>
> >>>>>>>>>>> might
> >>>
> >>>> speed up
> >>>>>>>
> >>>>>>>> termfreq at query time?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Aki
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> >>>>>>>>>>>> <ja...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Term frequency applies only to the indexed terms of a
> >>>>>>>>>>>>>
> >>>>>>>>>>>> tokenized
> >>>>>
> >>>>>> field.
> >>>>>>>>>>
> >>>>>>>>>>> DocValues is really just a copy of the original source
> >>>>>>>>>>>>>
> >>>>>>>>>>>> text
> >>>
> >>>> and is
> >>>>>>>
> >>>>>>>> not
> >>>>>>>>>>
> >>>>>>>>>>> tokenized into terms.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Maybe you could explain how exactly you are using term
> >>>>>>>>>>>>>
> >>>>>>>>>>>> frequency in
> >>>>>>>
> >>>>>>>> function queries. More importantly, what is so "heavy"
> >>>>>>>>>>>>>
> >>>>>>>>>>>> about
> >>>>>
> >>>>>> your
> >>>>>>>
> >>>>>>>> usage?
> >>>>>>>>>>>
> >>>>>>>>>>>> Generally, moderate use of a feature is much more
> >>>>>>>>>>>>>
> >>>>>>>>>>>> advisable to
> >>>>>
> >>>>>> heavy
> >>>>>>>>>
> >>>>>>>>>> usage,
> >>>>>>>>>>>
> >>>>>>>>>>>> unless you don't care about performance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- Jack Krupansky
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> >>>>>>>>>>>>>
> >>>>>>>>>>>> aki@marketmuse.com>
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our solr application, we use a Function Query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> (termfreq)
> >>>>>
> >>>>>> very
> >>>>>>>
> >>>>>>>> heavily.
> >>>>>>>>>>>
> >>>>>>>>>>>> Index time and disk space are not important, but
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> we're
> >>>
> >>>> looking to
> >>>>>>>
> >>>>>>>> improve
> >>>>>>>>>>>
> >>>>>>>>>>>> performance on termfreq at query time.
> >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> way to
> >>>
> >>>> improve
> >>>>>>>
> >>>>>>>> performance?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Queries, so
> >>>>>>>
> >>>>>>>> performance may not be affected.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> And, any general suggestions for improving query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> performance
> >>>>>
> >>>>>> on
> >>>>>>>
> >>>>>>>> Function
> >>>>>>>>>>>
> >>>>>>>>>>>> Queries?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Aki
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> > <
> https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F
> >
> >
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnautovic@sematext.com> wrote:

> If I got it right, you are using term query, use function to get TF as
> score, iterate all documents in results and sum up total number of
> occurrences of specific term in index? Is this only way you use index or
> this is side functionality?
>
> Thanks,
> Emir
>
>
> On 24.10.2015 22:28, Aki Balogh wrote:
>
>> Certainly, yes. I'm just doing a word count, ie how often does a specific
>> term come up in the corpus?
>> On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>
>> yes, but what do you want to do with the TF? What problem are you
>>> solving with it? If you are able to share that...
>>>
>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>
>>>> Yes, sorry, I am not being clear.
>>>>
>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>> doing
>>>> this in solr because it can scale well.
>>>>
>>>> But with large corpora, retrieving the word counts takes some time, in
>>>> part
>>>> because solr is splitting up word count by document and generating a
>>>> large
>>>> request. We then get the request and just sum it all up. I'm wondering
>>>> if
>>>> there's a more direct way.
>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>>
>>>> Can you explain more what you are using TF for? Because it sounds
>>>>>
>>>> rather
>>>
>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>>
>>>> be
>>>
>>>> mostly TF, no?
>>>>>
>>>>> Upayavira
>>>>>
>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>
>>>>>> Thanks, let me think about that.
>>>>>>
>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>
>>>>> term
>>>
>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>> termfreq
>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>
>>>>> like
>>>
>>>> it'd require some development work to compute that, and our code
>>>>>>
>>>>> would be
>>>
>>>> fragile.
>>>>>>
>>>>>> Let me think about that more.
>>>>>>
>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>> architectural
>>>>>> decision anyway.
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>>
>>>>>> If you just want word length, then do work during indexing - index
>>>>>>>
>>>>>> a
>>>
>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>>
>>>>>> e.g.
>>>
>>>> with the json faceting API I believe you can do a sum()
>>>>>>>
>>>>>> calculation on
>>>
>>>> a
>>>>>
>>>>>> field rather than the more traditional count.
>>>>>>>
>>>>>>> Thinking aloud, there might be an easier way - index a field that
>>>>>>>
>>>>>> is
>>>
>>>> the
>>>>>
>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>>
>>>>>> number
>>>
>>>> of documents, calculate the sum() of your word count field.
>>>>>>>
>>>>>>> I *think* that should work.
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>
>>>>>>>> Hi Jack,
>>>>>>>>
>>>>>>>> I'm just using solr to get word count across a large number of
>>>>>>>>
>>>>>>> documents.
>>>>>
>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>>
>>>>>>> but it
>>>
>>>> seems
>>>>>>>> to work well for this use case otherwise.
>>>>>>>>
>>>>>>>> My understanding then is:
>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>>>>>>>>
>>>>>>> way
>>>
>>>> to
>>>>>
>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>
>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>
>>>>>>> across all
>>>
>>>> documents in a search and just return one number for total
>>>>>>>>
>>>>>>> termfreqs
>>>
>>>>
>>>>>>>> Are these correct?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aki
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>> <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>>
>>>>>>>> used
>>>
>>>> in
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>>
>>>>>>>> producing a
>>>>>
>>>>>> single number, the score, for each document. That's the way
>>>>>>>>>
>>>>>>>> Solr is
>>>
>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>>
>>>>>>>> trying
>>>>>
>>>>>> to use
>>>>>>>
>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>
>>>>>>>> aki@marketmuse.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>
>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>>>>>>>>
>>>>>>>>> each
>>>>>
>>>>>> document
>>>>>>>>>
>>>>>>>>>> one-by-one.
>>>>>>>>>>
>>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>>>>>>>>
>>>>>>>>> request,
>>>>>
>>>>>> so I
>>>>>>>
>>>>>>>> only receive one number in the response?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>
>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>>
>>>>>>>>>> I'm
>>>
>>>> not
>>>>>
>>>>>> sure
>>>>>>>
>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>
>>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>>>>>>>>>
>>>>>>>>>> stored
>>>>>
>>>>>> in
>>>>>>>
>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>>
>>>>>>>>>> changing,
>>>>>
>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>>
>>>>>>>>>> thus
>>>
>>>> might
>>>>>>>
>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>>
>>>>>>>>>> but I
>>>>>
>>>>>> doubt
>>>>>>>
>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>
>>>>>>>>>>> Upayavira
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>>>>>>>>>>
>>>>>>>>>>> results.
>>>>>
>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>>
>>>>>>>>>>> concurrent
>>>>>
>>>>>> requests
>>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>>>>>>>>
>>>>>>>>>>> "real-time". The
>>>>>>>
>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>
>>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>>>>>>>>>>
>>>>>>>>>>> slightly.
>>>>>
>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>
>>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>>>>>>>>>>
>>>>>>>>>>> processing
>>>>>
>>>>>> onto
>>>>>>>
>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>>
>>>>>>>>>>> might
>>>
>>>> speed up
>>>>>>>
>>>>>>>> termfreq at query time?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Aki
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>> <ja...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>>>>>>>>>>>
>>>>>>>>>>>> tokenized
>>>>>
>>>>>> field.
>>>>>>>>>>
>>>>>>>>>>> DocValues is really just a copy of the original source
>>>>>>>>>>>>>
>>>>>>>>>>>> text
>>>
>>>> and is
>>>>>>>
>>>>>>>> not
>>>>>>>>>>
>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>>>>>>>>>
>>>>>>>>>>>> frequency in
>>>>>>>
>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>>
>>>>>>>>>>>> about
>>>>>
>>>>>> your
>>>>>>>
>>>>>>>> usage?
>>>>>>>>>>>
>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>>>>>>>>>>>
>>>>>>>>>>>> advisable to
>>>>>
>>>>>> heavy
>>>>>>>>>
>>>>>>>>>> usage,
>>>>>>>>>>>
>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>>>>>>>>>
>>>>>>>>>>>> aki@marketmuse.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (termfreq)
>>>>>
>>>>>> very
>>>>>>>
>>>>>>>> heavily.
>>>>>>>>>>>
>>>>>>>>>>>> Index time and disk space are not important, but
>>>>>>>>>>>>>>
>>>>>>>>>>>>> we're
>>>
>>>> looking to
>>>>>>>
>>>>>>>> improve
>>>>>>>>>>>
>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> way to
>>>
>>>> improve
>>>>>>>
>>>>>>>> performance?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Queries, so
>>>>>>>
>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, any general suggestions for improving query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> performance
>>>>>
>>>>>> on
>>>>>>>
>>>>>>>> Function
>>>>>>>>>>>
>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>
>

Re: Does docValues impact termfreq ?

Posted by Emir Arnautovic <em...@sematext.com>.
If I got it right, you are using term query, use function to get TF as 
score, iterate all documents in results and sum up total number of 
occurrences of specific term in index? Is this only way you use index or 
this is side functionality?

Thanks,
Emir

On 24.10.2015 22:28, Aki Balogh wrote:
> Certainly, yes. I'm just doing a word count, ie how often does a specific
> term come up in the corpus?
> On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>
>> yes, but what do you want to do with the TF? What problem are you
>> solving with it? If you are able to share that...
>>
>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>> Yes, sorry, I am not being clear.
>>>
>>> We are not even doing scoring, just getting the raw TF values. We're
>>> doing
>>> this in solr because it can scale well.
>>>
>>> But with large corpora, retrieving the word counts takes some time, in
>>> part
>>> because solr is splitting up word count by document and generating a
>>> large
>>> request. We then get the request and just sum it all up. I'm wondering if
>>> there's a more direct way.
>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>>>
>>>> Can you explain more what you are using TF for? Because it sounds
>> rather
>>>> like scoring. You could disable field norms and IDF and scoring would
>> be
>>>> mostly TF, no?
>>>>
>>>> Upayavira
>>>>
>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>> Thanks, let me think about that.
>>>>>
>>>>> We're using termfreq to get the TF score, but we don't know which
>> term
>>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>> termfreq
>>>>> for each potential term across all documents in the corpus. It seems
>> like
>>>>> it'd require some development work to compute that, and our code
>> would be
>>>>> fragile.
>>>>>
>>>>> Let me think about that more.
>>>>>
>>>>> It might make sense to just move to solrcloud, it's the right
>>>>> architectural
>>>>> decision anyway.
>>>>>
>>>>>
>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>
>>>>>> If you just want word length, then do work during indexing - index
>> a
>>>>>> field for the word length. Then, I believe you can do faceting -
>> e.g.
>>>>>> with the json faceting API I believe you can do a sum()
>> calculation on
>>>> a
>>>>>> field rather than the more traditional count.
>>>>>>
>>>>>> Thinking aloud, there might be an easier way - index a field that
>> is
>>>> the
>>>>>> same for all documents, and facet on it. Instead of counting the
>> number
>>>>>> of documents, calculate the sum() of your word count field.
>>>>>>
>>>>>> I *think* that should work.
>>>>>>
>>>>>> Upayavira
>>>>>>
>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>> Hi Jack,
>>>>>>>
>>>>>>> I'm just using solr to get word count across a large number of
>>>> documents.
>>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>> but it
>>>>>>> seems
>>>>>>> to work well for this use case otherwise.
>>>>>>>
>>>>>>> My understanding then is:
>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>> way
>>>> to
>>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>
>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>> across all
>>>>>>> documents in a search and just return one number for total
>> termfreqs
>>>>>>>
>>>>>>> Are these correct?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aki
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>> <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That's what a normal query does - Lucene takes all the terms
>> used
>>>> in
>>>>>> the
>>>>>>>> query and sums them up for each document in the response,
>>>> producing a
>>>>>>>> single number, the score, for each document. That's the way
>> Solr is
>>>>>>>> designed to be used. You still haven't elaborated why you are
>>>> trying
>>>>>> to use
>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>> aki@marketmuse.com>
>>>>>> wrote:
>>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>
>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>> each
>>>>>>>> document
>>>>>>>>> one-by-one.
>>>>>>>>>
>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>> request,
>>>>>> so I
>>>>>>>>> only receive one number in the response?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
>>>> wrote:
>>>>>>>>>> If you mean using the term frequency function query, then
>> I'm
>>>> not
>>>>>> sure
>>>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>
>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>> stored
>>>>>> in
>>>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>> changing,
>>>>>>>>>> optimising your index would reduce it to one segment, and
>> thus
>>>>>> might
>>>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>> but I
>>>>>> doubt
>>>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>
>>>>>>>>>> Upayavira
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>> results.
>>>>>>>>>>> In our application, we are making multiple (think: 50)
>>>> concurrent
>>>>>>>>>>> requests
>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>> "real-time". The
>>>>>>>>>>> faster that results return, the better.
>>>>>>>>>>>
>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>> slightly.
>>>>>>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>
>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>> processing
>>>>>> onto
>>>>>>>>>>> multiple servers, do you have any suggestions for what
>> might
>>>>>> speed up
>>>>>>>>>>> termfreq at query time?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Aki
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>> <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>> tokenized
>>>>>>>>> field.
>>>>>>>>>>>> DocValues is really just a copy of the original source
>> text
>>>>>> and is
>>>>>>>>> not
>>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>> frequency in
>>>>>>>>>>>> function queries. More importantly, what is so "heavy"
>>>> about
>>>>>> your
>>>>>>>>>> usage?
>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>> advisable to
>>>>>>>> heavy
>>>>>>>>>> usage,
>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>> aki@marketmuse.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>> (termfreq)
>>>>>> very
>>>>>>>>>> heavily.
>>>>>>>>>>>>> Index time and disk space are not important, but
>> we're
>>>>>> looking to
>>>>>>>>>> improve
>>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>> way to
>>>>>> improve
>>>>>>>>>>>>> performance?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>> Queries, so
>>>>>>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And, any general suggestions for improving query
>>>> performance
>>>>>> on
>>>>>>>>>> Function
>>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira" <uv...@odoko.co.uk> wrote:

> yes, but what do you want to do with the TF? What problem are you
> solving with it? If you are able to share that...
>
> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> > Yes, sorry, I am not being clear.
> >
> > We are not even doing scoring, just getting the raw TF values. We're
> > doing
> > this in solr because it can scale well.
> >
> > But with large corpora, retrieving the word counts takes some time, in
> > part
> > because solr is splitting up word count by document and generating a
> > large
> > request. We then get the request and just sum it all up. I'm wondering if
> > there's a more direct way.
> > On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
> >
> > > Can you explain more what you are using TF for? Because it sounds
> rather
> > > like scoring. You could disable field norms and IDF and scoring would
> be
> > > mostly TF, no?
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > > > Thanks, let me think about that.
> > > >
> > > > We're using termfreq to get the TF score, but we don't know which
> term
> > > > we'll need the TF for. So we'd have to do a corpuswide summing of
> > > > termfreq
> > > > for each potential term across all documents in the corpus. It seems
> like
> > > > it'd require some development work to compute that, and our code
> would be
> > > > fragile.
> > > >
> > > > Let me think about that more.
> > > >
> > > > It might make sense to just move to solrcloud, it's the right
> > > > architectural
> > > > decision anyway.
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > > >
> > > > > If you just want word length, then do work during indexing - index
> a
> > > > > field for the word length. Then, I believe you can do faceting -
> e.g.
> > > > > with the json faceting API I believe you can do a sum()
> calculation on
> > > a
> > > > > field rather than the more traditional count.
> > > > >
> > > > > Thinking aloud, there might be an easier way - index a field that
> is
> > > the
> > > > > same for all documents, and facet on it. Instead of counting the
> number
> > > > > of documents, calculate the sum() of your word count field.
> > > > >
> > > > > I *think* that should work.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > > > Hi Jack,
> > > > > >
> > > > > > I'm just using solr to get word count across a large number of
> > > documents.
> > > > > >
> > > > > > It's somewhat non-standard, because we're ignoring relevance,
> but it
> > > > > > seems
> > > > > > to work well for this use case otherwise.
> > > > > >
> > > > > > My understanding then is:
> > > > > > 1) since termfreq is pre-processed and fetched, there's no good
> way
> > > to
> > > > > > speed it up (except by caching earlier calculations)
> > > > > >
> > > > > > 2) there's no way to have solr sum up all of the termfreqs
> across all
> > > > > > documents in a search and just return one number for total
> termfreqs
> > > > > >
> > > > > >
> > > > > > Are these correct?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > > > <ja...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > That's what a normal query does - Lucene takes all the terms
> used
> > > in
> > > > > the
> > > > > > > query and sums them up for each document in the response,
> > > producing a
> > > > > > > single number, the score, for each document. That's the way
> Solr is
> > > > > > > designed to be used. You still haven't elaborated why you are
> > > trying
> > > > > to use
> > > > > > > Solr in a way other than it was intended.
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
> aki@marketmuse.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Gotcha - that's disheartening.
> > > > > > > >
> > > > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> > > each
> > > > > > > document
> > > > > > > > one-by-one.
> > > > > > > >
> > > > > > > > Is there a way to have solr sum it up before creating the
> > > request,
> > > > > so I
> > > > > > > > only receive one number in the response?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
> > > wrote:
> > > > > > > >
> > > > > > > > > If you mean using the term frequency function query, then
> I'm
> > > not
> > > > > sure
> > > > > > > > > there's a huge amount you can do to improve performance.
> > > > > > > > >
> > > > > > > > > The term frequency is a number that is used often, so it is
> > > stored
> > > > > in
> > > > > > > > > the index pre-calculated. Perhaps, if your data is not
> > > changing,
> > > > > > > > > optimising your index would reduce it to one segment, and
> thus
> > > > > might
> > > > > > > > > ever so slightly speed the aggregation of term frequencies,
> > > but I
> > > > > doubt
> > > > > > > > > it'd make enough difference to make it worth doing.
> > > > > > > > >
> > > > > > > > > Upayavira
> > > > > > > > >
> > > > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > > > Thanks, Jack. I did some more research and found similar
> > > results.
> > > > > > > > > >
> > > > > > > > > > In our application, we are making multiple (think: 50)
> > > concurrent
> > > > > > > > > > requests
> > > > > > > > > > to calculate term frequency on a set of documents in
> > > > > "real-time". The
> > > > > > > > > > faster that results return, the better.
> > > > > > > > > >
> > > > > > > > > > Most of these requests are unique, so cache only helps
> > > slightly.
> > > > > > > > > >
> > > > > > > > > > This analysis is happening on a single solr instance.
> > > > > > > > > >
> > > > > > > > > > Other than moving to solr cloud and splitting out the
> > > processing
> > > > > onto
> > > > > > > > > > multiple servers, do you have any suggestions for what
> might
> > > > > speed up
> > > > > > > > > > termfreq at query time?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Aki
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > > > > <ja...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Term frequency applies only to the indexed terms of a
> > > tokenized
> > > > > > > > field.
> > > > > > > > > > > DocValues is really just a copy of the original source
> text
> > > > > and is
> > > > > > > > not
> > > > > > > > > > > tokenized into terms.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you could explain how exactly you are using term
> > > > > frequency in
> > > > > > > > > > > function queries. More importantly, what is so "heavy"
> > > about
> > > > > your
> > > > > > > > > usage?
> > > > > > > > > > > Generally, moderate use of a feature is much more
> > > advisable to
> > > > > > > heavy
> > > > > > > > > usage,
> > > > > > > > > > > unless you don't care about performance.
> > > > > > > > > > >
> > > > > > > > > > > -- Jack Krupansky
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> > > > > aki@marketmuse.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > In our solr application, we use a Function Query
> > > (termfreq)
> > > > > very
> > > > > > > > > heavily.
> > > > > > > > > > > >
> > > > > > > > > > > > Index time and disk space are not important, but
> we're
> > > > > looking to
> > > > > > > > > improve
> > > > > > > > > > > > performance on termfreq at query time.
> > > > > > > > > > > > I've been reading up on docValues. Would this be a
> way to
> > > > > improve
> > > > > > > > > > > > performance?
> > > > > > > > > > > >
> > > > > > > > > > > > I had read that Lucene uses Field Cache for Function
> > > > > Queries, so
> > > > > > > > > > > > performance may not be affected.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > And, any general suggestions for improving query
> > > performance
> > > > > on
> > > > > > > > > Function
> > > > > > > > > > > > Queries?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Aki
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: Does docValues impact termfreq ?

Posted by Upayavira <uv...@odoko.co.uk>.
yes, but what do you want to do with the TF? What problem are you
solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> Yes, sorry, I am not being clear.
> 
> We are not even doing scoring, just getting the raw TF values. We're
> doing
> this in solr because it can scale well.
> 
> But with large corpora, retrieving the word counts takes some time, in
> part
> because solr is splitting up word count by document and generating a
> large
> request. We then get the request and just sum it all up. I'm wondering if
> there's a more direct way.
> On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
> 
> > Can you explain more what you are using TF for? Because it sounds rather
> > like scoring. You could disable field norms and IDF and scoring would be
> > mostly TF, no?
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > > Thanks, let me think about that.
> > >
> > > We're using termfreq to get the TF score, but we don't know which term
> > > we'll need the TF for. So we'd have to do a corpuswide summing of
> > > termfreq
> > > for each potential term across all documents in the corpus. It seems like
> > > it'd require some development work to compute that, and our code would be
> > > fragile.
> > >
> > > Let me think about that more.
> > >
> > > It might make sense to just move to solrcloud, it's the right
> > > architectural
> > > decision anyway.
> > >
> > >
> > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
> > >
> > > > If you just want word length, then do work during indexing - index a
> > > > field for the word length. Then, I believe you can do faceting - e.g.
> > > > with the json faceting API I believe you can do a sum() calculation on
> > a
> > > > field rather than the more traditional count.
> > > >
> > > > Thinking aloud, there might be an easier way - index a field that is
> > the
> > > > same for all documents, and facet on it. Instead of counting the number
> > > > of documents, calculate the sum() of your word count field.
> > > >
> > > > I *think* that should work.
> > > >
> > > > Upayavira
> > > >
> > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > > Hi Jack,
> > > > >
> > > > > I'm just using solr to get word count across a large number of
> > documents.
> > > > >
> > > > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > > > seems
> > > > > to work well for this use case otherwise.
> > > > >
> > > > > My understanding then is:
> > > > > 1) since termfreq is pre-processed and fetched, there's no good way
> > to
> > > > > speed it up (except by caching earlier calculations)
> > > > >
> > > > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > > > documents in a search and just return one number for total termfreqs
> > > > >
> > > > >
> > > > > Are these correct?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > > <ja...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > That's what a normal query does - Lucene takes all the terms used
> > in
> > > > the
> > > > > > query and sums them up for each document in the response,
> > producing a
> > > > > > single number, the score, for each document. That's the way Solr is
> > > > > > designed to be used. You still haven't elaborated why you are
> > trying
> > > > to use
> > > > > > Solr in a way other than it was intended.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com>
> > > > wrote:
> > > > > >
> > > > > > > Gotcha - that's disheartening.
> > > > > > >
> > > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> > each
> > > > > > document
> > > > > > > one-by-one.
> > > > > > >
> > > > > > > Is there a way to have solr sum it up before creating the
> > request,
> > > > so I
> > > > > > > only receive one number in the response?
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
> > wrote:
> > > > > > >
> > > > > > > > If you mean using the term frequency function query, then I'm
> > not
> > > > sure
> > > > > > > > there's a huge amount you can do to improve performance.
> > > > > > > >
> > > > > > > > The term frequency is a number that is used often, so it is
> > stored
> > > > in
> > > > > > > > the index pre-calculated. Perhaps, if your data is not
> > changing,
> > > > > > > > optimising your index would reduce it to one segment, and thus
> > > > might
> > > > > > > > ever so slightly speed the aggregation of term frequencies,
> > but I
> > > > doubt
> > > > > > > > it'd make enough difference to make it worth doing.
> > > > > > > >
> > > > > > > > Upayavira
> > > > > > > >
> > > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > > Thanks, Jack. I did some more research and found similar
> > results.
> > > > > > > > >
> > > > > > > > > In our application, we are making multiple (think: 50)
> > concurrent
> > > > > > > > > requests
> > > > > > > > > to calculate term frequency on a set of documents in
> > > > "real-time". The
> > > > > > > > > faster that results return, the better.
> > > > > > > > >
> > > > > > > > > Most of these requests are unique, so cache only helps
> > slightly.
> > > > > > > > >
> > > > > > > > > This analysis is happening on a single solr instance.
> > > > > > > > >
> > > > > > > > > Other than moving to solr cloud and splitting out the
> > processing
> > > > onto
> > > > > > > > > multiple servers, do you have any suggestions for what might
> > > > speed up
> > > > > > > > > termfreq at query time?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Aki
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > > > <ja...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Term frequency applies only to the indexed terms of a
> > tokenized
> > > > > > > field.
> > > > > > > > > > DocValues is really just a copy of the original source text
> > > > and is
> > > > > > > not
> > > > > > > > > > tokenized into terms.
> > > > > > > > > >
> > > > > > > > > > Maybe you could explain how exactly you are using term
> > > > frequency in
> > > > > > > > > > function queries. More importantly, what is so "heavy"
> > about
> > > > your
> > > > > > > > usage?
> > > > > > > > > > Generally, moderate use of a feature is much more
> > advisable to
> > > > > > heavy
> > > > > > > > usage,
> > > > > > > > > > unless you don't care about performance.
> > > > > > > > > >
> > > > > > > > > > -- Jack Krupansky
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> > > > aki@marketmuse.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello,
> > > > > > > > > > >
> > > > > > > > > > > In our solr application, we use a Function Query
> > (termfreq)
> > > > very
> > > > > > > > heavily.
> > > > > > > > > > >
> > > > > > > > > > > Index time and disk space are not important, but we're
> > > > looking to
> > > > > > > > improve
> > > > > > > > > > > performance on termfreq at query time.
> > > > > > > > > > > I've been reading up on docValues. Would this be a way to
> > > > improve
> > > > > > > > > > > performance?
> > > > > > > > > > >
> > > > > > > > > > > I had read that Lucene uses Field Cache for Function
> > > > Queries, so
> > > > > > > > > > > performance may not be affected.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > And, any general suggestions for improving query
> > performance
> > > > on
> > > > > > > > Function
> > > > > > > > > > > Queries?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Aki
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in part
because solr is splitting up word count by document and generating a large
request. We then get the request and just sum it all up. I'm wondering if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira" <uv...@odoko.co.uk> wrote:

> Can you explain more what you are using TF for? Because it sounds rather
> like scoring. You could disable field norms and IDF and scoring would be
> mostly TF, no?
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > Thanks, let me think about that.
> >
> > We're using termfreq to get the TF score, but we don't know which term
> > we'll need the TF for. So we'd have to do a corpuswide summing of
> > termfreq
> > for each potential term across all documents in the corpus. It seems like
> > it'd require some development work to compute that, and our code would be
> > fragile.
> >
> > Let me think about that more.
> >
> > It might make sense to just move to solrcloud, it's the right
> > architectural
> > decision anyway.
> >
> >
> > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > > If you just want word length, then do work during indexing - index a
> > > field for the word length. Then, I believe you can do faceting - e.g.
> > > with the json faceting API I believe you can do a sum() calculation on
> a
> > > field rather than the more traditional count.
> > >
> > > Thinking aloud, there might be an easier way - index a field that is
> the
> > > same for all documents, and facet on it. Instead of counting the number
> > > of documents, calculate the sum() of your word count field.
> > >
> > > I *think* that should work.
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > Hi Jack,
> > > >
> > > > I'm just using solr to get word count across a large number of
> documents.
> > > >
> > > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > > seems
> > > > to work well for this use case otherwise.
> > > >
> > > > My understanding then is:
> > > > 1) since termfreq is pre-processed and fetched, there's no good way
> to
> > > > speed it up (except by caching earlier calculations)
> > > >
> > > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > > documents in a search and just return one number for total termfreqs
> > > >
> > > >
> > > > Are these correct?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > <ja...@gmail.com>
> > > > wrote:
> > > >
> > > > > That's what a normal query does - Lucene takes all the terms used
> in
> > > the
> > > > > query and sums them up for each document in the response,
> producing a
> > > > > single number, the score, for each document. That's the way Solr is
> > > > > designed to be used. You still haven't elaborated why you are
> trying
> > > to use
> > > > > Solr in a way other than it was intended.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com>
> > > wrote:
> > > > >
> > > > > > Gotcha - that's disheartening.
> > > > > >
> > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> each
> > > > > document
> > > > > > one-by-one.
> > > > > >
> > > > > > Is there a way to have solr sum it up before creating the
> request,
> > > so I
> > > > > > only receive one number in the response?
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk>
> wrote:
> > > > > >
> > > > > > > If you mean using the term frequency function query, then I'm
> not
> > > sure
> > > > > > > there's a huge amount you can do to improve performance.
> > > > > > >
> > > > > > > The term frequency is a number that is used often, so it is
> stored
> > > in
> > > > > > > the index pre-calculated. Perhaps, if your data is not
> changing,
> > > > > > > optimising your index would reduce it to one segment, and thus
> > > might
> > > > > > > ever so slightly speed the aggregation of term frequencies,
> but I
> > > doubt
> > > > > > > it'd make enough difference to make it worth doing.
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > Thanks, Jack. I did some more research and found similar
> results.
> > > > > > > >
> > > > > > > > In our application, we are making multiple (think: 50)
> concurrent
> > > > > > > > requests
> > > > > > > > to calculate term frequency on a set of documents in
> > > "real-time". The
> > > > > > > > faster that results return, the better.
> > > > > > > >
> > > > > > > > Most of these requests are unique, so cache only helps
> slightly.
> > > > > > > >
> > > > > > > > This analysis is happening on a single solr instance.
> > > > > > > >
> > > > > > > > Other than moving to solr cloud and splitting out the
> processing
> > > onto
> > > > > > > > multiple servers, do you have any suggestions for what might
> > > speed up
> > > > > > > > termfreq at query time?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Aki
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > > <ja...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Term frequency applies only to the indexed terms of a
> tokenized
> > > > > > field.
> > > > > > > > > DocValues is really just a copy of the original source text
> > > and is
> > > > > > not
> > > > > > > > > tokenized into terms.
> > > > > > > > >
> > > > > > > > > Maybe you could explain how exactly you are using term
> > > frequency in
> > > > > > > > > function queries. More importantly, what is so "heavy"
> about
> > > your
> > > > > > > usage?
> > > > > > > > > Generally, moderate use of a feature is much more
> advisable to
> > > > > heavy
> > > > > > > usage,
> > > > > > > > > unless you don't care about performance.
> > > > > > > > >
> > > > > > > > > -- Jack Krupansky
> > > > > > > > >
> > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> > > aki@marketmuse.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > In our solr application, we use a Function Query
> (termfreq)
> > > very
> > > > > > > heavily.
> > > > > > > > > >
> > > > > > > > > > Index time and disk space are not important, but we're
> > > looking to
> > > > > > > improve
> > > > > > > > > > performance on termfreq at query time.
> > > > > > > > > > I've been reading up on docValues. Would this be a way to
> > > improve
> > > > > > > > > > performance?
> > > > > > > > > >
> > > > > > > > > > I had read that Lucene uses Field Cache for Function
> > > Queries, so
> > > > > > > > > > performance may not be affected.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > And, any general suggestions for improving query
> performance
> > > on
> > > > > > > Function
> > > > > > > > > > Queries?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Aki
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
>

Re: Does docValues impact termfreq ?

Posted by Upayavira <uv...@odoko.co.uk>.
Can you explain more what you are using TF for? Because it sounds rather
like scoring. You could disable field norms and IDF and scoring would be
mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> Thanks, let me think about that.
> 
> We're using termfreq to get the TF score, but we don't know which term
> we'll need the TF for. So we'd have to do a corpuswide summing of
> termfreq
> for each potential term across all documents in the corpus. It seems like
> it'd require some development work to compute that, and our code would be
> fragile.
> 
> Let me think about that more.
> 
> It might make sense to just move to solrcloud, it's the right
> architectural
> decision anyway.
> 
> 
> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> > If you just want word length, then do work during indexing - index a
> > field for the word length. Then, I believe you can do faceting - e.g.
> > with the json faceting API I believe you can do a sum() calculation on a
> > field rather than the more traditional count.
> >
> > Thinking aloud, there might be an easier way - index a field that is the
> > same for all documents, and facet on it. Instead of counting the number
> > of documents, calculate the sum() of your word count field.
> >
> > I *think* that should work.
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > Hi Jack,
> > >
> > > I'm just using solr to get word count across a large number of documents.
> > >
> > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > seems
> > > to work well for this use case otherwise.
> > >
> > > My understanding then is:
> > > 1) since termfreq is pre-processed and fetched, there's no good way to
> > > speed it up (except by caching earlier calculations)
> > >
> > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > documents in a search and just return one number for total termfreqs
> > >
> > >
> > > Are these correct?
> > >
> > > Thanks,
> > > Aki
> > >
> > >
> > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > <ja...@gmail.com>
> > > wrote:
> > >
> > > > That's what a normal query does - Lucene takes all the terms used in
> > the
> > > > query and sums them up for each document in the response, producing a
> > > > single number, the score, for each document. That's the way Solr is
> > > > designed to be used. You still haven't elaborated why you are trying
> > to use
> > > > Solr in a way other than it was intended.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com>
> > wrote:
> > > >
> > > > > Gotcha - that's disheartening.
> > > > >
> > > > > One idea: when I run termfreq, I get all of the termfreqs for each
> > > > document
> > > > > one-by-one.
> > > > >
> > > > > Is there a way to have solr sum it up before creating the request,
> > so I
> > > > > only receive one number in the response?
> > > > >
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:
> > > > >
> > > > > > If you mean using the term frequency function query, then I'm not
> > sure
> > > > > > there's a huge amount you can do to improve performance.
> > > > > >
> > > > > > The term frequency is a number that is used often, so it is stored
> > in
> > > > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > > > optimising your index would reduce it to one segment, and thus
> > might
> > > > > > ever so slightly speed the aggregation of term frequencies, but I
> > doubt
> > > > > > it'd make enough difference to make it worth doing.
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > Thanks, Jack. I did some more research and found similar results.
> > > > > > >
> > > > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > > > requests
> > > > > > > to calculate term frequency on a set of documents in
> > "real-time". The
> > > > > > > faster that results return, the better.
> > > > > > >
> > > > > > > Most of these requests are unique, so cache only helps slightly.
> > > > > > >
> > > > > > > This analysis is happening on a single solr instance.
> > > > > > >
> > > > > > > Other than moving to solr cloud and splitting out the processing
> > onto
> > > > > > > multiple servers, do you have any suggestions for what might
> > speed up
> > > > > > > termfreq at query time?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Aki
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > <ja...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > > > field.
> > > > > > > > DocValues is really just a copy of the original source text
> > and is
> > > > > not
> > > > > > > > tokenized into terms.
> > > > > > > >
> > > > > > > > Maybe you could explain how exactly you are using term
> > frequency in
> > > > > > > > function queries. More importantly, what is so "heavy" about
> > your
> > > > > > usage?
> > > > > > > > Generally, moderate use of a feature is much more advisable to
> > > > heavy
> > > > > > usage,
> > > > > > > > unless you don't care about performance.
> > > > > > > >
> > > > > > > > -- Jack Krupansky
> > > > > > > >
> > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> > aki@marketmuse.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > In our solr application, we use a Function Query (termfreq)
> > very
> > > > > > heavily.
> > > > > > > > >
> > > > > > > > > Index time and disk space are not important, but we're
> > looking to
> > > > > > improve
> > > > > > > > > performance on termfreq at query time.
> > > > > > > > > I've been reading up on docValues. Would this be a way to
> > improve
> > > > > > > > > performance?
> > > > > > > > >
> > > > > > > > > I had read that Lucene uses Field Cache for Function
> > Queries, so
> > > > > > > > > performance may not be affected.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > And, any general suggestions for improving query performance
> > on
> > > > > > Function
> > > > > > > > > Queries?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Aki
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which term
we'll need the TF for. So we'd have to do a corpuswide summing of termfreq
for each potential term across all documents in the corpus. It seems like
it'd require some development work to compute that, and our code would be
fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv...@odoko.co.uk> wrote:

> If you just want word length, then do work during indexing - index a
> field for the word length. Then, I believe you can do faceting - e.g.
> with the json faceting API I believe you can do a sum() calculation on a
> field rather than the more traditional count.
>
> Thinking aloud, there might be an easier way - index a field that is the
> same for all documents, and facet on it. Instead of counting the number
> of documents, calculate the sum() of your word count field.
>
> I *think* that should work.
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > Hi Jack,
> >
> > I'm just using solr to get word count across a large number of documents.
> >
> > It's somewhat non-standard, because we're ignoring relevance, but it
> > seems
> > to work well for this use case otherwise.
> >
> > My understanding then is:
> > 1) since termfreq is pre-processed and fetched, there's no good way to
> > speed it up (except by caching earlier calculations)
> >
> > 2) there's no way to have solr sum up all of the termfreqs across all
> > documents in a search and just return one number for total termfreqs
> >
> >
> > Are these correct?
> >
> > Thanks,
> > Aki
> >
> >
> > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > <ja...@gmail.com>
> > wrote:
> >
> > > That's what a normal query does - Lucene takes all the terms used in
> the
> > > query and sums them up for each document in the response, producing a
> > > single number, the score, for each document. That's the way Solr is
> > > designed to be used. You still haven't elaborated why you are trying
> to use
> > > Solr in a way other than it was intended.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com>
> wrote:
> > >
> > > > Gotcha - that's disheartening.
> > > >
> > > > One idea: when I run termfreq, I get all of the termfreqs for each
> > > document
> > > > one-by-one.
> > > >
> > > > Is there a way to have solr sum it up before creating the request,
> so I
> > > > only receive one number in the response?
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:
> > > >
> > > > > If you mean using the term frequency function query, then I'm not
> sure
> > > > > there's a huge amount you can do to improve performance.
> > > > >
> > > > > The term frequency is a number that is used often, so it is stored
> in
> > > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > > optimising your index would reduce it to one segment, and thus
> might
> > > > > ever so slightly speed the aggregation of term frequencies, but I
> doubt
> > > > > it'd make enough difference to make it worth doing.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > Thanks, Jack. I did some more research and found similar results.
> > > > > >
> > > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > > requests
> > > > > > to calculate term frequency on a set of documents in
> "real-time". The
> > > > > > faster that results return, the better.
> > > > > >
> > > > > > Most of these requests are unique, so cache only helps slightly.
> > > > > >
> > > > > > This analysis is happening on a single solr instance.
> > > > > >
> > > > > > Other than moving to solr cloud and splitting out the processing
> onto
> > > > > > multiple servers, do you have any suggestions for what might
> speed up
> > > > > > termfreq at query time?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > <ja...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > > field.
> > > > > > > DocValues is really just a copy of the original source text
> and is
> > > > not
> > > > > > > tokenized into terms.
> > > > > > >
> > > > > > > Maybe you could explain how exactly you are using term
> frequency in
> > > > > > > function queries. More importantly, what is so "heavy" about
> your
> > > > > usage?
> > > > > > > Generally, moderate use of a feature is much more advisable to
> > > heavy
> > > > > usage,
> > > > > > > unless you don't care about performance.
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> aki@marketmuse.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > In our solr application, we use a Function Query (termfreq)
> very
> > > > > heavily.
> > > > > > > >
> > > > > > > > Index time and disk space are not important, but we're
> looking to
> > > > > improve
> > > > > > > > performance on termfreq at query time.
> > > > > > > > I've been reading up on docValues. Would this be a way to
> improve
> > > > > > > > performance?
> > > > > > > >
> > > > > > > > I had read that Lucene uses Field Cache for Function
> Queries, so
> > > > > > > > performance may not be affected.
> > > > > > > >
> > > > > > > >
> > > > > > > > And, any general suggestions for improving query performance
> on
> > > > > Function
> > > > > > > > Queries?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Aki
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
>

Re: Does docValues impact termfreq ?

Posted by Upayavira <uv...@odoko.co.uk>.
If you just want word length, then do work during indexing - index a
field for the word length. Then, I believe you can do faceting - e.g.
with the json faceting API I believe you can do a sum() calculation on a
field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that is the
same for all documents, and facet on it. Instead of counting the number
of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> Hi Jack,
> 
> I'm just using solr to get word count across a large number of documents.
> 
> It's somewhat non-standard, because we're ignoring relevance, but it
> seems
> to work well for this use case otherwise.
> 
> My understanding then is:
> 1) since termfreq is pre-processed and fetched, there's no good way to
> speed it up (except by caching earlier calculations)
> 
> 2) there's no way to have solr sum up all of the termfreqs across all
> documents in a search and just return one number for total termfreqs
> 
> 
> Are these correct?
> 
> Thanks,
> Aki
> 
> 
> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> <ja...@gmail.com>
> wrote:
> 
> > That's what a normal query does - Lucene takes all the terms used in the
> > query and sums them up for each document in the response, producing a
> > single number, the score, for each document. That's the way Solr is
> > designed to be used. You still haven't elaborated why you are trying to use
> > Solr in a way other than it was intended.
> >
> > -- Jack Krupansky
> >
> > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com> wrote:
> >
> > > Gotcha - that's disheartening.
> > >
> > > One idea: when I run termfreq, I get all of the termfreqs for each
> > document
> > > one-by-one.
> > >
> > > Is there a way to have solr sum it up before creating the request, so I
> > > only receive one number in the response?
> > >
> > >
> > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:
> > >
> > > > If you mean using the term frequency function query, then I'm not sure
> > > > there's a huge amount you can do to improve performance.
> > > >
> > > > The term frequency is a number that is used often, so it is stored in
> > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > optimising your index would reduce it to one segment, and thus might
> > > > ever so slightly speed the aggregation of term frequencies, but I doubt
> > > > it'd make enough difference to make it worth doing.
> > > >
> > > > Upayavira
> > > >
> > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > Thanks, Jack. I did some more research and found similar results.
> > > > >
> > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > requests
> > > > > to calculate term frequency on a set of documents in "real-time". The
> > > > > faster that results return, the better.
> > > > >
> > > > > Most of these requests are unique, so cache only helps slightly.
> > > > >
> > > > > This analysis is happening on a single solr instance.
> > > > >
> > > > > Other than moving to solr cloud and splitting out the processing onto
> > > > > multiple servers, do you have any suggestions for what might speed up
> > > > > termfreq at query time?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > > >
> > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > <ja...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > field.
> > > > > > DocValues is really just a copy of the original source text and is
> > > not
> > > > > > tokenized into terms.
> > > > > >
> > > > > > Maybe you could explain how exactly you are using term frequency in
> > > > > > function queries. More importantly, what is so "heavy" about your
> > > > usage?
> > > > > > Generally, moderate use of a feature is much more advisable to
> > heavy
> > > > usage,
> > > > > > unless you don't care about performance.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com>
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > In our solr application, we use a Function Query (termfreq) very
> > > > heavily.
> > > > > > >
> > > > > > > Index time and disk space are not important, but we're looking to
> > > > improve
> > > > > > > performance on termfreq at query time.
> > > > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > > > performance?
> > > > > > >
> > > > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > > > performance may not be affected.
> > > > > > >
> > > > > > >
> > > > > > > And, any general suggestions for improving query performance on
> > > > Function
> > > > > > > Queries?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Aki
> > > > > > >
> > > > > >
> > > >
> > >
> >

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Hi Jack,

I'm just using solr to get word count across a large number of documents.

It's somewhat non-standard, because we're ignoring relevance, but it seems
to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good way to
speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs across all
documents in a search and just return one number for total termfreqs


Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> That's what a normal query does - Lucene takes all the terms used in the
> query and sums them up for each document in the response, producing a
> single number, the score, for each document. That's the way Solr is
> designed to be used. You still haven't elaborated why you are trying to use
> Solr in a way other than it was intended.
>
> -- Jack Krupansky
>
> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com> wrote:
>
> > Gotcha - that's disheartening.
> >
> > One idea: when I run termfreq, I get all of the termfreqs for each
> document
> > one-by-one.
> >
> > Is there a way to have solr sum it up before creating the request, so I
> > only receive one number in the response?
> >
> >
> > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > > If you mean using the term frequency function query, then I'm not sure
> > > there's a huge amount you can do to improve performance.
> > >
> > > The term frequency is a number that is used often, so it is stored in
> > > the index pre-calculated. Perhaps, if your data is not changing,
> > > optimising your index would reduce it to one segment, and thus might
> > > ever so slightly speed the aggregation of term frequencies, but I doubt
> > > it'd make enough difference to make it worth doing.
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > Thanks, Jack. I did some more research and found similar results.
> > > >
> > > > In our application, we are making multiple (think: 50) concurrent
> > > > requests
> > > > to calculate term frequency on a set of documents in "real-time". The
> > > > faster that results return, the better.
> > > >
> > > > Most of these requests are unique, so cache only helps slightly.
> > > >
> > > > This analysis is happening on a single solr instance.
> > > >
> > > > Other than moving to solr cloud and splitting out the processing onto
> > > > multiple servers, do you have any suggestions for what might speed up
> > > > termfreq at query time?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > > >
> > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > <ja...@gmail.com>
> > > > wrote:
> > > >
> > > > > Term frequency applies only to the indexed terms of a tokenized
> > field.
> > > > > DocValues is really just a copy of the original source text and is
> > not
> > > > > tokenized into terms.
> > > > >
> > > > > Maybe you could explain how exactly you are using term frequency in
> > > > > function queries. More importantly, what is so "heavy" about your
> > > usage?
> > > > > Generally, moderate use of a feature is much more advisable to
> heavy
> > > usage,
> > > > > unless you don't care about performance.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com>
> > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > In our solr application, we use a Function Query (termfreq) very
> > > heavily.
> > > > > >
> > > > > > Index time and disk space are not important, but we're looking to
> > > improve
> > > > > > performance on termfreq at query time.
> > > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > > performance?
> > > > > >
> > > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > > performance may not be affected.
> > > > > >
> > > > > >
> > > > > > And, any general suggestions for improving query performance on
> > > Function
> > > > > > Queries?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > >
> > >
> >
>

Re: Does docValues impact termfreq ?

Posted by Jack Krupansky <ja...@gmail.com>.
That's what a normal query does - Lucene takes all the terms used in the
query and sums them up for each document in the response, producing a
single number, the score, for each document. That's the way Solr is
designed to be used. You still haven't elaborated why you are trying to use
Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <ak...@marketmuse.com> wrote:

> Gotcha - that's disheartening.
>
> One idea: when I run termfreq, I get all of the termfreqs for each document
> one-by-one.
>
> Is there a way to have solr sum it up before creating the request, so I
> only receive one number in the response?
>
>
> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:
>
> > If you mean using the term frequency function query, then I'm not sure
> > there's a huge amount you can do to improve performance.
> >
> > The term frequency is a number that is used often, so it is stored in
> > the index pre-calculated. Perhaps, if your data is not changing,
> > optimising your index would reduce it to one segment, and thus might
> > ever so slightly speed the aggregation of term frequencies, but I doubt
> > it'd make enough difference to make it worth doing.
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > Thanks, Jack. I did some more research and found similar results.
> > >
> > > In our application, we are making multiple (think: 50) concurrent
> > > requests
> > > to calculate term frequency on a set of documents in "real-time". The
> > > faster that results return, the better.
> > >
> > > Most of these requests are unique, so cache only helps slightly.
> > >
> > > This analysis is happening on a single solr instance.
> > >
> > > Other than moving to solr cloud and splitting out the processing onto
> > > multiple servers, do you have any suggestions for what might speed up
> > > termfreq at query time?
> > >
> > > Thanks,
> > > Aki
> > >
> > >
> > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > <ja...@gmail.com>
> > > wrote:
> > >
> > > > Term frequency applies only to the indexed terms of a tokenized
> field.
> > > > DocValues is really just a copy of the original source text and is
> not
> > > > tokenized into terms.
> > > >
> > > > Maybe you could explain how exactly you are using term frequency in
> > > > function queries. More importantly, what is so "heavy" about your
> > usage?
> > > > Generally, moderate use of a feature is much more advisable to heavy
> > usage,
> > > > unless you don't care about performance.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com>
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > In our solr application, we use a Function Query (termfreq) very
> > heavily.
> > > > >
> > > > > Index time and disk space are not important, but we're looking to
> > improve
> > > > > performance on termfreq at query time.
> > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > performance?
> > > > >
> > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > performance may not be affected.
> > > > >
> > > > >
> > > > > And, any general suggestions for improving query performance on
> > Function
> > > > > Queries?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > >
> >
>

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for each document
one-by-one.

Is there a way to have solr sum it up before creating the request, so I
only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv...@odoko.co.uk> wrote:

> If you mean using the term frequency function query, then I'm not sure
> there's a huge amount you can do to improve performance.
>
> The term frequency is a number that is used often, so it is stored in
> the index pre-calculated. Perhaps, if your data is not changing,
> optimising your index would reduce it to one segment, and thus might
> ever so slightly speed the aggregation of term frequencies, but I doubt
> it'd make enough difference to make it worth doing.
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > Thanks, Jack. I did some more research and found similar results.
> >
> > In our application, we are making multiple (think: 50) concurrent
> > requests
> > to calculate term frequency on a set of documents in "real-time". The
> > faster that results return, the better.
> >
> > Most of these requests are unique, so cache only helps slightly.
> >
> > This analysis is happening on a single solr instance.
> >
> > Other than moving to solr cloud and splitting out the processing onto
> > multiple servers, do you have any suggestions for what might speed up
> > termfreq at query time?
> >
> > Thanks,
> > Aki
> >
> >
> > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > <ja...@gmail.com>
> > wrote:
> >
> > > Term frequency applies only to the indexed terms of a tokenized field.
> > > DocValues is really just a copy of the original source text and is not
> > > tokenized into terms.
> > >
> > > Maybe you could explain how exactly you are using term frequency in
> > > function queries. More importantly, what is so "heavy" about your
> usage?
> > > Generally, moderate use of a feature is much more advisable to heavy
> usage,
> > > unless you don't care about performance.
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > In our solr application, we use a Function Query (termfreq) very
> heavily.
> > > >
> > > > Index time and disk space are not important, but we're looking to
> improve
> > > > performance on termfreq at query time.
> > > > I've been reading up on docValues. Would this be a way to improve
> > > > performance?
> > > >
> > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > performance may not be affected.
> > > >
> > > >
> > > > And, any general suggestions for improving query performance on
> Function
> > > > Queries?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > >
>

Re: Does docValues impact termfreq ?

Posted by Upayavira <uv...@odoko.co.uk>.
If you mean using the term frequency function query, then I'm not sure
there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is stored in
the index pre-calculated. Perhaps, if your data is not changing,
optimising your index would reduce it to one segment, and thus might
ever so slightly speed the aggregation of term frequencies, but I doubt
it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> Thanks, Jack. I did some more research and found similar results.
> 
> In our application, we are making multiple (think: 50) concurrent
> requests
> to calculate term frequency on a set of documents in "real-time". The
> faster that results return, the better.
> 
> Most of these requests are unique, so cache only helps slightly.
> 
> This analysis is happening on a single solr instance.
> 
> Other than moving to solr cloud and splitting out the processing onto
> multiple servers, do you have any suggestions for what might speed up
> termfreq at query time?
> 
> Thanks,
> Aki
> 
> 
> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> <ja...@gmail.com>
> wrote:
> 
> > Term frequency applies only to the indexed terms of a tokenized field.
> > DocValues is really just a copy of the original source text and is not
> > tokenized into terms.
> >
> > Maybe you could explain how exactly you are using term frequency in
> > function queries. More importantly, what is so "heavy" about your usage?
> > Generally, moderate use of a feature is much more advisable to heavy usage,
> > unless you don't care about performance.
> >
> > -- Jack Krupansky
> >
> > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com> wrote:
> >
> > > Hello,
> > >
> > > In our solr application, we use a Function Query (termfreq) very heavily.
> > >
> > > Index time and disk space are not important, but we're looking to improve
> > > performance on termfreq at query time.
> > > I've been reading up on docValues. Would this be a way to improve
> > > performance?
> > >
> > > I had read that Lucene uses Field Cache for Function Queries, so
> > > performance may not be affected.
> > >
> > >
> > > And, any general suggestions for improving query performance on Function
> > > Queries?
> > >
> > > Thanks,
> > > Aki
> > >
> >

Re: Does docValues impact termfreq ?

Posted by Aki Balogh <ak...@marketmuse.com>.
Thanks, Jack. I did some more research and found similar results.

In our application, we are making multiple (think: 50) concurrent requests
to calculate term frequency on a set of documents in "real-time". The
faster that results return, the better.

Most of these requests are unique, so cache only helps slightly.

This analysis is happening on a single solr instance.

Other than moving to solr cloud and splitting out the processing onto
multiple servers, do you have any suggestions for what might speed up
termfreq at query time?

Thanks,
Aki


On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> Term frequency applies only to the indexed terms of a tokenized field.
> DocValues is really just a copy of the original source text and is not
> tokenized into terms.
>
> Maybe you could explain how exactly you are using term frequency in
> function queries. More importantly, what is so "heavy" about your usage?
> Generally, moderate use of a feature is much more advisable to heavy usage,
> unless you don't care about performance.
>
> -- Jack Krupansky
>
> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com> wrote:
>
> > Hello,
> >
> > In our solr application, we use a Function Query (termfreq) very heavily.
> >
> > Index time and disk space are not important, but we're looking to improve
> > performance on termfreq at query time.
> > I've been reading up on docValues. Would this be a way to improve
> > performance?
> >
> > I had read that Lucene uses Field Cache for Function Queries, so
> > performance may not be affected.
> >
> >
> > And, any general suggestions for improving query performance on Function
> > Queries?
> >
> > Thanks,
> > Aki
> >
>

Re: Does docValues impact termfreq ?

Posted by Jack Krupansky <ja...@gmail.com>.
Term frequency applies only to the indexed terms of a tokenized field.
DocValues is really just a copy of the original source text and is not
tokenized into terms.

Maybe you could explain how exactly you are using term frequency in
function queries. More importantly, what is so "heavy" about your usage?
Generally, moderate use of a feature is much more advisable to heavy usage,
unless you don't care about performance.

-- Jack Krupansky

On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <ak...@marketmuse.com> wrote:

> Hello,
>
> In our solr application, we use a Function Query (termfreq) very heavily.
>
> Index time and disk space are not important, but we're looking to improve
> performance on termfreq at query time.
> I've been reading up on docValues. Would this be a way to improve
> performance?
>
> I had read that Lucene uses Field Cache for Function Queries, so
> performance may not be affected.
>
>
> And, any general suggestions for improving query performance on Function
> Queries?
>
> Thanks,
> Aki
>