You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy Pickler <an...@gmail.com> on 2013/04/01 23:32:13 UTC

Top 10 Terms in Index (by date)

Our company has an application that is "Facebook-like" for usage by
enterprise customers.  We'd like to do a report of "top 10 terms entered by
users over (some time period)".  With that in mind I'm using the
DataImportHandler to put all the relevant data from our database into a
Solr 'content' field:

<field name="content" type="text_general" indexed="true" stored="false"
multiValued="false" required="true" termVectors="true"/>

Along with the content is the 'dateCreated' for that content:

<field name="dateCreated" type="tdate" indexed="true" stored="false"
multiValued="false" required="true"/>

I'm struggling with the TermVectorComponent documentation to understand how
I can put together a query that answers the 'report' mentioned above.  For
each document I need each term counted however many times it is entered
(content of "I think what I think" would report 'think' as used twice).
 Does anyone have any insight as to whether I'm headed in the right
direction and then what my query would be?

Thanks,
Andy Pickler

Re: Top 10 Terms in Index (by date)

Posted by Andy Pickler <an...@gmail.com>.
A key problem with those approaches as well as Lucene's HighFreqTerms class
(
http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html)
is that none of them seem to have the ability to combine with a date range
query...which is key in my scenario.  I'm kinda thinking that what I'm
asking to do just isn't supported by Lucene or Solr, and that I'll have to
pursue another avenue.  If anyone has any other suggestions, I'm all ears.
I'm starting to wonder if I need to have some nightly batch job that
executes against my database and builds up "that day's top terms" in a
table or something.

Thanks,
Andy Pickler

On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> wrote:

> Oh, I see, essentially you want to get the sum of the term frequencies for
> every term in a subset of documents (instead of the document frequency as
> the FacetComponent would give you). I don't know of an easy/out of the box
> solution for this. I know the TermVectorComponent will give you the tf for
> every term in a document, but I'm not sure if you can filter or sort on it.
> Maybe you can do something like:
> https://issues.apache.org/jira/browse/LUCENE-2393
> or what's suggested here:
> http://search-lucene.com/m/of5Fn1PUOHU/
> but I have never used something like that.
>
> Tomás
>
>
>
> On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler <an...@gmail.com>
> wrote:
>
> > I need "total number of occurrences" across all documents for each term.
> > Imagine this...
> >
> > Post #1: "I think, therefore I am like you"
> > Reply #1: "You think too much"
> > Reply #2 "I think that I think much as you"
> >
> > Each of those "documents" are put into 'content'.  Pretending I don't
> have
> > stop words, the top term query (not considering dateCreated in this
> > example) would result in something like...
> >
> > "think": 4
> > "I": 4
> > "you": 3
> > "much": 2
> > ...
> >
> > Thus, just a "number of documents" approach doesn't work, because if a
> word
> > occurs more than one time in a document it needs to be counted that many
> > times.  That seemed to rule out faceting like you mentioned as well as
> the
> > TermsComponent (which as I understand also only counts "documents").
> >
> > Thanks,
> > Andy Pickler
> >
> > On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <
> > tomasflobbe@gmail.com
> > > wrote:
> >
> > > So you have one document per user comment? Why not use faceting plus
> > > filtering on the "dateCreated" field? That would count "number of
> > > documents" for each term (so, in your case, if a term is used twice in
> > one
> > > comment it would only count once). Is that what you are looking for?
> > >
> > > Tomás
> > >
> > >
> > > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <an...@gmail.com>
> > > wrote:
> > >
> > > > Our company has an application that is "Facebook-like" for usage by
> > > > enterprise customers.  We'd like to do a report of "top 10 terms
> > entered
> > > by
> > > > users over (some time period)".  With that in mind I'm using the
> > > > DataImportHandler to put all the relevant data from our database
> into a
> > > > Solr 'content' field:
> > > >
> > > > <field name="content" type="text_general" indexed="true"
> stored="false"
> > > > multiValued="false" required="true" termVectors="true"/>
> > > >
> > > > Along with the content is the 'dateCreated' for that content:
> > > >
> > > > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > > > multiValued="false" required="true"/>
> > > >
> > > > I'm struggling with the TermVectorComponent documentation to
> understand
> > > how
> > > > I can put together a query that answers the 'report' mentioned above.
> > >  For
> > > > each document I need each term counted however many times it is
> entered
> > > > (content of "I think what I think" would report 'think' as used
> twice).
> > > >  Does anyone have any insight as to whether I'm headed in the right
> > > > direction and then what my query would be?
> > > >
> > > > Thanks,
> > > > Andy Pickler
> > > >
> > >
> >
>

Re: Top 10 Terms in Index (by date)

Posted by Tomás Fernández Löbbe <to...@gmail.com>.
Oh, I see, essentially you want to get the sum of the term frequencies for
every term in a subset of documents (instead of the document frequency as
the FacetComponent would give you). I don't know of an easy/out of the box
solution for this. I know the TermVectorComponent will give you the tf for
every term in a document, but I'm not sure if you can filter or sort on it.
Maybe you can do something like:
https://issues.apache.org/jira/browse/LUCENE-2393
or what's suggested here:
http://search-lucene.com/m/of5Fn1PUOHU/
but I have never used something like that.

Tomás



On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler <an...@gmail.com> wrote:

> I need "total number of occurrences" across all documents for each term.
> Imagine this...
>
> Post #1: "I think, therefore I am like you"
> Reply #1: "You think too much"
> Reply #2 "I think that I think much as you"
>
> Each of those "documents" are put into 'content'.  Pretending I don't have
> stop words, the top term query (not considering dateCreated in this
> example) would result in something like...
>
> "think": 4
> "I": 4
> "you": 3
> "much": 2
> ...
>
> Thus, just a "number of documents" approach doesn't work, because if a word
> occurs more than one time in a document it needs to be counted that many
> times.  That seemed to rule out faceting like you mentioned as well as the
> TermsComponent (which as I understand also only counts "documents").
>
> Thanks,
> Andy Pickler
>
> On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <
> tomasflobbe@gmail.com
> > wrote:
>
> > So you have one document per user comment? Why not use faceting plus
> > filtering on the "dateCreated" field? That would count "number of
> > documents" for each term (so, in your case, if a term is used twice in
> one
> > comment it would only count once). Is that what you are looking for?
> >
> > Tomás
> >
> >
> > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <an...@gmail.com>
> > wrote:
> >
> > > Our company has an application that is "Facebook-like" for usage by
> > > enterprise customers.  We'd like to do a report of "top 10 terms
> entered
> > by
> > > users over (some time period)".  With that in mind I'm using the
> > > DataImportHandler to put all the relevant data from our database into a
> > > Solr 'content' field:
> > >
> > > <field name="content" type="text_general" indexed="true" stored="false"
> > > multiValued="false" required="true" termVectors="true"/>
> > >
> > > Along with the content is the 'dateCreated' for that content:
> > >
> > > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > > multiValued="false" required="true"/>
> > >
> > > I'm struggling with the TermVectorComponent documentation to understand
> > how
> > > I can put together a query that answers the 'report' mentioned above.
> >  For
> > > each document I need each term counted however many times it is entered
> > > (content of "I think what I think" would report 'think' as used twice).
> > >  Does anyone have any insight as to whether I'm headed in the right
> > > direction and then what my query would be?
> > >
> > > Thanks,
> > > Andy Pickler
> > >
> >
>

Re: Top 10 Terms in Index (by date)

Posted by Andy Pickler <an...@gmail.com>.
I need "total number of occurrences" across all documents for each term.
Imagine this...

Post #1: "I think, therefore I am like you"
Reply #1: "You think too much"
Reply #2 "I think that I think much as you"

Each of those "documents" are put into 'content'.  Pretending I don't have
stop words, the top term query (not considering dateCreated in this
example) would result in something like...

"think": 4
"I": 4
"you": 3
"much": 2
...

Thus, just a "number of documents" approach doesn't work, because if a word
occurs more than one time in a document it needs to be counted that many
times.  That seemed to rule out faceting like you mentioned as well as the
TermsComponent (which as I understand also only counts "documents").

Thanks,
Andy Pickler

On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> wrote:

> So you have one document per user comment? Why not use faceting plus
> filtering on the "dateCreated" field? That would count "number of
> documents" for each term (so, in your case, if a term is used twice in one
> comment it would only count once). Is that what you are looking for?
>
> Tomás
>
>
> On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <an...@gmail.com>
> wrote:
>
> > Our company has an application that is "Facebook-like" for usage by
> > enterprise customers.  We'd like to do a report of "top 10 terms entered
> by
> > users over (some time period)".  With that in mind I'm using the
> > DataImportHandler to put all the relevant data from our database into a
> > Solr 'content' field:
> >
> > <field name="content" type="text_general" indexed="true" stored="false"
> > multiValued="false" required="true" termVectors="true"/>
> >
> > Along with the content is the 'dateCreated' for that content:
> >
> > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > multiValued="false" required="true"/>
> >
> > I'm struggling with the TermVectorComponent documentation to understand
> how
> > I can put together a query that answers the 'report' mentioned above.
>  For
> > each document I need each term counted however many times it is entered
> > (content of "I think what I think" would report 'think' as used twice).
> >  Does anyone have any insight as to whether I'm headed in the right
> > direction and then what my query would be?
> >
> > Thanks,
> > Andy Pickler
> >
>

Re: Top 10 Terms in Index (by date)

Posted by Tomás Fernández Löbbe <to...@gmail.com>.
So you have one document per user comment? Why not use faceting plus
filtering on the "dateCreated" field? That would count "number of
documents" for each term (so, in your case, if a term is used twice in one
comment it would only count once). Is that what you are looking for?

Tomás


On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <an...@gmail.com> wrote:

> Our company has an application that is "Facebook-like" for usage by
> enterprise customers.  We'd like to do a report of "top 10 terms entered by
> users over (some time period)".  With that in mind I'm using the
> DataImportHandler to put all the relevant data from our database into a
> Solr 'content' field:
>
> <field name="content" type="text_general" indexed="true" stored="false"
> multiValued="false" required="true" termVectors="true"/>
>
> Along with the content is the 'dateCreated' for that content:
>
> <field name="dateCreated" type="tdate" indexed="true" stored="false"
> multiValued="false" required="true"/>
>
> I'm struggling with the TermVectorComponent documentation to understand how
> I can put together a query that answers the 'report' mentioned above.  For
> each document I need each term counted however many times it is entered
> (content of "I think what I think" would report 'think' as used twice).
>  Does anyone have any insight as to whether I'm headed in the right
> direction and then what my query would be?
>
> Thanks,
> Andy Pickler
>