You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Larochelle <dl...@cyber.law.harvard.edu> on 2013/05/16 21:29:07 UTC

Aggregate word counts over a subset of documents

Is there a way to get aggregate word counts over a subset of documents?

For example given the following data:

  {
    "id": "1",
    "category": "cat1",
    "includes": "The green car.",
  },
  {
    "id": "2",
    "category": "cat1",
    "includes": "The red car.",
  },
  {
    "id": "3",
    "category": "cat2",
    "includes": "The black car.",
  }

I'd like to be able to get total term frequency counts per category. e.g.

<category name="cat1">
   <lst name="the">2</lst>
   <lst name="car">2</lst>
   <lst name="green">1</lst>
   <lst name="red">1</lst>
</category>
<category name="cat2">
   <lst name="the">1</lst>
   <lst name="car">1</lst>
   <lst name="black">1</lst>
</category>

I was initially hoping to do this within Solr and I tried using the
TermFrequencyComponent. This gives term frequencies for individual
documents and term frequencies for the entire index but doesn't seem to
help with subsets. For example, TermFrequencyComponent would tell me that
car occurs 3 times over all documents in the index and 1 time in document 1
but not that it occurs 2 times over cat1 documents and 1 time over cat2
documents.

Is there a good way to use Solr/Lucene to gather aggregate results like
this? I've been focusing on just using Solr with XML files but I could
certainly write Java code if necessary.

Thanks,

David

Re: Aggregate word counts over a subset of documents

Posted by David Larochelle <dl...@cyber.law.harvard.edu>.

Jason,

Thanks so much for your suggestion. This seems to do what I need.

--

David

On Thu, May 16, 2013 at 3:59 PM, Jason Hellman <
jhellman@innoventsolutions.com> wrote:

> David,
>
> A Pivot Facet could possibly provide these results by the following syntax:
>
> facet.pivot=category,includes
>
> We would presume that includes is a tokenized field and thus a set of
> facet values would be rendered from the terms resoling from that
> tokenization.  This would be nested in each category…and, of course, the
> entire set of documents considered for these facets is constrained by the
> current query.
>
> I think this maps to your requirement.
>
> Jason
>
> On May 16, 2013, at 12:29 PM, David Larochelle <
> dlarochelle@cyber.law.harvard.edu> wrote:
>
> > Is there a way to get aggregate word counts over a subset of documents?
> >
> > For example given the following data:
> >
> >  {
> >    "id": "1",
> >    "category": "cat1",
> >    "includes": "The green car.",
> >  },
> >  {
> >    "id": "2",
> >    "category": "cat1",
> >    "includes": "The red car.",
> >  },
> >  {
> >    "id": "3",
> >    "category": "cat2",
> >    "includes": "The black car.",
> >  }
> >
> > I'd like to be able to get total term frequency counts per category. e.g.
> >
> > <category name="cat1">
> >   <lst name="the">2</lst>
> >   <lst name="car">2</lst>
> >   <lst name="green">1</lst>
> >   <lst name="red">1</lst>
> > </category>
> > <category name="cat2">
> >   <lst name="the">1</lst>
> >   <lst name="car">1</lst>
> >   <lst name="black">1</lst>
> > </category>
> >
> > I was initially hoping to do this within Solr and I tried using the
> > TermFrequencyComponent. This gives term frequencies for individual
> > documents and term frequencies for the entire index but doesn't seem to
> > help with subsets. For example, TermFrequencyComponent would tell me that
> > car occurs 3 times over all documents in the index and 1 time in
> document 1
> > but not that it occurs 2 times over cat1 documents and 1 time over cat2
> > documents.
> >
> > Is there a good way to use Solr/Lucene to gather aggregate results like
> > this? I've been focusing on just using Solr with XML files but I could
> > certainly write Java code if necessary.
> >
> > Thanks,
> >
> > David
>
>

Re: Aggregate word counts over a subset of documents

Posted by Jason Hellman <jh...@innoventsolutions.com>.

David,

A Pivot Facet could possibly provide these results by the following syntax:

facet.pivot=category,includes

We would presume that includes is a tokenized field and thus a set of facet values would be rendered from the terms resoling from that tokenization.  This would be nested in each category…and, of course, the entire set of documents considered for these facets is constrained by the current query.

I think this maps to your requirement.

Jason

On May 16, 2013, at 12:29 PM, David Larochelle <dl...@cyber.law.harvard.edu> wrote:

> Is there a way to get aggregate word counts over a subset of documents?
> 
> For example given the following data:
> 
>  {
>    "id": "1",
>    "category": "cat1",
>    "includes": "The green car.",
>  },
>  {
>    "id": "2",
>    "category": "cat1",
>    "includes": "The red car.",
>  },
>  {
>    "id": "3",
>    "category": "cat2",
>    "includes": "The black car.",
>  }
> 
> I'd like to be able to get total term frequency counts per category. e.g.
> 
> <category name="cat1">
>   <lst name="the">2</lst>
>   <lst name="car">2</lst>
>   <lst name="green">1</lst>
>   <lst name="red">1</lst>
> </category>
> <category name="cat2">
>   <lst name="the">1</lst>
>   <lst name="car">1</lst>
>   <lst name="black">1</lst>
> </category>
> 
> I was initially hoping to do this within Solr and I tried using the
> TermFrequencyComponent. This gives term frequencies for individual
> documents and term frequencies for the entire index but doesn't seem to
> help with subsets. For example, TermFrequencyComponent would tell me that
> car occurs 3 times over all documents in the index and 1 time in document 1
> but not that it occurs 2 times over cat1 documents and 1 time over cat2
> documents.
> 
> Is there a good way to use Solr/Lucene to gather aggregate results like
> this? I've been focusing on just using Solr with XML files but I could
> certainly write Java code if necessary.
> 
> Thanks,
> 
> David