You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Larochelle <dl...@cyber.law.harvard.edu> on 2013/05/23 03:49:32 UTC

Fast faceting over large number of distinct terms

I'm trying to quickly obtain cumulative word frequency counts over all
documents matching a particular query.

I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB
and has around ~350,000 documents.

My schema includes the following fields:

<field name="id" type="string" indexed="true" stored="true" required="true"
multiValued="false" />
<field name="media_id" type="int" indexed="true" stored="true"
required="true" multiValued="false" />
<field name="story_text"  type="text_general" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true" />


story_text is used to store free form text obtained by crawling new papers
and blogs.

Running faceted searches with the fc or fcs methods fails with the error
"Too many values for UnInvertedField faceting on field story_text"
http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs

Running faceted search with the 'enum' method succeeds but takes a very
long time.
http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0<http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0>

The frustrating thing is even if the query only returns a few hundred
documents, it still takes 10 minutes or longer to get the cumulative word
count results.

Eventually we're hoping to build a system that will return results in a few
seconds and scale to hundreds of millions of documents.
Is there anyway to get this level of performance out of Solr/Lucene?

Thanks,

David

Re: Fast faceting over large number of distinct terms

Posted by David Larochelle <dl...@cyber.law.harvard.edu>.

Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
returning counts for terms in the top N documents matching a particular
query.

Also is there an efficient way to add term counts on the client side? I
thought of using the TermVectorComponent to get document level frequency
counts and then using something like Hadoop to add them up. However, I
couldn't find any documentation on using the results of a solr query to
feed a map reduce operation.

--

David


On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Here's a possibility:
>
> At index time extract important terms (and/or phrases) from this
> story_text and store top N of them in a separate field (which will be
> much smaller/shorter).  Then facet on that.  Or just retrieve it and
> manually parse and count in the client if that turns out to be faster.
> I did this in the previous decade before Solr was available and it
> worked well.  I limited my counting to top N (200?) hits.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, May 22, 2013 at 10:54 PM, David Larochelle
> <dl...@cyber.law.harvard.edu> wrote:
> > The goal of the system is to obtain data that can be used to generate
> word
> > clouds so that users can quickly get a sense of the aggregate contents of
> > all documents matching a particular query. For example, a user might want
> > to see a word cloud of all documents discussing 'Iraq' in a particular
> new
> > papers.
> >
> > Faceting on story_text gives counts of individual words rather than
> entire
> > text strings. I think this is because of the tokenization that happens
> > automatically as part of the text_general type. I'm happy to look at
> > alternatives to faceting but I wasn't able to find one that
> > provided aggregate word counts for just the documents matching a
> particular
> > query rather than an individual documents  or the entire index.
> >
> > --
> >
> > David
> >
> >
> > On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
> > brendan.grainger@gmail.com> wrote:
> >
> >> Hi David,
> >>
> >> Out of interest, what are you trying to accomplish by faceting over the
> >> story_text field? Is it generally the case that the story_text field
> will
> >> contain values that are repeated or categorize your documents somehow?
> >>  From your description: "story_text is used to store free form text
> >> obtained by crawling new papers and blogs", it doesn't seem that way, so
> >> I'm not sure faceting is what you want in this situation.
> >>
> >> Cheers,
> >> Brendan
> >>
> >>
> >> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
> >> dlarochelle@cyber.law.harvard.edu> wrote:
> >>
> >> > I'm trying to quickly obtain cumulative word frequency counts over all
> >> > documents matching a particular query.
> >> >
> >> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is
> 2.5
> >> GB
> >> > and has around ~350,000 documents.
> >> >
> >> > My schema includes the following fields:
> >> >
> >> > <field name="id" type="string" indexed="true" stored="true"
> >> required="true"
> >> > multiValued="false" />
> >> > <field name="media_id" type="int" indexed="true" stored="true"
> >> > required="true" multiValued="false" />
> >> > <field name="story_text"  type="text_general" indexed="true"
> >> stored="true"
> >> > termVectors="true" termPositions="true" termOffsets="true" />
> >> >
> >> >
> >> > story_text is used to store free form text obtained by crawling new
> >> papers
> >> > and blogs.
> >> >
> >> > Running faceted searches with the fc or fcs methods fails with the
> error
> >> > "Too many values for UnInvertedField faceting on field story_text"
> >> >
> >> >
> >>
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
> >> >
> >> > Running faceted search with the 'enum' method succeeds but takes a
> very
> >> > long time.
> >> >
> >> >
> >>
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >> > <
> >> >
> >>
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >> > >
> >> >
> >> > The frustrating thing is even if the query only returns a few hundred
> >> > documents, it still takes 10 minutes or longer to get the cumulative
> word
> >> > count results.
> >> >
> >> > Eventually we're hoping to build a system that will return results in
> a
> >> few
> >> > seconds and scale to hundreds of millions of documents.
> >> > Is there anyway to get this level of performance out of Solr/Lucene?
> >> >
> >> > Thanks,
> >> >
> >> > David
> >> >
> >>
> >>
> >>
> >> --
> >> Brendan Grainger
> >> www.kuripai.com
> >>
>

Re: Fast faceting over large number of distinct terms

Posted by Walter Underwood <wu...@wunderwood.org>.

I would fetch the term vectors for the top N documents and add them up myself. You could even scale the term counts by the relevance score for the document. That would avoid problems with analyzing ten documents where only the first three were really good matches.

I did something similar in a different engine for a kNN classifier.

wunder

On May 22, 2013, at 8:12 PM, Otis Gospodnetic wrote:

> Here's a possibility:
> 
> At index time extract important terms (and/or phrases) from this
> story_text and store top N of them in a separate field (which will be
> much smaller/shorter).  Then facet on that.  Or just retrieve it and
> manually parse and count in the client if that turns out to be faster.
> I did this in the previous decade before Solr was available and it
> worked well.  I limited my counting to top N (200?) hits.
> 
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
> 
> On Wed, May 22, 2013 at 10:54 PM, David Larochelle
> <dl...@cyber.law.harvard.edu> wrote:
>> The goal of the system is to obtain data that can be used to generate word
>> clouds so that users can quickly get a sense of the aggregate contents of
>> all documents matching a particular query. For example, a user might want
>> to see a word cloud of all documents discussing 'Iraq' in a particular new
>> papers.
>> 
>> Faceting on story_text gives counts of individual words rather than entire
>> text strings. I think this is because of the tokenization that happens
>> automatically as part of the text_general type. I'm happy to look at
>> alternatives to faceting but I wasn't able to find one that
>> provided aggregate word counts for just the documents matching a particular
>> query rather than an individual documents  or the entire index.
>> 
>> --
>> 
>> David
>> 
>> 
>> On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
>> brendan.grainger@gmail.com> wrote:
>> 
>>> Hi David,
>>> 
>>> Out of interest, what are you trying to accomplish by faceting over the
>>> story_text field? Is it generally the case that the story_text field will
>>> contain values that are repeated or categorize your documents somehow?
>>> From your description: "story_text is used to store free form text
>>> obtained by crawling new papers and blogs", it doesn't seem that way, so
>>> I'm not sure faceting is what you want in this situation.
>>> 
>>> Cheers,
>>> Brendan
>>> 
>>> 
>>> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
>>> dlarochelle@cyber.law.harvard.edu> wrote:
>>> 
>>>> I'm trying to quickly obtain cumulative word frequency counts over all
>>>> documents matching a particular query.
>>>> 
>>>> I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
>>> GB
>>>> and has around ~350,000 documents.
>>>> 
>>>> My schema includes the following fields:
>>>> 
>>>> <field name="id" type="string" indexed="true" stored="true"
>>> required="true"
>>>> multiValued="false" />
>>>> <field name="media_id" type="int" indexed="true" stored="true"
>>>> required="true" multiValued="false" />
>>>> <field name="story_text"  type="text_general" indexed="true"
>>> stored="true"
>>>> termVectors="true" termPositions="true" termOffsets="true" />
>>>> 
>>>> 
>>>> story_text is used to store free form text obtained by crawling new
>>> papers
>>>> and blogs.
>>>> 
>>>> Running faceted searches with the fc or fcs methods fails with the error
>>>> "Too many values for UnInvertedField faceting on field story_text"
>>>> 
>>>> 
>>> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>>>> 
>>>> Running faceted search with the 'enum' method succeeds but takes a very
>>>> long time.
>>>> 
>>>> 
>>> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>>>> <
>>>> 
>>> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>>>>> 
>>>> 
>>>> The frustrating thing is even if the query only returns a few hundred
>>>> documents, it still takes 10 minutes or longer to get the cumulative word
>>>> count results.
>>>> 
>>>> Eventually we're hoping to build a system that will return results in a
>>> few
>>>> seconds and scale to hundreds of millions of documents.
>>>> Is there anyway to get this level of performance out of Solr/Lucene?
>>>> 
>>>> Thanks,
>>>> 
>>>> David
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Brendan Grainger
>>> www.kuripai.com
>>> 

--
Walter Underwood
wunder@wunderwood.org

Re: Fast faceting over large number of distinct terms

Posted by Otis Gospodnetic <ot...@gmail.com>.

Here's a possibility:

At index time extract important terms (and/or phrases) from this
story_text and store top N of them in a separate field (which will be
much smaller/shorter).  Then facet on that.  Or just retrieve it and
manually parse and count in the client if that turns out to be faster.
I did this in the previous decade before Solr was available and it
worked well.  I limited my counting to top N (200?) hits.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, May 22, 2013 at 10:54 PM, David Larochelle
<dl...@cyber.law.harvard.edu> wrote:
> The goal of the system is to obtain data that can be used to generate word
> clouds so that users can quickly get a sense of the aggregate contents of
> all documents matching a particular query. For example, a user might want
> to see a word cloud of all documents discussing 'Iraq' in a particular new
> papers.
>
> Faceting on story_text gives counts of individual words rather than entire
> text strings. I think this is because of the tokenization that happens
> automatically as part of the text_general type. I'm happy to look at
> alternatives to faceting but I wasn't able to find one that
> provided aggregate word counts for just the documents matching a particular
> query rather than an individual documents  or the entire index.
>
> --
>
> David
>
>
> On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
> brendan.grainger@gmail.com> wrote:
>
>> Hi David,
>>
>> Out of interest, what are you trying to accomplish by faceting over the
>> story_text field? Is it generally the case that the story_text field will
>> contain values that are repeated or categorize your documents somehow?
>>  From your description: "story_text is used to store free form text
>> obtained by crawling new papers and blogs", it doesn't seem that way, so
>> I'm not sure faceting is what you want in this situation.
>>
>> Cheers,
>> Brendan
>>
>>
>> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
>> dlarochelle@cyber.law.harvard.edu> wrote:
>>
>> > I'm trying to quickly obtain cumulative word frequency counts over all
>> > documents matching a particular query.
>> >
>> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
>> GB
>> > and has around ~350,000 documents.
>> >
>> > My schema includes the following fields:
>> >
>> > <field name="id" type="string" indexed="true" stored="true"
>> required="true"
>> > multiValued="false" />
>> > <field name="media_id" type="int" indexed="true" stored="true"
>> > required="true" multiValued="false" />
>> > <field name="story_text"  type="text_general" indexed="true"
>> stored="true"
>> > termVectors="true" termPositions="true" termOffsets="true" />
>> >
>> >
>> > story_text is used to store free form text obtained by crawling new
>> papers
>> > and blogs.
>> >
>> > Running faceted searches with the fc or fcs methods fails with the error
>> > "Too many values for UnInvertedField faceting on field story_text"
>> >
>> >
>> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>> >
>> > Running faceted search with the 'enum' method succeeds but takes a very
>> > long time.
>> >
>> >
>> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>> > <
>> >
>> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>> > >
>> >
>> > The frustrating thing is even if the query only returns a few hundred
>> > documents, it still takes 10 minutes or longer to get the cumulative word
>> > count results.
>> >
>> > Eventually we're hoping to build a system that will return results in a
>> few
>> > seconds and scale to hundreds of millions of documents.
>> > Is there anyway to get this level of performance out of Solr/Lucene?
>> >
>> > Thanks,
>> >
>> > David
>> >
>>
>>
>>
>> --
>> Brendan Grainger
>> www.kuripai.com
>>

Re: Fast faceting over large number of distinct terms

Posted by David Larochelle <dl...@cyber.law.harvard.edu>.

The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents  or the entire index.

--

David


On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
brendan.grainger@gmail.com> wrote:

> Hi David,
>
> Out of interest, what are you trying to accomplish by faceting over the
> story_text field? Is it generally the case that the story_text field will
> contain values that are repeated or categorize your documents somehow?
>  From your description: "story_text is used to store free form text
> obtained by crawling new papers and blogs", it doesn't seem that way, so
> I'm not sure faceting is what you want in this situation.
>
> Cheers,
> Brendan
>
>
> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
> dlarochelle@cyber.law.harvard.edu> wrote:
>
> > I'm trying to quickly obtain cumulative word frequency counts over all
> > documents matching a particular query.
> >
> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
> GB
> > and has around ~350,000 documents.
> >
> > My schema includes the following fields:
> >
> > <field name="id" type="string" indexed="true" stored="true"
> required="true"
> > multiValued="false" />
> > <field name="media_id" type="int" indexed="true" stored="true"
> > required="true" multiValued="false" />
> > <field name="story_text"  type="text_general" indexed="true"
> stored="true"
> > termVectors="true" termPositions="true" termOffsets="true" />
> >
> >
> > story_text is used to store free form text obtained by crawling new
> papers
> > and blogs.
> >
> > Running faceted searches with the fc or fcs methods fails with the error
> > "Too many values for UnInvertedField faceting on field story_text"
> >
> >
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
> >
> > Running faceted search with the 'enum' method succeeds but takes a very
> > long time.
> >
> >
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> > <
> >
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> > >
> >
> > The frustrating thing is even if the query only returns a few hundred
> > documents, it still takes 10 minutes or longer to get the cumulative word
> > count results.
> >
> > Eventually we're hoping to build a system that will return results in a
> few
> > seconds and scale to hundreds of millions of documents.
> > Is there anyway to get this level of performance out of Solr/Lucene?
> >
> > Thanks,
> >
> > David
> >
>
>
>
> --
> Brendan Grainger
> www.kuripai.com
>

Re: Fast faceting over large number of distinct terms

Posted by Brendan Grainger <br...@gmail.com>.

Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
 From your description: "story_text is used to store free form text
obtained by crawling new papers and blogs", it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.

Cheers,
Brendan


On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
dlarochelle@cyber.law.harvard.edu> wrote:

> I'm trying to quickly obtain cumulative word frequency counts over all
> documents matching a particular query.
>
> I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB
> and has around ~350,000 documents.
>
> My schema includes the following fields:
>
> <field name="id" type="string" indexed="true" stored="true" required="true"
> multiValued="false" />
> <field name="media_id" type="int" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="story_text"  type="text_general" indexed="true" stored="true"
> termVectors="true" termPositions="true" termOffsets="true" />
>
>
> story_text is used to store free form text obtained by crawling new papers
> and blogs.
>
> Running faceted searches with the fc or fcs methods fails with the error
> "Too many values for UnInvertedField faceting on field story_text"
>
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>
> Running faceted search with the 'enum' method succeeds but takes a very
> long time.
>
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> <
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >
>
> The frustrating thing is even if the query only returns a few hundred
> documents, it still takes 10 minutes or longer to get the cumulative word
> count results.
>
> Eventually we're hoping to build a system that will return results in a few
> seconds and scale to hundreds of millions of documents.
> Is there anyway to get this level of performance out of Solr/Lucene?
>
> Thanks,
>
> David
>



-- 
Brendan Grainger
www.kuripai.com