You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by solr user <mv...@gmail.com> on 2012/01/20 22:06:33 UTC

Getting a word count frequency out of a page field

SOLR reports the term occurrence for terms over all the documents. I am
having trouble making a query that returns the term occurrence in a
specific page field called, documentPageId.

I don't know how to issue a proper SOLR query that returns a word count for
a paragraph of text such as the term "amplifier" for a field. For some
reason it only returns.

The things I've tried only return a count for 1 occurrence of the term even
though I see the term in the paragraph more than just once.

I've tried faceting on the field, "contents"

http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count

<lst name="facet_counts">
<lst name="facet_queries">
<int name="amplifier">21</int>
</lst>
<lst name="facet_fields">
<lst name="documentPageId">
<int name="49667.1">1</int>
<int name="49667.10">1</int>
<int name="49667.11">1</int>
<int name="49667.12">1</int>
<int name="49667.13">1</int>
<int name="49667.14">1</int>
<int name="49667.15">1</int>
<int name="49667.16">1</int>
<int name="49667.17">1</int>
<int name="49667.18">1</int>
<int name="49667.19">1</int>
<int name="49667.2">1</int>
<int name="49667.20">1</int>
<int name="49667.21">1</int>
<int name="49667.3">1</int>
<int name="49667.4">1</int>
<int name="49667.5">1</int>
<int name="49667.6">1</int>
<int name="49667.7">1</int>
<int name="49667.8">1</int>
<int name="49667.9">1</int>
<int name="49670.1">1</int>
<int name="49670.2">1</int>
<int name="49670.3">1</int>
<int name="49670.4">1</int>
<int name="49677.1">1</int>
<int name="49677.2">1</int>
<int name="49677.3">1</int>
<int>0</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
</response>


In schema.xml:
 <field name="contents" type="bucketFirstLetter" stored="true"
indexed="true" />
 <field name="documentPageId" type="string" indexed="true" stored="true"
multiValued="false"/>

In solrconfig.xml:

       <str name="facet.field">filewrapper</str>
       <str name="facet.field">caseNumber</str>
       <str name="facet.field">pageNumber</str>
       <str name="facet.field">documentId</str>
       <str name="facet.field">contents</str>
       <str name="facet.query">documentId</str>
       <str name="facet.query">caseNumber</str>
       <str name="facet.query">pageNumber</str>
      <str name="facet.field">documentPageId</str>
       <str name="facet.query">contents</str>

Thanks in advance,

Re: Getting a word count frequency out of a page field

Posted by solr user <mv...@gmail.com>.
Thanks for the article.

I am indexing each page of a document as if it were a document.

I think the answer is to configure SOLR for use of the TermVector Component:
 http://wiki.apache.org/solr/TermVectorComponent

I have not tried it yet, but someone told me on StackExchange forum to try
this one.

-Melanie

On Sun, Jan 22, 2012 at 8:56 PM, Erick Erickson <er...@gmail.com>wrote:

> Here's Hoss' XY problem writeup:
> http://people.apache.org/~hossman/#xyproblem
> but this doesn't appear to be that.
>
> There's no way out of the box that I know of to do what you want. It starts
> with the fact that Solr has no clue what a page is in the first place. Or
> a paragraph. Or a sentence. So you're really on your own here....
> Solr only knows about *documents*. If each document is a page,
> you can do some stuff with term frequencies etc. But for a larger
> document you'll be getting into some pretty low-level analysis
> of the data to accomplish this.
>
> Sorry I can't be more help.
> Erick
>
> On Sun, Jan 22, 2012 at 5:35 PM, solr user <mv...@gmail.com> wrote:
> > See comments inline below.
> >
> > On Sun, Jan 22, 2012 at 8:27 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >>
> >> Faceting won't work at all. Its function is to return the count
> >> of the *documents* that a value occurs in, so that's no good
> >> for your use case.
> >>
> >> "I don't know how to issue a proper SOLR query that returns a word count
> >> for
> >> a paragraph of text such as the term "amplifier" for a field. For some
> >> reason it only returns."
> >>
> >> This is really unclear. Are you asking for the word counts of a
> paragraph
> >> that contains "amplifier"? The number of times "amplifier" appears in
> >> a paragraph? In a document?
> >
> >
> > I'm looking for the number of times the word or term appears in a
> paragraph
> > that I'm indexing as the field name "contents". I'm storing and indexing
> the
> > field name "contents" that contains multiple occurrences of the
> term/word.
> > However, when I query for that term it only reports that the word/term
> > appeared only once in the field name "contents".
> >
> >>
> >>
> >> And why do you want this information anyway? It might be an XY problem.
> >
> >
> > I want to be able to search for word frequency for a page in a document
> that
> > has many pages. So I can report to the user that the term/word occurred
> on
> > page 1 "10" times. The user can click on the result and go right the the
> > page where the word/term appeared most frequently.
> >
> > What do you mean an XY problem?
> >
> >
> >>
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Jan 20, 2012 at 1:06 PM, solr user <mv...@gmail.com>
> wrote:
> >> > SOLR reports the term occurrence for terms over all the documents. I
> am
> >> > having trouble making a query that returns the term occurrence in a
> >> > specific page field called, documentPageId.
> >> >
> >> > I don't know how to issue a proper SOLR query that returns a word
> count
> >> > for
> >> > a paragraph of text such as the term "amplifier" for a field. For some
> >> > reason it only returns.
> >> >
> >> > The things I've tried only return a count for 1 occurrence of the term
> >> > even
> >> > though I see the term in the paragraph more than just once.
> >> >
> >> > I've tried faceting on the field, "contents"
> >> >
> >> >
> >> >
> http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count
> >> >
> >> > <lst name="facet_counts">
> >> > <lst name="facet_queries">
> >> > <int name="amplifier">21</int>
> >> > </lst>
> >> > <lst name="facet_fields">
> >> > <lst name="documentPageId">
> >> > <int name="49667.1">1</int>
> >> > <int name="49667.10">1</int>
> >> > <int name="49667.11">1</int>
> >> > <int name="49667.12">1</int>
> >> > <int name="49667.13">1</int>
> >> > <int name="49667.14">1</int>
> >> > <int name="49667.15">1</int>
> >> > <int name="49667.16">1</int>
> >> > <int name="49667.17">1</int>
> >> > <int name="49667.18">1</int>
> >> > <int name="49667.19">1</int>
> >> > <int name="49667.2">1</int>
> >> > <int name="49667.20">1</int>
> >> > <int name="49667.21">1</int>
> >> > <int name="49667.3">1</int>
> >> > <int name="49667.4">1</int>
> >> > <int name="49667.5">1</int>
> >> > <int name="49667.6">1</int>
> >> > <int name="49667.7">1</int>
> >> > <int name="49667.8">1</int>
> >> > <int name="49667.9">1</int>
> >> > <int name="49670.1">1</int>
> >> > <int name="49670.2">1</int>
> >> > <int name="49670.3">1</int>
> >> > <int name="49670.4">1</int>
> >> > <int name="49677.1">1</int>
> >> > <int name="49677.2">1</int>
> >> > <int name="49677.3">1</int>
> >> > <int>0</int>
> >> > </lst>
> >> > </lst>
> >> > <lst name="facet_dates"/>
> >> > <lst name="facet_ranges"/>
> >> > </lst>
> >> > </response>
> >> >
> >> >
> >> > In schema.xml:
> >> >  <field name="contents" type="bucketFirstLetter" stored="true"
> >> > indexed="true" />
> >> >  <field name="documentPageId" type="string" indexed="true"
> stored="true"
> >> > multiValued="false"/>
> >> >
> >> > In solrconfig.xml:
> >> >
> >> >       <str name="facet.field">filewrapper</str>
> >> >       <str name="facet.field">caseNumber</str>
> >> >       <str name="facet.field">pageNumber</str>
> >> >       <str name="facet.field">documentId</str>
> >> >       <str name="facet.field">contents</str>
> >> >       <str name="facet.query">documentId</str>
> >> >       <str name="facet.query">caseNumber</str>
> >> >       <str name="facet.query">pageNumber</str>
> >> >      <str name="facet.field">documentPageId</str>
> >> >       <str name="facet.query">contents</str>
> >> >
> >> > Thanks in advance,
> >
> >
>

Re: Getting a word count frequency out of a page field

Posted by solr user <mv...@gmail.com>.
See comments inline below.

On Sun, Jan 22, 2012 at 8:27 PM, Erick Erickson <er...@gmail.com>wrote:

> Faceting won't work at all. Its function is to return the count
> of the *documents* that a value occurs in, so that's no good
> for your use case.
>
> "I don't know how to issue a proper SOLR query that returns a word count
> for
> a paragraph of text such as the term "amplifier" for a field. For some
> reason it only returns."
>
> This is really unclear. Are you asking for the word counts of a paragraph
> that contains "amplifier"? The number of times "amplifier" appears in
> a paragraph? In a document?
>

I'm looking for the number of times the word or term appears in a paragraph
that I'm indexing as the field name "contents". I'm storing and indexing
the field name "contents" that contains multiple occurrences of the
term/word. However, when I query for that term it only reports that the
word/term appeared only once in the field name "contents".


>
> And why do you want this information anyway? It might be an XY problem.
>

I want to be able to search for word frequency for a page in a document
that has many pages. So I can report to the user that the term/word
occurred on page 1 "10" times. The user can click on the result and go
right the the page where the word/term appeared most frequently.

What do you mean an XY problem?



>
> Best
> Erick
>
> On Fri, Jan 20, 2012 at 1:06 PM, solr user <mv...@gmail.com> wrote:
> > SOLR reports the term occurrence for terms over all the documents. I am
> > having trouble making a query that returns the term occurrence in a
> > specific page field called, documentPageId.
> >
> > I don't know how to issue a proper SOLR query that returns a word count
> for
> > a paragraph of text such as the term "amplifier" for a field. For some
> > reason it only returns.
> >
> > The things I've tried only return a count for 1 occurrence of the term
> even
> > though I see the term in the paragraph more than just once.
> >
> > I've tried faceting on the field, "contents"
> >
> >
> http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count
> >
> > <lst name="facet_counts">
> > <lst name="facet_queries">
> > <int name="amplifier">21</int>
> > </lst>
> > <lst name="facet_fields">
> > <lst name="documentPageId">
> > <int name="49667.1">1</int>
> > <int name="49667.10">1</int>
> > <int name="49667.11">1</int>
> > <int name="49667.12">1</int>
> > <int name="49667.13">1</int>
> > <int name="49667.14">1</int>
> > <int name="49667.15">1</int>
> > <int name="49667.16">1</int>
> > <int name="49667.17">1</int>
> > <int name="49667.18">1</int>
> > <int name="49667.19">1</int>
> > <int name="49667.2">1</int>
> > <int name="49667.20">1</int>
> > <int name="49667.21">1</int>
> > <int name="49667.3">1</int>
> > <int name="49667.4">1</int>
> > <int name="49667.5">1</int>
> > <int name="49667.6">1</int>
> > <int name="49667.7">1</int>
> > <int name="49667.8">1</int>
> > <int name="49667.9">1</int>
> > <int name="49670.1">1</int>
> > <int name="49670.2">1</int>
> > <int name="49670.3">1</int>
> > <int name="49670.4">1</int>
> > <int name="49677.1">1</int>
> > <int name="49677.2">1</int>
> > <int name="49677.3">1</int>
> > <int>0</int>
> > </lst>
> > </lst>
> > <lst name="facet_dates"/>
> > <lst name="facet_ranges"/>
> > </lst>
> > </response>
> >
> >
> > In schema.xml:
> >  <field name="contents" type="bucketFirstLetter" stored="true"
> > indexed="true" />
> >  <field name="documentPageId" type="string" indexed="true" stored="true"
> > multiValued="false"/>
> >
> > In solrconfig.xml:
> >
> >       <str name="facet.field">filewrapper</str>
> >       <str name="facet.field">caseNumber</str>
> >       <str name="facet.field">pageNumber</str>
> >       <str name="facet.field">documentId</str>
> >       <str name="facet.field">contents</str>
> >       <str name="facet.query">documentId</str>
> >       <str name="facet.query">caseNumber</str>
> >       <str name="facet.query">pageNumber</str>
> >      <str name="facet.field">documentPageId</str>
> >       <str name="facet.query">contents</str>
> >
> > Thanks in advance,
>

Re: Getting a word count frequency out of a page field

Posted by Erick Erickson <er...@gmail.com>.
Faceting won't work at all. Its function is to return the count
of the *documents* that a value occurs in, so that's no good
for your use case.

"I don't know how to issue a proper SOLR query that returns a word count for
a paragraph of text such as the term "amplifier" for a field. For some
reason it only returns."

This is really unclear. Are you asking for the word counts of a paragraph
that contains "amplifier"? The number of times "amplifier" appears in
a paragraph? In a document?

And why do you want this information anyway? It might be an XY problem.

Best
Erick

On Fri, Jan 20, 2012 at 1:06 PM, solr user <mv...@gmail.com> wrote:
> SOLR reports the term occurrence for terms over all the documents. I am
> having trouble making a query that returns the term occurrence in a
> specific page field called, documentPageId.
>
> I don't know how to issue a proper SOLR query that returns a word count for
> a paragraph of text such as the term "amplifier" for a field. For some
> reason it only returns.
>
> The things I've tried only return a count for 1 occurrence of the term even
> though I see the term in the paragraph more than just once.
>
> I've tried faceting on the field, "contents"
>
> http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count
>
> <lst name="facet_counts">
> <lst name="facet_queries">
> <int name="amplifier">21</int>
> </lst>
> <lst name="facet_fields">
> <lst name="documentPageId">
> <int name="49667.1">1</int>
> <int name="49667.10">1</int>
> <int name="49667.11">1</int>
> <int name="49667.12">1</int>
> <int name="49667.13">1</int>
> <int name="49667.14">1</int>
> <int name="49667.15">1</int>
> <int name="49667.16">1</int>
> <int name="49667.17">1</int>
> <int name="49667.18">1</int>
> <int name="49667.19">1</int>
> <int name="49667.2">1</int>
> <int name="49667.20">1</int>
> <int name="49667.21">1</int>
> <int name="49667.3">1</int>
> <int name="49667.4">1</int>
> <int name="49667.5">1</int>
> <int name="49667.6">1</int>
> <int name="49667.7">1</int>
> <int name="49667.8">1</int>
> <int name="49667.9">1</int>
> <int name="49670.1">1</int>
> <int name="49670.2">1</int>
> <int name="49670.3">1</int>
> <int name="49670.4">1</int>
> <int name="49677.1">1</int>
> <int name="49677.2">1</int>
> <int name="49677.3">1</int>
> <int>0</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> </response>
>
>
> In schema.xml:
>  <field name="contents" type="bucketFirstLetter" stored="true"
> indexed="true" />
>  <field name="documentPageId" type="string" indexed="true" stored="true"
> multiValued="false"/>
>
> In solrconfig.xml:
>
>       <str name="facet.field">filewrapper</str>
>       <str name="facet.field">caseNumber</str>
>       <str name="facet.field">pageNumber</str>
>       <str name="facet.field">documentId</str>
>       <str name="facet.field">contents</str>
>       <str name="facet.query">documentId</str>
>       <str name="facet.query">caseNumber</str>
>       <str name="facet.query">pageNumber</str>
>      <str name="facet.field">documentPageId</str>
>       <str name="facet.query">contents</str>
>
> Thanks in advance,