You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by darul <da...@gmail.com> on 2011/12/14 09:26:11 UTC

Copy in multivalued field and faceting

Hello,

Field for this scenario is "Title" and contains several words.

For a specific query, I would like get the top ten words by frequency in a
specific field.

My idea was the following:

- Title in my schema is stored/indexed in a specific field
- A copyField copy Title field content into a multivalued field. If my
multivalue field use a specific tokenizer which split words, does it fill
each word in each multivalued items ?
- If so, using faceting on this multivalue field, I will get top ten words,
correct ?

Example:

1) Title : this is my title
2) CopyField Title to specific multivalue field F1
3) F1 contains : {this, is, my, title}

My english....

Thanks,

Jul

--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Copy in multivalued field and faceting

Posted by darul <da...@gmail.com>.

The first case you mentioned is the one I am looking for. I do not want top
terms on a whole index but top terms for a specific query result set.

Faceting on my field appears being the only way to get relevant results of
top terms for documents that hit query. 

Thanks for LukeRequestHandler and TermsComponent features, I knew them also
but they do not meet my needs.

--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3598485.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Copy in multivalued field and faceting

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

Hi Darul, it actually depends in if you want the top terms in the documents
that hit a query (in which case you'll need something like the faceting
approach you are mentioning) or the top terms for the field in general,
regardless of a specific query, in that case the easiest way to go is with
the TermsComponent. The second one will me much more efficient but the use
case is different. See http://wiki.apache.org/solr/TermsComponent
You'll see an example of use in the default solrconfig.xml under a request
handler called "/terms"

On Mon, Dec 19, 2011 at 5:47 AM, darul <da...@gmail.com> wrote:

> Thank you all for these advices, you are obviously right that no need for
> any
> copyField instructions to get what we expect.
>
> I will do some tests on using facet or LukeRequestHandler which seem much
> more useful in my case.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3597802.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Copy in multivalued field and faceting

Posted by darul <da...@gmail.com>.

Thank you all for these advices, you are obviously right that no need for any
copyField instructions to get what we expect. 

I will do some tests on using facet or LukeRequestHandler which seem much
more useful in my case.



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3597802.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Copy in multivalued field and faceting

Posted by Ahmet Arslan <io...@yahoo.com>.

> I read the document of "Facet.sort=count" which seems to
> return the facets
> order by the doc hit counts.
> 
> So, suppose one doc has title "value1 value2 value3", and
> another doc has
> title "value2 value 4 value 5", and use WhitespaceTokenizer
> (no matter
> designed in single field or multi-value field), do we get
> the facet results
> as:
> "value2" - 2 docs
> "value1" - 1 doc
> "value3" - 1 doc
> "value4" - 1 doc
> "value5" - 1 doc
> 
> is it a way to get top words? does it cause high
> performance cost?

Consider using http://wiki.apache.org/solr/LukeRequestHandler for top term. Faceting is more meant to 'drill down' in the search result set.

Re: Copy in multivalued field and faceting

Posted by yunfei wu <yu...@gmail.com>.

Hi, Eric,

Just interested in this topic, so might want to ask further question based
on Jul's topic.

I read the document of "Facet.sort=count" which seems to return the facets
order by the doc hit counts.

So, suppose one doc has title "value1 value2 value3", and another doc has
title "value2 value 4 value 5", and use WhitespaceTokenizer (no matter
designed in single field or multi-value field), do we get the facet results
as:
"value2" - 2 docs
"value1" - 1 doc
"value3" - 1 doc
"value4" - 1 doc
"value5" - 1 doc

is it a way to get top words? does it cause high performance cost?

Thanks,
Yunfei



On Wed, Dec 14, 2011 at 5:51 AM, Erick Erickson <er...@gmail.com>wrote:

> I don't quite understand what you're trying to do. MultiValued is
> a bit misleading. All it means is that you can add the same
> field multiple times to a document, i.e. (XML example)
> <doc>
>  <add name="field">value1 value2 value3</add>
>  <add name="field">value4 value5 value6</add>
> </doc>
>
> will succeed if "field" is multiValued and fail if not.
>
> This will work if "field" is NOT multiValued:
> <doc>
>  <add name="field">value1 value2 value3 value4 value5 value6</add>
> </doc>
>
> and, assuming WhitespaceTokenizer, the field "field" will contain
> the exact same tokens. The only difference *might* be the
> offsets, but don't worry about that quite yet, all it would really
> affect is phrase queries.
>
> With that as a preface, I don't see why copyField has anything
> to do with your problem, you'd get the same results faceting
> on the title field, assuming identical analyzer chains.
>
> Faceting on a text field is iffy, it can be quite expensive. What you'd
> get in the end, though, is a list of the top words in your corpus for
> that field counted from the documents that satisfied the query. Which
> sounds like what you're after.
>
> Best
> Erick
>
> On Wed, Dec 14, 2011 at 4:59 AM, yunfei wu <yu...@gmail.com> wrote:
> > Sounds like working by carefully choosing tokenizer, and then use
> > facet.sort and facet.limit parameters to do faceting.
> >
> > Will see any expert's comments on this one.
> >
> > Yunfei
> >
> >
> > On Wed, Dec 14, 2011 at 12:26 AM, darul <da...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> Field for this scenario is "Title" and contains several words.
> >>
> >> For a specific query, I would like get the top ten words by frequency
> in a
> >> specific field.
> >>
> >> My idea was the following:
> >>
> >> - Title in my schema is stored/indexed in a specific field
> >> - A copyField copy Title field content into a multivalued field. If my
> >> multivalue field use a specific tokenizer which split words, does it
> fill
> >> each word in each multivalued items ?
> >> - If so, using faceting on this multivalue field, I will get top ten
> words,
> >> correct ?
> >>
> >> Example:
> >>
> >> 1) Title : this is my title
> >> 2) CopyField Title to specific multivalue field F1
> >> 3) F1 contains : {this, is, my, title}
> >>
> >> My english....
> >>
> >> Thanks,
> >>
> >> Jul
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: Copy in multivalued field and faceting

Posted by Erick Erickson <er...@gmail.com>.

I don't quite understand what you're trying to do. MultiValued is
a bit misleading. All it means is that you can add the same
field multiple times to a document, i.e. (XML example)
<doc>
  <add name="field">value1 value2 value3</add>
  <add name="field">value4 value5 value6</add>
</doc>

will succeed if "field" is multiValued and fail if not.

This will work if "field" is NOT multiValued:
<doc>
  <add name="field">value1 value2 value3 value4 value5 value6</add>
</doc>

and, assuming WhitespaceTokenizer, the field "field" will contain
the exact same tokens. The only difference *might* be the
offsets, but don't worry about that quite yet, all it would really
affect is phrase queries.

With that as a preface, I don't see why copyField has anything
to do with your problem, you'd get the same results faceting
on the title field, assuming identical analyzer chains.

Faceting on a text field is iffy, it can be quite expensive. What you'd
get in the end, though, is a list of the top words in your corpus for
that field counted from the documents that satisfied the query. Which
sounds like what you're after.

Best
Erick

On Wed, Dec 14, 2011 at 4:59 AM, yunfei wu <yu...@gmail.com> wrote:
> Sounds like working by carefully choosing tokenizer, and then use
> facet.sort and facet.limit parameters to do faceting.
>
> Will see any expert's comments on this one.
>
> Yunfei
>
>
> On Wed, Dec 14, 2011 at 12:26 AM, darul <da...@gmail.com> wrote:
>
>> Hello,
>>
>> Field for this scenario is "Title" and contains several words.
>>
>> For a specific query, I would like get the top ten words by frequency in a
>> specific field.
>>
>> My idea was the following:
>>
>> - Title in my schema is stored/indexed in a specific field
>> - A copyField copy Title field content into a multivalued field. If my
>> multivalue field use a specific tokenizer which split words, does it fill
>> each word in each multivalued items ?
>> - If so, using faceting on this multivalue field, I will get top ten words,
>> correct ?
>>
>> Example:
>>
>> 1) Title : this is my title
>> 2) CopyField Title to specific multivalue field F1
>> 3) F1 contains : {this, is, my, title}
>>
>> My english....
>>
>> Thanks,
>>
>> Jul
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: Copy in multivalued field and faceting

Posted by yunfei wu <yu...@gmail.com>.

Sounds like working by carefully choosing tokenizer, and then use
facet.sort and facet.limit parameters to do faceting.

Will see any expert's comments on this one.

Yunfei


On Wed, Dec 14, 2011 at 12:26 AM, darul <da...@gmail.com> wrote:

> Hello,
>
> Field for this scenario is "Title" and contains several words.
>
> For a specific query, I would like get the top ten words by frequency in a
> specific field.
>
> My idea was the following:
>
> - Title in my schema is stored/indexed in a specific field
> - A copyField copy Title field content into a multivalued field. If my
> multivalue field use a specific tokenizer which split words, does it fill
> each word in each multivalued items ?
> - If so, using faceting on this multivalue field, I will get top ten words,
> correct ?
>
> Example:
>
> 1) Title : this is my title
> 2) CopyField Title to specific multivalue field F1
> 3) F1 contains : {this, is, my, title}
>
> My english....
>
> Thanks,
>
> Jul
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>