You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2010/10/27 03:43:36 UTC

Multiple Word Facets

All,
I am a new to Solr faceting and stuck on how to get multiple-word
facets returned from a standard Solr query. See below for what is
currently being returned.

<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="title">
<int name="Federal">89</int>
<int name="EFLHD">87</int>
<int name="Eastern">87</int>
<int name="Lands">87</int>
<int name="Highways">84</int>
<int name="FHWA">60</int>
<int name="Transportation">32</int>
<int name="GIS">22</int>
<int name="Planning">19</int>
<int name="Asset">15</int>
<int name="Environment">15</int>
<int name="Management">14</int>
<int name="Realty">12</int>
<int name="Highway">11</int>
<int name="HEP">10</int>
<int name="Program">9</int>
<int name="HEPGIS">7</int>
<int name="Resources">7</int>
<int name="Roads">7</int>
<int name="EEI">6</int>
<int name="Environmental">6</int>
<int name="Right">6</int>
<int name="Way">6</int>
...etc...

There are many terms in there that are 2 or 3 word phrases. For
example, Eastern Federal Lands Highway Division all gets broken down
in to the individual words that make up the total group of words. I've
seen quite a few websites that do what it is I am trying to do here so
any suggestions at this point would be great. See my schema below
(copied from the example schema).

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

Similar for type="query". Please advise on how to group or cluster
document terms so that they can be used as facets.

Many thanks in advance,
Adam Estrada

Re: Multiple Word Facets

Posted by Pradeep Singh <pk...@gmail.com>.

Use this field type -

    <fieldType name="facetField" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>

On Tue, Oct 26, 2010 at 6:43 PM, Adam Estrada <es...@gmail.com>wrote:

> All,
> I am a new to Solr faceting and stuck on how to get multiple-word
> facets returned from a standard Solr query. See below for what is
> currently being returned.
>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="title">
> <int name="Federal">89</int>
> <int name="EFLHD">87</int>
> <int name="Eastern">87</int>
> <int name="Lands">87</int>
> <int name="Highways">84</int>
> <int name="FHWA">60</int>
> <int name="Transportation">32</int>
> <int name="GIS">22</int>
> <int name="Planning">19</int>
> <int name="Asset">15</int>
> <int name="Environment">15</int>
> <int name="Management">14</int>
> <int name="Realty">12</int>
> <int name="Highway">11</int>
> <int name="HEP">10</int>
> <int name="Program">9</int>
> <int name="HEPGIS">7</int>
> <int name="Resources">7</int>
> <int name="Roads">7</int>
> <int name="EEI">6</int>
> <int name="Environmental">6</int>
> <int name="Right">6</int>
> <int name="Way">6</int>
> ...etc...
>
> There are many terms in there that are 2 or 3 word phrases. For
> example, Eastern Federal Lands Highway Division all gets broken down
> in to the individual words that make up the total group of words. I've
> seen quite a few websites that do what it is I am trying to do here so
> any suggestions at this point would be great. See my schema below
> (copied from the example schema).
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>
> Similar for type="query". Please advise on how to group or cluster
> document terms so that they can be used as facets.
>
> Many thanks in advance,
> Adam Estrada
>

Re: Multiple Word Facets

Posted by Ken Krugler <kk...@transpac.com>.

On Oct 27, 2010, at 6:29am, Adam Estrada wrote:

> Ahhh...I see! I am doing my testing crawling a couple websites using
> Nutch and in doing so I am assigning my facets to the title field
> which is type=text. Are you saying that I will need to manually
> generate the content for my facet field? I can see the reason and need
> for doing it that way but I really need for my faceting to happen
> dynamically based on the content in the field which in this case is
> the title of a URL.

You would use copyfield to copy the contents of the title into a new  
field that uses the string type, and is the one you use for faceting.

-- Ken

> On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
> <ja...@gmail.com> wrote:
>> The Shingle Filter Breaks the words in a sentence into a  
>> combination of 2/3
>> words.
>>
>> For faceting field you should use :-
>> <field name="facet_field" *type="string"* indexed="true"  
>> stored="true"
>> multiValued="true"/>
>>
>> The type of the field should be *string *so that it is not  
>> tokenised at all.
>>
>> On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada  
>> <es...@gmail.com>wrote:
>>
>>> Thanks guys, the solr.ShingleFilterFactory did work to get me  
>>> multiple
>>> terms per facet but now I am seeing some redundancy in the facets
>>> numbers. See below...
>>>
>>> Highway (62)
>>> Highway System (59)
>>> National (59)
>>> National Highway (59)
>>> National Highway System (59)
>>> System (59)
>>>
>>> See what's going on here? How can I make my multi token facets  
>>> smarter
>>> so that the tokens aren't duplicated?
>>>
>>> Thanks in advance,
>>> Adam
>>>
>>> On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan <io...@yahoo.com>  
>>> wrote:
>>>> Facets are generated from indexed terms.
>>>>
>>>> Depending on your need/use-case:
>>>>
>>>> You can use a additional separate String field (which is not  
>>>> tokenized)
>>> for facets, populate it via copyField. Search on tokenized field  
>>> facet on
>>> non-tokenized field.
>>>>
>>>> Or
>>>>
>>>> You can add solr.ShingleFilterFactory to your index analyzer to  
>>>> form
>>> multiple word terms.
>>>>
>>>> --- On Wed, 10/27/10, Adam Estrada <es...@gmail.com> wrote:
>>>>
>>>>> From: Adam Estrada <es...@gmail.com>
>>>>> Subject: Multiple Word Facets
>>>>> To: solr-user@lucene.apache.org
>>>>> Date: Wednesday, October 27, 2010, 4:43 AM
>>>>> All,
>>>>> I am a new to Solr faceting and stuck on how to get
>>>>> multiple-word
>>>>> facets returned from a standard Solr query. See below for
>>>>> what is
>>>>> currently being returned.
>>>>>
>>>>> <lst name="facet_counts">
>>>>> <lst name="facet_queries"/>
>>>>> <lst name="facet_fields">
>>>>> <lst name="title">
>>>>> <int name="Federal">89</int>
>>>>> <int name="EFLHD">87</int>
>>>>> <int name="Eastern">87</int>
>>>>> <int name="Lands">87</int>
>>>>> <int name="Highways">84</int>
>>>>> <int name="FHWA">60</int>
>>>>> <int name="Transportation">32</int>
>>>>> <int name="GIS">22</int>
>>>>> <int name="Planning">19</int>
>>>>> <int name="Asset">15</int>
>>>>> <int name="Environment">15</int>
>>>>> <int name="Management">14</int>
>>>>> <int name="Realty">12</int>
>>>>> <int name="Highway">11</int>
>>>>> <int name="HEP">10</int>
>>>>> <int name="Program">9</int>
>>>>> <int name="HEPGIS">7</int>
>>>>> <int name="Resources">7</int>
>>>>> <int name="Roads">7</int>
>>>>> <int name="EEI">6</int>
>>>>> <int name="Environmental">6</int>
>>>>> <int name="Right">6</int>
>>>>> <int name="Way">6</int>
>>>>> ...etc...
>>>>>
>>>>> There are many terms in there that are 2 or 3 word phrases.
>>>>> For
>>>>> example, Eastern Federal Lands Highway Division all gets
>>>>> broken down
>>>>> in to the individual words that make up the total group of
>>>>> words. I've
>>>>> seen quite a few websites that do what it is I am trying to
>>>>> do here so
>>>>> any suggestions at this point would be great. See my schema
>>>>> below
>>>>> (copied from the example schema).
>>>>>
>>>>>     <fieldType name="text"
>>>>> class="solr.TextField" positionIncrementGap="100">
>>>>>       <analyzer type="index">
>>>>>          <tokenizer
>>>>> class="solr.WhitespaceTokenizerFactory"/>
>>>>>     <filter
>>>>> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="false"/>
>>>>>         <filter
>>>>> class="solr.StopFilterFactory"
>>>>>
>>>>> ignoreCase="true"
>>>>>
>>>>> words="stopwords.txt"
>>>>>
>>>>> enablePositionIncrements="true"
>>>>>
>>>>> />
>>>>>     <filter
>>>>> class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1"
>>>>> generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0"
>>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>>         <filter
>>>>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>       </analyzer>
>>>>>
>>>>> Similar for type="query". Please advise on how to group or
>>>>> cluster
>>>>> document terms so that they can be used as facets.
>>>>>
>>>>> Many thanks in advance,
>>>>> Adam Estrada
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Multiple Word Facets

Posted by Adam Estrada <es...@gmail.com>.

Ahhh...I see! I am doing my testing crawling a couple websites using
Nutch and in doing so I am assigning my facets to the title field
which is type=text. Are you saying that I will need to manually
generate the content for my facet field? I can see the reason and need
for doing it that way but I really need for my faceting to happen
dynamically based on the content in the field which in this case is
the title of a URL.

Thanks again for all the tips on getting this working for me.

Adam

On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
<ja...@gmail.com> wrote:
> The Shingle Filter Breaks the words in a sentence into a combination of 2/3
> words.
>
> For faceting field you should use :-
> <field name="facet_field" *type="string"* indexed="true" stored="true"
> multiValued="true"/>
>
> The type of the field should be *string *so that it is not tokenised at all.
>
> On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada <es...@gmail.com>wrote:
>
>> Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
>> terms per facet but now I am seeing some redundancy in the facets
>> numbers. See below...
>>
>> Highway (62)
>> Highway System (59)
>> National (59)
>> National Highway (59)
>> National Highway System (59)
>> System (59)
>>
>> See what's going on here? How can I make my multi token facets smarter
>> so that the tokens aren't duplicated?
>>
>> Thanks in advance,
>> Adam
>>
>> On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>> > Facets are generated from indexed terms.
>> >
>> > Depending on your need/use-case:
>> >
>> > You can use a additional separate String field (which is not tokenized)
>> for facets, populate it via copyField. Search on tokenized field facet on
>> non-tokenized field.
>> >
>> > Or
>> >
>> > You can add solr.ShingleFilterFactory to your index analyzer to form
>> multiple word terms.
>> >
>> > --- On Wed, 10/27/10, Adam Estrada <es...@gmail.com> wrote:
>> >
>> >> From: Adam Estrada <es...@gmail.com>
>> >> Subject: Multiple Word Facets
>> >> To: solr-user@lucene.apache.org
>> >> Date: Wednesday, October 27, 2010, 4:43 AM
>> >> All,
>> >> I am a new to Solr faceting and stuck on how to get
>> >> multiple-word
>> >> facets returned from a standard Solr query. See below for
>> >> what is
>> >> currently being returned.
>> >>
>> >> <lst name="facet_counts">
>> >> <lst name="facet_queries"/>
>> >> <lst name="facet_fields">
>> >> <lst name="title">
>> >> <int name="Federal">89</int>
>> >> <int name="EFLHD">87</int>
>> >> <int name="Eastern">87</int>
>> >> <int name="Lands">87</int>
>> >> <int name="Highways">84</int>
>> >> <int name="FHWA">60</int>
>> >> <int name="Transportation">32</int>
>> >> <int name="GIS">22</int>
>> >> <int name="Planning">19</int>
>> >> <int name="Asset">15</int>
>> >> <int name="Environment">15</int>
>> >> <int name="Management">14</int>
>> >> <int name="Realty">12</int>
>> >> <int name="Highway">11</int>
>> >> <int name="HEP">10</int>
>> >> <int name="Program">9</int>
>> >> <int name="HEPGIS">7</int>
>> >> <int name="Resources">7</int>
>> >> <int name="Roads">7</int>
>> >> <int name="EEI">6</int>
>> >> <int name="Environmental">6</int>
>> >> <int name="Right">6</int>
>> >> <int name="Way">6</int>
>> >> ...etc...
>> >>
>> >> There are many terms in there that are 2 or 3 word phrases.
>> >> For
>> >> example, Eastern Federal Lands Highway Division all gets
>> >> broken down
>> >> in to the individual words that make up the total group of
>> >> words. I've
>> >> seen quite a few websites that do what it is I am trying to
>> >> do here so
>> >> any suggestions at this point would be great. See my schema
>> >> below
>> >> (copied from the example schema).
>> >>
>> >>     <fieldType name="text"
>> >> class="solr.TextField" positionIncrementGap="100">
>> >>       <analyzer type="index">
>> >>          <tokenizer
>> >> class="solr.WhitespaceTokenizerFactory"/>
>> >>     <filter
>> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> >> ignoreCase="true" expand="false"/>
>> >>         <filter
>> >> class="solr.StopFilterFactory"
>> >>
>> >> ignoreCase="true"
>> >>
>> >> words="stopwords.txt"
>> >>
>> >> enablePositionIncrements="true"
>> >>
>> >> />
>> >>     <filter
>> >> class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> generateNumberParts="1" catenateWords="0"
>> >> catenateNumbers="0"
>> >> catenateAll="0" splitOnCaseChange="1"/>
>> >>         <filter
>> >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>       </analyzer>
>> >>
>> >> Similar for type="query". Please advise on how to group or
>> >> cluster
>> >> document terms so that they can be used as facets.
>> >>
>> >> Many thanks in advance,
>> >> Adam Estrada
>> >>
>> >
>> >
>> >
>> >
>>
>

Re: Multiple Word Facets

Posted by Jayendra Patil <ja...@gmail.com>.

The Shingle Filter Breaks the words in a sentence into a combination of 2/3
words.

For faceting field you should use :-
<field name="facet_field" *type="string"* indexed="true" stored="true"
multiValued="true"/>

The type of the field should be *string *so that it is not tokenised at all.

On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada <es...@gmail.com>wrote:

> Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
> terms per facet but now I am seeing some redundancy in the facets
> numbers. See below...
>
> Highway (62)
> Highway System (59)
> National (59)
> National Highway (59)
> National Highway System (59)
> System (59)
>
> See what's going on here? How can I make my multi token facets smarter
> so that the tokens aren't duplicated?
>
> Thanks in advance,
> Adam
>
> On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> > Facets are generated from indexed terms.
> >
> > Depending on your need/use-case:
> >
> > You can use a additional separate String field (which is not tokenized)
> for facets, populate it via copyField. Search on tokenized field facet on
> non-tokenized field.
> >
> > Or
> >
> > You can add solr.ShingleFilterFactory to your index analyzer to form
> multiple word terms.
> >
> > --- On Wed, 10/27/10, Adam Estrada <es...@gmail.com> wrote:
> >
> >> From: Adam Estrada <es...@gmail.com>
> >> Subject: Multiple Word Facets
> >> To: solr-user@lucene.apache.org
> >> Date: Wednesday, October 27, 2010, 4:43 AM
> >> All,
> >> I am a new to Solr faceting and stuck on how to get
> >> multiple-word
> >> facets returned from a standard Solr query. See below for
> >> what is
> >> currently being returned.
> >>
> >> <lst name="facet_counts">
> >> <lst name="facet_queries"/>
> >> <lst name="facet_fields">
> >> <lst name="title">
> >> <int name="Federal">89</int>
> >> <int name="EFLHD">87</int>
> >> <int name="Eastern">87</int>
> >> <int name="Lands">87</int>
> >> <int name="Highways">84</int>
> >> <int name="FHWA">60</int>
> >> <int name="Transportation">32</int>
> >> <int name="GIS">22</int>
> >> <int name="Planning">19</int>
> >> <int name="Asset">15</int>
> >> <int name="Environment">15</int>
> >> <int name="Management">14</int>
> >> <int name="Realty">12</int>
> >> <int name="Highway">11</int>
> >> <int name="HEP">10</int>
> >> <int name="Program">9</int>
> >> <int name="HEPGIS">7</int>
> >> <int name="Resources">7</int>
> >> <int name="Roads">7</int>
> >> <int name="EEI">6</int>
> >> <int name="Environmental">6</int>
> >> <int name="Right">6</int>
> >> <int name="Way">6</int>
> >> ...etc...
> >>
> >> There are many terms in there that are 2 or 3 word phrases.
> >> For
> >> example, Eastern Federal Lands Highway Division all gets
> >> broken down
> >> in to the individual words that make up the total group of
> >> words. I've
> >> seen quite a few websites that do what it is I am trying to
> >> do here so
> >> any suggestions at this point would be great. See my schema
> >> below
> >> (copied from the example schema).
> >>
> >>     <fieldType name="text"
> >> class="solr.TextField" positionIncrementGap="100">
> >>       <analyzer type="index">
> >>          <tokenizer
> >> class="solr.WhitespaceTokenizerFactory"/>
> >>     <filter
> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false"/>
> >>         <filter
> >> class="solr.StopFilterFactory"
> >>
> >> ignoreCase="true"
> >>
> >> words="stopwords.txt"
> >>
> >> enablePositionIncrements="true"
> >>
> >> />
> >>     <filter
> >> class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1"
> >> generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0"
> >> catenateAll="0" splitOnCaseChange="1"/>
> >>         <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>       </analyzer>
> >>
> >> Similar for type="query". Please advise on how to group or
> >> cluster
> >> document terms so that they can be used as facets.
> >>
> >> Many thanks in advance,
> >> Adam Estrada
> >>
> >
> >
> >
> >
>

Re: Multiple Word Facets

Posted by Adam Estrada <es...@gmail.com>.

Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
terms per facet but now I am seeing some redundancy in the facets
numbers. See below...

Highway (62)
Highway System (59)
National (59)
National Highway (59)
National Highway System (59)
System (59)

See what's going on here? How can I make my multi token facets smarter
so that the tokens aren't duplicated?

Thanks in advance,
Adam

On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> Facets are generated from indexed terms.
>
> Depending on your need/use-case:
>
> You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field.
>
> Or
>
> You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms.
>
> --- On Wed, 10/27/10, Adam Estrada <es...@gmail.com> wrote:
>
>> From: Adam Estrada <es...@gmail.com>
>> Subject: Multiple Word Facets
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, October 27, 2010, 4:43 AM
>> All,
>> I am a new to Solr faceting and stuck on how to get
>> multiple-word
>> facets returned from a standard Solr query. See below for
>> what is
>> currently being returned.
>>
>> <lst name="facet_counts">
>> <lst name="facet_queries"/>
>> <lst name="facet_fields">
>> <lst name="title">
>> <int name="Federal">89</int>
>> <int name="EFLHD">87</int>
>> <int name="Eastern">87</int>
>> <int name="Lands">87</int>
>> <int name="Highways">84</int>
>> <int name="FHWA">60</int>
>> <int name="Transportation">32</int>
>> <int name="GIS">22</int>
>> <int name="Planning">19</int>
>> <int name="Asset">15</int>
>> <int name="Environment">15</int>
>> <int name="Management">14</int>
>> <int name="Realty">12</int>
>> <int name="Highway">11</int>
>> <int name="HEP">10</int>
>> <int name="Program">9</int>
>> <int name="HEPGIS">7</int>
>> <int name="Resources">7</int>
>> <int name="Roads">7</int>
>> <int name="EEI">6</int>
>> <int name="Environmental">6</int>
>> <int name="Right">6</int>
>> <int name="Way">6</int>
>> ...etc...
>>
>> There are many terms in there that are 2 or 3 word phrases.
>> For
>> example, Eastern Federal Lands Highway Division all gets
>> broken down
>> in to the individual words that make up the total group of
>> words. I've
>> seen quite a few websites that do what it is I am trying to
>> do here so
>> any suggestions at this point would be great. See my schema
>> below
>> (copied from the example schema).
>>
>>     <fieldType name="text"
>> class="solr.TextField" positionIncrementGap="100">
>>       <analyzer type="index">
>>          <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>     <filter
>> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false"/>
>>         <filter
>> class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"
>>
>> enablePositionIncrements="true"
>>
>> />
>>     <filter
>> class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>         <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>
>> Similar for type="query". Please advise on how to group or
>> cluster
>> document terms so that they can be used as facets.
>>
>> Many thanks in advance,
>> Adam Estrada
>>
>
>
>
>

Re: Multiple Word Facets

Posted by Ahmet Arslan <io...@yahoo.com>.

Facets are generated from indexed terms.

Depending on your need/use-case: 

You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field.

Or

You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms.

--- On Wed, 10/27/10, Adam Estrada <es...@gmail.com> wrote:

> From: Adam Estrada <es...@gmail.com>
> Subject: Multiple Word Facets
> To: solr-user@lucene.apache.org
> Date: Wednesday, October 27, 2010, 4:43 AM
> All,
> I am a new to Solr faceting and stuck on how to get
> multiple-word
> facets returned from a standard Solr query. See below for
> what is
> currently being returned.
> 
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="title">
> <int name="Federal">89</int>
> <int name="EFLHD">87</int>
> <int name="Eastern">87</int>
> <int name="Lands">87</int>
> <int name="Highways">84</int>
> <int name="FHWA">60</int>
> <int name="Transportation">32</int>
> <int name="GIS">22</int>
> <int name="Planning">19</int>
> <int name="Asset">15</int>
> <int name="Environment">15</int>
> <int name="Management">14</int>
> <int name="Realty">12</int>
> <int name="Highway">11</int>
> <int name="HEP">10</int>
> <int name="Program">9</int>
> <int name="HEPGIS">7</int>
> <int name="Resources">7</int>
> <int name="Roads">7</int>
> <int name="EEI">6</int>
> <int name="Environmental">6</int>
> <int name="Right">6</int>
> <int name="Way">6</int>
> ...etc...
> 
> There are many terms in there that are 2 or 3 word phrases.
> For
> example, Eastern Federal Lands Highway Division all gets
> broken down
> in to the individual words that make up the total group of
> words. I've
> seen quite a few websites that do what it is I am trying to
> do here so
> any suggestions at this point would be great. See my schema
> below
> (copied from the example schema).
> 
>     <fieldType name="text"
> class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>          <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>     <filter
> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>         <filter
> class="solr.StopFilterFactory"
>                
> ignoreCase="true"
>                
> words="stopwords.txt"
>                
> enablePositionIncrements="true"
>                
> />
>     <filter
> class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1" catenateWords="0"
> catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>         <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
> 
> Similar for type="query". Please advise on how to group or
> cluster
> document terms so that they can be used as facets.
> 
> Many thanks in advance,
> Adam Estrada
>