You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andre Basse <AB...@theage.com.au> on 2006/09/14 03:19:04 UTC

Faceted Searching problems

Hi all,
 
I just installed the nightly build to try the Faceted Searching . After some testing I discovered that some characters are missing in the result XML and that fields with "/" chars are sometimes split into two entries.
 
Example:
<int name="franc">1</int> should be France
<int name="culturefestiv">1</int> should be Culture/Festivals

Please find details below.
 
Original XML
=========
 
<str name="section">Metro</str>
 
<arr name="classification">
<str>Culture/Film</str>
<str>Culture/Festivals</str>
</arr>

<arr name="geoloc">
<str>France</str>
<str>Sydney</str>
</arr>
 
 
 
SOLR response for the query 
=====================
(http://192.168.157.128:8983/solr/select/?q=Bellucci&rows=0&facet=true&facet.limit=5&facet.field=section&facet.field=geoloc&facet.field=classification)
 
<response>
−
 <responseHeader>
<status>0</status>
<QTime>518</QTime>
</responseHeader>
<result numFound="2" start="0"/>
−
 <lst name="facet_counts">
<lst name="facet_queries"/>
−
 <lst name="facet_fields">
−
 <lst name="section">
<int name="metro">2</int>
<int name="busi">0</int>
<int name="career">0</int>
<int name="comput">0</int>
<int name="domain">0</int>
</lst>
−
 <lst name="geoloc">
<int name="franc">1</int>
<int name="sydney">1</int>
<int name="act">0</int>
<int name="adelaid">0</int>
<int name="afghanistan">0</int>
</lst>
−
 <lst name="classification">
<int name="cultur">1</int>
<int name="culturefestiv">1</int>
<int name="culturefilm">1</int>
<int name="festiv">1</int>
<int name="film">1</int>
</lst>
</lst>
</lst>
</response>
 
 
Any help is much appreciated!
 
 
Thanks,
 
Andre
 
 
 


*********************************************************************************
The information contained in this e-mail message and any accompanying files is or may be confidential.  If you are not the intended recipient, any use, dissemination, reliance, forwarding, printing or copying of this e-mail or any attached files is unauthorised. This e-mail is subject to copyright. No part of it should be reproduced, adapted or communicated without the written consent of the copyright owner. If you have received this e-mail in error, please advise the sender immediately by return e-mail, or telephone and delete all copies. Fairfax does not guarantee the accuracy or completeness of any information contained in this e-mail or attached files. Internet communications are not secure, therefore Fairfax does not accept legal responsibility for the contents of this message or attached files.
*********************************************************************************


Re: Faceted Searching problems

Posted by Yonik Seeley <yo...@apache.org>.
On 9/13/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> You need to use an untokenized field for facets.

At least 3 answers in 5 minutes... we should try synchronized swimming ;-)

-Yonik

Re: Faceted Searching problems

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
You need to use an untokenized field for facets.  I can see we're  
going to get this question frequently now - it was mentioned earlier  
today in fact.  You can use a <copyField> that is untokenized such  
that you can use one field for searching, and one for facets.

You are obviously using a stemming analyzer, and that is why France  
became franc, etc - just to explain why you are seeing those terms  
listed.

	Erik


On Sep 13, 2006, at 9:19 PM, Andre Basse wrote:

> Hi all,
>
> I just installed the nightly build to try the Faceted Searching .  
> After some testing I discovered that some characters are missing in  
> the result XML and that fields with "/" chars are sometimes split  
> into two entries.
>
> Example:
> <int name="franc">1</int> should be France
> <int name="culturefestiv">1</int> should be Culture/Festivals
>
> Please find details below.
>
> Original XML
> =========
>
> <str name="section">Metro</str>
>
> <arr name="classification">
> <str>Culture/Film</str>
> <str>Culture/Festivals</str>
> </arr>
>
> <arr name="geoloc">
> <str>France</str>
> <str>Sydney</str>
> </arr>
>
>
>
> SOLR response for the query
> =====================
> (http://192.168.157.128:8983/solr/select/? 
> q=Bellucci&rows=0&facet=true&facet.limit=5&facet.field=section&facet.f 
> ield=geoloc&facet.field=classification)
>
> <response>
> −
>  <responseHeader>
> <status>0</status>
> <QTime>518</QTime>
> </responseHeader>
> <result numFound="2" start="0"/>
> −
>  <lst name="facet_counts">
> <lst name="facet_queries"/>
> −
>  <lst name="facet_fields">
> −
>  <lst name="section">
> <int name="metro">2</int>
> <int name="busi">0</int>
> <int name="career">0</int>
> <int name="comput">0</int>
> <int name="domain">0</int>
> </lst>
> −
>  <lst name="geoloc">
> <int name="franc">1</int>
> <int name="sydney">1</int>
> <int name="act">0</int>
> <int name="adelaid">0</int>
> <int name="afghanistan">0</int>
> </lst>
> −
>  <lst name="classification">
> <int name="cultur">1</int>
> <int name="culturefestiv">1</int>
> <int name="culturefilm">1</int>
> <int name="festiv">1</int>
> <int name="film">1</int>
> </lst>
> </lst>
> </lst>
> </response>
>
>
> Any help is much appreciated!
>
>
> Thanks,
>
> Andre
>
>
>
>
>
> ********************************************************************** 
> ***********
> The information contained in this e-mail message and any  
> accompanying files is or may be confidential.  If you are not the  
> intended recipient, any use, dissemination, reliance, forwarding,  
> printing or copying of this e-mail or any attached files is  
> unauthorised. This e-mail is subject to copyright. No part of it  
> should be reproduced, adapted or communicated without the written  
> consent of the copyright owner. If you have received this e-mail in  
> error, please advise the sender immediately by return e-mail, or  
> telephone and delete all copies. Fairfax does not guarantee the  
> accuracy or completeness of any information contained in this e- 
> mail or attached files. Internet communications are not secure,  
> therefore Fairfax does not accept legal responsibility for the  
> contents of this message or attached files.
> ********************************************************************** 
> ***********
>


Re: Faceted Searching problems

Posted by Yonik Seeley <yo...@apache.org>.
On 9/13/06, Andre Basse <AB...@theage.com.au> wrote:
> Example:
> <int name="franc">1</int> should be France
> <int name="culturefestiv">1</int> should be Culture/Festivals

Hi Andre,

Field faceting works over the indexed terms... so you get back what
was indexed (word splitting, lowercasing, stemming, etc...  the
process is not generally reversible).

Perhaps you "classification" field should be of type "string" which is
indexed by not analyzed at all.  If you need some analysis (like if
you also want a query of "Festival" to match against
"Culture/Festivals", then you should index the field again as a
non-tokenized (non analyzed) "string" type.  This can be easily done
with an extra field definition and an a copyField statement in the
schema.xml

-Yonik

> Please find details below.
>
> Original XML
> =========
>
> <str name="section">Metro</str>
>
> <arr name="classification">
> <str>Culture/Film</str>
> <str>Culture/Festivals</str>
> </arr>
>
> <arr name="geoloc">
> <str>France</str>
> <str>Sydney</str>
> </arr>
>
>
>
> SOLR response for the query
> =====================
> (http://192.168.157.128:8983/solr/select/?q=Bellucci&rows=0&facet=true&facet.limit=5&facet.field=section&facet.field=geoloc&facet.field=classification)
>
> <response>
> −
>  <responseHeader>
> <status>0</status>
> <QTime>518</QTime>
> </responseHeader>
> <result numFound="2" start="0"/>
> −
>  <lst name="facet_counts">
> <lst name="facet_queries"/>
> −
>  <lst name="facet_fields">
> −
>  <lst name="section">
> <int name="metro">2</int>
> <int name="busi">0</int>
> <int name="career">0</int>
> <int name="comput">0</int>
> <int name="domain">0</int>
> </lst>
> −
>  <lst name="geoloc">
> <int name="franc">1</int>
> <int name="sydney">1</int>
> <int name="act">0</int>
> <int name="adelaid">0</int>
> <int name="afghanistan">0</int>
> </lst>
> −
>  <lst name="classification">
> <int name="cultur">1</int>
> <int name="culturefestiv">1</int>
> <int name="culturefilm">1</int>
> <int name="festiv">1</int>
> <int name="film">1</int>
> </lst>
> </lst>
> </lst>
> </response>
>
>
> Any help is much appreciated!
>
>
> Thanks,
>
> Andre

Re: Faceted Searching problems

Posted by Yonik Seeley <yo...@apache.org>.
On 9/13/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> Would it ever make sense to generate facets on a tokenized field?
> Maybe the facet implementation could throw an error if the field name
> specified is tokenized?

I think it probably can make sense...
 - finding top terms in a full-text field that match a query could be useful
 - the analysis could just be for normalization - trimming whitespace
or normalization
 - it allows more flexibility on how to represent tags... one may
already have tags in a whitespace delimited field rather than separate
values in a multi-valued field.

-Yonik

Re: Faceted Searching problems

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Sep 13, 2006, at 9:37 PM, Chris Hostetter wrote:
> http://www.nabble.com/Error-in-faceted-browsing-tf2267819.html
>
> ...i'll try to update the docs for facet.field to make this more  
> obvious.

Would it ever make sense to generate facets on a tokenized field?   
Maybe the facet implementation could throw an error if the field name  
specified is tokenized?

	Erik


Re: Faceted Searching problems

Posted by Chris Hostetter <ho...@fucit.org>.
: I just installed the nightly build to try the Faceted Searching . After
: some testing I discovered that some characters are missing in the result
: XML and that fields with "/" chars are sometimes split into two entries.

I believe what you are encountering is an issue of tokenization (or
analysis) ... you didn't post your schema.xml, but i'm guessing these two
fields have a datatype that is analyzed right?  Take a look at the
followup posts in this recent thread...

http://www.nabble.com/Error-in-faceted-browsing-tf2267819.html

...i'll try to update the docs for facet.field to make this more obvious.


:
: Example:
: <int name="franc">1</int> should be France
: <int name="culturefestiv">1</int> should be Culture/Festivals
:
: Please find details below.
:
: Original XML
: =========
:
: <str name="section">Metro</str>
:
: <arr name="classification">
: <str>Culture/Film</str>
: <str>Culture/Festivals</str>
: </arr>
:
: <arr name="geoloc">
: <str>France</str>
: <str>Sydney</str>
: </arr>
:
:
:
: SOLR response for the query
: =====================
: (http://192.168.157.128:8983/solr/select/?q=Bellucci&rows=0&facet=true&facet.limit=5&facet.field=section&facet.field=geoloc&facet.field=classification)
:
: <response>
: −
:  <responseHeader>
: <status>0</status>
: <QTime>518</QTime>
: </responseHeader>
: <result numFound="2" start="0"/>
: −
:  <lst name="facet_counts">
: <lst name="facet_queries"/>
: −
:  <lst name="facet_fields">
: −
:  <lst name="section">
: <int name="metro">2</int>
: <int name="busi">0</int>
: <int name="career">0</int>
: <int name="comput">0</int>
: <int name="domain">0</int>
: </lst>
: −
:  <lst name="geoloc">
: <int name="franc">1</int>
: <int name="sydney">1</int>
: <int name="act">0</int>
: <int name="adelaid">0</int>
: <int name="afghanistan">0</int>
: </lst>
: −
:  <lst name="classification">
: <int name="cultur">1</int>
: <int name="culturefestiv">1</int>
: <int name="culturefilm">1</int>
: <int name="festiv">1</int>
: <int name="film">1</int>
: </lst>
: </lst>
: </lst>
: </response>
:
:
: Any help is much appreciated!
:
:
: Thanks,
:
: Andre
:
:
:
:
:
: *********************************************************************************
: The information contained in this e-mail message and any accompanying files is or may be confidential.  If you are not the intended recipient, any use, dissemination, reliance, forwarding, printing or copying of this e-mail or any attached files is unauthorised. This e-mail is subject to copyright. No part of it should be reproduced, adapted or communicated without the written consent of the copyright owner. If you have received this e-mail in error, please advise the sender immediately by return e-mail, or telephone and delete all copies. Fairfax does not guarantee the accuracy or completeness of any information contained in this e-mail or attached files. Internet communications are not secure, therefore Fairfax does not accept legal responsibility for the contents of this message or attached files.
: *********************************************************************************
:
:



-Hoss