You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christos Constantinou <ch...@simpleweb.co.uk> on 2010/08/19 15:06:50 UTC
Faceting by fields that contain special characters
Hi all,
I am doing a faceted search on a solr field that contains URLs, for the sole purpose of trying to locate duplicate URLs in my documents.
However, the solr response I get looks like this:
public 'com' => int 492198
public 'flickr' => int 492198
public 'http' => int 492198
public 'www' => int 253881
public 'photo' => int 253843
public 'n' => int 253318
public 'httpwwwflickrcomphoto' => int 253316
public 'farm' => int 238317
public 'httpfarm' => int 238317
public 'jpg' => int 238317
public 'static' => int 238317
public 'staticflickrcom' => int 238317
public '5' => int 237939
public '00' => int 61009
public 'b' => int 59463
public 'c' => int 59094
public 'f' => int 59004
public 'd' => int 58995
public 'e' => int 58818
public 'a' => int 58327
public '08' => int 33797
public '06' => int 33341
public '04' => int 29902
public '02' => int 29224
public '2' => int 26671
public '4' => int 26613
public '6' => int 26606
public '03' => int 26506
public '1' => int 26389
public '8' => int 26384
It should instead have the entire URL as the variable name, but the name is only a part of the URL. Is this because characters like :// in http:// cannot be used in variable names? If so, is there any workaround to the problem or an alternative way to detect duplicates?
Thanks
Christos
RE: Faceting by fields that contain special characters
Posted by Markus Jelsma <ma...@buyways.nl>.
A very common issue, you need to facet on a non-analyzed field.
http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-td1023699.html#a1222961
-----Original message-----
From: Christos Constantinou <ch...@simpleweb.co.uk>
Sent: Thu 19-08-2010 15:08
To: solr-user@lucene.apache.org;
Subject: Faceting by fields that contain special characters
Hi all,
I am doing a faceted search on a solr field that contains URLs, for the sole purpose of trying to locate duplicate URLs in my documents.
However, the solr response I get looks like this:
public 'com' => int 492198
public 'flickr' => int 492198
public 'http' => int 492198
public 'www' => int 253881
public 'photo' => int 253843
public 'n' => int 253318
public 'httpwwwflickrcomphoto' => int 253316
public 'farm' => int 238317
public 'httpfarm' => int 238317
public 'jpg' => int 238317
public 'static' => int 238317
public 'staticflickrcom' => int 238317
public '5' => int 237939
public '00' => int 61009
public 'b' => int 59463
public 'c' => int 59094
public 'f' => int 59004
public 'd' => int 58995
public 'e' => int 58818
public 'a' => int 58327
public '08' => int 33797
public '06' => int 33341
public '04' => int 29902
public '02' => int 29224
public '2' => int 26671
public '4' => int 26613
public '6' => int 26606
public '03' => int 26506
public '1' => int 26389
public '8' => int 26384
It should instead have the entire URL as the variable name, but the name is only a part of the URL. Is this because characters like :// in http:// cannot be used in variable names? If so, is there any workaround to the problem or an alternative way to detect duplicates?
Thanks
Christos