You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2010/07/05 01:32:42 UTC
Re: Bizarre Terms revisited

: Subject: Bizarre Terms revisited

to clarify for folks: this is a followup to this previous thread...

http://search.lucidimagination.com/search/document/c0aabe47aad1ca3c/bizarre_tfv_output#de3abb42754407d6

: Using MLT, I get terms that appear to be long concatenations of words
: that are space delimited in the original text.
: I can't think of any reason for these sentence-like terms to exist  (see
: below).

You never answered the questions i asked in the previous thread...

http://search.lucidimagination.com/search/document/c0aabe47aad1ca3c/bizarre_tfv_output#ed49cebdd92db674
>> Did you try pasting that text into the analysis page to see exactly what 
>> your "text_t" field does with it at analysis time like ia suggested?
>> 
>> My best hunch is that the "spaces" are not your typical basic "space" 
>> character (hex 20) and maybe the tokenizer you are using doesn't 
>> tokenize on them, but then perhaps something like word delimiter treats 
>> them as non-word characters and chews them up.
	...
>> (Tip: if you use the JSON response writer (wt=json) when looking at the 
>> stored field value, it will help you see exactly what characters were in 
>> the original values by showing you the unicode escapes)

FWIW: I cut/pasted the text you provided...

: Original text (partially snipped) as it appears in the stored index.
: 
: "Ontreweb Product Features 
: 
:      
: 
: Unlimited mutliword and phrase matching Multiple inheritance of concepts Pluggable vocabularies, ontologies Multilingual 
: lexicons: french, english, etc. Search in one language, find results in another 200,000+ words and phrases, 35,000 mapped 
: concepts.
: 
: 1. 2. 3. 4."

...into the example/exampledocs/solr.xml file in Solr 1.4.1 using the 
field name "attr_darren" (which uses the "textgen" field type with an 
analysis chain matching what you included in your last mail).  When I 
indexed that doc (and nothing else), and look at the list of terms indexed 
in that field using the LukeRequestHandler, i get the oupput below.

In short, i can't reproduce what you are describing at all .. and my best 
guess at what you are seeing (baring the possibility that this is old data 
from when the field type was something else) is that what you think is 
whitespace isn't actaully a space character that the WhitespaceTokenizer 
recognizes.

look at the JSON output from your actual stored value, and verify that 
it's not some funky UTF8 character.


http://localhost:8983/solr/admin/luke?fl=attr_darren&numTerms=1000
  
  ...
  <lst name="topTerms">
	<int name="1">1</int>
	<int name="2">1</int>
	<int name="200">1</int>

	<int name="200000">1</int>
	<int name="3">1</int>
	<int name="35">1</int>
	<int name="35000">1</int>
	<int name="4">1</int>
	<int name="another">1</int>

	<int name="concepts">1</int>
	<int name="english">1</int>
	<int name="etc">1</int>
	<int name="features">1</int>
	<int name="find">1</int>
	<int name="french">1</int>

	<int name="inheritance">1</int>
	<int name="language">1</int>
	<int name="lexicons">1</int>
	<int name="mapped">1</int>
	<int name="matching">1</int>
	<int name="multilingual">1</int>

	<int name="multiple">1</int>
	<int name="mutliword">1</int>
	<int name="one">1</int>
	<int name="ontologies">1</int>
	<int name="ontreweb">1</int>
	<int name="phrase">1</int>

	<int name="phrases">1</int>
	<int name="pluggable">1</int>
	<int name="product">1</int>
	<int name="results">1</int>
	<int name="search">1</int>
	<int name="unlimited">1</int>

	<int name="vocabularies">1</int>
	<int name="words">1</int>
	<int name="000">1</int>
  </lst>









-Hoss