You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stephen Weiss <sw...@stylesight.com> on 2008/10/28 06:32:06 UTC

Question about textTight

Hi,

So I've been using the textTight field to hold filenames, and I've run  
into a weird problem.  Basically, people want to search by part of a  
filename (say, the filename is stm0810m_ws_001ftws and they want to  
find everything starting with stm0810m_ (stm0810m_*).  I'm hoping  
someone might have done this before (I bet someone has).

Lots of things work - you can search for stm0810m_ws_001ftws and get a  
result, or (stm 0810 m*), or various other combinations.  What does  
not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything  
like that - a problem, because often they don't want things with ma_  
or mx_, but just m_.  It's almost like underscores just break  
everything, escaping them does nothing.

Here's the field definition (it should be what came with my solr):

     <fieldType name="textTight" class="solr.TextField"  
positionIncrementGap="100" >
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory"  
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
         <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="0" generateNumberParts="0" catenateWords="1"  
catenateNumbers="1" catenateAll="0"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

and usage:

    <field name="name" type="textTight"
           indexed="true" stored="true" omitNorms="true"
           />


Now, I thought textTight would be good because it's the one best  
suited for SKU's, but I guess I'm wrong.  What should I be using for  
this?  Would changing any of these "generateWordParts" or  
"catenateAll" options help?  I can't seem to find any documentation so  
I'm really not sure what it would do, but reindexing this whole thing  
will take quite some time so I'd rather know what will actually work  
before I just start changing things.

Thanks so much for any insight!

--
Steve

Re: Question about textTight

Posted by Stephen Weiss <sw...@stylesight.com>.

OK, thanks everyone.  Since this is the only thing this field is used  
for, I think we'll just reindex without the filters and go from  
there...  Now if only I could just reindex that field!  Oh well.

--
Steve

On Oct 28, 2008, at 3:32 PM, Yonik Seeley wrote:

> I'm wrong: I saw the punctuation being left in for "m_*" and thought
> that the WordDelimiterFilter wasn't working.
>
> So as Todd pointed out, underscores are dropped during indexing and
> searching.  The limitation you are running into is that things like
> prefix and wildcard queries are not analyzed (so the _ won't be
> dropped).  You could set up another field for use with wildcard
> queries, or you could create separate query and index analyzers for
> textTight and set the index analyzer to use a WordDelimiterFilter that
> also indexes the original token.
>
> -Yonik
>
> On Tue, Oct 28, 2008 at 2:31 PM, Stephen Weiss  
> <sw...@stylesight.com> wrote:
>> That's strange then.  The schema hasn't changed in well over a  
>> month, solr's
>> been restarted several times since then to reload synonyms and the  
>> whole
>> thing was reindexed just this past week to add in new chinese  
>> translations
>> (the fields were already there but left blank).
>>
>>
>>
>>
>>
>> I attached the full schema if that helps.
>> --
>> Steve
>>
>> On Oct 28, 2008, at 1:54 PM, Yonik Seeley wrote:
>>
>>> These query parsing results don't match with the config you've  
>>> posted.
>>> Double-check the type of the "name" field and that you have  
>>> restarted
>>> Solr since changing the schema.xml
>>>
>>> -Yonik
>>>
>>> On Tue, Oct 28, 2008 at 11:25 AM, Stephen Weiss <sweiss@stylesight.com 
>>> >
>>> wrote:
>>>>
>>>> Thanks for the reply.  I've been looking at the debug page... and I
>>>> really
>>>> don't see any clues there (maybe I don't know how to read it).
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <response>
>>>>
>>>> <lst name="responseHeader">
>>>> <int name="status">0</int>
>>>> <int name="QTime">1</int>
>>>> <lst name="params">
>>>> <str name="wt">standard</str>
>>>> <str name="rows">10</str>
>>>>
>>>> <str name="start">0</str>
>>>> <str name="explainOther"/>
>>>> <str name="hl.fl"/>
>>>> <str name="indent">on</str>
>>>> <str name="q">name:(stm 0810 m_*)</str>
>>>> <str name="fl">*,score</str>
>>>> <str name="qt">standard</str>
>>>>
>>>> <str name="debugQuery">on</str>
>>>> <str name="version">2.2</str>
>>>> </lst>
>>>> </lst>
>>>> <result name="response" numFound="0" start="0" maxScore="0.0"/>
>>>> <lst name="debug">
>>>> <str name="rawquerystring">name:(stm 0810 m_*)</str>
>>>> <str name="querystring">name:(stm 0810 m_*)</str>
>>>>
>>>> <str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
>>>> <str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</ 
>>>> str>
>>>> <lst name="explain"/>
>>>> </lst>
>>>> </response>
>>>>
>>>> I mean, as far as I can tell, that seems right.  I think I'm  
>>>> missing
>>>> something here.
>>>>
>>>> The wiki page is awesome though, thank you.  The catenateAll  
>>>> option does
>>>> seem to do what I think it did... but should I perhaps just  
>>>> remove any
>>>> kind
>>>> of filter or analyzer on this field?  It's really not a big deal if
>>>> someone
>>>> has to get the dashes and underscores exactly right - it's a worse
>>>> problem
>>>> if they do get them right, but it still doesn't work (usually  
>>>> they copy
>>>> and
>>>> paste these from an e-mail or something).  Just in general, it's  
>>>> never
>>>> really critical for someone to search by parts of the filename -  
>>>> except
>>>> for
>>>> searching with wildcard (that is, stm0810m_* and the like), and  
>>>> it would
>>>> be
>>>> a lot easier if they didn't have to put spaces where letters  
>>>> change to
>>>> numbers & vice versa.
>>>>
>>>> Thanks again for your input.
>>>>
>>>> --
>>>> Steve
>>>>
>>>> On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:
>>>>
>>>>> You may want to take a very close look at what the  
>>>>> WordDelimiterFilter
>>>>> is doing. I believe the underscore is dropped entirely during  
>>>>> indexing
>>>>> AND searching as it's not alphanumeric.
>>>>>
>>>>> Wiki doco here
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
>>>>> okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089
>>>>>
>>>>> The admin analysis page and query debug will help a lot to see  
>>>>> what's
>>>>> going on.
>>>>>
>>>>> -Todd
>>>>>
>>>>> -----Original Message-----
>>>>> From: Stephen Weiss [mailto:sweiss@stylesight.com]
>>>>> Sent: Monday, October 27, 2008 10:32 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Question about textTight
>>>>>
>>>>> Hi,
>>>>>
>>>>> So I've been using the textTight field to hold filenames, and  
>>>>> I've run
>>>>> into a weird problem.  Basically, people want to search by part  
>>>>> of a
>>>>> filename (say, the filename is stm0810m_ws_001ftws and they want  
>>>>> to
>>>>> find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
>>>>> someone might have done this before (I bet someone has).
>>>>>
>>>>> Lots of things work - you can search for stm0810m_ws_001ftws and  
>>>>> get a
>>>>> result, or (stm 0810 m*), or various other combinations.  What  
>>>>> does
>>>>> not work, is searching for (stm0810m_*) or (stm 0810 m_*) or  
>>>>> anything
>>>>> like that - a problem, because often they don't want things with  
>>>>> ma_
>>>>> or mx_, but just m_.  It's almost like underscores just break
>>>>> everything, escaping them does nothing.
>>>>>
>>>>> Here's the field definition (it should be what came with my solr):
>>>>>
>>>>> <fieldType name="textTight" class="solr.TextField"
>>>>> positionIncrementGap="100" >
>>>>>   <analyzer>
>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>     <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>>>>>     <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt"/>
>>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0"/>
>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>     <filter class="solr.EnglishPorterFilterFactory"
>>>>> protected="protwords.txt"/>
>>>>>     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>   </analyzer>
>>>>> </fieldType>
>>>>>
>>>>> and usage:
>>>>>
>>>>> <field name="name" type="textTight"
>>>>>       indexed="true" stored="true" omitNorms="true"
>>>>>       />
>>>>>
>>>>>
>>>>> Now, I thought textTight would be good because it's the one best
>>>>> suited for SKU's, but I guess I'm wrong.  What should I be using  
>>>>> for
>>>>> this?  Would changing any of these "generateWordParts" or
>>>>> "catenateAll" options help?  I can't seem to find any  
>>>>> documentation so
>>>>> I'm really not sure what it would do, but reindexing this whole  
>>>>> thing
>>>>> will take quite some time so I'd rather know what will actually  
>>>>> work
>>>>> before I just start changing things.
>>>>>
>>>>> Thanks so much for any insight!
>>>>>
>>>>> --
>>>>> Steve
>>>>>
>>>>
>>>>
>>
>>
>>

RE: Changing field datatype

Posted by "Nguyen, Joe" <jn...@automotive.com>.

Thanks for your quick reply.

What would be a reasonable way to handle this without affecting the end
users?  

Create a new dynamic core with the new schema, load documents to the new
core, then swap the cores?  At some moments, two mostly identical cores
co-exist on solr server, would that impact query time?   

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com] 
Sent: Tuesday, October 28, 2008 1:33 Joe
To: solr-user@lucene.apache.org
Subject: Re: Changing field datatype

On Wed, Oct 29, 2008 at 1:55 AM, Nguyen, Joe <jn...@automotive.com>
wrote:

>
> 1.  If I modify datatype of a field 'foo' from string to a sint and
> restart the server, what would happen to the existing documents? And
> documents added with the new schema?  At query time (sort=foo desc),
> should I expect the documents sorted properly?

Do I need to re-index all documents?

The fields can't be converted automatically. Therefore, a sort on foo
will
still be a lexical sort instead of a numerical sort. You'll have to
re-index
to have "foo desc" give a numerically non-ascending sort order.

> 2. If I add two additional fields, do I need to re-index again?

The old documents won't have any values for those fields of course but
new
documents will. It is best to re-index to avoid any inconsistencies.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Changing field datatype

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Oct 29, 2008 at 1:55 AM, Nguyen, Joe <jn...@automotive.com> wrote:

>
> 1.  If I modify datatype of a field 'foo' from string to a sint and
> restart the server, what would happen to the existing documents? And
> documents added with the new schema?  At query time (sort=foo desc),
> should I expect the documents sorted properly?

Do I need to re-index all documents?

The fields can't be converted automatically. Therefore, a sort on foo will
still be a lexical sort instead of a numerical sort. You'll have to re-index
to have "foo desc" give a numerically non-ascending sort order.

> 2. If I add two additional fields, do I need to re-index again?

The old documents won't have any values for those fields of course but new
documents will. It is best to re-index to avoid any inconsistencies.

-- 
Regards,
Shalin Shekhar Mangar.

Changing field datatype

Posted by "Nguyen, Joe" <jn...@automotive.com>.

I have a solr core having 2 million lengthy documents.  

1.  If I modify datatype of a field 'foo' from string to a sint and
restart the server, what would happen to the existing documents? And
documents added with the new schema?  At query time (sort=foo desc),
should I expect the documents sorted properly? 

Do I need to re-index all documents?

2. If I add two additional fields, do I need to re-index again?

Thanks.

Re: Question about textTight

Posted by Yonik Seeley <yo...@apache.org>.

I'm wrong: I saw the punctuation being left in for "m_*" and thought
that the WordDelimiterFilter wasn't working.

So as Todd pointed out, underscores are dropped during indexing and
searching.  The limitation you are running into is that things like
prefix and wildcard queries are not analyzed (so the _ won't be
dropped).  You could set up another field for use with wildcard
queries, or you could create separate query and index analyzers for
textTight and set the index analyzer to use a WordDelimiterFilter that
also indexes the original token.

-Yonik

On Tue, Oct 28, 2008 at 2:31 PM, Stephen Weiss <sw...@stylesight.com> wrote:
> That's strange then.  The schema hasn't changed in well over a month, solr's
> been restarted several times since then to reload synonyms and the whole
> thing was reindexed just this past week to add in new chinese translations
> (the fields were already there but left blank).
>
>
>
>
>
> I attached the full schema if that helps.
> --
> Steve
>
> On Oct 28, 2008, at 1:54 PM, Yonik Seeley wrote:
>
>> These query parsing results don't match with the config you've posted.
>> Double-check the type of the "name" field and that you have restarted
>> Solr since changing the schema.xml
>>
>> -Yonik
>>
>> On Tue, Oct 28, 2008 at 11:25 AM, Stephen Weiss <sw...@stylesight.com>
>> wrote:
>>>
>>> Thanks for the reply.  I've been looking at the debug page... and I
>>> really
>>> don't see any clues there (maybe I don't know how to read it).
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>>
>>> <lst name="responseHeader">
>>> <int name="status">0</int>
>>> <int name="QTime">1</int>
>>> <lst name="params">
>>> <str name="wt">standard</str>
>>> <str name="rows">10</str>
>>>
>>> <str name="start">0</str>
>>> <str name="explainOther"/>
>>> <str name="hl.fl"/>
>>> <str name="indent">on</str>
>>> <str name="q">name:(stm 0810 m_*)</str>
>>> <str name="fl">*,score</str>
>>> <str name="qt">standard</str>
>>>
>>> <str name="debugQuery">on</str>
>>> <str name="version">2.2</str>
>>> </lst>
>>> </lst>
>>> <result name="response" numFound="0" start="0" maxScore="0.0"/>
>>> <lst name="debug">
>>> <str name="rawquerystring">name:(stm 0810 m_*)</str>
>>> <str name="querystring">name:(stm 0810 m_*)</str>
>>>
>>> <str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
>>> <str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</str>
>>> <lst name="explain"/>
>>> </lst>
>>> </response>
>>>
>>> I mean, as far as I can tell, that seems right.  I think I'm missing
>>> something here.
>>>
>>> The wiki page is awesome though, thank you.  The catenateAll option does
>>> seem to do what I think it did... but should I perhaps just remove any
>>> kind
>>> of filter or analyzer on this field?  It's really not a big deal if
>>> someone
>>> has to get the dashes and underscores exactly right - it's a worse
>>> problem
>>> if they do get them right, but it still doesn't work (usually they copy
>>> and
>>> paste these from an e-mail or something).  Just in general, it's never
>>> really critical for someone to search by parts of the filename - except
>>> for
>>> searching with wildcard (that is, stm0810m_* and the like), and it would
>>> be
>>> a lot easier if they didn't have to put spaces where letters change to
>>> numbers & vice versa.
>>>
>>> Thanks again for your input.
>>>
>>> --
>>> Steve
>>>
>>> On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:
>>>
>>>> You may want to take a very close look at what the WordDelimiterFilter
>>>> is doing. I believe the underscore is dropped entirely during indexing
>>>> AND searching as it's not alphanumeric.
>>>>
>>>> Wiki doco here
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
>>>> okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089
>>>>
>>>> The admin analysis page and query debug will help a lot to see what's
>>>> going on.
>>>>
>>>> -Todd
>>>>
>>>> -----Original Message-----
>>>> From: Stephen Weiss [mailto:sweiss@stylesight.com]
>>>> Sent: Monday, October 27, 2008 10:32 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Question about textTight
>>>>
>>>> Hi,
>>>>
>>>> So I've been using the textTight field to hold filenames, and I've run
>>>> into a weird problem.  Basically, people want to search by part of a
>>>> filename (say, the filename is stm0810m_ws_001ftws and they want to
>>>> find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
>>>> someone might have done this before (I bet someone has).
>>>>
>>>> Lots of things work - you can search for stm0810m_ws_001ftws and get a
>>>> result, or (stm 0810 m*), or various other combinations.  What does
>>>> not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything
>>>> like that - a problem, because often they don't want things with ma_
>>>> or mx_, but just m_.  It's almost like underscores just break
>>>> everything, escaping them does nothing.
>>>>
>>>> Here's the field definition (it should be what came with my solr):
>>>>
>>>>  <fieldType name="textTight" class="solr.TextField"
>>>> positionIncrementGap="100" >
>>>>    <analyzer>
>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>      <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt"/>
>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>      <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/>
>>>>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>>
>>>> and usage:
>>>>
>>>>  <field name="name" type="textTight"
>>>>        indexed="true" stored="true" omitNorms="true"
>>>>        />
>>>>
>>>>
>>>> Now, I thought textTight would be good because it's the one best
>>>> suited for SKU's, but I guess I'm wrong.  What should I be using for
>>>> this?  Would changing any of these "generateWordParts" or
>>>> "catenateAll" options help?  I can't seem to find any documentation so
>>>> I'm really not sure what it would do, but reindexing this whole thing
>>>> will take quite some time so I'd rather know what will actually work
>>>> before I just start changing things.
>>>>
>>>> Thanks so much for any insight!
>>>>
>>>> --
>>>> Steve
>>>>
>>>
>>>
>
>
>

Re: Question about textTight

Posted by Stephen Weiss <sw...@stylesight.com>.

That's strange then.  The schema hasn't changed in well over a month,  
solr's been restarted several times since then to reload synonyms and  
the whole thing was reindexed just this past week to add in new  
chinese translations (the fields were already there but left blank).

Re: Question about textTight

Posted by Yonik Seeley <yo...@apache.org>.

These query parsing results don't match with the config you've posted.
Double-check the type of the "name" field and that you have restarted
Solr since changing the schema.xml

-Yonik

On Tue, Oct 28, 2008 at 11:25 AM, Stephen Weiss <sw...@stylesight.com> wrote:
> Thanks for the reply.  I've been looking at the debug page... and I really
> don't see any clues there (maybe I don't know how to read it).
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> <lst name="params">
>  <str name="wt">standard</str>
>  <str name="rows">10</str>
>
>  <str name="start">0</str>
>  <str name="explainOther"/>
>  <str name="hl.fl"/>
>  <str name="indent">on</str>
>  <str name="q">name:(stm 0810 m_*)</str>
>  <str name="fl">*,score</str>
>  <str name="qt">standard</str>
>
>  <str name="debugQuery">on</str>
>  <str name="version">2.2</str>
> </lst>
> </lst>
> <result name="response" numFound="0" start="0" maxScore="0.0"/>
> <lst name="debug">
> <str name="rawquerystring">name:(stm 0810 m_*)</str>
> <str name="querystring">name:(stm 0810 m_*)</str>
>
> <str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
> <str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</str>
> <lst name="explain"/>
> </lst>
> </response>
>
> I mean, as far as I can tell, that seems right.  I think I'm missing
> something here.
>
> The wiki page is awesome though, thank you.  The catenateAll option does
> seem to do what I think it did... but should I perhaps just remove any kind
> of filter or analyzer on this field?  It's really not a big deal if someone
> has to get the dashes and underscores exactly right - it's a worse problem
> if they do get them right, but it still doesn't work (usually they copy and
> paste these from an e-mail or something).  Just in general, it's never
> really critical for someone to search by parts of the filename - except for
> searching with wildcard (that is, stm0810m_* and the like), and it would be
> a lot easier if they didn't have to put spaces where letters change to
> numbers & vice versa.
>
> Thanks again for your input.
>
> --
> Steve
>
> On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:
>
>> You may want to take a very close look at what the WordDelimiterFilter
>> is doing. I believe the underscore is dropped entirely during indexing
>> AND searching as it's not alphanumeric.
>>
>> Wiki doco here
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
>> okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089
>>
>> The admin analysis page and query debug will help a lot to see what's
>> going on.
>>
>> -Todd
>>
>> -----Original Message-----
>> From: Stephen Weiss [mailto:sweiss@stylesight.com]
>> Sent: Monday, October 27, 2008 10:32 PM
>> To: solr-user@lucene.apache.org
>> Subject: Question about textTight
>>
>> Hi,
>>
>> So I've been using the textTight field to hold filenames, and I've run
>> into a weird problem.  Basically, people want to search by part of a
>> filename (say, the filename is stm0810m_ws_001ftws and they want to
>> find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
>> someone might have done this before (I bet someone has).
>>
>> Lots of things work - you can search for stm0810m_ws_001ftws and get a
>> result, or (stm 0810 m*), or various other combinations.  What does
>> not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything
>> like that - a problem, because often they don't want things with ma_
>> or mx_, but just m_.  It's almost like underscores just break
>> everything, escaping them does nothing.
>>
>> Here's the field definition (it should be what came with my solr):
>>
>>    <fieldType name="textTight" class="solr.TextField"
>> positionIncrementGap="100" >
>>      <analyzer>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>>
>> and usage:
>>
>>   <field name="name" type="textTight"
>>          indexed="true" stored="true" omitNorms="true"
>>          />
>>
>>
>> Now, I thought textTight would be good because it's the one best
>> suited for SKU's, but I guess I'm wrong.  What should I be using for
>> this?  Would changing any of these "generateWordParts" or
>> "catenateAll" options help?  I can't seem to find any documentation so
>> I'm really not sure what it would do, but reindexing this whole thing
>> will take quite some time so I'd rather know what will actually work
>> before I just start changing things.
>>
>> Thanks so much for any insight!
>>
>> --
>> Steve
>>
>
>

Re: Question about textTight

Posted by Stephen Weiss <sw...@stylesight.com>.

Thanks for the reply.  I've been looking at the debug page... and I  
really don't see any clues there (maybe I don't know how to read it).

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
  <str name="wt">standard</str>
  <str name="rows">10</str>

  <str name="start">0</str>
  <str name="explainOther"/>
  <str name="hl.fl"/>
  <str name="indent">on</str>
  <str name="q">name:(stm 0810 m_*)</str>
  <str name="fl">*,score</str>
  <str name="qt">standard</str>

  <str name="debugQuery">on</str>
  <str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="0" start="0" maxScore="0.0"/>
<lst name="debug">
<str name="rawquerystring">name:(stm 0810 m_*)</str>
<str name="querystring">name:(stm 0810 m_*)</str>

<str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
<str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</str>
<lst name="explain"/>
</lst>
</response>

I mean, as far as I can tell, that seems right.  I think I'm missing  
something here.

The wiki page is awesome though, thank you.  The catenateAll option  
does seem to do what I think it did... but should I perhaps just  
remove any kind of filter or analyzer on this field?  It's really not  
a big deal if someone has to get the dashes and underscores exactly  
right - it's a worse problem if they do get them right, but it still  
doesn't work (usually they copy and paste these from an e-mail or  
something).  Just in general, it's never really critical for someone  
to search by parts of the filename - except for searching with  
wildcard (that is, stm0810m_* and the like), and it would be a lot  
easier if they didn't have to put spaces where letters change to  
numbers & vice versa.

Thanks again for your input.

--
Steve

On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:

> You may want to take a very close look at what the WordDelimiterFilter
> is doing. I believe the underscore is dropped entirely during indexing
> AND searching as it's not alphanumeric.
>
> Wiki doco here
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
> okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089
>
> The admin analysis page and query debug will help a lot to see what's
> going on.
>
> -Todd
>
> -----Original Message-----
> From: Stephen Weiss [mailto:sweiss@stylesight.com]
> Sent: Monday, October 27, 2008 10:32 PM
> To: solr-user@lucene.apache.org
> Subject: Question about textTight
>
> Hi,
>
> So I've been using the textTight field to hold filenames, and I've run
> into a weird problem.  Basically, people want to search by part of a
> filename (say, the filename is stm0810m_ws_001ftws and they want to
> find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
> someone might have done this before (I bet someone has).
>
> Lots of things work - you can search for stm0810m_ws_001ftws and get a
> result, or (stm 0810 m*), or various other combinations.  What does
> not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything
> like that - a problem, because often they don't want things with ma_
> or mx_, but just m_.  It's almost like underscores just break
> everything, escaping them does nothing.
>
> Here's the field definition (it should be what came with my solr):
>
>     <fieldType name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> and usage:
>
>    <field name="name" type="textTight"
>           indexed="true" stored="true" omitNorms="true"
>           />
>
>
> Now, I thought textTight would be good because it's the one best
> suited for SKU's, but I guess I'm wrong.  What should I be using for
> this?  Would changing any of these "generateWordParts" or
> "catenateAll" options help?  I can't seem to find any documentation so
> I'm really not sure what it would do, but reindexing this whole thing
> will take quite some time so I'd rather know what will actually work
> before I just start changing things.
>
> Thanks so much for any insight!
>
> --
> Steve
>

RE: Question about textTight

Posted by "Feak, Todd" <To...@smss.sony.com>.

You may want to take a very close look at what the WordDelimiterFilter
is doing. I believe the underscore is dropped entirely during indexing
AND searching as it's not alphanumeric.

Wiki doco here
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089

The admin analysis page and query debug will help a lot to see what's
going on.

-Todd

-----Original Message-----
From: Stephen Weiss [mailto:sweiss@stylesight.com] 
Sent: Monday, October 27, 2008 10:32 PM
To: solr-user@lucene.apache.org
Subject: Question about textTight

Hi,

So I've been using the textTight field to hold filenames, and I've run  
into a weird problem.  Basically, people want to search by part of a  
filename (say, the filename is stm0810m_ws_001ftws and they want to  
find everything starting with stm0810m_ (stm0810m_*).  I'm hoping  
someone might have done this before (I bet someone has).

Lots of things work - you can search for stm0810m_ws_001ftws and get a  
result, or (stm 0810 m*), or various other combinations.  What does  
not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything  
like that - a problem, because often they don't want things with ma_  
or mx_, but just m_.  It's almost like underscores just break  
everything, escaping them does nothing.

Here's the field definition (it should be what came with my solr):

     <fieldType name="textTight" class="solr.TextField"  
positionIncrementGap="100" >
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory"  
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
         <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="0" generateNumberParts="0" catenateWords="1"  
catenateNumbers="1" catenateAll="0"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

and usage:

    <field name="name" type="textTight"
           indexed="true" stored="true" omitNorms="true"
           />


Now, I thought textTight would be good because it's the one best  
suited for SKU's, but I guess I'm wrong.  What should I be using for  
this?  Would changing any of these "generateWordParts" or  
"catenateAll" options help?  I can't seem to find any documentation so  
I'm really not sure what it would do, but reindexing this whole thing  
will take quite some time so I'd rather know what will actually work  
before I just start changing things.

Thanks so much for any insight!

--
Steve