You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by PeterKerk <ve...@hotmail.com> on 2012/02/28 22:05:10 UTC

Need tokenization that finds part of stringvalue

I have the following in my schema.xml

<field name="title" type="text_ws" indexed="true" stored="true"/>
<field name="title_search" type="text" indexed="true" stored="true"/>


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	
	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


I want to search on field "title".
Now my field title holds the value "great smartphone".
If I search on "smartphone" the item is found. But I want the item also to
be found on "great" or "phone" it doesnt work.
I have been playing around with the tokenizer test function, but have failed
to find the definition for the "text" fieldtype I need.
Help? :)

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Erick Erickson <er...@gmail.com>.

On frequent method of doing leading and trailing wildcards
is to use ngrams (as distinct from edgengrams). That in
combination with phrase queries might work well in this case.

You also might be surprised at how little space bigrams take,
give it a test and see <G>..

Best
Erick

On Thu, Mar 1, 2012 at 4:57 PM, PeterKerk <ve...@hotmail.com> wrote:
> @iorixxx: Where can I find that example schema.xml?
>
> I downloaded the latest version here:
> ftp://apache.mirror.easycolocate.nl//lucene/solr/3.5.0
> And checked \example\example-DIH\solr\db\conf\schema.xml
> But no text_rev type is defined in there.
>
> And when I find it, can I just make the title field which currently is of
> "text" type then of "text_rev" type?
>
> Thanks!
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3791863.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

edismax did the trick! Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3805045.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Ahmet Arslan <io...@yahoo.com>.

> @iorixxx: Sorry it took so long, had
> some difficulties upgrading to 3.5.0
> 
> It still doesnt work. Here's what I have now:
> 
> I copied text_general_rev from
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/schema.xml
> to my schema.xml:
>     <fieldType name="text_general_rev"
> class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.StandardTokenizerFactory"/>
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>         <filter
> class="solr.ReversedWildcardFilterFactory"
> withOriginal="true"
>        
>    maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.StandardTokenizerFactory"/>
>         <filter
> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> To be complete: this the definition of the title
> fieldtype:    
>     <fieldType name="text_ws"
> class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>     </fieldType>
>     
>     
> 
> <field name="title" type="text_ws" indexed="true"
> stored="true"/>    
> <field name="title_search" type="text_general_rev"
> indexed="true"
> stored="true"/>
> <copyField source="title"
> dest="title_search"/>    
> 
> 
> title field value="Smartphone"
> 
> With this searchquery I dont get any results:
> http://localhost:8983/solr/zz/select/?indent=on&facet=true&q=*smart*&defType=dismax&qf=title_search^20.0&start=0&rows=30&fl=id,title&facet.mincount=1
> 
> What more can I do?
> Thanks!

Dismax query parser does not support wildcard queries. defType=edismax would work. Also defType=lucene&df=title_search&q=*smart* should work too.

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

@iorixxx: Sorry it took so long, had some difficulties upgrading to 3.5.0

It still doesnt work. Here's what I have now:

I copied text_general_rev from
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/schema.xml
to my schema.xml:
    <fieldType name="text_general_rev" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true"
           maxPosAsterisk="3" maxPosQuestion="2"
maxFractionAsterisk="0.33"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

To be complete: this the definition of the title fieldtype:	
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
	
	

<field name="title" type="text_ws" indexed="true" stored="true"/>	
<field name="title_search" type="text_general_rev" indexed="true"
stored="true"/>
<copyField source="title" dest="title_search"/>	


title field value="Smartphone"

With this searchquery I dont get any results:
http://localhost:8983/solr/zz/select/?indent=on&facet=true&q=*smart*&defType=dismax&qf=title_search^20.0&start=0&rows=30&fl=id,title&facet.mincount=1

What more can I do?
Thanks!


--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3804979.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Ahmet Arslan <io...@yahoo.com>.

> @iorixxx
> I tried making my title_search of type text_rev and tried
> adding the
> ReversedWildcardFilterFactory to my existing "text" type,
> but in both cases
> no luck.

I was able to perform *query* types of searches with solr 3.5 distro.
Here is what I did:

Download apache-solr-3.5.0
Edit schema.xml
make text_rev as stored="true" 
add <copyField source="features" dest="text_rev"/>
java -jar start.jar
java -jar post.jar 

http://localhost:8983/solr/select/?q=text_rev:*me*&version=2.2&start=0&rows=10&indent=on&fl=text_rev&hl=true&hl.highlightMultiTerm=true&hl.usePhraseHighlighter=true&hl.fl=text_rev

returns 7 docs, with the following snippets:

<em>SmartMedia</em>, <em>megapixel</em>, <em>document</em>, <em>time</em> etc.

Keep in mind that changes of this kind in schema.xml requires re-indexing and restart solr server.

Also you need to be aware of http://wiki.apache.org/solr/MultitermQueryAnalysis


> @Erick Erickson
> "On frequent method of doing leading and trailing wildcards
> is to use ngrams
> (as distinct from edgengrams). That in combination with
> phrase queries might
> work well in this case. "

Erick's suggestion will work faster in terms of QTime (response time)
To get the idea, try "text_ngrm" field type in analysis.jsp and it will display generated tokens.

http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramFilterFactory.html

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

@iorixxx
I tried making my title_search of type text_rev and tried adding the
ReversedWildcardFilterFactory to my existing "text" type, but in both cases
no luck.

@Erick Erickson
"On frequent method of doing leading and trailing wildcards is to use ngrams
(as distinct from edgengrams). That in combination with phrase queries might
work well in this case. "

Do you perhaps have an example of that?

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3797953.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Ahmet Arslan <io...@yahoo.com>.

> @iorixxx: Where can I find that
> example schema.xml?

Please find text_general_rev at 
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/schema.xml


> And when I find it, can I just make the title field which
> currently is of
> "text" type then of "text_rev" type?

Yes, also you can just add solr.ReversedWildcardFilterFactory into your index analyzer too.

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

@iorixxx: Where can I find that example schema.xml?

I downloaded the latest version here:
ftp://apache.mirror.easycolocate.nl//lucene/solr/3.5.0
And checked \example\example-DIH\solr\db\conf\schema.xml
But no text_rev type is defined in there.

And when I find it, can I just make the title field which currently is of
"text" type then of "text_rev" type?

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3791863.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Ahmet Arslan <io...@yahoo.com>.

--- On Thu, 3/1/12, PeterKerk <ve...@hotmail.com> wrote:

> From: PeterKerk <ve...@hotmail.com>
> Subject: Re: Need tokenization that finds part of stringvalue
> To: solr-user@lucene.apache.org
> Date: Thursday, March 1, 2012, 6:59 PM
> @iorixxx: yes, that is what I need.
> But also when its IN the text, not
> necessarily at the beginning.
> 
> So using the * character like: 
> q=smart* 
> the product is found, but when I do this: 
> q=*mart* 
> it isnt...why is that?

In example schema.xml there is a field type named text_rev that makes use of ReversedWildcardFilterFactory. It is designed to enable leading star operator. e.g. q=*mart

Didn't used by myself but may be you can use both leading and trailing wildcard (at the same time) with this type.
q=*mart*&df=title_search&defType=lucene

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

@iorixxx: yes, that is what I need. But also when its IN the text, not
necessarily at the beginning.

So using the * character like: 
q=smart* 
the product is found, but when I do this: 
q=*mart* 
it isnt...why is that?

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3791064.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need tokenization that finds part of stringvalue

Posted by Ahmet Arslan <io...@yahoo.com>.

> if title holds "smartphone" I want it to be found when
> someone types
> "martph" or "smar" or "smart".

Peter, so you want to beginsWith startsWith type of search? You can use use wildcard search (with start operator) for this. e.g. &q=smar* 

Alternatively, if your index size is not huge, you can use EdgeNGramFilterFactory at index time along with normal queries. e.g. &q=smar

http://lucene.apache.org/solr/api/org/apache/solr/analysis/EdgeNGramFilterFactory.html

Re: Need tokenization that finds part of stringvalue

Posted by PeterKerk <ve...@hotmail.com>.

I think I didnt explain myself clearly: I need to be able to find substrings.
So, its not that I'd expect Solr to find synonyms, but rather if a piece of
text contains the searched text, for example:

if title holds "smartphone" I want it to be found when someone types
"martph" or "smar" or "smart".

I think that is different from what you initially understood from my
explanation or....?

--
View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3790505.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Need tokenization that finds part of stringvalue

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Speaking of which, there is a spellchecker in jira that will detect word-break errors like this.  See "WordBreakSpellChecker" at https://issues.apache.org/jira/browse/LUCENE-3523 .

To use it with Solr, you'd also need to apply SOLR-2993 (https://issues.apache.org/jira/browse/SOLR-2993).  This Solr piece will take the results of your "normal" spellchecker and integrate them with the results from the WordBreakSpellChecker.  

These patches are for Trunk/4.x, and you'd have to apply them as described here:  http://wiki.apache.org/solr/HowToContribute#Review.2BAC8-Improve_Existing_Patches

I would appreiate it if you tried these out to provide feedback on the JIRA issues as to how it works for you and also how it can be improved.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Thursday, March 01, 2012 9:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Need tokenization that finds part of stringvalue

I once used a spell checker to break up compound words. It was slow, but worked pretty well.

wunder

On Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <ve...@hotmail.com> wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wunder@wunderwood.org

Re: Need tokenization that finds part of stringvalue

Posted by Walter Underwood <wu...@wunderwood.org>.

I once used a spell checker to break up compound words. It was slow, but worked pretty well.

wunder

On Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <ve...@hotmail.com> wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wunder@wunderwood.org

Re: Need tokenization that finds part of stringvalue

Posted by Erick Erickson <er...@gmail.com>.

Right, there's nothing in Solr that I know of that'll help here. How would
a tokenizer understand that "smartphone" should be "smart" "phone"?
There's no general solution for this issue.

You can do domain-specific solutions with synonyms for instance, or
some other word list that contains terms you're interested in, entries
like smartphone => smart phone
but that has the obvious drawback of requiring that you know all the
terms that might be smashed together.

You *might* be able to do something with shingles, but I'm a little unclear
on how.

Best
Erick

On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <ve...@hotmail.com> wrote:
> I have the following in my schema.xml
>
> <field name="title" type="text_ws" indexed="true" stored="true"/>
> <field name="title_search" type="text" indexed="true" stored="true"/>
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>  <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
> </fieldType>
>
>
> I want to search on field "title".
> Now my field title holds the value "great smartphone".
> If I search on "smartphone" the item is found. But I want the item also to
> be found on "great" or "phone" it doesnt work.
> I have been playing around with the tokenizer test function, but have failed
> to find the definition for the "text" fieldtype I need.
> Help? :)
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
> Sent from the Solr - User mailing list archive at Nabble.com.