You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vitaly bulgakov <bu...@yahoo.com> on 2015/10/15 15:47:44 UTC

Tokenize ShingleFilterFactory results and apply filters to tokens

I want to rephrase my question I asked in another post. 
As far as I understand filter ShingleFilterFactory creates shingle as
strings. 
But I want to apply more filters (like EdgeNgrams) to each token of a
shingle. 

For example from "Home Improvement Service" I have two shingles:
"Home Improvement" and "Improvement Service".

I want to apply EdgeNgram to be able to do exact match to:
"Hom Improvem" and "Improvemen Servi" as new phrases. 

Any, help, ideas are welcomed and appreciated.



--
View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

Posted by Steve Rowe <sa...@gmail.com>.
Hi Vitaliy,

I don’t know of any combination of built-in Lucene/Solr analysis components that would do what you want, but there used to be filter called ShingleMatrixFilter that (if I understand both that filter and what you want correctly), would do what you want, following an EdgeNGramFilter: <https://lucene.apache.org/core/3_6_2/api/all/org/apache/lucene/analysis/shingle/ShingleMatrixFilter.html>

It was deprecated in v3.1 and removed in v4.0 (see <https://issues.apache.org/jira/browse/LUCENE-2920>) because it wasn’t being maintained by the original creator and nobody else understood it :).  Uwe Schindler put up a patch that rewrote it and fixed some problems on <https://issues.apache.org/jira/browse/LUCENE-1391>, but that was never finished/committed.

What you want could create a huge number of terms, depending on the # of documents, terms in the field, and term length.  What do you want to use these terms for?

Steve

> On Oct 17, 2015, at 10:33 AM, vitaly bulgakov <bu...@yahoo.com> wrote:
> 
> /why don't you put EdgeNGramFilter just after ShingleFilter?/
> 
> Because it will do Edge Ngrams over a shingle as a string:
> for "Home Improvement" shingle it will do: .... Hom, Home, Home , Home I,
> Home Im, Home Imp ...... 
> 
> But I need:
> ... Hom Imp, Hom Impr ......
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574p4234872.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenize ShingleFilterFactory results and apply filters to tokens

Posted by vitaly bulgakov <bu...@yahoo.com>.
/why don't you put EdgeNGramFilter just after ShingleFilter?/

Because it will do Edge Ngrams over a shingle as a string:
for "Home Improvement" shingle it will do: .... Hom, Home, Home , Home I,
Home Im, Home Imp ...... 

But I need:
... Hom Imp, Hom Impr ......



--
View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574p4234872.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

Posted by Koji Sekiguchi <ko...@rondhuit.com>.
Hi Vitaly,

I'm not sure I understand you correctly, why don't you put EdgeNGramFilter just after
ShingleFilter? That is:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2"/>
<filter class="solr.EdgeNGramFilterFactory"/>

Koji

On 2015/10/15 22:47, vitaly bulgakov wrote:
> I want to rephrase my question I asked in another post.
> As far as I understand filter ShingleFilterFactory creates shingle as
> strings.
> But I want to apply more filters (like EdgeNgrams) to each token of a
> shingle.
>
> For example from "Home Improvement Service" I have two shingles:
> "Home Improvement" and "Improvement Service".
>
> I want to apply EdgeNgram to be able to do exact match to:
> "Hom Improvem" and "Improvemen Servi" as new phrases.
>
> Any, help, ideas are welcomed and appreciated.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Tokenize ShingleFilterFactory results and apply filters to tokens

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
This sounds like an attempt to create an auto-complete using n-grams
in text. In which case, Ted Sullivan's writing might be of relevance:
http://lucidworks.com/blog/author/tedsullivan/

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 15 October 2015 at 09:47, vitaly bulgakov <bu...@yahoo.com> wrote:
> I want to rephrase my question I asked in another post.
> As far as I understand filter ShingleFilterFactory creates shingle as
> strings.
> But I want to apply more filters (like EdgeNgrams) to each token of a
> shingle.
>
> For example from "Home Improvement Service" I have two shingles:
> "Home Improvement" and "Improvement Service".
>
> I want to apply EdgeNgram to be able to do exact match to:
> "Hom Improvem" and "Improvemen Servi" as new phrases.
>
> Any, help, ideas are welcomed and appreciated.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574.html
> Sent from the Solr - User mailing list archive at Nabble.com.