You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ilya Zavorin <iz...@caci.com> on 2012/08/24 21:48:45 UTC

Efficient string lookup using Lucene

Hi Everyone,

I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in several languages mixed up. So to me these are just a bunch of Unicode text files.

What I need is to implement an efficient EXACT string lookup. That is, I need to be able to find ANY Unicode string exactly as it appears. I do not care about language-specific modifications of the string. That is, if I search for a string "run", I do not need to find "ran" but I do want to find it in all of these strings below:

Fox is running fast
!%#^&$run!$!%@&$#
run,run

Is there a way of using StandardAnalyzer or any other analyzer and the corresponding query parser to find these? Again, my queries might be more or less random Unicode sequences and I need to find all their accurrences in the text.

Essentially, what I am trying to do is implement substring matching more efficiently that using Java's standard substring matching methods.

Thanks!

Ilya Zavorin

Re: Efficient string lookup using Lucene

Posted by Dawid Weiss <da...@gmail.com>.
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
> After that, you can wildcards. This will use very little space. I
> believe leading&trailing wildcards are supported now, right?

If leading wildcards take too much time (don't know, really) then one
could also try to index reversed tokens as synonyms and use trailing
wildcards once, for everything :)

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Lance Norskog <go...@gmail.com>.
The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
After that, you can wildcards. This will use very little space. I
believe leading&trailing wildcards are supported now, right?

On Sun, Aug 26, 2012 at 11:29 AM, Ilya Zavorin <iz...@caci.com> wrote:
> The user uploads a set of text files, either all of them at once or one at a time, and then they will be searched locally on the phone against a set of "hotlist" words. This assumes no connection to any sort of server so everything must be done locally.
>
> I already have Lucene integrated so I might want to try the n-gram approach. But I just want to double-check first that it will work with any Unicode string, be it an English word, a foreign word, a sequence of digits or any random sequence of Unicode characters. In other words, this is not in any way language-dependent/-specific.
>
> Thanks,
>
> Ilya
>
> -----Original Message-----
> From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> Sent: Sunday, August 26, 2012 3:55 AM
> To: java-user@lucene.apache.org
> Subject: Re: Efficient string lookup using Lucene
>
>> Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene?
>
> You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated).
>
>> By the way, I need this to run on an Android phone so size of memory might be an issue...
>
> How large is your input? Do you need to index on the android or just read the index on it? These are all factors to take into account. I mentioned suffix trees and suffix arrays because these two are "canonical" data structures to perform any substring lookups in constant time (in fact, the lookup takes the number of elements of the matched input string, building the suffix tree/ array is O(n), at least in theory).
>
> If you already have Lucene integrated in your pipeline then that n-gram approach will also work. If you know your minimum match substring length to be p then index p-sized shingles. For strings longer than p you can create a query which will search for all n-gram occurrences and take into account positional information to remove false matches.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Efficient string lookup using Lucene

Posted by Ilya Zavorin <iz...@caci.com>.
The user uploads a set of text files, either all of them at once or one at a time, and then they will be searched locally on the phone against a set of "hotlist" words. This assumes no connection to any sort of server so everything must be done locally.

I already have Lucene integrated so I might want to try the n-gram approach. But I just want to double-check first that it will work with any Unicode string, be it an English word, a foreign word, a sequence of digits or any random sequence of Unicode characters. In other words, this is not in any way language-dependent/-specific.

Thanks,

Ilya

-----Original Message-----
From: Dawid Weiss [mailto:dawid.weiss@gmail.com] 
Sent: Sunday, August 26, 2012 3:55 AM
To: java-user@lucene.apache.org
Subject: Re: Efficient string lookup using Lucene

> Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene?

You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated).

> By the way, I need this to run on an Android phone so size of memory might be an issue...

How large is your input? Do you need to index on the android or just read the index on it? These are all factors to take into account. I mentioned suffix trees and suffix arrays because these two are "canonical" data structures to perform any substring lookups in constant time (in fact, the lookup takes the number of elements of the matched input string, building the suffix tree/ array is O(n), at least in theory).

If you already have Lucene integrated in your pipeline then that n-gram approach will also work. If you know your minimum match substring length to be p then index p-sized shingles. For strings longer than p you can create a query which will search for all n-gram occurrences and take into account positional information to remove false matches.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Dawid Weiss <da...@gmail.com>.
> Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene?

You'd have to implement it separately but it'd be much, much smaller
than Lucene itself (even obfuscated).

> By the way, I need this to run on an Android phone so size of memory might be an issue...

How large is your input? Do you need to index on the android or just
read the index on it? These are all factors to take into account. I
mentioned suffix trees and suffix arrays because these two are
"canonical" data structures to perform any substring lookups in
constant time (in fact, the lookup takes the number of elements of the
matched input string, building the suffix tree/ array is O(n), at
least in theory).

If you already have Lucene integrated in your pipeline then that
n-gram approach will also work. If you know your minimum match
substring length to be p then index p-sized shingles. For strings
longer than p you can create a query which will search for all n-gram
occurrences and take into account positional information to remove
false matches.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Noopur Julka <no...@gmail.com>.
I haven't yet found answer to my original question which was
how to work with search for japanese characters.

Regards,
Noopur Julka



On Sun, Aug 26, 2012 at 9:17 AM, Devon H. O'Dell <de...@gmail.com>wrote:

> Seems worth mentioning in partial response to this thread's topics that
> (almost) regardless of index strategy, lucene performance hinges on number
> of matched documents per query, not total docs in index. There are other
> mitigating factors (disk type, ram size, etc), but worst case performance
> analysis can generally be modeled in terms of matched documents as opposed
> to index size.
>
> Apologies for any spelling / grammatical errors; this is sent from my
> phone.
>
> --dho
>  On Aug 25, 2012 11:02 PM, "Noopur Julka" <no...@gmail.com> wrote:
>
> > Index being very large can be ruled out as Luke returned few results and
> > the app is capable of returning approx 200 results.
> >
> > Regards,
> > Noopur Julka
> >
> >
> >
> > On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin <iz...@caci.com> wrote:
> >
> > > Does Lucene support this type of structure, or do I need to somehow
> > > implement it outside Lucene?
> > >
> > > By the way, I need this to run on an Android phone so size of memory
> > might
> > > be an issue...
> > >
> > > Thanks,
> > >
> > >
> > > Ilya Zavorin
> > >
> > >
> > > -----Original Message-----
> > > From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> > > Sent: Friday, August 24, 2012 4:50 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Efficient string lookup using Lucene
> > >
> > > What you need is a suffix tree or a suffix array. Both data structures
> > > will allow you to perform constant-time searches for existence/
> > occurrence
> > > of any input pattern. Depending on how much text you have on the input
> it
> > > may either be a simple task -- see here:
> > >
> > > http://labs.carrotsearch.com/jsuffixarrays.html
> > >
> > > or a complicated task if your input size is larger (larger than
> memory).
> > > Google search for suffix trees/ suffix arrays though, it's the data
> > > structure to use here.
> > >
> > > Dawid
> > >
> > > On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <iz...@caci.com>
> wrote:
> > > > Hi Everyone,
> > > >
> > > > I have the following task. I have a set of documents in multiple
> > > languages. I don't know what these languages are. Any given doc may
> > contain
> > > text in several languages mixed up. So to me these are just a bunch of
> > > Unicode text files.
> > > >
> > > > What I need is to implement an efficient EXACT string lookup. That
> is,
> > I
> > > need to be able to find ANY Unicode string exactly as it appears. I do
> > not
> > > care about language-specific modifications of the string. That is, if I
> > > search for a string "run", I do not need to find "ran" but I do want to
> > > find it in all of these strings below:
> > > >
> > > > Fox is running fast
> > > > !%#^&$run!$!%@&$#
> > > > run,run
> > > >
> > > > Is there a way of using StandardAnalyzer or any other analyzer and
> the
> > > corresponding query parser to find these? Again, my queries might be
> more
> > > or less random Unicode sequences and I need to find all their
> accurrences
> > > in the text.
> > > >
> > > > Essentially, what I am trying to do is implement substring matching
> > more
> > > efficiently that using Java's standard substring matching methods.
> > > >
> > > > Thanks!
> > > >
> > > > Ilya Zavorin
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Efficient string lookup using Lucene

Posted by "Devon H. O'Dell" <de...@gmail.com>.
Seems worth mentioning in partial response to this thread's topics that
(almost) regardless of index strategy, lucene performance hinges on number
of matched documents per query, not total docs in index. There are other
mitigating factors (disk type, ram size, etc), but worst case performance
analysis can generally be modeled in terms of matched documents as opposed
to index size.

Apologies for any spelling / grammatical errors; this is sent from my phone.

--dho
 On Aug 25, 2012 11:02 PM, "Noopur Julka" <no...@gmail.com> wrote:

> Index being very large can be ruled out as Luke returned few results and
> the app is capable of returning approx 200 results.
>
> Regards,
> Noopur Julka
>
>
>
> On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin <iz...@caci.com> wrote:
>
> > Does Lucene support this type of structure, or do I need to somehow
> > implement it outside Lucene?
> >
> > By the way, I need this to run on an Android phone so size of memory
> might
> > be an issue...
> >
> > Thanks,
> >
> >
> > Ilya Zavorin
> >
> >
> > -----Original Message-----
> > From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> > Sent: Friday, August 24, 2012 4:50 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Efficient string lookup using Lucene
> >
> > What you need is a suffix tree or a suffix array. Both data structures
> > will allow you to perform constant-time searches for existence/
> occurrence
> > of any input pattern. Depending on how much text you have on the input it
> > may either be a simple task -- see here:
> >
> > http://labs.carrotsearch.com/jsuffixarrays.html
> >
> > or a complicated task if your input size is larger (larger than memory).
> > Google search for suffix trees/ suffix arrays though, it's the data
> > structure to use here.
> >
> > Dawid
> >
> > On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <iz...@caci.com> wrote:
> > > Hi Everyone,
> > >
> > > I have the following task. I have a set of documents in multiple
> > languages. I don't know what these languages are. Any given doc may
> contain
> > text in several languages mixed up. So to me these are just a bunch of
> > Unicode text files.
> > >
> > > What I need is to implement an efficient EXACT string lookup. That is,
> I
> > need to be able to find ANY Unicode string exactly as it appears. I do
> not
> > care about language-specific modifications of the string. That is, if I
> > search for a string "run", I do not need to find "ran" but I do want to
> > find it in all of these strings below:
> > >
> > > Fox is running fast
> > > !%#^&$run!$!%@&$#
> > > run,run
> > >
> > > Is there a way of using StandardAnalyzer or any other analyzer and the
> > corresponding query parser to find these? Again, my queries might be more
> > or less random Unicode sequences and I need to find all their accurrences
> > in the text.
> > >
> > > Essentially, what I am trying to do is implement substring matching
> more
> > efficiently that using Java's standard substring matching methods.
> > >
> > > Thanks!
> > >
> > > Ilya Zavorin
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Efficient string lookup using Lucene

Posted by Noopur Julka <no...@gmail.com>.
Index being very large can be ruled out as Luke returned few results and
the app is capable of returning approx 200 results.

Regards,
Noopur Julka



On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin <iz...@caci.com> wrote:

> Does Lucene support this type of structure, or do I need to somehow
> implement it outside Lucene?
>
> By the way, I need this to run on an Android phone so size of memory might
> be an issue...
>
> Thanks,
>
>
> Ilya Zavorin
>
>
> -----Original Message-----
> From: Dawid Weiss [mailto:dawid.weiss@gmail.com]
> Sent: Friday, August 24, 2012 4:50 PM
> To: java-user@lucene.apache.org
> Subject: Re: Efficient string lookup using Lucene
>
> What you need is a suffix tree or a suffix array. Both data structures
> will allow you to perform constant-time searches for existence/ occurrence
> of any input pattern. Depending on how much text you have on the input it
> may either be a simple task -- see here:
>
> http://labs.carrotsearch.com/jsuffixarrays.html
>
> or a complicated task if your input size is larger (larger than memory).
> Google search for suffix trees/ suffix arrays though, it's the data
> structure to use here.
>
> Dawid
>
> On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <iz...@caci.com> wrote:
> > Hi Everyone,
> >
> > I have the following task. I have a set of documents in multiple
> languages. I don't know what these languages are. Any given doc may contain
> text in several languages mixed up. So to me these are just a bunch of
> Unicode text files.
> >
> > What I need is to implement an efficient EXACT string lookup. That is, I
> need to be able to find ANY Unicode string exactly as it appears. I do not
> care about language-specific modifications of the string. That is, if I
> search for a string "run", I do not need to find "ran" but I do want to
> find it in all of these strings below:
> >
> > Fox is running fast
> > !%#^&$run!$!%@&$#
> > run,run
> >
> > Is there a way of using StandardAnalyzer or any other analyzer and the
> corresponding query parser to find these? Again, my queries might be more
> or less random Unicode sequences and I need to find all their accurrences
> in the text.
> >
> > Essentially, what I am trying to do is implement substring matching more
> efficiently that using Java's standard substring matching methods.
> >
> > Thanks!
> >
> > Ilya Zavorin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Efficient string lookup using Lucene

Posted by Ilya Zavorin <iz...@caci.com>.
Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene?

By the way, I need this to run on an Android phone so size of memory might be an issue...

Thanks,


Ilya Zavorin


-----Original Message-----
From: Dawid Weiss [mailto:dawid.weiss@gmail.com] 
Sent: Friday, August 24, 2012 4:50 PM
To: java-user@lucene.apache.org
Subject: Re: Efficient string lookup using Lucene

What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on how much text you have on the input it may either be a simple task -- see here:

http://labs.carrotsearch.com/jsuffixarrays.html

or a complicated task if your input size is larger (larger than memory). Google search for suffix trees/ suffix arrays though, it's the data structure to use here.

Dawid

On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <iz...@caci.com> wrote:
> Hi Everyone,
>
> I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in several languages mixed up. So to me these are just a bunch of Unicode text files.
>
> What I need is to implement an efficient EXACT string lookup. That is, I need to be able to find ANY Unicode string exactly as it appears. I do not care about language-specific modifications of the string. That is, if I search for a string "run", I do not need to find "ran" but I do want to find it in all of these strings below:
>
> Fox is running fast
> !%#^&$run!$!%@&$#
> run,run
>
> Is there a way of using StandardAnalyzer or any other analyzer and the corresponding query parser to find these? Again, my queries might be more or less random Unicode sequences and I need to find all their accurrences in the text.
>
> Essentially, what I am trying to do is implement substring matching more efficiently that using Java's standard substring matching methods.
>
> Thanks!
>
> Ilya Zavorin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Dawid Weiss <da...@gmail.com>.
What you need is a suffix tree or a suffix array. Both data structures
will allow you to perform constant-time searches for existence/
occurrence of any input pattern. Depending on how much text you have
on the input it may either be a simple task -- see here:

http://labs.carrotsearch.com/jsuffixarrays.html

or a complicated task if your input size is larger (larger than
memory). Google search for suffix trees/ suffix arrays though, it's
the data structure to use here.

Dawid

On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin <iz...@caci.com> wrote:
> Hi Everyone,
>
> I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in several languages mixed up. So to me these are just a bunch of Unicode text files.
>
> What I need is to implement an efficient EXACT string lookup. That is, I need to be able to find ANY Unicode string exactly as it appears. I do not care about language-specific modifications of the string. That is, if I search for a string "run", I do not need to find "ran" but I do want to find it in all of these strings below:
>
> Fox is running fast
> !%#^&$run!$!%@&$#
> run,run
>
> Is there a way of using StandardAnalyzer or any other analyzer and the corresponding query parser to find these? Again, my queries might be more or less random Unicode sequences and I need to find all their accurrences in the text.
>
> Essentially, what I am trying to do is implement substring matching more efficiently that using Java's standard substring matching methods.
>
> Thanks!
>
> Ilya Zavorin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Jack Krupansky <ja...@basetechnology.com>.
I can't speak for any non-Latin languages, but how about simply using the 
StandardAnalyzer plus the EdgeNGramFilter for indexing (but not query.) The 
latter would allow a query of "run" to match "running".

-- Jack Krupansky

-----Original Message----- 
From: Ilya Zavorin
Sent: Friday, August 24, 2012 3:48 PM
To: java-user@lucene.apache.org
Subject: Efficient string lookup using Lucene

Hi Everyone,

I have the following task. I have a set of documents in multiple languages. 
I don't know what these languages are. Any given doc may contain text in 
several languages mixed up. So to me these are just a bunch of Unicode text 
files.

What I need is to implement an efficient EXACT string lookup. That is, I 
need to be able to find ANY Unicode string exactly as it appears. I do not 
care about language-specific modifications of the string. That is, if I 
search for a string "run", I do not need to find "ran" but I do want to find 
it in all of these strings below:

Fox is running fast
!%#^&$run!$!%@&$#
run,run

Is there a way of using StandardAnalyzer or any other analyzer and the 
corresponding query parser to find these? Again, my queries might be more or 
less random Unicode sequences and I need to find all their accurrences in 
the text.

Essentially, what I am trying to do is implement substring matching more 
efficiently that using Java's standard substring matching methods.

Thanks!

Ilya Zavorin 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Noopur Julka <no...@gmail.com>.
Hi,

I have a similar issue.
I need lucene search to work with kanji characters (japanese).

The hits object (or topDocs) returns length = 0 for results but works well
for english.
I know my index contains matches as luke (lucene search tool) renders them.

I tried lace analyser - did not work.

Regards,
Noopur Julka



On Sat, Aug 25, 2012 at 2:28 AM, Ahmet Arslan <io...@yahoo.com> wrote:

>  [image: Boxbe] <https://www.boxbe.com/overview>
> java-user@lucene.apache.org is not on your Guest List<https://www.boxbe.com/approved-list>| Approve
> sender <https://www.boxbe.com/anno?tc=12214130363_2118064944> | Approve
> domain <https://www.boxbe.com/anno?tc=12214130363_2118064944&dom>
>
> > search for a string "run", I do not need to find "ran" but I
> > do want to find it in all of these strings below:
> >
> > Fox is running fast
> > !%#^&$run!$!%@&$#
> > run,run
>
>
> With NGramFilter you can do that. But it creates a lot of tokens. For
> example "Fox is running fast" becomes
>
> F
>
> o
>
> x
>
> Fo
>
> ox
>
> Fox
>
> i
>
> s
>
> is
>
> r
>
> u
>
> n
>
> n
>
> i
>
> n
>
> g
>
> ru
>
> un
>
> nn
>
> ni
>
> in
>
> ng
>
> *run*
>
> unn
>
> nni
>
> nin
>
> ing
>
> runn
>
> unni
>
> nnin
>
> ning
>
> runni
>
> unnin
>
> nning
>
> runnin
>
> unning
>
> running
>
> f
>
> a
>
> s
>
> t
>
> fa
>
> as
>
> st
>
> fas
>
> ast
>
> fast
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

RE: Efficient string lookup using Lucene

Posted by Ilya Zavorin <iz...@caci.com>.
Does it mean that the resulting index will be very large?

Thanks,

Ilya

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Sent: Friday, August 24, 2012 4:59 PM
To: java-user@lucene.apache.org
Subject: Re: Efficient string lookup using Lucene

> search for a string "run", I do not need to find "ran" but I do want 
> to find it in all of these strings below:
> 
> Fox is running fast
> !%#^&$run!$!%@&$#
> run,run


With NGramFilter you can do that. But it creates a lot of tokens. For example "Fox is running fast" becomes 

F
	
o
	
x
	
Fo
	
ox
	
Fox
	
i
	
s
	
is
	
r
	
u
	
n
	
n
	
i
	
n
	
g
	
ru
	
un
	
nn
	
ni
	
in
	
ng
	
*run*
	
unn
	
nni
	
nin
	
ing
	
runn
	
unni
	
nnin
	
ning
	
runni
	
unnin
	
nning
	
runnin
	
unning
	
running
	
f
	
a
	
s
	
t
	
fa
	
as
	
st
	
fas
	
ast
	
fast


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Efficient string lookup using Lucene

Posted by Ahmet Arslan <io...@yahoo.com>.
> search for a string "run", I do not need to find "ran" but I
> do want to find it in all of these strings below:
> 
> Fox is running fast
> !%#^&$run!$!%@&$#
> run,run


With NGramFilter you can do that. But it creates a lot of tokens. For example "Fox is running fast" becomes 

F
	
o
	
x
	
Fo
	
ox
	
Fox
	
i
	
s
	
is
	
r
	
u
	
n
	
n
	
i
	
n
	
g
	
ru
	
un
	
nn
	
ni
	
in
	
ng
	
*run*
	
unn
	
nni
	
nin
	
ing
	
runn
	
unni
	
nnin
	
ning
	
runni
	
unnin
	
nning
	
runnin
	
unning
	
running
	
f
	
a
	
s
	
t
	
fa
	
as
	
st
	
fas
	
ast
	
fast


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org