You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jchang <jc...@gmail.com> on 2010/02/01 08:25:26 UTC
Can't get tokenization/stop works working
I want to be able to store a doc with a field with this as a substring:
www.fubar.com
And then I want this document to get returned when I query on
fubar or
fubar.com
I assume what I should do is make www and com stop words, and make sure the
field is tokenized, so it wil break it up along the '.'
I thought I should take a list of Enlisgh stop words, add in 'www' and com,
and then make sure the field is tokenized, which I did by using this
constructor:
new Field("name", "value", Field.Store.YES, Field.Index.Analyzed).
I saw that Field.Index.Analyzed meant it would be tokenized.
It is not working. Searching on fubar or fubar.com does not return it.
Thanks for any help.
--
View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27400546.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Can't get tokenization/stop works working
Posted by Digy <di...@gmail.com>.
Seeing "www.fubar.com" in the index means that your analyzer returns it as a
single token. To strip out "www" and "com", you have to use an analyzer that
returns tokens as "www", "fubar" and " com".
Try to use a different analyzer( or write your own as below ).
//a C# example
public class LetterOrDigitAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)
{
TokenStream t = new LetterOrDigitTokenizer(reader);
t = new LowerCaseFilter(t);
return t;
}
}
public class LetterOrDigitTokenizer : CharTokenizer
{
public LetterOrDigitTokenizer(TextReader input) : base(input)
{
}
protected override bool IsTokenChar(char c)
{
return char.IsLetterOrDigit(c);
}
}
DIGY
-----Original Message-----
From: jchang [mailto:jchangkihatest@gmail.com]
Sent: Tuesday, February 02, 2010 11:16 PM
To: java-user@lucene.apache.org
Subject: Re: Can't get tokenization/stop works working
I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.
Looking through luke, I see that www.fubar.com was indexed, not fubar. So,
clearly, I'm not stripping out the stop words of www and com. Any ideas?
--
View this message in context:
http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546
p27427519.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Can't get tokenization/stop works working
Posted by jchang <jc...@gmail.com>.
I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.
Looking through luke, I see that www.fubar.com was indexed, not fubar. So,
clearly, I'm not stripping out the stop words of www and com. Any ideas?
--
View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27427519.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Can't get tokenization/stop works working
Posted by Ian Lea <ia...@gmail.com>.
If you make com a stop word then you won't be able to search for it,
but a search for fubar should have worked. Are you sure your analyzer
is doing what you want? You don't tell us what analyzer you are
using.
Tips:
use Luke to see what has been indexed
read the FAQ entry
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F
--
Ian.
On Mon, Feb 1, 2010 at 7:25 AM, jchang <jc...@gmail.com> wrote:
>
> I want to be able to store a doc with a field with this as a substring:
> www.fubar.com
> And then I want this document to get returned when I query on
> fubar or
> fubar.com
>
> I assume what I should do is make www and com stop words, and make sure the
> field is tokenized, so it wil break it up along the '.'
>
> I thought I should take a list of Enlisgh stop words, add in 'www' and com,
> and then make sure the field is tokenized, which I did by using this
> constructor:
> new Field("name", "value", Field.Store.YES, Field.Index.Analyzed).
> I saw that Field.Index.Analyzed meant it would be tokenized.
>
> It is not working. Searching on fubar or fubar.com does not return it.
> Thanks for any help.
> --
> View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27400546.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org