You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by jchang <jc...@gmail.com> on 2010/02/01 08:25:26 UTC

Can't get tokenization/stop works working

I want to be able to store a doc with a field with this as a substring:
  www.fubar.com
And then I want this document to get returned when I query on
  fubar or
  fubar.com

I assume what I should do is make www and com stop words, and make sure the
field is tokenized, so it wil break it up along the '.'

I thought  I should take a list of Enlisgh stop words, add in 'www' and com,
and then make sure the field is tokenized, which I did by using this
constructor:
new Field("name", "value",  Field.Store.YES, Field.Index.Analyzed).
I saw that Field.Index.Analyzed meant it would be tokenized.

It is not working.  Searching on fubar or fubar.com does not return it. 
Thanks for any help.
-- 
View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27400546.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Can't get tokenization/stop works working

Posted by Digy <di...@gmail.com>.

Seeing "www.fubar.com" in the index means that your analyzer returns it as a
single token. To strip out "www" and "com", you have to use an analyzer that
returns tokens as "www", "fubar" and " com". 

Try to use a different analyzer( or write your own  as below ).

 

    //a C# example

    public class LetterOrDigitAnalyzer : Analyzer

    {

        public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)

        {

            TokenStream t = new LetterOrDigitTokenizer(reader);

            t = new LowerCaseFilter(t);

            return t;

        }

    }

 

    public class LetterOrDigitTokenizer : CharTokenizer

    {

        public LetterOrDigitTokenizer(TextReader input) : base(input)

        {

        }

 

        protected override bool IsTokenChar(char c)

        {

            return char.IsLetterOrDigit(c);

        }

    }

 

 

DIGY

 

-----Original Message-----
From: jchang [mailto:jchangkihatest@gmail.com] 
Sent: Tuesday, February 02, 2010 11:16 PM
To: java-user@lucene.apache.org
Subject: Re: Can't get tokenization/stop works working

 

 

I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.

 

Looking through luke, I see that www.fubar.com was indexed, not fubar.  So,

clearly, I'm not stripping out the stop words of www and com.  Any ideas?

 

 

-- 

View this message in context:
http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546
p27427519.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 

 

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Can't get tokenization/stop works working

Posted by jchang <jc...@gmail.com>.

I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.

Looking through luke, I see that www.fubar.com was indexed, not fubar.  So,
clearly, I'm not stripping out the stop words of www and com.  Any ideas?


-- 
View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27427519.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Can't get tokenization/stop works working

Posted by Ian Lea <ia...@gmail.com>.

If you make com a stop word then you won't be able to search for it,
but a search for fubar should have worked.  Are you sure your analyzer
is doing what you want?  You don't tell us what analyzer you are
using.

Tips:
  use Luke to see what has been indexed
  read the FAQ entry
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F


--
Ian.

On Mon, Feb 1, 2010 at 7:25 AM, jchang <jc...@gmail.com> wrote:
>
> I want to be able to store a doc with a field with this as a substring:
>  www.fubar.com
> And then I want this document to get returned when I query on
>  fubar or
>  fubar.com
>
> I assume what I should do is make www and com stop words, and make sure the
> field is tokenized, so it wil break it up along the '.'
>
> I thought  I should take a list of Enlisgh stop words, add in 'www' and com,
> and then make sure the field is tokenized, which I did by using this
> constructor:
> new Field("name", "value",  Field.Store.YES, Field.Index.Analyzed).
> I saw that Field.Index.Analyzed meant it would be tokenized.
>
> It is not working.  Searching on fubar or fubar.com does not return it.
> Thanks for any help.
> --
> View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546p27400546.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org