You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2009/08/06 16:03:30 UTC

StandardFilter not handling dots as exptected ?

Hi want the query "R.E.S" to match "R.E.S"

I use StandardFilter in my analyzer below and the description says:

    'Splits words at punctuation characters, removing punctuation. 
However, a dot that's not followed by whitespace is considered part of a 
token. '

so I thought that R.E.S. would become searchable as R.E.S, and the 
search would work, but it doesn't whereas searching for "R.E.S" does 
return a hit .

thanks Paul

public class StandardUnaccentAnalyzer extends Analyzer {

    public TokenStream tokenStream(String fieldName, Reader reader) {
        StandardTokenizer tokenStream = new StandardTokenizer(reader);
        TokenStream result = new StandardFilter(tokenStream);
        result = new LowerCaseFilter(result);
        return result;
    }
   
    private static final class SavedStreams {
        StandardTokenizer tokenStream;
        TokenStream filteredTokenStream;
    }
   
    public TokenStream reusableTokenStream(String fieldName, Reader 
reader) throws IOException {
        SavedStreams streams = (SavedStreams)getPreviousTokenStream();
        if (streams == null) {
            streams = new SavedStreams();
            setPreviousTokenStream(streams);
            streams.tokenStream = new StandardTokenizer(reader);
            streams.filteredTokenStream = new 
StandardFilter(streams.tokenStream);
            streams.filteredTokenStream = new 
LowerCaseFilter(streams.filteredTokenStream);
        }
        else {
            streams.tokenStream.reset(reader);
        }
        return streams.filteredTokenStream;
    }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Paul Taylor <pa...@fastmail.fm>.

Ian Lea wrote:
> See https://issues.apache.org/jira/browse/LUCENE-1068 which appears to
> be talking about the same sort of thing, and
> StandardAnalyzer.setReplaceInvalidAcronym(b).
>
> Quite how you deal with this in your own analyzer is left as an exercise ...
>   

Yes I think you are right, though dont understand it fully


        TokenStream ts = analyzer.tokenStream("content", new 
StringReader("R.E.S."));
        Token t;
        while ((t = ts.next()) != null) { System.out.println("R.E.S. 
parsed to :"+t); }


        ts = analyzer.tokenStream("content", new StringReader("R.E.S"));
        while ((t = ts.next()) != null) { System.out.println("R.E.S 
parsed to :"+t); }
        }

this code outputs

R.E.S. parsed to :(res,0,6,type=<ACRONYM>)
R.E.S parsed to :(r.e.s,0,5,type=<HOST>)

so from my perspective I cannot see
it thinks R.E.S is a HOST it should be an acronym, but also for the one 
that is an acronym I thought it end up as r.e.s not res

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Ian Lea <ia...@gmail.com>.

See https://issues.apache.org/jira/browse/LUCENE-1068 which appears to
be talking about the same sort of thing, and
StandardAnalyzer.setReplaceInvalidAcronym(b).

Quite how you deal with this in your own analyzer is left as an exercise ...


--
Ian.



On Thu, Aug 6, 2009 at 3:45 PM, Paul Taylor<pa...@fastmail.fm> wrote:
> Erick Erickson wrote:
>>
>> I don't see anything obvious in the code.
>>
>> Are you using the same analzer at query time as at index time?
>
> Yes, I do I have created a testcase now, that fails
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.queryParser.QueryParser;
> import junit.framework.TestCase;
>
> public class RESTest extends TestCase {
>   public void testMatchAcronymns() throws Exception {
>       Analyzer analyzer = new StandardUnaccentAnalyzer();
>       RAMDirectory dir = new RAMDirectory();
>       IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
>       Document doc = new Document();
>       doc.add(new Field("name", "R.E.S.", Field.Store.YES,
> Field.Index.ANALYZED));
>       writer.addDocument(doc);
>       writer.close();
>
>       IndexSearcher searcher = new IndexSearcher(dir);
>       Query q = new QueryParser("name", analyzer).parse("R.E.S");
>       System.out.println(q.toString());
>       Hits hits = searcher.search(q);
>       assertEquals(1, hits.length());
>   }
> }
>>
>> I'd also get a copy of Luke and examine your index to see what
>> is actually getting put in it, and query.toString might help.
>>
> Query to string returns
> name:r.e.s
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Paul Taylor <pa...@fastmail.fm>.

Paul Taylor wrote:
> Shai Erera wrote:
>> No actually I think that's how the ACRONYM rule is defined:
>> R.E.S. is detected as ACRONYM, and therefore is converted to RES.
>> R.E.S is not detected as ACRONYM and therefore remains as R.E.S
>> Hence the mismatch.
> Hi looking at your suggestion of 
> https://issues.apache.org/jira/browse/LUCENE-1068
>
> ACRONYM = {LETTER} "." ({LETTER} ".")+
>
> it demands a trailing dot, perhaps I need this instead
>
> ACRONYM = {LETTER} ("." {LETTER} )+
>
> Paul
Ok, I raised a RFE at https://issues.apache.org/jira/browse/LUCENE-1787 
,what do you think ?


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Paul Taylor <pa...@fastmail.fm>.

Shai Erera wrote:
> No actually I think that's how the ACRONYM rule is defined:
> R.E.S. is detected as ACRONYM, and therefore is converted to RES.
> R.E.S is not detected as ACRONYM and therefore remains as R.E.S
> Hence the mismatch.
Hi looking at your suggestion of 
https://issues.apache.org/jira/browse/LUCENE-1068

ACRONYM = {LETTER} "." ({LETTER} ".")+

it demands a trailing dot, perhaps I need this instead

ACRONYM = {LETTER} ("." {LETTER} )+

Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Paul Taylor <pa...@fastmail.fm>.

Shai Erera wrote:
> I see you index R.E.S. and search for R.E.S (note the dot that's 
> missing in the query at the end). Can you try to query w/ the dot?
Yes if you search with the dot it works (i mentioned this in the first 
email) so it appears when the field is being indexed its no tremoving 
the last dot.
>
> On Thu, Aug 6, 2009 at 5:45 PM, Paul Taylor <paul_t100@fastmail.fm 
> <ma...@fastmail.fm>> wrote:
>
>     Erick Erickson wrote:
>
>         I don't see anything obvious in the code.
>
>         Are you using the same analzer at query time as at index time?
>
>     Yes, I do I have created a testcase now, that fails
>
>
>     import org.apache.lucene.analysis.Analyzer;
>     import org.apache.lucene.store.RAMDirectory;
>     import org.apache.lucene.index.IndexWriter;
>     import org.apache.lucene.document.Document;
>     import org.apache.lucene.document.Field;
>     import org.apache.lucene.search.IndexSearcher;
>     import org.apache.lucene.search.Query;
>     import org.apache.lucene.search.Hits;
>     import org.apache.lucene.queryParser.QueryParser;
>     import junit.framework.TestCase;
>
>     public class RESTest extends TestCase {
>       public void testMatchAcronymns() throws Exception {
>           Analyzer analyzer = new StandardUnaccentAnalyzer();
>           RAMDirectory dir = new RAMDirectory();
>           IndexWriter writer = new IndexWriter(dir, analyzer, true,
>     IndexWriter.MaxFieldLength.LIMITED);
>           Document doc = new Document();
>           doc.add(new Field("name", "R.E.S.", Field.Store.YES,
>     Field.Index.ANALYZED));
>           writer.addDocument(doc);
>           writer.close();
>
>           IndexSearcher searcher = new IndexSearcher(dir);
>           Query q = new QueryParser("name", analyzer).parse("R.E.S");
>           System.out.println(q.toString());
>           Hits hits = searcher.search(q);
>           assertEquals(1, hits.length());
>
>       }
>     }
>
>
>         I'd also get a copy of Luke and examine your index to see what
>         is actually getting put in it, and query.toString might help.
>
>     Query to string returns
>     name:r.e.s
>
>     Paul
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-user-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Shai Erera <se...@gmail.com>.

I see you index R.E.S. and search for R.E.S (note the dot that's missing in
the query at the end). Can you try to query w/ the dot?

On Thu, Aug 6, 2009 at 5:45 PM, Paul Taylor <pa...@fastmail.fm> wrote:

> Erick Erickson wrote:
>
>> I don't see anything obvious in the code.
>>
>> Are you using the same analzer at query time as at index time?
>>
> Yes, I do I have created a testcase now, that fails
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.queryParser.QueryParser;
> import junit.framework.TestCase;
>
> public class RESTest extends TestCase {
>   public void testMatchAcronymns() throws Exception {
>       Analyzer analyzer = new StandardUnaccentAnalyzer();
>       RAMDirectory dir = new RAMDirectory();
>       IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
>       Document doc = new Document();
>       doc.add(new Field("name", "R.E.S.", Field.Store.YES,
> Field.Index.ANALYZED));
>       writer.addDocument(doc);
>       writer.close();
>
>       IndexSearcher searcher = new IndexSearcher(dir);
>       Query q = new QueryParser("name", analyzer).parse("R.E.S");
>       System.out.println(q.toString());
>       Hits hits = searcher.search(q);
>       assertEquals(1, hits.length());
>   }
> }
>
>>
>> I'd also get a copy of Luke and examine your index to see what
>> is actually getting put in it, and query.toString might help.
>>
>>  Query to string returns
> name:r.e.s
>
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: StandardFilter not handling dots as exptected ?

Posted by Paul Taylor <pa...@fastmail.fm>.

Erick Erickson wrote:
> I don't see anything obvious in the code.
>
> Are you using the same analzer at query time as at index time?
Yes, I do I have created a testcase now, that fails


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import junit.framework.TestCase;

public class RESTest extends TestCase {
    public void testMatchAcronymns() throws Exception {
        Analyzer analyzer = new StandardUnaccentAnalyzer();
        RAMDirectory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, analyzer, true, 
IndexWriter.MaxFieldLength.LIMITED);
        Document doc = new Document();
        doc.add(new Field("name", "R.E.S.", Field.Store.YES, 
Field.Index.ANALYZED));
        writer.addDocument(doc);
        writer.close();

        IndexSearcher searcher = new IndexSearcher(dir);
        Query q = new QueryParser("name", analyzer).parse("R.E.S");
        System.out.println(q.toString());
        Hits hits = searcher.search(q);
        assertEquals(1, hits.length());
    }
}
>
> I'd also get a copy of Luke and examine your index to see what
> is actually getting put in it, and query.toString might help.
>
Query to string returns
name:r.e.s

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardFilter not handling dots as exptected ?

Posted by Erick Erickson <er...@gmail.com>.

I don't see anything obvious in the code.
Are you using the same analzer at query time as at index time?
I'd also get a copy of Luke and examine your index to see what
is actually getting put in it, and query.toString might help.

Best
Erick

On Thu, Aug 6, 2009 at 10:03 AM, Paul Taylor <pa...@fastmail.fm> wrote:

>
> Hi want the query "R.E.S" to match "R.E.S"
>
> I use StandardFilter in my analyzer below and the description says:
>
>   'Splits words at punctuation characters, removing punctuation. However, a
> dot that's not followed by whitespace is considered part of a token. '
>
> so I thought that R.E.S. would become searchable as R.E.S, and the search
> would work, but it doesn't whereas searching for "R.E.S" does return a hit .
>
> thanks Paul
>
> public class StandardUnaccentAnalyzer extends Analyzer {
>
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>       StandardTokenizer tokenStream = new StandardTokenizer(reader);
>       TokenStream result = new StandardFilter(tokenStream);
>       result = new LowerCaseFilter(result);
>       return result;
>   }
>     private static final class SavedStreams {
>       StandardTokenizer tokenStream;
>       TokenStream filteredTokenStream;
>   }
>     public TokenStream reusableTokenStream(String fieldName, Reader reader)
> throws IOException {
>       SavedStreams streams = (SavedStreams)getPreviousTokenStream();
>       if (streams == null) {
>           streams = new SavedStreams();
>           setPreviousTokenStream(streams);
>           streams.tokenStream = new StandardTokenizer(reader);
>           streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
>           streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
>       }
>       else {
>           streams.tokenStream.reset(reader);
>       }
>       return streams.filteredTokenStream;
>   }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>