You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2009/08/06 16:03:30 UTC
StandardFilter not handling dots as exptected ?
Hi want the query "R.E.S" to match "R.E.S"
I use StandardFilter in my analyzer below and the description says:
'Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of a
token. '
so I thought that R.E.S. would become searchable as R.E.S, and the
search would work, but it doesn't whereas searching for "R.E.S" does
return a hit .
thanks Paul
public class StandardUnaccentAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(reader);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
return result;
}
private static final class SavedStreams {
StandardTokenizer tokenStream;
TokenStream filteredTokenStream;
}
public TokenStream reusableTokenStream(String fieldName, Reader
reader) throws IOException {
SavedStreams streams = (SavedStreams)getPreviousTokenStream();
if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(reader);
streams.filteredTokenStream = new
StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
}
else {
streams.tokenStream.reset(reader);
}
return streams.filteredTokenStream;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Ian Lea wrote:
> See https://issues.apache.org/jira/browse/LUCENE-1068 which appears to
> be talking about the same sort of thing, and
> StandardAnalyzer.setReplaceInvalidAcronym(b).
>
> Quite how you deal with this in your own analyzer is left as an exercise ...
>
Yes I think you are right, though dont understand it fully
TokenStream ts = analyzer.tokenStream("content", new
StringReader("R.E.S."));
Token t;
while ((t = ts.next()) != null) { System.out.println("R.E.S.
parsed to :"+t); }
ts = analyzer.tokenStream("content", new StringReader("R.E.S"));
while ((t = ts.next()) != null) { System.out.println("R.E.S
parsed to :"+t); }
}
this code outputs
R.E.S. parsed to :(res,0,6,type=<ACRONYM>)
R.E.S parsed to :(r.e.s,0,5,type=<HOST>)
so from my perspective I cannot see
it thinks R.E.S is a HOST it should be an acronym, but also for the one
that is an acronym I thought it end up as r.e.s not res
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Ian Lea <ia...@gmail.com>.
See https://issues.apache.org/jira/browse/LUCENE-1068 which appears to
be talking about the same sort of thing, and
StandardAnalyzer.setReplaceInvalidAcronym(b).
Quite how you deal with this in your own analyzer is left as an exercise ...
--
Ian.
On Thu, Aug 6, 2009 at 3:45 PM, Paul Taylor<pa...@fastmail.fm> wrote:
> Erick Erickson wrote:
>>
>> I don't see anything obvious in the code.
>>
>> Are you using the same analzer at query time as at index time?
>
> Yes, I do I have created a testcase now, that fails
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.queryParser.QueryParser;
> import junit.framework.TestCase;
>
> public class RESTest extends TestCase {
> public void testMatchAcronymns() throws Exception {
> Analyzer analyzer = new StandardUnaccentAnalyzer();
> RAMDirectory dir = new RAMDirectory();
> IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> Document doc = new Document();
> doc.add(new Field("name", "R.E.S.", Field.Store.YES,
> Field.Index.ANALYZED));
> writer.addDocument(doc);
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(dir);
> Query q = new QueryParser("name", analyzer).parse("R.E.S");
> System.out.println(q.toString());
> Hits hits = searcher.search(q);
> assertEquals(1, hits.length());
> }
> }
>>
>> I'd also get a copy of Luke and examine your index to see what
>> is actually getting put in it, and query.toString might help.
>>
> Query to string returns
> name:r.e.s
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Paul Taylor wrote:
> Shai Erera wrote:
>> No actually I think that's how the ACRONYM rule is defined:
>> R.E.S. is detected as ACRONYM, and therefore is converted to RES.
>> R.E.S is not detected as ACRONYM and therefore remains as R.E.S
>> Hence the mismatch.
> Hi looking at your suggestion of
> https://issues.apache.org/jira/browse/LUCENE-1068
>
> ACRONYM = {LETTER} "." ({LETTER} ".")+
>
> it demands a trailing dot, perhaps I need this instead
>
> ACRONYM = {LETTER} ("." {LETTER} )+
>
> Paul
Ok, I raised a RFE at https://issues.apache.org/jira/browse/LUCENE-1787
,what do you think ?
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Shai Erera wrote:
> No actually I think that's how the ACRONYM rule is defined:
> R.E.S. is detected as ACRONYM, and therefore is converted to RES.
> R.E.S is not detected as ACRONYM and therefore remains as R.E.S
> Hence the mismatch.
Hi looking at your suggestion of
https://issues.apache.org/jira/browse/LUCENE-1068
ACRONYM = {LETTER} "." ({LETTER} ".")+
it demands a trailing dot, perhaps I need this instead
ACRONYM = {LETTER} ("." {LETTER} )+
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Shai Erera wrote:
> I see you index R.E.S. and search for R.E.S (note the dot that's
> missing in the query at the end). Can you try to query w/ the dot?
Yes if you search with the dot it works (i mentioned this in the first
email) so it appears when the field is being indexed its no tremoving
the last dot.
>
> On Thu, Aug 6, 2009 at 5:45 PM, Paul Taylor <paul_t100@fastmail.fm
> <ma...@fastmail.fm>> wrote:
>
> Erick Erickson wrote:
>
> I don't see anything obvious in the code.
>
> Are you using the same analzer at query time as at index time?
>
> Yes, I do I have created a testcase now, that fails
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.queryParser.QueryParser;
> import junit.framework.TestCase;
>
> public class RESTest extends TestCase {
> public void testMatchAcronymns() throws Exception {
> Analyzer analyzer = new StandardUnaccentAnalyzer();
> RAMDirectory dir = new RAMDirectory();
> IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> Document doc = new Document();
> doc.add(new Field("name", "R.E.S.", Field.Store.YES,
> Field.Index.ANALYZED));
> writer.addDocument(doc);
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(dir);
> Query q = new QueryParser("name", analyzer).parse("R.E.S");
> System.out.println(q.toString());
> Hits hits = searcher.search(q);
> assertEquals(1, hits.length());
>
> }
> }
>
>
> I'd also get a copy of Luke and examine your index to see what
> is actually getting put in it, and query.toString might help.
>
> Query to string returns
> name:r.e.s
>
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <ma...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <ma...@lucene.apache.org>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Shai Erera <se...@gmail.com>.
I see you index R.E.S. and search for R.E.S (note the dot that's missing in
the query at the end). Can you try to query w/ the dot?
On Thu, Aug 6, 2009 at 5:45 PM, Paul Taylor <pa...@fastmail.fm> wrote:
> Erick Erickson wrote:
>
>> I don't see anything obvious in the code.
>>
>> Are you using the same analzer at query time as at index time?
>>
> Yes, I do I have created a testcase now, that fails
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.queryParser.QueryParser;
> import junit.framework.TestCase;
>
> public class RESTest extends TestCase {
> public void testMatchAcronymns() throws Exception {
> Analyzer analyzer = new StandardUnaccentAnalyzer();
> RAMDirectory dir = new RAMDirectory();
> IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> Document doc = new Document();
> doc.add(new Field("name", "R.E.S.", Field.Store.YES,
> Field.Index.ANALYZED));
> writer.addDocument(doc);
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(dir);
> Query q = new QueryParser("name", analyzer).parse("R.E.S");
> System.out.println(q.toString());
> Hits hits = searcher.search(q);
> assertEquals(1, hits.length());
> }
> }
>
>>
>> I'd also get a copy of Luke and examine your index to see what
>> is actually getting put in it, and query.toString might help.
>>
>> Query to string returns
> name:r.e.s
>
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: StandardFilter not handling dots as exptected ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Erick Erickson wrote:
> I don't see anything obvious in the code.
>
> Are you using the same analzer at query time as at index time?
Yes, I do I have created a testcase now, that fails
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import junit.framework.TestCase;
public class RESTest extends TestCase {
public void testMatchAcronymns() throws Exception {
Analyzer analyzer = new StandardUnaccentAnalyzer();
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field("name", "R.E.S.", Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new QueryParser("name", analyzer).parse("R.E.S");
System.out.println(q.toString());
Hits hits = searcher.search(q);
assertEquals(1, hits.length());
}
}
>
> I'd also get a copy of Luke and examine your index to see what
> is actually getting put in it, and query.toString might help.
>
Query to string returns
name:r.e.s
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardFilter not handling dots as exptected ?
Posted by Erick Erickson <er...@gmail.com>.
I don't see anything obvious in the code.
Are you using the same analzer at query time as at index time?
I'd also get a copy of Luke and examine your index to see what
is actually getting put in it, and query.toString might help.
Best
Erick
On Thu, Aug 6, 2009 at 10:03 AM, Paul Taylor <pa...@fastmail.fm> wrote:
>
> Hi want the query "R.E.S" to match "R.E.S"
>
> I use StandardFilter in my analyzer below and the description says:
>
> 'Splits words at punctuation characters, removing punctuation. However, a
> dot that's not followed by whitespace is considered part of a token. '
>
> so I thought that R.E.S. would become searchable as R.E.S, and the search
> would work, but it doesn't whereas searching for "R.E.S" does return a hit .
>
> thanks Paul
>
> public class StandardUnaccentAnalyzer extends Analyzer {
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
> StandardTokenizer tokenStream = new StandardTokenizer(reader);
> TokenStream result = new StandardFilter(tokenStream);
> result = new LowerCaseFilter(result);
> return result;
> }
> private static final class SavedStreams {
> StandardTokenizer tokenStream;
> TokenStream filteredTokenStream;
> }
> public TokenStream reusableTokenStream(String fieldName, Reader reader)
> throws IOException {
> SavedStreams streams = (SavedStreams)getPreviousTokenStream();
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> }
> else {
> streams.tokenStream.reset(reader);
> }
> return streams.filteredTokenStream;
> }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>