You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2009/08/22 13:17:43 UTC

Query parser fails on Hangul/Korean

public class Issue3341Test extends TestCase {

public void testMatchHangul() throws Exception {
Analyzer analyzer = new StandardAnalyzer();
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, analyzer, true, 
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field("name", "키드갱", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();

IndexSearcher searcher = new IndexSearcher(dir,true);
Query q = new QueryParser("name", analyzer).parse("키드갱");
System.out.println(q.toString());


Hits hits = searcher.search(q);
assertEquals(1, hits.length());
}

}

gives the following:

org.apache.lucene.queryParser.ParseException: Cannot parse '???': '*' or 
'?' not allowed as first character in WildcardQuery
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:181)
at 
org.musicbrainz.search.analysis.Issue3341Test.testMatchHangul(Issue3341Test.java:32)

Why does the parser think its a wildcard.
(I'm just using the standard analyser, because the search could be 
performed in any language, but the user doesnt specify the language so 
we don't know what analyser to use. But thats okay I dont expect lucene 
to do anything clever but I would expect a match when index and query 
are identical.)


thanks Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query parser fails on Hangul/Korean

Posted by Simon Willnauer <si...@googlemail.com>.
Paul, my frist guess would be that your source file encoding is set to
something else than UTF-8. Those characters should be supported by
lucene - none of them are > 16bit so I don't see why this should be
caused by lucene.

I'm pretty sure thats a encoding issues. R u running on windows?!

hope that helps,

simon

On Sat, Aug 22, 2009 at 1:17 PM, Paul Taylor<pa...@fastmail.fm> wrote:
> public class Issue3341Test extends TestCase {
>
> public void testMatchHangul() throws Exception {
> Analyzer analyzer = new StandardAnalyzer();
> RAMDirectory dir = new RAMDirectory();
> IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> Document doc = new Document();
> doc.add(new Field("name", "키드갱", Field.Store.YES, Field.Index.ANALYZED));
> writer.addDocument(doc);
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(dir,true);
> Query q = new QueryParser("name", analyzer).parse("키드갱");
> System.out.println(q.toString());
>
>
> Hits hits = searcher.search(q);
> assertEquals(1, hits.length());
> }
>
> }
>
> gives the following:
>
> org.apache.lucene.queryParser.ParseException: Cannot parse '???': '*' or '?'
> not allowed as first character in WildcardQuery
> at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:181)
> at
> org.musicbrainz.search.analysis.Issue3341Test.testMatchHangul(Issue3341Test.java:32)
>
> Why does the parser think its a wildcard.
> (I'm just using the standard analyser, because the search could be performed
> in any language, but the user doesnt specify the language so we don't know
> what analyser to use. But thats okay I dont expect lucene to do anything
> clever but I would expect a match when index and query are identical.)
>
>
> thanks Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org