You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/21 15:26:32 UTC

How to query/search unicoded docs in lucene using unicode text as query?

Hi All,
I've indexed some docs[non-english] in unicoded utf=8 format. For both
indexing as well as searching/querying I'm using simpleanalyzer. For english
texts when I tried with single words its working then I thought of trying
for non-english texts. So I wrote those words[multiple words] in babelmap[a
unicode converter] and got the unicode for the text string and tried that as
query but it din't work. Earlier I've used the same method to query solr
index which use lucene at the backend. I tried say this query,
\u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930
which is unicoded for some non-english text, but this give me zero search
result in lucene. I want to know whats going wrong. As I know at the end of
the day lucene writes my non-english texts in unicodes, so if I'm reading
say the index it'll have this kind of characters on the disk, right? So when
I query using the same thing it should work. This used to work perfectly
well with Solr where I was indexing all docs in unicode utf-8 encoding and
the query was also unicoded as show above. Can someone point me what is
going wrong here?
May be I've to have a look over the analyzer solr was using in the default
setting[i used the default setting only, and pretty sure it was using lot
many analyzers/filter factory]. Thanks for all your time and appreciation.

Thanks,
KK.

Re: How to query/search unicoded docs in lucene using unicode text as query?

Posted by KK <di...@gmail.com>.
Thanks Muir.
I dont know whats going wrong but I did a fresh build using simplAanalyzer
and its working. I tried for long sentences say upto 10/12 words also and
its fine. So far so good...
I dont have any idea about unicode normalization. I'll google it and see if
i can make use of it.
That apart, I want a single indexer/searcher that can handle say around 8/10
non-eng[indian] languages. So far using simpleAnalyzer I've trired upto 3
languages for indexing/searching and its working. I dont know if its going
to work for other non-eng indian languages as well.
Muir, if you have some pointers on doing unicode normalization please let me
know. If you think that might help I'ld definitely give it a try.

Thanks,
KK


On Thu, May 21, 2009 at 7:40 PM, Robert Muir <rc...@gmail.com> wrote:

> hello, your example (hindi), is probably suffering from a number of search
> issues:
>
> i dont recommend standardanalyzer as for this example, it will break words
> around dependent the vowels and nukta dot, etc.
> whitespaceanalyzer might be a good start.
>
> also, is it possible to apply unicode normalization to your text before
> indexing it?
> normalization will standardize things in indian languages.
>
> in your example, the pha + nukda dot you queries on is the normalized form,
> but i wonder if in your text its encoded as fa (u095E)
> if you apply normalization mode NFC it will standardize to pha + nukda dot.
>
> On Thu, May 21, 2009 at 9:26 AM, KK <di...@gmail.com> wrote:
>
> > Hi All,
> > I've indexed some docs[non-english] in unicoded utf=8 format. For both
> > indexing as well as searching/querying I'm using simpleanalyzer. For
> > english
> > texts when I tried with single words its working then I thought of trying
> > for non-english texts. So I wrote those words[multiple words] in
> babelmap[a
> > unicode converter] and got the unicode for the text string and tried that
> > as
> > query but it din't work. Earlier I've used the same method to query solr
> > index which use lucene at the backend. I tried say this query,
> > \u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930
> > which is unicoded for some non-english text, but this give me zero search
> > result in lucene. I want to know whats going wrong. As I know at the end
> of
> > the day lucene writes my non-english texts in unicodes, so if I'm reading
> > say the index it'll have this kind of characters on the disk, right? So
> > when
> > I query using the same thing it should work. This used to work perfectly
> > well with Solr where I was indexing all docs in unicode utf-8 encoding
> and
> > the query was also unicoded as show above. Can someone point me what is
> > going wrong here?
> > May be I've to have a look over the analyzer solr was using in the
> default
> > setting[i used the default setting only, and pretty sure it was using lot
> > many analyzers/filter factory]. Thanks for all your time and
> appreciation.
> >
> > Thanks,
> > KK.
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to query/search unicoded docs in lucene using unicode text as query?

Posted by Robert Muir <rc...@gmail.com>.
hello, your example (hindi), is probably suffering from a number of search
issues:

i dont recommend standardanalyzer as for this example, it will break words
around dependent the vowels and nukta dot, etc.
whitespaceanalyzer might be a good start.

also, is it possible to apply unicode normalization to your text before
indexing it?
normalization will standardize things in indian languages.

in your example, the pha + nukda dot you queries on is the normalized form,
but i wonder if in your text its encoded as fa (u095E)
if you apply normalization mode NFC it will standardize to pha + nukda dot.

On Thu, May 21, 2009 at 9:26 AM, KK <di...@gmail.com> wrote:

> Hi All,
> I've indexed some docs[non-english] in unicoded utf=8 format. For both
> indexing as well as searching/querying I'm using simpleanalyzer. For
> english
> texts when I tried with single words its working then I thought of trying
> for non-english texts. So I wrote those words[multiple words] in babelmap[a
> unicode converter] and got the unicode for the text string and tried that
> as
> query but it din't work. Earlier I've used the same method to query solr
> index which use lucene at the backend. I tried say this query,
> \u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930
> which is unicoded for some non-english text, but this give me zero search
> result in lucene. I want to know whats going wrong. As I know at the end of
> the day lucene writes my non-english texts in unicodes, so if I'm reading
> say the index it'll have this kind of characters on the disk, right? So
> when
> I query using the same thing it should work. This used to work perfectly
> well with Solr where I was indexing all docs in unicode utf-8 encoding and
> the query was also unicoded as show above. Can someone point me what is
> going wrong here?
> May be I've to have a look over the analyzer solr was using in the default
> setting[i used the default setting only, and pretty sure it was using lot
> many analyzers/filter factory]. Thanks for all your time and appreciation.
>
> Thanks,
> KK.
>



-- 
Robert Muir
rcmuir@gmail.com