You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andrey Grishin <gr...@softline.kiev.ua> on 2002/11/21 14:38:25 UTC

problems with search on Russian content

Hi All, 
I have a problems with searching on Russian content using lucene 1.2

I indexed the content using Cp1251 charset
------------
text = new String(text.getBytes("Cp1251"));
doc.add(Field.Text(CONTENT_FIELD,text));

------------
and I am searching using the same charset

String txt = "Анд";
txt = new String(txt.getBytes("Cp1251"));
PrefixQuery query = new PrefixQuery(new Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

or 

Analyzer analyzer = new StandardAnalyzer();
String txt = "Андрей";
txt = new String(txt.getBytes("Cp1251"));
Query query = QueryParser.parse(txt, PortalHTMLDocument.CONTENT_FIELD, analyzer);

hits = searcher.search(query);


and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there isn't any

I tried UTF-8/16 - and got the same result.

Also, if I list all index's content via iterating IndexReader - I can see that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS

System Requirements and Installation for Ant1.4

Posted by Tian LUO <lu...@yahoo.co.uk>.
 
 

 

   Lucene runs with JDK 1.1 and later
   If you're working with the Lucene source, you'll need to use Ant 1.4 or greater Ant is implemented in java and uses XML-based configuration files.  You can get it at: http://jakarta.apache.org/ant/index.html 
   Specifically, you can get the binary distributions at:   http://jakarta.apache.org/builds/jakarta-ant/release/ 
   You'll need to download both the Ant binary distribution and the "optional" jar file.   
   Install these according to the instructions at:   http://jakarta.apache.org/ant/manual/index.html

 

Steps:

1)      Download Lucene from Apache http://jakarta.apache.org/builds/jakarta-lucene/release/ 

2)      Connect to the top-level of your Lucene installation. Lucene's top-level directory contains the default.properties and build.xml files.  you do need to run ant from this location so it knows where to find them. 

3)      Set ant in your PATH and ANT_HOME to the location of ant installation.

4)      Run ant by typing ‘ant’ at the command prompt.

In addition to these steps,what should I do to configure to make lucene

work properly with Ant1.4 ,thank u very much!

:)




---------------------------------
With Yahoo! Mail you can get a bigger mailbox -- choose a size that fits your needs

Re: problems with search on Russian content

Posted by Andrey Grishin <gr...@softline.kiev.ua>.
I got the noghtly build from the CVS

When I am trying to use IndexWriter this way:
writer = new IndexWriter(indexDirectory, new
RussianAnalyzer("Cp1251".toCharArray()), true);
I got the following exception
----------------------------------------------------------------------------
-----------------------
java.lang.ArrayIndexOutOfBoundsException: 7
	at
org.apache.lucene.analysis.ru.RussianAnalyzer.makeStopWords(RussianAnalyzer.
java:521)
	at org.apache.lucene.analysis.ru.RussianAnalyzer.(RussianAnalyzer.java:473)
----------------------------------------------------------------------------
-----------------------


When I am trying to use it this way:
writer = new IndexWriter(indexDirectory, new
RussianAnalyzer("Cp1251".toCharArray(), new String[] {}), true);
I got the following exception
----------------------------------------------------------------------------
-----------------------
2002-11-25 15:09:09,044
[ua.kiev.softline.services.searcher.index.PublishingIndexerImpl]
INFO   - --Throwable in addArticle(): java.lang.ArrayIndexOut
OfBoundsException: 8
java.lang.ArrayIndexOutOfBoundsException: 8
        at
org.apache.lucene.analysis.ru.RussianStemmer.isVowel(RussianStemmer.java:991
)
        at
org.apache.lucene.analysis.ru.RussianStemmer.markPositions(RussianStemmer.ja
va:909)
        at
org.apache.lucene.analysis.ru.RussianStemmer.stem(RussianStemmer.java:1551)
        at
org.apache.lucene.analysis.ru.RussianStemFilter.next(RussianStemFilter.java:
189)
        at
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:17
0)
        at
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
        at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:209)
        at
ua.kiev.softline.services.searcher.index.PublishingIndexerImpl.addArticle(Pu
blishingIndexerImpl.java:130)
----------------------------------------------------------------------------
-----------------------

When I commented line 575 in RussianAnalyzer.java
result = new RussianStemFilter(result, charset);
everything works fine - I can search (and find :)) russian words...

Am I doing something wrong?

Regards, Andrey



----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, November 21, 2002 5:35 PM
Subject: Re: problems with search on Russian content


> Look at CHANGES.txt document in CVS - there is some new stuff in
> org.apache.lucene.analysis.ru package that you will want to use.
> Get the Lucene from the nightly build...
>
> Otis
>
> --- Andrey Grishin <gr...@softline.kiev.ua> wrote:
> > Hi All,
> > I have a problems with searching on Russian content using lucene 1.2
> >
> > I indexed the content using Cp1251 charset
> > ------------
> > text = new String(text.getBytes("Cp1251"));
> > doc.add(Field.Text(CONTENT_FIELD,text));
> >
> > ------------
> > and I am searching using the same charset
> >
> > String txt = "áÎÄ";
> > txt = new String(txt.getBytes("Cp1251"));
> > PrefixQuery query = new PrefixQuery(new
> > Term(PortalHTMLDocument.CONTENT_FIELD, txt));
> > hits = searcher.search(query);
> >
> > or
> >
> > Analyzer analyzer = new StandardAnalyzer();
> > String txt = "áÎÄÒÅÊ";
> > txt = new String(txt.getBytes("Cp1251"));
> > Query query = QueryParser.parse(txt,
> > PortalHTMLDocument.CONTENT_FIELD, analyzer);
> >
> > hits = searcher.search(query);
> >
> >
> > and lucene can't find nothing.
> > Also I checked for the DecodeInterceptor in my server.xml - there
> > isn't any
> >
> > I tried UTF-8/16 - and got the same result.
> >
> > Also, if I list all index's content via iterating IndexReader - I can
> > see that my russian content is stored in index...
> > Can you please help me? Do you have any more ideas about what else
> > can be done here to fix this problem?
> >
> > I will appreciate any help.
> > Thanks, Andrey.
> >
> > P.S.
> > I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: problems with search on Russian content

Posted by Karl Øie <ka...@gan.no>.
Sorry, my bad! Didn't read this informative post :-)

mvh karl øie


On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote:

> Look at CHANGES.txt document in CVS - there is some new stuff in
> org.apache.lucene.analysis.ru package that you will want to use.
> Get the Lucene from the nightly build...
>
> Otis
>
> --- Andrey Grishin <gr...@softline.kiev.ua> wrote:
>> Hi All,
>> I have a problems with searching on Russian content using lucene 1.2
>>
>> I indexed the content using Cp1251 charset
>> ------------
>> text = new String(text.getBytes("Cp1251"));
>> doc.add(Field.Text(CONTENT_FIELD,text));
>>
>> ------------
>> and I am searching using the same charset
>>
>> String txt = "·Œƒ";
>> txt = new String(txt.getBytes("Cp1251"));
>> PrefixQuery query = new PrefixQuery(new
>> Term(PortalHTMLDocument.CONTENT_FIELD, txt));
>> hits = searcher.search(query);
>>
>> or
>>
>> Analyzer analyzer = new StandardAnalyzer();
>> String txt = "·Œƒ“≈ ";
>> txt = new String(txt.getBytes("Cp1251"));
>> Query query = QueryParser.parse(txt,
>> PortalHTMLDocument.CONTENT_FIELD, analyzer);
>>
>> hits = searcher.search(query);
>>
>>
>> and lucene can't find nothing.
>> Also I checked for the DecodeInterceptor in my server.xml - there
>> isn't any
>>
>> I tried UTF-8/16 - and got the same result.
>>
>> Also, if I list all index's content via iterating IndexReader - I can
>> see that my russian content is stored in index...
>> Can you please help me? Do you have any more ideas about what else
>> can be done here to fix this problem?
>>
>> I will appreciate any help.
>> Thanks, Andrey.
>>
>> P.S.
>> I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus ñ Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: problems with search on Russian content

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Look at CHANGES.txt document in CVS - there is some new stuff in
org.apache.lucene.analysis.ru package that you will want to use.
Get the Lucene from the nightly build...

Otis

--- Andrey Grishin <gr...@softline.kiev.ua> wrote:
> Hi All, 
> I have a problems with searching on Russian content using lucene 1.2
> 
> I indexed the content using Cp1251 charset
> ------------
> text = new String(text.getBytes("Cp1251"));
> doc.add(Field.Text(CONTENT_FIELD,text));
> 
> ------------
> and I am searching using the same charset
> 
> String txt = "���";
> txt = new String(txt.getBytes("Cp1251"));
> PrefixQuery query = new PrefixQuery(new
> Term(PortalHTMLDocument.CONTENT_FIELD, txt));
> hits = searcher.search(query);
> 
> or 
> 
> Analyzer analyzer = new StandardAnalyzer();
> String txt = "������";
> txt = new String(txt.getBytes("Cp1251"));
> Query query = QueryParser.parse(txt,
> PortalHTMLDocument.CONTENT_FIELD, analyzer);
> 
> hits = searcher.search(query);
> 
> 
> and lucene can't find nothing.
> Also I checked for the DecodeInterceptor in my server.xml - there
> isn't any
> 
> I tried UTF-8/16 - and got the same result.
> 
> Also, if I list all index's content via iterating IndexReader - I can
> see that my russian content is stored in index...
> Can you please help me? Do you have any more ideas about what else
> can be done here to fix this problem?
> 
> I will appreciate any help.
> Thanks, Andrey.
> 
> P.S.
> I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus � Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>