You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Liaqat Ali <li...@gmail.com> on 2007/12/26 20:36:38 UTC

StopWords problem

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As I 
am using notepad for coding, then in which format the file should be saved.


public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" 
,"کو" ,"ہے" };

Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried Unicode 
but it does not work and I also tried UTF-8, but it also generate two 
errors of identifying two illegal characters. What should be the 
solution. Kindly guide in this.

Thanks ..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Doron Cohen <cd...@gmail.com>.

Try printing all these after you close the writer:
- ((FSDirectory) dir).getFile().getAbsolutePath()
- dir.list().length  (n)
- dir.list()[0], .. , dir.list[n]

This should at least help you verify that an index was created and where.

Regards,
Doron

On Dec 27, 2007 12:26 PM, Liaqat Ali <li...@gmail.com> wrote:

> Doron Cohen wrote:
> > On Dec 27, 2007 11:49 AM, Liaqat Ali <li...@gmail.com> wrote:
> >
> >
> >> I got your point. The program given does not give not any error during
> >> compilation and it is interpreted well. But the it does not create any
> >> index. when the StandardAnalyzer() is called without Stopwords list it
> >> works well, but when the list of stop words is passed as an argument,
> >> then it does not.
> >>
> >
> >
> > Hi Liaqat, I am confused, are you saying that the program creates no
> > index when stopwords are used?
> >
> > All this time I thought the problem you get is that stopwords are
> indexed
> > as if they were regular words, but now you say no index is created..
> >
> > Is there any exception thrown?
> > Do you see that there is no index to be found on the file system?
> > Or do you mean after closing the IndexWriter and opening an
> > IndexReader or IndexSearcher its numDocs is 0?
> > Or perhaps the index contains documents but your query search
> > finds nothing?
> >
> > Again, a stand-alone Java program that demonstrates the problem
> > would be best and save your time and others. Lucene's Junit
> > tests are good examples of short programs that demonstrate
> > a problem, and fails unless the problem is fixed.
> >
> > Regards, Doron
> >
> >
> Thanks alot Doron,  I was confused too, the index i thought was created
> by previous program. But this time when I run this program  (with
> stopwords as a argument), then it does not create index in the given
> directory, there is nothing inside the specified directory.
> There is no error during compilation or during interpretation. And I am
> using LUKE for retrieval. So kindly suggest some guidelines.
>
> Regards,
> Liaqat
>

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Doron Cohen wrote:
> On Dec 27, 2007 11:49 AM, Liaqat Ali <li...@gmail.com> wrote:
>
>   
>> I got your point. The program given does not give not any error during
>> compilation and it is interpreted well. But the it does not create any
>> index. when the StandardAnalyzer() is called without Stopwords list it
>> works well, but when the list of stop words is passed as an argument,
>> then it does not.
>>     
>
>
> Hi Liaqat, I am confused, are you saying that the program creates no
> index when stopwords are used?
>
> All this time I thought the problem you get is that stopwords are indexed
> as if they were regular words, but now you say no index is created..
>
> Is there any exception thrown?
> Do you see that there is no index to be found on the file system?
> Or do you mean after closing the IndexWriter and opening an
> IndexReader or IndexSearcher its numDocs is 0?
> Or perhaps the index contains documents but your query search
> finds nothing?
>
> Again, a stand-alone Java program that demonstrates the problem
> would be best and save your time and others. Lucene's Junit
> tests are good examples of short programs that demonstrate
> a problem, and fails unless the problem is fixed.
>
> Regards, Doron
>
>   
Thanks alot Doron,  I was confused too, the index i thought was created 
by previous program. But this time when I run this program  (with 
stopwords as a argument), then it does not create index in the given 
directory, there is nothing inside the specified directory.
There is no error during compilation or during interpretation. And I am 
using LUKE for retrieval. So kindly suggest some guidelines.

Regards,
Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Doron Cohen <cd...@gmail.com>.

On Dec 27, 2007 11:49 AM, Liaqat Ali <li...@gmail.com> wrote:

> I got your point. The program given does not give not any error during
> compilation and it is interpreted well. But the it does not create any
> index. when the StandardAnalyzer() is called without Stopwords list it
> works well, but when the list of stop words is passed as an argument,
> then it does not.

Hi Liaqat, I am confused, are you saying that the program creates no
index when stopwords are used?

All this time I thought the problem you get is that stopwords are indexed
as if they were regular words, but now you say no index is created..

Is there any exception thrown?
Do you see that there is no index to be found on the file system?
Or do you mean after closing the IndexWriter and opening an
IndexReader or IndexSearcher its numDocs is 0?
Or perhaps the index contains documents but your query search
finds nothing?

Again, a stand-alone Java program that demonstrates the problem
would be best and save your time and others. Lucene's Junit
tests are good examples of short programs that demonstrate
a problem, and fails unless the problem is fixed.

Regards, Doron

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Doron Cohen wrote:
> This is not a self contained program - it is incomplete, and it depends
> on files on *your* disk...
>
> Still, can you show why you're saying it indexes stopwords?
> Can you print here few samples of IndexReader.terms().term()?
>
> BR, Doron
>
> On Dec 27, 2007 10:22 AM, Liaqat Ali <li...@gmail.com> wrote:
>
>   
>> The whole program is given below. But it does not eliminate stop words
>> from the index.
>>
>>                Document document  = new Document();
>>            document.add(new Field("contents",sb.toString(),
>> Field.Store.NO, Field.Index.TOKENIZED));
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
> Hi, Doron
>   
I got your point. The program given does not give not any error during 
compilation and it is interpreted well. But the it does not create any 
index. when the StandardAnalyzer() is called without Stopwords list it 
works well, but when the list of stop words is passed as an argument, 
then it does not.



import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;


public class urduIndexer1 {



Reader file;
BufferedReader buff;
String line;
IndexWriter writer;
String indexDir;
Directory dir;

public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" 
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };


public void index() throws IOException,
UnsupportedEncodingException {


indexDir = "D:\\UIR\\in";
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
boolean createFlag = true;

dir = FSDirectory.getDirectory(indexDir);
writer = new IndexWriter(dir, analyzer, createFlag);

for (int i=1;i<=201;i++) {



file = new InputStreamReader(new FileInputStream("corpus\\doc" + i + 
".txt"), "UTF-8");

StringBuffer sb = new StringBuffer();

buff = new BufferedReader(file);

//line = buff.readLine();


while( (line = buff.readLine()) != null) {
sb.append(line);
}



boolean eof = false;

Document document = new Document();
document.add(new Field("contents",sb.toString(), Field.Store.NO, 
Field.Index.TOKENIZED));

//document.add(new Field("Document","Doc" + 
i,Field.Store.YES,Field.Index.TOKENIZED));


writer.addDocument(document);

buff.close();



}
writer.optimize();
writer.close();



}


public static void main(String arg[]) throws Exception{


urduIndexer indx = new urduIndexer();
indx.index();





}




}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Doron Cohen <cd...@gmail.com>.

This is not a self contained program - it is incomplete, and it depends
on files on *your* disk...

Still, can you show why you're saying it indexes stopwords?
Can you print here few samples of IndexReader.terms().term()?

BR, Doron

On Dec 27, 2007 10:22 AM, Liaqat Ali <li...@gmail.com> wrote:

>
> The whole program is given below. But it does not eliminate stop words
> from the index.
>
>                Document document  = new Document();
>            document.add(new Field("contents",sb.toString(),
> Field.Store.NO, Field.Index.TOKENIZED));
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: StopWords problem

Posted by "N. Hira" <nh...@cognocys.com>.

Hi Liaqat,

Are you sure that the Urdu characters are being correctly interpreted  
by the JVM even during the file I/O operation?

I would expect Unicode characters to be encoded as multi-byte  
sequences and so, the string-matching operations would fail (if the  
literals are different from the file encoding).

Can you try out a simple indexOf() to confirm that this is not going on?
E.g., for a document where you know a stop word occurs, print out the  
value of:
	line.indexOf(URDU_STOP_WORDS[1])

Regards,

-h
----------------------------------------------------------------------
Hira, N.R.
Solutions Architect
Cognocys, Inc.

On 27-Dec-2007, at 2:22 AM, Liaqat Ali wrote:

> Doron Cohen wrote:
>> Hi Liagat,
>>
>> This part of the code seems correct and should work, so problem
>> must be elsewhere.
>>
>> Can you post a short program that demonstrates the problem?
>>
>> You can start with something like this:
>>       Document doc = new Document();
>>       doc.add(new Field("text",URDU_STOP_WORDS[0] +
>>                   " regular text",Store.YES, Index.TOKENIZED));
>>       indexWriter.addDocument(doc);
>>
>> Now URDU_STOP_WORDS[0] should not appear within the index terms.
>> You can easily verify this by iterating IndexReader.terms();
>>
>> Regards, Doron
>>
>> On Dec 27, 2007 9:36 AM, Liaqat Ali <li...@gmail.com> wrote:
>>
>>
>>> Hi, Grant
>>>
>>> I think i did not make my self clear. I am trying to pass a list  
>>> of Urdu
>>> Stop words as a argument to the Standard Analyzer. But it does  
>>> work well
>>> for me..
>>>
>>> public static final String[] URDU_STOP_WORDS =  
>>> { "کی" ,"کا" ,"کو" ,"ہے"
>>> ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
>>> ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ," 
>>> وہ" ,"جس" ,"نہں" ,"تک" };
>>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>>
>>>
>>> Kindly give some guidelines.
>>>
>>> Regards,
>>> Liaqat
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> The whole program is given below. But it does not eliminate stop  
> words from the index.
>
>
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
>
> import java.io.*;
>
>
> public class urduIndexer1  {
>
>
>    Reader file;
>    BufferedReader buff;
>    String line;
>    IndexWriter writer;
>    String indexDir;
>    Directory dir;
>
>    public static final String[] URDU_STOP_WORDS =  
> { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ," 
> سے","میں" ,"بھی"
> ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"و 
> ہ" ,"جس" ,"نہں" ,"تک"  };
>
>
>    public void index() throws IOException,
>     UnsupportedEncodingException {
>
>              indexDir = "D:\\UIR\\index";
>        Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>            boolean createFlag = true;
>
>        dir = FSDirectory.getDirectory(indexDir);
>        writer = new IndexWriter(dir, analyzer, createFlag);
>
>        for (int i=1;i<=201;i++)  {
>
>
>
>            file = new InputStreamReader(new FileInputStream("corpus\ 
> \doc" + i + ".txt"), "UTF-8");
>
>            StringBuffer sb = new StringBuffer();
>
>            buff = new BufferedReader(file);
>
>            //line = buff.readLine();
>
>
>            while( (line = buff.readLine()) != null) {
>                        sb.append(line);
>                }
>
>
>
>            boolean eof = false;
>
>                Document document  = new Document();
>            document.add(new Field("contents",sb.toString(),  
> Field.Store.NO, Field.Index.TOKENIZED));
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Doron Cohen wrote:
> Hi Liagat,
>
> This part of the code seems correct and should work, so problem
> must be elsewhere.
>
> Can you post a short program that demonstrates the problem?
>
> You can start with something like this:
>       Document doc = new Document();
>       doc.add(new Field("text",URDU_STOP_WORDS[0] +
>                   " regular text",Store.YES, Index.TOKENIZED));
>       indexWriter.addDocument(doc);
>
> Now URDU_STOP_WORDS[0] should not appear within the index terms.
> You can easily verify this by iterating IndexReader.terms();
>
> Regards, Doron
>
> On Dec 27, 2007 9:36 AM, Liaqat Ali <li...@gmail.com> wrote:
>
>   
>> Hi, Grant
>>
>> I think i did not make my self clear. I am trying to pass a list of Urdu
>> Stop words as a argument to the Standard Analyzer. But it does work well
>> for me..
>>
>> public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے"
>> ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
>> ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>
>>
>> Kindly give some guidelines.
>>
>> Regards,
>> Liaqat
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     

The whole program is given below. But it does not eliminate stop words 
from the index.


import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;


public class urduIndexer1  {

   

    Reader file;
    BufferedReader buff;
    String line;
    IndexWriter writer;
    String indexDir;
    Directory dir;

    public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" 
,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک"  };


    public void index() throws IOException,
     UnsupportedEncodingException {

       
        indexDir = "D:\\UIR\\index";
        Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
            boolean createFlag = true;

        dir = FSDirectory.getDirectory(indexDir);
        writer = new IndexWriter(dir, analyzer, createFlag);

        for (int i=1;i<=201;i++)  {



            file = new InputStreamReader(new 
FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");

            StringBuffer sb = new StringBuffer();

            buff = new BufferedReader(file);

            //line = buff.readLine();


            while( (line = buff.readLine()) != null) {
                        sb.append(line);
                }



            boolean eof = false;

                Document document  = new Document();
            document.add(new Field("contents",sb.toString(), 
Field.Store.NO, Field.Index.TOKENIZED));
           

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Doron Cohen <cd...@gmail.com>.

Hi Liagat,

This part of the code seems correct and should work, so problem
must be elsewhere.

Can you post a short program that demonstrates the problem?

You can start with something like this:
      Document doc = new Document();
      doc.add(new Field("text",URDU_STOP_WORDS[0] +
                  " regular text",Store.YES, Index.TOKENIZED));
      indexWriter.addDocument(doc);

Now URDU_STOP_WORDS[0] should not appear within the index terms.
You can easily verify this by iterating IndexReader.terms();

Regards, Doron

On Dec 27, 2007 9:36 AM, Liaqat Ali <li...@gmail.com> wrote:

> Hi, Grant
>
> I think i did not make my self clear. I am trying to pass a list of Urdu
> Stop words as a argument to the Standard Analyzer. But it does work well
> for me..
>
> public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے"
> ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
> ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>
>
> Kindly give some guidelines.
>
> Regards,
> Liaqat
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Grant Ingersoll wrote:
>
> On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote:
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> No, at this level I am not using any stemming technique. I am just 
>> trying to eliminate stop words.
>
> Can you share your analyzer code?
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Hi, Grant

I think i did not make my self clear. I am trying to pass a list of Urdu 
Stop words as a argument to the Standard Analyzer. But it does work well 
for me..

public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" 
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Kindly give some guidelines.

Regards,
Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote:
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> No, at this level I am not using any stemming technique. I am just  
> trying to eliminate stop words.

Can you share your analyzer code?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Grant Ingersoll wrote:
> Are you altering (stemming) the token before it gets to the StopFilter?
>
> On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:
>
>> Doron Cohen wrote:
>>> On Dec 26, 2007 10:33 PM, Liaqat Ali <li...@gmail.com> wrote:
>>>
>>>
>>>> Using javac -encoding UTF-8 still raises the following error.
>>>>
>>>> urduIndexer.java : illegal character: \65279
>>>> ?
>>>> ^
>>>> 1 error
>>>>
>>>> What I am doing wrong?
>>>>
>>>>
>>>
>>> If you have the stop-words in a file, say one word in a line,
>>> they can be read like this:
>>>
>>>    BufferedReader r = new BufferedReader(new InputStreamReader(new
>>> FileInputStream("Urdu.txt"),"UTF8"));
>>>    String word = r.readLine();    // loop this line, you get the 
>>> picture
>>>
>>> (Make sure to specify encoding "UTF8" when saving the file from 
>>> notepad).
>>>
>>> Regards,
>>> Doron
>>>
>>>
>> Hi, Doron
>>
>> The compilation problem is solved, but there is no change in the index.
>>
>>
>> public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" 
>> ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
>> ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" 
>> ,"تک" };
>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>
>>
>> Again these words are appeared in the index with high ranks.
>>
>> Regards,
>> Liaqat
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
No, at this level I am not using any stemming technique. I am just 
trying to eliminate stop words.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Grant Ingersoll <gs...@apache.org>.

Are you altering (stemming) the token before it gets to the StopFilter?

On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:

> Doron Cohen wrote:
>> On Dec 26, 2007 10:33 PM, Liaqat Ali <li...@gmail.com> wrote:
>>
>>
>>> Using javac -encoding UTF-8 still raises the following error.
>>>
>>> urduIndexer.java : illegal character: \65279
>>> ?
>>> ^
>>> 1 error
>>>
>>> What I am doing wrong?
>>>
>>>
>>
>> If you have the stop-words in a file, say one word in a line,
>> they can be read like this:
>>
>>    BufferedReader r = new BufferedReader(new InputStreamReader(new
>> FileInputStream("Urdu.txt"),"UTF8"));
>>    String word = r.readLine();    // loop this line, you get the  
>> picture
>>
>> (Make sure to specify encoding "UTF8" when saving the file from  
>> notepad).
>>
>> Regards,
>> Doron
>>
>>
> Hi, Doron
>
> The compilation problem is solved, but there is no change in the  
> index.
>
>
> public static final String[] URDU_STOP_WORDS =  
> { "کی 
> " ,"کا 
> " ,"کو 
> " ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
> ,"ان 
> " ,"ایک 
> " ,"تھا 
> " ,"تھی 
> " ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>
>
> Again these words are appeared in the index with high ranks.
>
> Regards,
> Liaqat
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

Doron Cohen wrote:
> On Dec 26, 2007 10:33 PM, Liaqat Ali <li...@gmail.com> wrote:
>
>   
>> Using javac -encoding UTF-8 still raises the following error.
>>
>> urduIndexer.java : illegal character: \65279
>> ?
>> ^
>> 1 error
>>
>> What I am doing wrong?
>>
>>     
>
> If you have the stop-words in a file, say one word in a line,
> they can be read like this:
>
>     BufferedReader r = new BufferedReader(new InputStreamReader(new
> FileInputStream("Urdu.txt"),"UTF8"));
>     String word = r.readLine();    // loop this line, you get the picture
>
> (Make sure to specify encoding "UTF8" when saving the file from notepad).
>
> Regards,
> Doron
>
>   
Hi, Doron

The compilation problem is solved, but there is no change in the index.


public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" 
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Again these words are appeared in the index with high ranks.

Regards,
Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Doron Cohen <cd...@gmail.com>.

On Dec 26, 2007 10:33 PM, Liaqat Ali <li...@gmail.com> wrote:

> Using javac -encoding UTF-8 still raises the following error.
>
> urduIndexer.java : illegal character: \65279
> ?
> ^
> 1 error
>
> What I am doing wrong?
>

If you have the stop-words in a file, say one word in a line,
they can be read like this:

    BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
    String word = r.readLine();    // loop this line, you get the picture

(Make sure to specify encoding "UTF8" when saving the file from notepad).

Regards,
Doron

Re: StopWords problem

Posted by 李晓峰 <en...@gmail.com>.

or you can save it as "Unicode" and javac -encoding Unicode

this way you can still use notepad.

Liaqat Ali 写道:
> 李晓峰 wrote:
>> "javac" has an option "-encoding", which tells the compiler the 
>> encoding the input source file is using, this will probably solve the 
>> problem.
>> or you can try the unicode escape: \uxxxx, then you can save it in 
>> ANSI, had for human to read though.
>> or use an IDE, eclipse is a good choice, you can set the source file 
>> encoding, and it will take care of the compiler for you.
>>
>> regards.
>>> Hi, Doro Cohen
>>>
>>> Thanks for your reply, but I am facing a small problem over here. As 
>>> I am using notepad for coding, then in which format the file should 
>>> be saved.
>>>
>>>
>>> public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
>>> ,"کا" ,"کو" ,"ہے" };
>>>
>>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>>
>>>
>>> If I save it in ANSI format it will lose the contents, I tried 
>>> Unicode but it does not work and I also tried UTF-8, but it also 
>>> generate two errors of identifying two illegal characters. What 
>>> should be the solution. Kindly guide in this.
>>>
>>> Thanks ..
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> Hi,
> Thanks alot for your suggestion.
> Using javac -encoding UTF-8 still raises the following error.
>
> urduIndexer.java : illegal character: \65279
> ?
> ^
> 1 error
>
> What I am doing wrong?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by 李晓峰 <en...@gmail.com>.

It's the notepad.
It adds byte-order-mark(BOM, in this case 65279, or 0xfeff.) in front of 
your file, which javac does not recognize for reasons not quite clear to me.
here is the bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
it won't be fixed, so try to eliminate BOM before compile your code.

Liaqat Ali wrote:
> 李晓峰 wrote:
>> "javac" has an option "-encoding", which tells the compiler the 
>> encoding the input source file is using, this will probably solve the 
>> problem.
>> or you can try the unicode escape: \uxxxx, then you can save it in 
>> ANSI, had for human to read though.
>> or use an IDE, eclipse is a good choice, you can set the source file 
>> encoding, and it will take care of the compiler for you.
>>
>> regards.
>>> Hi, Doro Cohen
>>>
>>> Thanks for your reply, but I am facing a small problem over here. As 
>>> I am using notepad for coding, then in which format the file should 
>>> be saved.
>>>
>>>
>>> public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
>>> ,"کا" ,"کو" ,"ہے" };
>>>
>>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>>
>>>
>>> If I save it in ANSI format it will lose the contents, I tried 
>>> Unicode but it does not work and I also tried UTF-8, but it also 
>>> generate two errors of identifying two illegal characters. What 
>>> should be the solution. Kindly guide in this.
>>>
>>> Thanks ..
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> Hi,
> Thanks alot for your suggestion.
> Using javac -encoding UTF-8 still raises the following error.
>
> urduIndexer.java : illegal character: \65279
> ?
> ^
> 1 error
>
> What I am doing wrong?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by Liaqat Ali <li...@gmail.com>.

李晓峰 wrote:
> "javac" has an option "-encoding", which tells the compiler the 
> encoding the input source file is using, this will probably solve the 
> problem.
> or you can try the unicode escape: \uxxxx, then you can save it in 
> ANSI, had for human to read though.
> or use an IDE, eclipse is a good choice, you can set the source file 
> encoding, and it will take care of the compiler for you.
>
> regards.
>> Hi, Doro Cohen
>>
>> Thanks for your reply, but I am facing a small problem over here. As 
>> I am using notepad for coding, then in which format the file should 
>> be saved.
>>
>>
>> public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
>> ,"کا" ,"کو" ,"ہے" };
>>
>> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>>
>>
>> If I save it in ANSI format it will lose the contents, I tried 
>> Unicode but it does not work and I also tried UTF-8, but it also 
>> generate two errors of identifying two illegal characters. What 
>> should be the solution. Kindly guide in this.
>>
>> Thanks ..
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Hi,
Thanks alot for your suggestion.
Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StopWords problem

Posted by 李晓峰 <en...@gmail.com>.

"javac" has an option "-encoding", which tells the compiler the encoding 
the input source file is using, this will probably solve the problem.
or you can try the unicode escape: \uxxxx, then you can save it in ANSI, 
had for human to read though.
or use an IDE, eclipse is a good choice, you can set the source file 
encoding, and it will take care of the compiler for you.

regards.
> Hi, Doro Cohen
>
> Thanks for your reply, but I am facing a small problem over here. As I 
> am using notepad for coding, then in which format the file should be 
> saved.
>
>
> public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
> ,"کا" ,"کو" ,"ہے" };
>
> Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
>
>
> If I save it in ANSI format it will lose the contents, I tried Unicode 
> but it does not work and I also tried UTF-8, but it also generate two 
> errors of identifying two illegal characters. What should be the 
> solution. Kindly guide in this.
>
> Thanks ..
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org