You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ahmet Aksoy <ah...@axtelsoft.com> on 2005/09/03 10:12:45 UTC

Indexing problems in a dictionary

Hi,
I'm using Lucene in an open source java project at 
http://belletmen.dev.java.net .
In the project there are several dictionaries with a simple structure. 
All items are composed of a "phrase", and a "definition". Both parts 
might contain a single word, or have lots of words.
Since both parts  might contain multiple  words,   I used the following:
    private Document buildDocument(SozlukBirimi birim){
        Document doc = new Document();
        doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word 
in Turkish
        doc.add(Field.Text("soz1", birim.getSoz()));//the same as 
keyword part
        doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means 
meaning in Turkish
        return doc;
    }
 As you can see, I used the first part both as a keyword field, and a 
text field. The reason is that the program will try to find phrases, or 
single words in the first part also.
At the first stages of the application, there were a single 
English-Turkish dictionary, and I had used an analyzer in which both 
English and Turkish stop words are included.
And, here my questions:
1- Do you think whether the above system is a good solution for a 
dictionary, or not?
2- I'm in hesitation now, about using stop words in a dictionary. What 
do you think?
3- I have a quite big timing problem. For a 107155 items of an 
English-English dictionary, it took 1436 seconds to complete the 
indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it 
normal? Or, am I in a completely wrong way?
I'm waiting for your suggestions.
Thanks a lot.
Ahmet Aksoy


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing problems in a dictionary

Posted by Ahmet Aksoy <ah...@axtelsoft.com>.

Hi Paul,

I decided to use a minimum number of stop words in my application. I 
hope, it will work better.

According to your suggestion, I made a few trials, and found my optimum 
values.

The following values look like best in my case:
mergeFactor = 100;
minMergeDocs = 500;
maxMergeDocs = 1000;

Now, indexing is performed 8 to 30 times faster than before.
Thanks a lot.
Ahmet Aksoy

Paul Elschot wrote:

>Ahmet,
>
>On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:
>  
>
>>Hi,
>>I'm using Lucene in an open source java project at 
>>http://belletmen.dev.java.net .
>>In the project there are several dictionaries with a simple structure. 
>>All items are composed of a "phrase", and a "definition". Both parts 
>>might contain a single word, or have lots of words.
>>Since both parts  might contain multiple  words,   I used the following:
>>    private Document buildDocument(SozlukBirimi birim){
>>        Document doc = new Document();
>>        doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word 
>>in Turkish
>>        doc.add(Field.Text("soz1", birim.getSoz()));//the same as 
>>keyword part
>>        doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means 
>>meaning in Turkish
>>        return doc;
>>    }
>> As you can see, I used the first part both as a keyword field, and a 
>>text field. The reason is that the program will try to find phrases, or 
>>single words in the first part also.
>>At the first stages of the application, there were a single 
>>English-Turkish dictionary, and I had used an analyzer in which both 
>>English and Turkish stop words are included.
>>And, here my questions:
>>1- Do you think whether the above system is a good solution for a 
>>dictionary, or not?
>>    
>>
>
>Keyword is used for searching exactly the same term, and this is normally
>reserved for things like a primary key. When the word (soz) has that role,
>it is ok. One way to determine a primary key is to consider how to
>delete a Document from the index.
>
>In case you anticipate searching on both the words and their meanings,
>you might also consider to index them concatenated together in another field.
>
>  
>
>>2- I'm in hesitation now, about using stop words in a dictionary. What 
>>do you think?
>>    
>>
>
>For the word field, they are needed, stopwords are words.
>For the meaning field, you need to determine whether stopwords are
>necessary in the index.
>It is possible to use a different analyzer per field.
>
>  
>
>>3- I have a quite big timing problem. For a 107155 items of an 
>>English-English dictionary, it took 1436 seconds to complete the 
>>indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it 
>>normal? Or, am I in a completely wrong way?
>>    
>>
>
>Since each entry is probably rather small, it is possible to keep many
>of them in RAM before writing them to disk during index building.
>You could try and increase IndexWriter.minMergeFactor first
>and consider the other merge factors of IndexWriter there later.
>
>Regards,
>Paul Elschot
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing problems in a dictionary

Posted by Paul Elschot <pa...@xs4all.nl>.

Ahmet,

On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:
> Hi,
> I'm using Lucene in an open source java project at 
> http://belletmen.dev.java.net .
> In the project there are several dictionaries with a simple structure. 
> All items are composed of a "phrase", and a "definition". Both parts 
> might contain a single word, or have lots of words.
> Since both parts  might contain multiple  words,   I used the following:
>     private Document buildDocument(SozlukBirimi birim){
>         Document doc = new Document();
>         doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word 
> in Turkish
>         doc.add(Field.Text("soz1", birim.getSoz()));//the same as 
> keyword part
>         doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means 
> meaning in Turkish
>         return doc;
>     }
>  As you can see, I used the first part both as a keyword field, and a 
> text field. The reason is that the program will try to find phrases, or 
> single words in the first part also.
> At the first stages of the application, there were a single 
> English-Turkish dictionary, and I had used an analyzer in which both 
> English and Turkish stop words are included.
> And, here my questions:
> 1- Do you think whether the above system is a good solution for a 
> dictionary, or not?

Keyword is used for searching exactly the same term, and this is normally
reserved for things like a primary key. When the word (soz) has that role,
it is ok. One way to determine a primary key is to consider how to
delete a Document from the index.

In case you anticipate searching on both the words and their meanings,
you might also consider to index them concatenated together in another field.

> 2- I'm in hesitation now, about using stop words in a dictionary. What 
> do you think?

For the word field, they are needed, stopwords are words.
For the meaning field, you need to determine whether stopwords are
necessary in the index.
It is possible to use a different analyzer per field.

> 3- I have a quite big timing problem. For a 107155 items of an 
> English-English dictionary, it took 1436 seconds to complete the 
> indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it 
> normal? Or, am I in a completely wrong way?

Since each entry is probably rather small, it is possible to keep many
of them in RAM before writing them to disk during index building.
You could try and increase IndexWriter.minMergeFactor first
and consider the other merge factors of IndexWriter there later.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org