You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Chris Collins <ch...@yahoo.com> on 2009/05/11 17:28:06 UTC

Re: what if my database data contains other language (like danish, german).

Is anyone aware of either of the two things:

1) ability to plugin an external source for DF, this would allow you  
to circumvent the problem you mentioned below.  (Of course you would  
have to compute a df set for each language you care to have meaningful  
weights for).
2) any open source segmenters, primarily for german, but also for CJK  
at a longshot :-}

Thanks

C

On May 11, 2009, at 8:13 AM, Ted Dunning wrote:

> Yes.  Lucene can handle that.  You have to select which stemmer to  
> use.  You
> may have to improve the German and Danish stemmers a little bit.
>
> You may also have some issues with the fact that if Danish is 5% of  
> your
> corpus, then words that occur in 100% of your Danish documents will  
> tend to
> have too high weights since they only occur in 5% of your  
> documents.  Any
> term that occurs in more than 20% of a sub-corpus should generally be
> discarded from your query.  This can be difficult in multi-lingual
> situations.
>
> For a first pass, I would ignore this issue, however.
>
> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla  
> <uk...@mach.com>wrote:
>
>> what if my database data contains other language (like danish,  
>> german).
>>
>> Is Lucene will handle that .
>>
>> If yes How?
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)

Re: what if my database data contains other language (like danish, german).

Posted by Chris Collins <ch...@yahoo.com>.

Thanks Otis, I will take a look.

Best

C
On May 17, 2009, at 7:05 PM, Otis Gospodnetic wrote:

>
> Chris,
>
> I don't have the issue number here, but look in Lucene's JIRA and  
> search for... ah, here:
>
>  https://issues.apache.org/jira/browse/LUCENE-1166
>
>
> And for Chinese:
>
>  https://issues.apache.org/jira/browse/LUCENE-1629
>
> If you happen to be using Solr:
>
>  http://www.sematext.com/product-multilingual-analyzer.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Chris Collins <ch...@yahoo.com>
>> To: general@lucene.apache.org
>> Sent: Monday, May 11, 2009 11:28:06 AM
>> Subject: Re: what if my database data contains other language (like  
>> danish,  german).
>>
>> Is anyone aware of either of the two things:
>>
>> 1) ability to plugin an external source for DF, this would allow  
>> you to
>> circumvent the problem you mentioned below.  (Of course you would  
>> have to
>> compute a df set for each language you care to have meaningful  
>> weights for).
>> 2) any open source segmenters, primarily for german, but also for  
>> CJK at a
>> longshot :-}
>>
>> Thanks
>>
>> C
>>
>> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
>>
>>> Yes.  Lucene can handle that.  You have to select which stemmer to  
>>> use.  You
>>> may have to improve the German and Danish stemmers a little bit.
>>>
>>> You may also have some issues with the fact that if Danish is 5%  
>>> of your
>>> corpus, then words that occur in 100% of your Danish documents  
>>> will tend to
>>> have too high weights since they only occur in 5% of your  
>>> documents.  Any
>>> term that occurs in more than 20% of a sub-corpus should generally  
>>> be
>>> discarded from your query.  This can be difficult in multi-lingual
>>> situations.
>>>
>>> For a first pass, I would ignore this issue, however.
>>>
>>> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
>>>
>>>> what if my database data contains other language (like danish,  
>>>> german).
>>>>
>>>> Is Lucene will handle that .
>>>>
>>>> If yes How?
>>>>
>>>
>>>
>>>
>>> --Ted Dunning, CTO
>>> DeepDyve
>>>
>>> 111 West Evelyn Ave. Ste. 202
>>> Sunnyvale, CA 94086
>>> www.deepdyve.com
>>> 858-414-0013 (m)
>>> 408-773-0220 (fax)
>

Re: what if my database data contains other language (like danish, german).

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Chris,

I don't have the issue number here, but look in Lucene's JIRA and search for... ah, here:

  https://issues.apache.org/jira/browse/LUCENE-1166


And for Chinese:

  https://issues.apache.org/jira/browse/LUCENE-1629

If you happen to be using Solr:

  http://www.sematext.com/product-multilingual-analyzer.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Chris Collins <ch...@yahoo.com>
> To: general@lucene.apache.org
> Sent: Monday, May 11, 2009 11:28:06 AM
> Subject: Re: what if my database data contains other language (like danish,  german).
> 
> Is anyone aware of either of the two things:
> 
> 1) ability to plugin an external source for DF, this would allow you to 
> circumvent the problem you mentioned below.  (Of course you would have to 
> compute a df set for each language you care to have meaningful weights for).
> 2) any open source segmenters, primarily for german, but also for CJK at a 
> longshot :-}
> 
> Thanks
> 
> C
> 
> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
> 
> > Yes.  Lucene can handle that.  You have to select which stemmer to use.  You
> > may have to improve the German and Danish stemmers a little bit.
> > 
> > You may also have some issues with the fact that if Danish is 5% of your
> > corpus, then words that occur in 100% of your Danish documents will tend to
> > have too high weights since they only occur in 5% of your documents.  Any
> > term that occurs in more than 20% of a sub-corpus should generally be
> > discarded from your query.  This can be difficult in multi-lingual
> > situations.
> > 
> > For a first pass, I would ignore this issue, however.
> > 
> > On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
> > 
> >> what if my database data contains other language (like danish, german).
> >> 
> >> Is Lucene will handle that .
> >> 
> >> If yes How?
> >> 
> > 
> > 
> > 
> > --Ted Dunning, CTO
> > DeepDyve
> > 
> > 111 West Evelyn Ave. Ste. 202
> > Sunnyvale, CA 94086
> > www.deepdyve.com
> > 858-414-0013 (m)
> > 408-773-0220 (fax)

Re: what if my database data contains other language (like danish, german).

Posted by Ted Dunning <te...@gmail.com>.

On Mon, May 11, 2009 at 8:28 AM, Chris Collins <ch...@yahoo.com>wrote:

> Is anyone aware of either of the two things:
>
> 1) ability to plugin an external source for DF, this would allow you to
> circumvent the problem you mentioned below.  (Of course you would have to
> compute a df set for each language you care to have meaningful weights for).

See
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html#weight(org.apache.lucene.search.Searcher)

The typical idiom is to extend Searcher with a specialized structure that
knows the term frequencies that you want it to know.

This is what katta does to propagate cluster-global term frequencies to
shard specific searches.  Presumably solr does likewise.

> 2) any open source segmenters, primarily for german, but also for CJK at a
> longshot :-}

Lucene has a rudimentary german stemmer which may be sufficient.  Real lemma
identification in German can be difficult because of the large number of
morphological variants and word compounding.  For text retrieval, however,
compounding is your friend and very simple stemmers typically suffice.

For CJK, the approach that I favor lately is this one:

      http://technology.chtsai.org/mmseg/

Basically, it is a longest dictionary match method with the addition that it
picks the next token that is part of the longest match for the next three
tokens.  This gets rid of the garden path problems that greedy algorithms
without look-ahead have.  It depends a bit on the assumption that long words
in the dictionary have higher frequency than would be expected if the
possible components occur independently.  This means that picking the longer
match in the dictionary is equivalent to doing a more subtle statistical
test.  (See here for more details on the stats involved in bigram detection:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html).

-- 
Ted Dunning, CTO
DeepDyve