You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by George Aroush <ge...@aroush.net> on 2007/04/03 02:39:57 UTC

RE: Best use of language dep. analyzers?

Hi Torsten,

Are you referring to the analyzer in Snowball.Net?  I ported those analyzer
to C# however, since I lack the language understanding, and those analyzers
don't come with a JUnit to port and test in the C# land, I can't confirm if
the port is valid or not.  This is the case for 1.9 as well as for 2.0, I'm
afraid it will remain the case unless if someone with langue knowledge
debugged them.

-- George Aroush

-----Original Message-----
From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net] 
Sent: Saturday, March 31, 2007 11:52 AM
To: lucene-net-user@incubator.apache.org
Subject: Best use of language dep. analyzers?

Hi, I'm not so familiar with the lucene (Java) direction of dev. in the
field of language dependent analyzers. What will it be?
 
We use a slightly modified version of 1.9 lucene.net (wich include the yet
published/converted language dep. analyzers - various folders below
"Analysis" named "BR", "CJK", "FR", "DE" etc.). As far I understand they
should be used to analyze language specific documents/texts and get rid of
stop words, etc. - so provide the "real" text to index. So currently we
detect/get the language out of the documents we index, transform them to
create the "right" analyzer and add the document.
But they are not stable, we got various problems using them (endless loops,
empty string in a stop word table just to name some).
 
Will this be the same for lucene.net 2.x ? What "language" package will be
available?
Will it be part of the apache project?
 
Thx,
Torsten Rendelmann

RE: Best use of language dep. analyzers?

Posted by George Aroush <ge...@aroush.net>.

Snowball is a per language stemmer, this is why you will see classes such as
DutchStemmer.cs, FinnishStemmer.cs, German2Stemmer.cs, ItalianStemmer.cs,
etc.

-- George

> -----Original Message-----
> From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net] 
> Sent: Tuesday, April 03, 2007 2:39 AM
> To: lucene-net-user@incubator.apache.org
> Subject: RE: Best use of language dep. analyzers?
> 
> George,
> 
> Yes Snowball was in my mind as I wrote my post.
> My understanding of that was it does provide a general way to 
> analyze, not providing one analyzer for each language.
> I'm wrong?
> 
> If I only would have enough spare time to have a look, I 
> would like to help with that (porting our current code using 
> per language analyzers and track down issues).
> 
> Torsten
> 
> > -----Original Message-----
> > From: George Aroush [mailto:george@aroush.net]
> > Sent: Tuesday, April 03, 2007 2:40 AM
> > To: lucene-net-user@incubator.apache.org
> > Subject: RE: Best use of language dep. analyzers?
> > 
> > Hi Torsten,
> > 
> > Are you referring to the analyzer in Snowball.Net?  I ported those 
> > analyzer to C# however, since I lack the language 
> understanding, and 
> > those analyzers don't come with a JUnit to port and test in the C# 
> > land, I can't confirm if the port is valid or not.  This is 
> the case 
> > for 1.9 as well as for 2.0, I'm afraid it will remain the 
> case unless 
> > if someone with langue knowledge debugged them.
> > 
> > -- George Aroush
> > 
> > -----Original Message-----
> > From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net]
> > Sent: Saturday, March 31, 2007 11:52 AM
> > To: lucene-net-user@incubator.apache.org
> > Subject: Best use of language dep. analyzers?
> > 
> > Hi, I'm not so familiar with the lucene (Java) direction of dev. in 
> > the field of language dependent analyzers. What will it be?
> >  
> > We use a slightly modified version of 1.9 lucene.net (wich 
> include the 
> > yet published/converted language dep. analyzers - various folders 
> > below "Analysis" named "BR", "CJK", "FR", "DE" etc.). As far I 
> > understand they should be used to analyze language specific 
> > documents/texts and get rid of stop words, etc. - so provide the 
> > "real" text to index. So currently we detect/get the 
> language out of 
> > the documents we index, transform them to create the 
> "right" analyzer 
> > and add the document.
> > But they are not stable, we got various problems using them 
> (endless 
> > loops, empty string in a stop word table just to name some).
> >  
> > Will this be the same for lucene.net 2.x ? What "language" 
> > package will be
> > available?
> > Will it be part of the apache project?
> >  
> > Thx,
> > Torsten Rendelmann
> >  
> 
>

RE: Best use of language dep. analyzers?

Posted by Torsten Rendelmann <to...@gmx.net>.

George,

Yes Snowball was in my mind as I wrote my post.
My understanding of that was it does provide a general way
to analyze, not providing one analyzer for each language.
I'm wrong?

If I only would have enough spare time to have a look,
I would like to help with that (porting our current code using
per language analyzers and track down issues).

Torsten

> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net] 
> Sent: Tuesday, April 03, 2007 2:40 AM
> To: lucene-net-user@incubator.apache.org
> Subject: RE: Best use of language dep. analyzers?
> 
> Hi Torsten,
> 
> Are you referring to the analyzer in Snowball.Net?  I ported 
> those analyzer
> to C# however, since I lack the language understanding, and 
> those analyzers
> don't come with a JUnit to port and test in the C# land, I 
> can't confirm if
> the port is valid or not.  This is the case for 1.9 as well 
> as for 2.0, I'm
> afraid it will remain the case unless if someone with langue knowledge
> debugged them.
> 
> -- George Aroush
> 
> -----Original Message-----
> From: Torsten Rendelmann [mailto:torsten.rendelmann@gmx.net] 
> Sent: Saturday, March 31, 2007 11:52 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Best use of language dep. analyzers?
> 
> Hi, I'm not so familiar with the lucene (Java) direction of 
> dev. in the
> field of language dependent analyzers. What will it be?
>  
> We use a slightly modified version of 1.9 lucene.net (wich 
> include the yet
> published/converted language dep. analyzers - various folders below
> "Analysis" named "BR", "CJK", "FR", "DE" etc.). As far I 
> understand they
> should be used to analyze language specific documents/texts 
> and get rid of
> stop words, etc. - so provide the "real" text to index. So 
> currently we
> detect/get the language out of the documents we index, 
> transform them to
> create the "right" analyzer and add the document.
> But they are not stable, we got various problems using them 
> (endless loops,
> empty string in a stop word table just to name some).
>  
> Will this be the same for lucene.net 2.x ? What "language" 
> package will be
> available?
> Will it be part of the apache project?
>  
> Thx,
> Torsten Rendelmann
>