You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marc Guillemot <ma...@amath.net> on 2002/10/01 09:40:55 UTC

Help for german queries

Hi,

I've performed some tests with Lucene for german indexation/search but I
don't get the results I expected:

- Umlaut:
search for:
    - "Geschäft" -> x results
    - "Geschaeft" -> no result
Is there an option in the standard german classes to make the 2 searches
above equivalent?

- Composed words:
"betreuung" is not found in a doc containing "Kundenbetreuung"

Any suggestions?

Marc.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Help for german queries

Posted by Andreas Mack <va...@mediales.net>.
On Wed, Oct 02, 2002 at 03:03:28PM +0200, Marc Guillemot wrote:
> Great, your stemmer does the job I expected for Umlaut. Thanks.
> 
> Has someone an idea for composed words ("betreuung" is not found in a doc
> containing "Kundenbetreuung")?

I had the idea of using aspell's word lists to further "stem" the words
but it doesn't look so trivial. The words in the lists look like
"8etreuung" and I hadn't time to investigate this. I'd be interested
in the opinions of others on this.


Greetings,
Andreas.

-- 
Andreas Mack <va...@mediales.de>
mediales. GmbH http://www.mediales.de
The unexamined life is not worth living.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Help for german queries

Posted by Marc Guillemot <ma...@amath.net>.
Great, your stemmer does the job I expected for Umlaut. Thanks.

Has someone an idea for composed words ("betreuung" is not found in a doc
containing "Kundenbetreuung")?

Marc.



----- Original Message -----
From: "Clemens Marschner" <cm...@lanlab.de>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, October 02, 2002 2:36 PM
Subject: Re: Help for german queries


> Hm sorry I don't have the time right now, but I think it took me 10
minutes
> to discover the location where I had to do the changes.
> I thought ä=ae would already be included.
> I included my GermanStemmer version in this post. Sorry I can't do
> CVSing/diff'ing at the moment.
> The stemmer does ä->a and ae->a and doesn't distinguish between uppercase
> and lowercase. I'm not a linguist, so I can't say if it does overstemming.
I
> commented out the expression below
>
>      // "t" occurs only as suffix of verbs.
>      else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&&
> !uppercase*/ ) {
>   buffer.deleteCharAt( buffer.length() - 1 );
>      }
>      else {
>   doMore = false;
>      }
>
> Hope that helps
>
> Clemens
>
> ----- Original Message -----
> From: "Marc Guillemot" <ma...@amath.net>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, October 02, 2002 12:47 PM
> Subject: Re: Help for german queries
>
>
> > The problem/question is not on the first letter case because but only on
> the
> > equivalence between "ä" and "ae" for instance.
> >
> > in my tests, searching for:
> > - Geschäft -> 13 results
> > - geschäft -> 0 result
> > - Geschaeft -> 0 result
> > - geschaeft -> 0 result
> >
> > Marc.
> >
> >
> > ----- Original Message -----
> > From: "Clemens Marschner" <cm...@lanlab.de>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Tuesday, October 01, 2002 1:16 PM
> > Subject: Re: Help for german queries
> >
> >
> > > there's a "feature" in the German stemmer (I would call it a bug) that
> > > treats words ending with "t" differently if they start with a capital
or
> > > non-capital letter. Are you sure you didn't type "geschäft" and
> > "Geschaeft"?
> > > Cause that's supposedly stemmed differently.
> > >
> > > --Clemens
> > >
> > > ----- Original Message -----
> > > From: "Marc Guillemot" <ma...@amath.net>
> > > To: <lu...@jakarta.apache.org>
> > > Sent: Tuesday, October 01, 2002 9:40 AM
> > > Subject: Help for german queries
> > >
> > >
> > > > Hi,
> > > >
> > > > I've performed some tests with Lucene for german indexation/search
but
> I
> > > > don't get the results I expected:
> > > >
> > > > - Umlaut:
> > > > search for:
> > > >     - "Geschäft" -> x results
> > > >     - "Geschaeft" -> no result
> > > > Is there an option in the standard german classes to make the 2
> searches
> > > > above equivalent?
> > > >
> > > > - Composed words:
> > > > "betreuung" is not found in a doc containing "Kundenbetreuung"
> > > >
> > > > Any suggestions?
> > > >
> > > > Marc.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Help for german queries

Posted by Clemens Marschner <cm...@lanlab.de>.
Hm sorry I don't have the time right now, but I think it took me 10 minutes
to discover the location where I had to do the changes.
I thought �=ae would already be included.
I included my GermanStemmer version in this post. Sorry I can't do
CVSing/diff'ing at the moment.
The stemmer does �->a and ae->a and doesn't distinguish between uppercase
and lowercase. I'm not a linguist, so I can't say if it does overstemming. I
commented out the expression below

     // "t" occurs only as suffix of verbs.
     else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&&
!uppercase*/ ) {
  buffer.deleteCharAt( buffer.length() - 1 );
     }
     else {
  doMore = false;
     }

Hope that helps

Clemens

----- Original Message -----
From: "Marc Guillemot" <ma...@amath.net>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, October 02, 2002 12:47 PM
Subject: Re: Help for german queries


> The problem/question is not on the first letter case because but only on
the
> equivalence between "�" and "ae" for instance.
>
> in my tests, searching for:
> - Gesch�ft -> 13 results
> - gesch�ft -> 0 result
> - Geschaeft -> 0 result
> - geschaeft -> 0 result
>
> Marc.
>
>
> ----- Original Message -----
> From: "Clemens Marschner" <cm...@lanlab.de>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, October 01, 2002 1:16 PM
> Subject: Re: Help for german queries
>
>
> > there's a "feature" in the German stemmer (I would call it a bug) that
> > treats words ending with "t" differently if they start with a capital or
> > non-capital letter. Are you sure you didn't type "gesch�ft" and
> "Geschaeft"?
> > Cause that's supposedly stemmed differently.
> >
> > --Clemens
> >
> > ----- Original Message -----
> > From: "Marc Guillemot" <ma...@amath.net>
> > To: <lu...@jakarta.apache.org>
> > Sent: Tuesday, October 01, 2002 9:40 AM
> > Subject: Help for german queries
> >
> >
> > > Hi,
> > >
> > > I've performed some tests with Lucene for german indexation/search but
I
> > > don't get the results I expected:
> > >
> > > - Umlaut:
> > > search for:
> > >     - "Gesch�ft" -> x results
> > >     - "Geschaeft" -> no result
> > > Is there an option in the standard german classes to make the 2
searches
> > > above equivalent?
> > >
> > > - Composed words:
> > > "betreuung" is not found in a doc containing "Kundenbetreuung"
> > >
> > > Any suggestions?
> > >
> > > Marc.
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>

Re: Help for german queries

Posted by Marc Guillemot <ma...@amath.net>.
The problem/question is not on the first letter case because but only on the
equivalence between "ä" and "ae" for instance.

in my tests, searching for:
- Geschäft -> 13 results
- geschäft -> 0 result
- Geschaeft -> 0 result
- geschaeft -> 0 result

Marc.


----- Original Message -----
From: "Clemens Marschner" <cm...@lanlab.de>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, October 01, 2002 1:16 PM
Subject: Re: Help for german queries


> there's a "feature" in the German stemmer (I would call it a bug) that
> treats words ending with "t" differently if they start with a capital or
> non-capital letter. Are you sure you didn't type "geschäft" and
"Geschaeft"?
> Cause that's supposedly stemmed differently.
>
> --Clemens
>
> ----- Original Message -----
> From: "Marc Guillemot" <ma...@amath.net>
> To: <lu...@jakarta.apache.org>
> Sent: Tuesday, October 01, 2002 9:40 AM
> Subject: Help for german queries
>
>
> > Hi,
> >
> > I've performed some tests with Lucene for german indexation/search but I
> > don't get the results I expected:
> >
> > - Umlaut:
> > search for:
> >     - "Geschäft" -> x results
> >     - "Geschaeft" -> no result
> > Is there an option in the standard german classes to make the 2 searches
> > above equivalent?
> >
> > - Composed words:
> > "betreuung" is not found in a doc containing "Kundenbetreuung"
> >
> > Any suggestions?
> >
> > Marc.
> >
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Help for german queries

Posted by Clemens Marschner <cm...@lanlab.de>.
there's a "feature" in the German stemmer (I would call it a bug) that
treats words ending with "t" differently if they start with a capital or
non-capital letter. Are you sure you didn't type "geschäft" and "Geschaeft"?
Cause that's supposedly stemmed differently.

--Clemens

----- Original Message -----
From: "Marc Guillemot" <ma...@amath.net>
To: <lu...@jakarta.apache.org>
Sent: Tuesday, October 01, 2002 9:40 AM
Subject: Help for german queries


> Hi,
>
> I've performed some tests with Lucene for german indexation/search but I
> don't get the results I expected:
>
> - Umlaut:
> search for:
>     - "Geschäft" -> x results
>     - "Geschaeft" -> no result
> Is there an option in the standard german classes to make the 2 searches
> above equivalent?
>
> - Composed words:
> "betreuung" is not found in a doc containing "Kundenbetreuung"
>
> Any suggestions?
>
> Marc.
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>