You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mariella Di Giacomo <ma...@lanl.gov> on 2005/01/05 19:51:58 UTC

Question about Analyzer and words spelled in different languages

Hi ALL,


We are trying to index scientic articles written in english, but whose 
authors can be spelled in any language (depending on the author's nazionality)

E.g.
Schäffer


In the XML document that we provide to Lucene the author name is written in 
the following way (using HTML ENTITIES)

Sch&amp;auml;ffer

So in practice that is the name that would be given to a Lucene analyzer/filter

Is there any already written analyzer that would take that name 
(Sch&amp;auml;ffer or any other name that has entities) so that
Lucene index could searched (once the field has been indexed) for the real 
version of the name, which is

Schäffer

and the english spelled version of the name which is

Schaffer

Thanks a lot in advance for your help,


Mariella

Re: Question about Analyzer and words spelled in different languages

Posted by Chris Hostetter <ho...@fucit.org>.

: Is there any already written analyzer that would take that name
: (Sch&amp;auml;ffer or any other name that has entities) so that
: Lucene index could searched (once the field has been indexed) for the real
: version of the name, which is
:
: Schäffer
:
: and the english spelled version of the name which is
:
: Schaffer

I don't know about the un-xml-escaping part of things (there are lots
of xml escapng libraries out there, i'm sure one of them has an unescape)
but there was a recent discussion about unicode characters that look
similar and writting an analyzer that could know about them.  the last
message in the thread was from me, pointing out that it should be easy to
build the mapping table once, and then write a quick and dirty Analyzer
filter to use it ... but no one seemed to have any code handy that
allready did that...

http://mail-archives.apache.org/eyebrowse/BrowseList?listName=lucene-user@jakarta.apache.org&by=thread&from=962022


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Question about Analyzer and words spelled in different languages

Posted by David Spencer <da...@tropo.com>.

markharw00d wrote:

>  >>Writing this kind of an analyzer can be a bit of a hassle and the 
> position increment of 0 might affect highlighting code
> 
> The highlighter in the sandbox was refactored to support these kinds of 
> analyzer some time ago so it shouldn't be a problem.
Sorry - I should have noted that as I think I was the one that requested 
this change! I meant to say that position increments are tricky and an 
increment of 0 could affect all kinds of other code that doesn't know to 
look for such things.

>  The Junit test 
> that come with the highlighter includes a SynonymAnalyzer class which 
> demonstrates this.
> 
> Cheers
> Mark
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Question about Analyzer and words spelled in different languages

Posted by markharw00d <ma...@yahoo.co.uk>.

 >>Writing this kind of an analyzer can be a bit of a hassle and the 
position increment of 0 might affect highlighting code

The highlighter in the sandbox was refactored to support these kinds of 
analyzer some time ago so it shouldn't be a problem.  The Junit test 
that come with the highlighter includes a SynonymAnalyzer class which 
demonstrates this.

Cheers
Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Question about Analyzer and words spelled in different languages

Posted by David Spencer <da...@tropo.com>.

Mariella Di Giacomo wrote:

> Hi ALL,
> 
> 
> We are trying to index scientic articles written in english, but whose 
> authors can be spelled in any language (depending on the author's 
> nazionality)
> 
> E.g.
> Schäffer
> 
> 
> In the XML document that we provide to Lucene the author name is written 
> in the following way (using HTML ENTITIES)
> 
> Sch&amp;auml;ffer
> 
> So in practice that is the name that would be given to a Lucene 
> analyzer/filter
> 
> Is there any already written analyzer that would take that name 
> (Sch&amp;auml;ffer or any other name that has entities) so that
> Lucene index could searched (once the field has been indexed) for the 
> real version of the name, which is
> 
> Schäffer
> 
> and the english spelled version of the name which is
> 
> Schaffer
> 
> Thanks a lot in advance for your help,

If I understand the question then I think there are 2 ways of doing it.

[1] Write a custom analyzer that uses Token.setPositionIncrement(0) to 
put alternate spellings at the same place in the token stream. This way 
phrase matches work right (so the query "Jonathan Schaffer" and 
"Jonathan Schäffer" will match the same phrase in the doc).

[2] Do not use a special analyzer - instead do query expansion, so if 
they search for "Schaffer" then the generated query is (Schaffer Schäffer).

I've used both techniques before - I use #1 w/ a "JavadocAnalyzer" on 
searchmorph.com so that if you search for "hash" you'll see matches for 
"HashMap", as "HashMap" is tokenized into 3 tokens at the same location 
( 'hash', 'map, 'hashmap').  Writing this kind of an analyzer can be a 
bit of a hassle and the position increment of 0 might affect 
highlighting code or other (say, summarizing) code that uses the Analyzer.

For an example of #2 see my Wordnet/Synonym query expansion example in 
the lucene sandbox. You prebuild an index of synonyms (or in your case 
maybe just rules are fine). Then you need query expansion code that 
takes "Schaffer" and expands it to something like "Schaffer 
Schäffer^0.9" (if you want to assume the user probably spells the name 
right). Simple enough to code, only hassle then is if you want to use 
the standard QueryParser...

thx,
  Dave

> 
> 
> Mariella
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org