You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jarek Zgoda <ja...@redefine.pl> on 2008/10/16 09:07:30 UTC

Advice on analysis/filtering?

Hello, group.

I'm trying to create a search facility for documents in "broken"  
Polish (by broken I mean "not language rules compliant"), searchable  
by terms in "broken" Polish, but broken in many other ways than  
documents. See this example:

document text: "włatcy móch" (in proper Polish this would be "władcy  
much")
example terms that should match: "włatcy much", "wlatcy moch", "wladcy  
much"

This double brokeness ruled out any Polish stemmers currently  
available for Lucene and now I am at point 0. The search results do  
not have to be 100% accurate - some missing results are acceptable,  
but "false positives" are not. Is it at all possible using machinery  
provided by Solr (I do not own PHD in liguistics), or should I ask the  
business for lowering their expectations?

-- 
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zgoda@redefine.pl

Re: Advice on analysis/filtering?

Posted by Erick Erickson <er...@gmail.com>.

You're welcome. I should have pointed out that I was responding
mostly to the "false hits are not acceptable" portion, which I don't
think is achievable....

Best
Erick

2008/10/16 Jarek Zgoda <ja...@redefine.pl>

> Wiadomość napisana w dniu 2008-10-16, o godz. 15:54, przez Erick Erickson:
>
>  Well, let me see. Your customers are telling you, in essence,
>> "for any random input, you cannot return false positives". Which
>> is nonsense, so I'd say you need to negotiate with your
>> customers. I flat guarantee that, for any algorithm you try,
>> you can write a counter-example in, oh, 15 seconds or so <G>.
>>
>
> They came to such expectations seeing Solr's own Spellcheck at work - if it
> can suggest correct versions, it should be able to sanitize broken words in
> documents and search them using sanitized input. For me, this seemed
> reasonable request (of course, if this can be achieved reasonably abusing
> solr's spellcheck component).
>
>  FuzzySearch tries to do some of this work for you, and that may be
>> acceptable, as this is a common issue. But it'll never be
>> perfect.
>>
>> You might get some joy from ngrams, but I haven't
>> worked with it myself, just seen it recommended by people
>> whose opinions I respect...
>>
>
> Thank you for these suggestions.
>
>
>
>>
>> Best
>> Erick
>>
>>
>> 2008/10/16 Jarek Zgoda <ja...@redefine.pl>
>>
>>  Hello, group.
>>>
>>> I'm trying to create a search facility for documents in "broken" Polish
>>> (by
>>> broken I mean "not language rules compliant"), searchable by terms in
>>> "broken" Polish, but broken in many other ways than documents. See this
>>> example:
>>>
>>> document text: "włatcy móch" (in proper Polish this would be "władcy
>>> much")
>>> example terms that should match: "włatcy much", "wlatcy moch", "wladcy
>>> much"
>>>
>>> This double brokeness ruled out any Polish stemmers currently available
>>> for
>>> Lucene and now I am at point 0. The search results do not have to be 100%
>>> accurate - some missing results are acceptable, but "false positives" are
>>> not. Is it at all possible using machinery provided by Solr (I do not own
>>> PHD in liguistics), or should I ask the business for lowering their
>>> expectations?
>>>
>>> --
>>> We read Knuth so you don't have to. - Tim Peters
>>>
>>> Jarek Zgoda, R&D, Redefine
>>> jarek.zgoda@redefine.pl
>>>
>>>
>>>
> --
> We read Knuth so you don't have to. - Tim Peters
>
> Jarek Zgoda, R&D, Redefine
> jarek.zgoda@redefine.pl
>
>

Re: Advice on analysis/filtering?

Posted by Norberto Meijome <nu...@gmail.com>.

On Thu, 16 Oct 2008 16:09:17 +0200
Jarek Zgoda <ja...@redefine.pl> wrote:

> They came to such expectations seeing Solr's own Spellcheck at work -  
> if it can suggest correct versions, it should be able to sanitize  
> broken words in documents and search them using sanitized input. For  
> me, this seemed reasonable request (of course, if this can be achieved  
> reasonably abusing solr's spellcheck component).

don't forget that the solr spellchecker finds its suggestions based on your
corpus. so if you don't have a correctly spelt version of wordA , you won't
receive back wordA as a 'spellchecked' version of that word. I think that's how
it works by default (which is all I've needed so far).
I *think* there is a way to use an external spellchecker (component or list) -
so you could have your full list of Polish words in a file, i guess....

I agree playing with analysis.jsp is the best approach to solving these
problems ( tick all the boxes and see how the changes to your terms take place).

good luck - let us know what you come up with :)

B
_________________________
{Beto|Norberto|Numard} Meijome

"You can discover what your enemy fears most by observing the means he uses to
frighten you." Eric Hoffer (1902 - 1983)

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Advice on analysis/filtering?

Posted by Jarek Zgoda <ja...@redefine.pl>.

Wiadomość napisana w dniu 2008-10-16, o godz. 15:54, przez Erick  
Erickson:

> Well, let me see. Your customers are telling you, in essence,
> "for any random input, you cannot return false positives". Which
> is nonsense, so I'd say you need to negotiate with your
> customers. I flat guarantee that, for any algorithm you try,
> you can write a counter-example in, oh, 15 seconds or so <G>.

They came to such expectations seeing Solr's own Spellcheck at work -  
if it can suggest correct versions, it should be able to sanitize  
broken words in documents and search them using sanitized input. For  
me, this seemed reasonable request (of course, if this can be achieved  
reasonably abusing solr's spellcheck component).

> FuzzySearch tries to do some of this work for you, and that may be
> acceptable, as this is a common issue. But it'll never be
> perfect.
>
> You might get some joy from ngrams, but I haven't
> worked with it myself, just seen it recommended by people
> whose opinions I respect...

Thank you for these suggestions.

>
>
> Best
> Erick
>
>
> 2008/10/16 Jarek Zgoda <ja...@redefine.pl>
>
>> Hello, group.
>>
>> I'm trying to create a search facility for documents in "broken"  
>> Polish (by
>> broken I mean "not language rules compliant"), searchable by terms in
>> "broken" Polish, but broken in many other ways than documents. See  
>> this
>> example:
>>
>> document text: "włatcy móch" (in proper Polish this would be  
>> "władcy much")
>> example terms that should match: "włatcy much", "wlatcy moch",  
>> "wladcy
>> much"
>>
>> This double brokeness ruled out any Polish stemmers currently  
>> available for
>> Lucene and now I am at point 0. The search results do not have to  
>> be 100%
>> accurate - some missing results are acceptable, but "false  
>> positives" are
>> not. Is it at all possible using machinery provided by Solr (I do  
>> not own
>> PHD in liguistics), or should I ask the business for lowering their
>> expectations?
>>
>> --
>> We read Knuth so you don't have to. - Tim Peters
>>
>> Jarek Zgoda, R&D, Redefine
>> jarek.zgoda@redefine.pl
>>
>>

-- 
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zgoda@redefine.pl

Re: Advice on analysis/filtering?

Posted by Erick Erickson <er...@gmail.com>.

Well, let me see. Your customers are telling you, in essence,
"for any random input, you cannot return false positives". Which
is nonsense, so I'd say you need to negotiate with your
customers. I flat guarantee that, for any algorithm you try,
you can write a counter-example in, oh, 15 seconds or so <G>.

I think the best you can hope for is "reasonable results", but
getting your customers to agree to what is "reasonable" is...er...
often a challenge. Frequently when confronted by "close but
not perfect", customers aren't as unforgiving as their first
position would indicate since the inconvenience of the not-
quite-perfect results is often much less than people think
when starting out.

FuzzySearch tries to do some of this work for you, and that may be
acceptable, as this is a common issue. But it'll never be
perfect.

You might get some joy from ngrams, but I haven't
worked with it myself, just seen it recommended by people
whose opinions I respect...

Best
Erick


2008/10/16 Jarek Zgoda <ja...@redefine.pl>

> Hello, group.
>
> I'm trying to create a search facility for documents in "broken" Polish (by
> broken I mean "not language rules compliant"), searchable by terms in
> "broken" Polish, but broken in many other ways than documents. See this
> example:
>
> document text: "włatcy móch" (in proper Polish this would be "władcy much")
> example terms that should match: "włatcy much", "wlatcy moch", "wladcy
> much"
>
> This double brokeness ruled out any Polish stemmers currently available for
> Lucene and now I am at point 0. The search results do not have to be 100%
> accurate - some missing results are acceptable, but "false positives" are
> not. Is it at all possible using machinery provided by Solr (I do not own
> PHD in liguistics), or should I ask the business for lowering their
> expectations?
>
> --
> We read Knuth so you don't have to. - Tim Peters
>
> Jarek Zgoda, R&D, Redefine
> jarek.zgoda@redefine.pl
>
>

Re: Advice on analysis/filtering?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jarek Zgoda wrote:
> Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant Ingersoll:
> 
>>> I'm trying to create a search facility for documents in "broken" 
>>> Polish (by broken I mean "not language rules compliant"),
>>
>> Can you explain what you mean here a bit more?  I don't know Polish, 

Hi guys,

I do speak Polish :) maybe I can help here a bit.

> Some documents (around 15% of all pile) contain the texts entered by 
> children from primary school's and that implies many syntactic and 
> ortographic errors.

>>> document text: "włatcy móch" (in proper Polish this would be "władcy 
>>> much")
>>> example terms that should match: "włatcy much", "wlatcy moch", 
>>> "wladcy much"

These examples can be classified as "sounds like", and typically 
soundexing algorithms are used to address this problem, in order to 
generate initial suggestions. After that you can use other heuristic 
rules to select the most probable correct forms.

AFAIK, there are no (public) soundex implementations for Polish, in 
particular in Java, although there was some research work done on the 
construction of a specifically Polish soundex. You could also use the 
Daitch-Mokotoff soundex, which comes close enough.

> Taking word "włatcy" from my example, I'd like to find documents 
> containing words

> "wlatcy" (latin-2 accentuations stripped from original), 

This step is trivial.

> "władcy" (proper form of this noun) and "wladcy" (latin-2 
> accents stripped from proper form).

And this one is not. It requires using something like soundexing in 
order to look up possible similar terms. However ... in this process you 
inevitably collect false positives, and you don't have any way in the 
input text to determine that they should be rejected. You can only make 
this decision based on some external knowledge of Polish, such as:

* a morpho-syntactic analyzer that will determine which combinations of 
suggestions are more correct and more probable,

* a language model that for any given soundexed phrase can generate the 
most probable original phrases.

Also, knowing the context in which a query is asked may help, but 
usually you don't have this information (queries are short).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Advice on analysis/filtering?

Posted by Jarek Zgoda <ja...@redefine.pl>.

Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant  
Ingersoll:

>> I'm trying to create a search facility for documents in "broken"  
>> Polish (by broken I mean "not language rules compliant"),
>
> Can you explain what you mean here a bit more?  I don't know Polish,  
> but most spoken languages can't be pinned down to a specific set of  
> rules.  In other words, the exception is the rule.  Or, are you  
> saying the documents use more dialog based, i.e. more informal, as  
> in two people having a conversation?

Some documents (around 15% of all pile) contain the texts entered by  
children from primary school's and that implies many syntactic and  
ortographic errors. The text is indexed "as is" and Solr is able to  
find exact occurences, but I'd like to be able to find also documents  
that contain other variations of errors and proper forms, too. And oh,  
the system will be used by the same aged children, who tends to make  
similar errors when entering search terms.

>> searchable by terms in "broken" Polish, but broken in many other  
>> ways than documents. See this example:
>>
>> document text: "włatcy móch" (in proper Polish this would be  
>> "władcy much")
>> example terms that should match: "włatcy much", "wlatcy moch",  
>> "wladcy much"
>>
>> This double brokeness ruled out any Polish stemmers currently  
>> available for Lucene and now I am at point 0. The search results do  
>> not have to be 100% accurate - some missing results are acceptable,
>>
>> but "false positives" are not.
>
>
> There's no such thing in any language.  In your example above, what  
> is matching that shouldn't?  Is this happening across a lot of  
> documents, or just a few?

Yea, I know that. By "not acceptable" I mean "not acceptable above  
some level". Sorry for this confusion.

Taking word "włatcy" from my example, I'd like to find documents  
containing words "wlatcy" (latin-2 accentuations stripped from  
original), "władcy" (proper form of this noun) and "wladcy" (latin-2  
accents stripped from proper form). The issue #1 (stripping  
accentuations from original) seems to be resolvable outside solr - I  
can index texts with accentuations stripped already. The issue #2  
(finding proper form for word) is the most interesting for me. Issue  
#3 depends on #1 and #2.

>> Is it at all possible using machinery provided by Solr (I do not  
>> own PHD in liguistics), or should I ask the business for lowering  
>> their expectations?
>
> Well, I think there are a couple of approaches:
> 1. You can write your own filter/stemmer/analyzer that you think  
> fixes these issues
> 2. You can protect the "broken" words and not have them filtered, or  
> filter them differently.
> 3. You can lower expectations.
>
>
> One thing to try out is Solr's analysis tool in the admin, and see  
> if you can get a better handle on what is going wrong.

I'll see how far I could go with spellchecker and fuzzy searches.

-- 
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zgoda@redefine.pl

Re: Advice on analysis/filtering?

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 16, 2008, at 3:07 AM, Jarek Zgoda wrote:

> Hello, group.
>
> I'm trying to create a search facility for documents in "broken"  
> Polish (by broken I mean "not language rules compliant"),

Can you explain what you mean here a bit more?  I don't know Polish,  
but most spoken languages can't be pinned down to a specific set of  
rules.  In other words, the exception is the rule.  Or, are you saying  
the documents use more dialog based, i.e. more informal, as in two  
people having a conversation?

> searchable by terms in "broken" Polish, but broken in many other  
> ways than documents. See this example:
>
> document text: "włatcy móch" (in proper Polish this would be  
> "władcy much")
> example terms that should match: "włatcy much", "wlatcy moch",  
> "wladcy much"
>
> This double brokeness ruled out any Polish stemmers currently  
> available for Lucene and now I am at point 0. The search results do  
> not have to be 100% accurate - some missing results are acceptable,
>
> but "false positives" are not.

There's no such thing in any language.  In your example above, what is  
matching that shouldn't?  Is this happening across a lot of documents,  
or just a few?

> Is it at all possible using machinery provided by Solr (I do not own  
> PHD in liguistics), or should I ask the business for lowering their  
> expectations?

Well, I think there are a couple of approaches:
1. You can write your own filter/stemmer/analyzer that you think fixes  
these issues
2. You can protect the "broken" words and not have them filtered, or  
filter them differently.
3. You can lower expectations.

One thing to try out is Solr's analysis tool in the admin, and see if  
you can get a better handle on what is going wrong.

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ