You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Arnold Leung <al...@shaw.ca> on 2007/05/17 07:38:41 UTC

snowball (english) and filenames

Does Snowball (English) support "filenames?"

ex. Authenicate.dll does not return a "hit" if the keyword  
"authenticate" (without ".dll") is used.

("authenticate*" or authenticate.dll works though)

Is there anyway to get around this?  How come the Snowball demo  
(http://snowball.tartarus.org/demo.php) seems to work?

ex.
I entered the following in the textbox:

authenticate.dll
authenticate
authentication

and I got back:

authenticate -> authent
dll -> dll
authenticate -> authent
authentication -> authent

Thanks in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: snowball (english) and filenames

Posted by Doron Cohen <DO...@il.ibm.com>.
> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo
> indicates it should do.

I am not sure about the "should" here - the way I see it, this
is just how the demo works: Snowball stemmers operate on words,
so the demo first breaks the input text into words and only
then applies stemming.

> For my lucene testing, I indexed one text file with one
> "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.
> It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the
> string based on the periods).

In Lucene the way text is "broken" into words is up to
application - and depends on the analyzer being used.
WhitespaceAnalyzer would break on white space. StandardAnalyzer
would do more sophisticated work. Analyzers are extendable,
so you could modify their behavior. The wiki page
"AnalysisParalysis" has some relevant info.

Using Lucene's SimpleAnalyzer btw would break "a.b.c" into
"a b c" which seems to be what you are looking for?

HTH,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: snowball (english) and filenames

Posted by Arnold Leung <al...@shaw.ca>.
On 16-May-07, at 11:00 PM, Doron Cohen wrote:

> If you enter a.b.c.d.e.f.g.h to that demo you'll see that
> the demo simply breaks the input text on '.' - that has
> nothing to do with filenames.

That is not what I am seeing from my testing:

a.b.c.d.e.f.g.h is not broken apart like how the snowball demo  
indicates it should do.

At http://snowball.tartarus.org/demo.php

"a.b.c.d.e.f.g.h" shows:

a -> a
b -> b
c -> c
d -> d
e -> e
f -> f
g -> g
h -> h

For my lucene testing, I indexed one text file with one   
"a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.   
It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the  
string based on the periods).


As a real world example, Logon.dll is being converted to "Logon.dl"  
rather than "Logon" and "dll" as indicated by the snowball demo.

Also:

Demo:
some-msp.msp

somemsp -> somemsp
msp -> msp

Lucene:
some-msp.msp

some
msp.msp


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: snowball (english) and filenames

Posted by Doron Cohen <DO...@il.ibm.com>.
If you enter a.b.c.d.e.f.g.h to that demo you'll see that
the demo simply breaks the input text on '.' - that has
nothing to do with filenames.

Arnold Leung <al...@shaw.ca> wrote on 16/05/2007 22:38:41:

> Does Snowball (English) support "filenames?"
>
> ex. Authenicate.dll does not return a "hit" if the keyword
> "authenticate" (without ".dll") is used.
>
> ("authenticate*" or authenticate.dll works though)
>
> Is there anyway to get around this?  How come the Snowball demo
> (http://snowball.tartarus.org/demo.php) seems to work?
>
> ex.
> I entered the following in the textbox:
>
> authenticate.dll
> authenticate
> authentication
>
> and I got back:
>
> authenticate -> authent
> dll -> dll
> authenticate -> authent
> authentication -> authent
>
> Thanks in advance.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org