You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Owen Densmore <ow...@backspaces.net> on 2005/01/20 22:50:10 UTC

Newbie: Human Readable Stemming, Lucene Architecture, etc!

Hi .. I'm new to the list so forgive a dumb question or two as I get 
started.

We're in the midst of converting a small collection (1200-1500 
currently) of scientific literature to be easily searchable/navigable.  
We'll likely provide both a text query interface as well as a graphical 
way to search and discover.

Our initial approach will be vector based, looking at Latent Semantic 
Indexing (LSI) as a potential tool, although if that's not needed, 
we'll stop at reasonably simple stemming with a weighted document term 
matrix (DTM).  (Bear in mind I couldn't even pronounce most of these 
concepts last week, so go easy if I'm incoherent!)

It looks to me that Lucene has a quite well factored architecture.  I 
should at the very least be able to use the analyzer and stemmer to 
create a good starting point in the project.  I'd also like to leave a 
nice architecture behind in case we or others end up experimenting 
with, or extending, the system.

So a couple of questions:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  -> generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
     So the question is: Is there a way to have the stemmer produce 
english
     base forms of the words being stemmed?

2 - We're probably using Lucene in ways it was not designed for, such 
as DTM/LSI and graphical clustering and navigation.  Naturally we'll 
provide code for these parts that are not in Lucene.
     But the question arises: is this kinda dumb?!  Has anyone stretched 
Lucene's
     design center with positive results?  Are we barking up the wrong 
tree?

3 - A nit on hyphenation: Our collection is scientific so has many 
hyphenated words.  I'm wondering about your experiences with 
hyphenation.  In our collection, things like self-organization, 
power-law, space-time, small-world, agent-based, etc. occur often, for 
example.
     So the question is: Do folks break up hyphenated words?  If not, do 
you stem the
     parts and glue them back together?  Do you apply stoplists to the 
parts?

Thanks for any help and pointers you can fling along,

Owen    http://backspaces.net/    http://redfish.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Posted by jian chen <ch...@gmail.com>.
Hi,

One thing to point out. I think Lucene is not using LSI as the
underlying retrieval model. It uses vector space model and also
proximity based retrieval.

Personally, I don't know much about LSI and I don't think the fancy
stuff like LSI is workable in industry. I believe we are far away from
the era of artificial intelligence and using any elusive way to do
information retrieval.

Cheers,

Jian


On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore <ow...@backspaces.net> wrote:
> Hi .. I'm new to the list so forgive a dumb question or two as I get
> started.
> 
> We're in the midst of converting a small collection (1200-1500
> currently) of scientific literature to be easily searchable/navigable.
> We'll likely provide both a text query interface as well as a graphical
> way to search and discover.
> 
> Our initial approach will be vector based, looking at Latent Semantic
> Indexing (LSI) as a potential tool, although if that's not needed,
> we'll stop at reasonably simple stemming with a weighted document term
> matrix (DTM).  (Bear in mind I couldn't even pronounce most of these
> concepts last week, so go easy if I'm incoherent!)
> 
> It looks to me that Lucene has a quite well factored architecture.  I
> should at the very least be able to use the analyzer and stemmer to
> create a good starting point in the project.  I'd also like to leave a
> nice architecture behind in case we or others end up experimenting
> with, or extending, the system.
> 
> So a couple of questions:
> 
> 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
> apparently produces non-word stems .. i.e. not really human readable.
> (Example: generate, generates, generated, generating  -> generat)
> Although in typical queries this is not important because the result of
> the search is a document list, it *would* be important if we use the
> stems within a graphical navigation interface.
>      So the question is: Is there a way to have the stemmer produce
> english
>      base forms of the words being stemmed?
> 
> 2 - We're probably using Lucene in ways it was not designed for, such
> as DTM/LSI and graphical clustering and navigation.  Naturally we'll
> provide code for these parts that are not in Lucene.
>      But the question arises: is this kinda dumb?!  Has anyone stretched
> Lucene's
>      design center with positive results?  Are we barking up the wrong
> tree?
> 
> 3 - A nit on hyphenation: Our collection is scientific so has many
> hyphenated words.  I'm wondering about your experiences with
> hyphenation.  In our collection, things like self-organization,
> power-law, space-time, small-world, agent-based, etc. occur often, for
> example.
>      So the question is: Do folks break up hyphenated words?  If not, do
> you stem the
>      parts and glue them back together?  Do you apply stoplists to the
> parts?
> 
> Thanks for any help and pointers you can fling along,
> 
> Owen    http://backspaces.net/    http://redfish.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Posted by Andrzej Bialecki <ab...@getopt.org>.
Morus Walter wrote:
> Owen Densmore writes:
> 
> 
>>1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
>>apparently produces non-word stems .. i.e. not really human readable.  
>>(Example: generate, generates, generated, generating  -> generat) 
>>Although in typical queries this is not important because the result of 
>>the search is a document list, it *would* be important if we use the 
>>stems within a graphical navigation interface.
>>     So the question is: Is there a way to have the stemmer produce 
>>english
>>     base forms of the words being stemmed?
>>
> 
> rule based stemmers such as porter/snowball cannot do that.
> But there are (commercial) dictionary based tools that can. E.g. the
> canoo lemmatizer.
> You might also have a look at egothors stemmer, that are word list based.

Egothor stemmers are algorithmic, they only use word lists for training. 
Stems produced by them are usually closer to lemmas than in e.g. 
Porter's stemmer, but there is a significant amount of stems like in the 
example above.




-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

Posted by Morus Walter <mo...@tanto.de>.
Owen Densmore writes:

> 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
> apparently produces non-word stems .. i.e. not really human readable.  
> (Example: generate, generates, generated, generating  -> generat) 
> Although in typical queries this is not important because the result of 
> the search is a document list, it *would* be important if we use the 
> stems within a graphical navigation interface.
>      So the question is: Is there a way to have the stemmer produce 
> english
>      base forms of the words being stemmed?
> 
rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.

HTH
	Morus



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org