You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kasun Perera <ka...@opensource.lk> on 2012/04/28 05:02:56 UTC

Indexing with Semantics

I'm using Lucene's Term Freq vector to calculate cosine similarity between
documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene
takes this as 3 separate terms, but 3 of them means same "owe". Is there
any functionality in Lucene that can be used to index by semantics? so that
it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?

If not I'd welcome any suggestions achieving this task?

-- 
Regards

Kasun Perera

Re: Indexing with Semantics

Posted by Li Li <fa...@gmail.com>.
stemmer
semantic is a "large" word, care to use it.

On Sat, Apr 28, 2012 at 11:02 AM, Kasun Perera <ka...@opensource.lk> wrote:
> I'm using Lucene's Term Freq vector to calculate cosine similarity between
> documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene
> takes this as 3 separate terms, but 3 of them means same "owe". Is there
> any functionality in Lucene that can be used to index by semantics? so that
> it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?
>
> If not I'd welcome any suggestions achieving this task?
>
> --
> Regards
>
> Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Indexing with Semantics

Posted by Yuval Kesten <yk...@yahoo-inc.com>.
Hi,
The logic you are looking for is Lemmatization - http://en.wikipedia.org/wiki/Lemmatisation.
I don't think Lucene has a built-in lemmatizer but you can use GATE which is an open source project:
http://gate.ac.uk
http://gate.ac.uk/gate/doc/plugins.html

Enjoy!



-----Original Message-----
From: Kasun Perera [mailto:kasunp@opensource.lk] 
Sent: Saturday, April 28, 2012 6:03 AM
To: java-user@lucene.apache.org
Subject: Indexing with Semantics

I'm using Lucene's Term Freq vector to calculate cosine similarity between documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene takes this as 3 separate terms, but 3 of them means same "owe". Is there any functionality in Lucene that can be used to index by semantics? so that it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?

If not I'd welcome any suggestions achieving this task?

--
Regards

Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org