You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kishore AVK. Veleti" <Ki...@coreobjects.com> on 2007/11/18 18:09:16 UTC

Finding all possible synonyms for a word

Hi All,

I am new to Lucene / SOLR and developing a POC as part of research. Check below my requirement and problem statement. Need help on how I can index the data such data I have a very good search functionality in my POC.

------------------------------------------------------------------
Requirement:
------------------------------------------------------------------

Assume my web application is an Online book store and it sell all categories of books like Computers, Social Studies, Physical Sciences etc. Each of these categories has sub-categories. For example Computers has sub-categories like Software Engineering, Java, SQL Server etc

I have a database table called Categories and it contains both Parent Category descriptions and also Child Category descriptions.

Data structure of Category table is:

Category_ID_Primay_Key  integer
Parent_Category_ID  integer
Category_Name varchar(100)
Category_Description varchar(1000)


------------------------------------------------------------------
My Search UI:
------------------------------------------------------------------

My search page is very simple. We have a text field with "Search" button.

------------------------------------------------------------------
User Action:
------------------------------------------------------------------

User enter below search text in above text field and clicks on "Search" button.

"Books on Data Center"

------------------------------------------------------------------
What is my expected behavior:
------------------------------------------------------------------

Since the word "Data Center" more relevant computers I should show books related to computers.

------------------------------------------------------------------
My Problem statement and Question to you all:
------------------------------------------------------------------

To have a better search in my web applications what kind of strategy should I have and index the data accordingly in SOLR/Lucene.

In my Lucene Index I may or may not have the word "data center". Still I should be able to return "data center"

One thought I have is as follows:

Modify the Category table by adding one more column to it:

Category_ID_Primay_Key  integer
Parent_Category_ID  integer
Category_Name varchar(100)
Category_Description varchar(1000)
Category_Description_Keywords varchar(8000)

Now take each word in "Category_description", find synonyms of it and store that data in Category_Description_Keywords column. After doing it, index the Category table records in SOLR/Lucene.

Below are my questions to you all:

Question 1:
Need your feedbacks on above approach or any other approach which help me to make my search better that returns most relevant results to the user.

Question 2:
Can you suggest me Java based best Open Source or commercial synonym engines. I want such a best synonym engine that gives me all possible synonyms of a word.



Thanks in Advance,
Kishore Veleti A.V.K.

Re: Finding all possible synonyms for a word

Posted by Walter Underwood <wu...@netflix.com>.
How many synonym sets do you have? I'm using about 600 sets with
no problem.  --wunder

On 11/19/07 8:23 PM, "climbingrose" <cl...@gmail.com> wrote:

> Correction for last message: you need to modify or extend
> SynonymFilterFactory instead of SynonymFilter. SynonmFilterFactory is
> responsible for initialising SynonymFilter and populating the list of
> synonyms. Have a look at the source code. I think it's pretty easy to
> understand. What you probably need to do is to add more parameters
> such as database host, username, password and the actual database in
> init() method.
> 
> On Nov 20, 2007 3:18 PM, climbingrose <cl...@gmail.com> wrote:
>> One approach is to extend SynonymFilter so that it reads synonyms from
>> database instead of a file. SynonymFilter is just a Java class so you
>> can do whatever you want with it :D. From what I remember, the filter
>> initialises a list of all input synonyms and store them in memory.
>> Therefore, you need to make sure that all the synonyms can fit into
>> memory at runtime.
>> 
>> 
>> On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti <Ki...@coreobjects.com>
>> wrote:
>>> Hi Eswar,
>>> 
>>> Thanks for the update.
>>> 
>>> I have gone through the below link provided by you and what I understood
>>> from it is, we need to have all possible synonyms in a text file. This file
>>> need to be given as input for "SynonymFilterFactory" to work. If my
>>> understanding is right then the approach may not suit my requirement. Reason
>>> is I need to find synonyms of all the keywords in category description and
>>> store those synonyms in the above said input file. The file may be too big.
>>> 
>>> Let me know if my understanding is wrong.
>>> 
>>> 
>>> Thanks,
>>> Kishore Veleti A.V.K.
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Eswar K [mailto:kja.eswar@gmail.com]
>>> Sent: Monday, November 19, 2007 11:22 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Finding all possible synonyms for a word
>>> 
>>> Kishore,
>>> 
>>> Solr has a SynonymFilterFactory which might be off use to you (
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4
>>> ddd82e453dc68fcfc92da77358d46)
>>> 
>>> 
>>> Regards,
>>> Eswar
>>> 
>>> On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <Ki...@coreobjects.com>
>>> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I am new to Lucene / SOLR and developing a POC as part of research. Check
>>>> below my requirement and problem statement. Need help on how I can index
>>>> the
>>>> data such data I have a very good search functionality in my POC.
>>>> 
>>>> ------------------------------------------------------------------
>>>> Requirement:
>>>> ------------------------------------------------------------------
>>>> 
>>>> Assume my web application is an Online book store and it sell all
>>>> categories of books like Computers, Social Studies, Physical Sciences etc.
>>>> Each of these categories has sub-categories. For example Computers has
>>>> sub-categories like Software Engineering, Java, SQL Server etc
>>>> 
>>>> I have a database table called Categories and it contains both Parent
>>>> Category descriptions and also Child Category descriptions.
>>>> 
>>>> Data structure of Category table is:
>>>> 
>>>> Category_ID_Primay_Key  integer
>>>> Parent_Category_ID  integer
>>>> Category_Name varchar(100)
>>>> Category_Description varchar(1000)
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> My Search UI:
>>>> ------------------------------------------------------------------
>>>> 
>>>> My search page is very simple. We have a text field with "Search" button.
>>>> 
>>>> ------------------------------------------------------------------
>>>> User Action:
>>>> ------------------------------------------------------------------
>>>> 
>>>> User enter below search text in above text field and clicks on "Search"
>>>> button.
>>>> 
>>>> "Books on Data Center"
>>>> 
>>>> ------------------------------------------------------------------
>>>> What is my expected behavior:
>>>> ------------------------------------------------------------------
>>>> 
>>>> Since the word "Data Center" more relevant computers I should show books
>>>> related to computers.
>>>> 
>>>> ------------------------------------------------------------------
>>>> My Problem statement and Question to you all:
>>>> ------------------------------------------------------------------
>>>> 
>>>> To have a better search in my web applications what kind of strategy
>>>> should I have and index the data accordingly in SOLR/Lucene.
>>>> 
>>>> In my Lucene Index I may or may not have the word "data center". Still I
>>>> should be able to return "data center"
>>>> 
>>>> One thought I have is as follows:
>>>> 
>>>> Modify the Category table by adding one more column to it:
>>>> 
>>>> Category_ID_Primay_Key  integer
>>>> Parent_Category_ID  integer
>>>> Category_Name varchar(100)
>>>> Category_Description varchar(1000)
>>>> Category_Description_Keywords varchar(8000)
>>>> 
>>>> Now take each word in "Category_description", find synonyms of it and
>>>> store that data in Category_Description_Keywords column. After doing it,
>>>> index the Category table records in SOLR/Lucene.
>>>> 
>>>> Below are my questions to you all:
>>>> 
>>>> Question 1:
>>>> Need your feedbacks on above approach or any other approach which help me
>>>> to make my search better that returns most relevant results to the user.
>>>> 
>>>> Question 2:
>>>> Can you suggest me Java based best Open Source or commercial synonym
>>>> engines. I want such a best synonym engine that gives me all possible
>>>> synonyms of a word.
>>>> 
>>>> 
>>>> 
>>>> Thanks in Advance,
>>>> Kishore Veleti A.V.K.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Regards,
>> 
>> Cuong Hoang
>> 
> 
> 


Re: Finding all possible synonyms for a word

Posted by climbingrose <cl...@gmail.com>.
Correction for last message: you need to modify or extend
SynonymFilterFactory instead of SynonymFilter. SynonmFilterFactory is
responsible for initialising SynonymFilter and populating the list of
synonyms. Have a look at the source code. I think it's pretty easy to
understand. What you probably need to do is to add more parameters
such as database host, username, password and the actual database in
init() method.

On Nov 20, 2007 3:18 PM, climbingrose <cl...@gmail.com> wrote:
> One approach is to extend SynonymFilter so that it reads synonyms from
> database instead of a file. SynonymFilter is just a Java class so you
> can do whatever you want with it :D. From what I remember, the filter
> initialises a list of all input synonyms and store them in memory.
> Therefore, you need to make sure that all the synonyms can fit into
> memory at runtime.
>
>
> On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti <Ki...@coreobjects.com> wrote:
> > Hi Eswar,
> >
> > Thanks for the update.
> >
> > I have gone through the below link provided by you and what I understood from it is, we need to have all possible synonyms in a text file. This file need to be given as input for "SynonymFilterFactory" to work. If my understanding is right then the approach may not suit my requirement. Reason is I need to find synonyms of all the keywords in category description and store those synonyms in the above said input file. The file may be too big.
> >
> > Let me know if my understanding is wrong.
> >
> >
> > Thanks,
> > Kishore Veleti A.V.K.
> >
> >
> >
> >
> > -----Original Message-----
> > From: Eswar K [mailto:kja.eswar@gmail.com]
> > Sent: Monday, November 19, 2007 11:22 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Finding all possible synonyms for a word
> >
> > Kishore,
> >
> > Solr has a SynonymFilterFactory which might be off use to you (
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)
> >
> >
> > Regards,
> > Eswar
> >
> > On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <Ki...@coreobjects.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I am new to Lucene / SOLR and developing a POC as part of research. Check
> > > below my requirement and problem statement. Need help on how I can index the
> > > data such data I have a very good search functionality in my POC.
> > >
> > > ------------------------------------------------------------------
> > > Requirement:
> > > ------------------------------------------------------------------
> > >
> > > Assume my web application is an Online book store and it sell all
> > > categories of books like Computers, Social Studies, Physical Sciences etc.
> > > Each of these categories has sub-categories. For example Computers has
> > > sub-categories like Software Engineering, Java, SQL Server etc
> > >
> > > I have a database table called Categories and it contains both Parent
> > > Category descriptions and also Child Category descriptions.
> > >
> > > Data structure of Category table is:
> > >
> > > Category_ID_Primay_Key  integer
> > > Parent_Category_ID  integer
> > > Category_Name varchar(100)
> > > Category_Description varchar(1000)
> > >
> > >
> > > ------------------------------------------------------------------
> > > My Search UI:
> > > ------------------------------------------------------------------
> > >
> > > My search page is very simple. We have a text field with "Search" button.
> > >
> > > ------------------------------------------------------------------
> > > User Action:
> > > ------------------------------------------------------------------
> > >
> > > User enter below search text in above text field and clicks on "Search"
> > > button.
> > >
> > > "Books on Data Center"
> > >
> > > ------------------------------------------------------------------
> > > What is my expected behavior:
> > > ------------------------------------------------------------------
> > >
> > > Since the word "Data Center" more relevant computers I should show books
> > > related to computers.
> > >
> > > ------------------------------------------------------------------
> > > My Problem statement and Question to you all:
> > > ------------------------------------------------------------------
> > >
> > > To have a better search in my web applications what kind of strategy
> > > should I have and index the data accordingly in SOLR/Lucene.
> > >
> > > In my Lucene Index I may or may not have the word "data center". Still I
> > > should be able to return "data center"
> > >
> > > One thought I have is as follows:
> > >
> > > Modify the Category table by adding one more column to it:
> > >
> > > Category_ID_Primay_Key  integer
> > > Parent_Category_ID  integer
> > > Category_Name varchar(100)
> > > Category_Description varchar(1000)
> > > Category_Description_Keywords varchar(8000)
> > >
> > > Now take each word in "Category_description", find synonyms of it and
> > > store that data in Category_Description_Keywords column. After doing it,
> > > index the Category table records in SOLR/Lucene.
> > >
> > > Below are my questions to you all:
> > >
> > > Question 1:
> > > Need your feedbacks on above approach or any other approach which help me
> > > to make my search better that returns most relevant results to the user.
> > >
> > > Question 2:
> > > Can you suggest me Java based best Open Source or commercial synonym
> > > engines. I want such a best synonym engine that gives me all possible
> > > synonyms of a word.
> > >
> > >
> > >
> > > Thanks in Advance,
> > > Kishore Veleti A.V.K.
> > >
> >
>
>
>
> --
> Regards,
>
> Cuong Hoang
>



-- 
Regards,

Cuong Hoang

Re: Finding all possible synonyms for a word

Posted by climbingrose <cl...@gmail.com>.
One approach is to extend SynonymFilter so that it reads synonyms from
database instead of a file. SynonymFilter is just a Java class so you
can do whatever you want with it :D. From what I remember, the filter
initialises a list of all input synonyms and store them in memory.
Therefore, you need to make sure that all the synonyms can fit into
memory at runtime.

On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti <Ki...@coreobjects.com> wrote:
> Hi Eswar,
>
> Thanks for the update.
>
> I have gone through the below link provided by you and what I understood from it is, we need to have all possible synonyms in a text file. This file need to be given as input for "SynonymFilterFactory" to work. If my understanding is right then the approach may not suit my requirement. Reason is I need to find synonyms of all the keywords in category description and store those synonyms in the above said input file. The file may be too big.
>
> Let me know if my understanding is wrong.
>
>
> Thanks,
> Kishore Veleti A.V.K.
>
>
>
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 19, 2007 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Finding all possible synonyms for a word
>
> Kishore,
>
> Solr has a SynonymFilterFactory which might be off use to you (
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)
>
>
> Regards,
> Eswar
>
> On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <Ki...@coreobjects.com>
> wrote:
>
> > Hi All,
> >
> > I am new to Lucene / SOLR and developing a POC as part of research. Check
> > below my requirement and problem statement. Need help on how I can index the
> > data such data I have a very good search functionality in my POC.
> >
> > ------------------------------------------------------------------
> > Requirement:
> > ------------------------------------------------------------------
> >
> > Assume my web application is an Online book store and it sell all
> > categories of books like Computers, Social Studies, Physical Sciences etc.
> > Each of these categories has sub-categories. For example Computers has
> > sub-categories like Software Engineering, Java, SQL Server etc
> >
> > I have a database table called Categories and it contains both Parent
> > Category descriptions and also Child Category descriptions.
> >
> > Data structure of Category table is:
> >
> > Category_ID_Primay_Key  integer
> > Parent_Category_ID  integer
> > Category_Name varchar(100)
> > Category_Description varchar(1000)
> >
> >
> > ------------------------------------------------------------------
> > My Search UI:
> > ------------------------------------------------------------------
> >
> > My search page is very simple. We have a text field with "Search" button.
> >
> > ------------------------------------------------------------------
> > User Action:
> > ------------------------------------------------------------------
> >
> > User enter below search text in above text field and clicks on "Search"
> > button.
> >
> > "Books on Data Center"
> >
> > ------------------------------------------------------------------
> > What is my expected behavior:
> > ------------------------------------------------------------------
> >
> > Since the word "Data Center" more relevant computers I should show books
> > related to computers.
> >
> > ------------------------------------------------------------------
> > My Problem statement and Question to you all:
> > ------------------------------------------------------------------
> >
> > To have a better search in my web applications what kind of strategy
> > should I have and index the data accordingly in SOLR/Lucene.
> >
> > In my Lucene Index I may or may not have the word "data center". Still I
> > should be able to return "data center"
> >
> > One thought I have is as follows:
> >
> > Modify the Category table by adding one more column to it:
> >
> > Category_ID_Primay_Key  integer
> > Parent_Category_ID  integer
> > Category_Name varchar(100)
> > Category_Description varchar(1000)
> > Category_Description_Keywords varchar(8000)
> >
> > Now take each word in "Category_description", find synonyms of it and
> > store that data in Category_Description_Keywords column. After doing it,
> > index the Category table records in SOLR/Lucene.
> >
> > Below are my questions to you all:
> >
> > Question 1:
> > Need your feedbacks on above approach or any other approach which help me
> > to make my search better that returns most relevant results to the user.
> >
> > Question 2:
> > Can you suggest me Java based best Open Source or commercial synonym
> > engines. I want such a best synonym engine that gives me all possible
> > synonyms of a word.
> >
> >
> >
> > Thanks in Advance,
> > Kishore Veleti A.V.K.
> >
>



-- 
Regards,

Cuong Hoang

RE: Finding all possible synonyms for a word

Posted by "Kishore AVK. Veleti" <Ki...@coreobjects.com>.
Hi Eswar,

Thanks for the update.

I have gone through the below link provided by you and what I understood from it is, we need to have all possible synonyms in a text file. This file need to be given as input for "SynonymFilterFactory" to work. If my understanding is right then the approach may not suit my requirement. Reason is I need to find synonyms of all the keywords in category description and store those synonyms in the above said input file. The file may be too big.

Let me know if my understanding is wrong.


Thanks,
Kishore Veleti A.V.K.



-----Original Message-----
From: Eswar K [mailto:kja.eswar@gmail.com]
Sent: Monday, November 19, 2007 11:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Finding all possible synonyms for a word

Kishore,

Solr has a SynonymFilterFactory which might be off use to you (
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)


Regards,
Eswar

On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <Ki...@coreobjects.com>
wrote:

> Hi All,
>
> I am new to Lucene / SOLR and developing a POC as part of research. Check
> below my requirement and problem statement. Need help on how I can index the
> data such data I have a very good search functionality in my POC.
>
> ------------------------------------------------------------------
> Requirement:
> ------------------------------------------------------------------
>
> Assume my web application is an Online book store and it sell all
> categories of books like Computers, Social Studies, Physical Sciences etc.
> Each of these categories has sub-categories. For example Computers has
> sub-categories like Software Engineering, Java, SQL Server etc
>
> I have a database table called Categories and it contains both Parent
> Category descriptions and also Child Category descriptions.
>
> Data structure of Category table is:
>
> Category_ID_Primay_Key  integer
> Parent_Category_ID  integer
> Category_Name varchar(100)
> Category_Description varchar(1000)
>
>
> ------------------------------------------------------------------
> My Search UI:
> ------------------------------------------------------------------
>
> My search page is very simple. We have a text field with "Search" button.
>
> ------------------------------------------------------------------
> User Action:
> ------------------------------------------------------------------
>
> User enter below search text in above text field and clicks on "Search"
> button.
>
> "Books on Data Center"
>
> ------------------------------------------------------------------
> What is my expected behavior:
> ------------------------------------------------------------------
>
> Since the word "Data Center" more relevant computers I should show books
> related to computers.
>
> ------------------------------------------------------------------
> My Problem statement and Question to you all:
> ------------------------------------------------------------------
>
> To have a better search in my web applications what kind of strategy
> should I have and index the data accordingly in SOLR/Lucene.
>
> In my Lucene Index I may or may not have the word "data center". Still I
> should be able to return "data center"
>
> One thought I have is as follows:
>
> Modify the Category table by adding one more column to it:
>
> Category_ID_Primay_Key  integer
> Parent_Category_ID  integer
> Category_Name varchar(100)
> Category_Description varchar(1000)
> Category_Description_Keywords varchar(8000)
>
> Now take each word in "Category_description", find synonyms of it and
> store that data in Category_Description_Keywords column. After doing it,
> index the Category table records in SOLR/Lucene.
>
> Below are my questions to you all:
>
> Question 1:
> Need your feedbacks on above approach or any other approach which help me
> to make my search better that returns most relevant results to the user.
>
> Question 2:
> Can you suggest me Java based best Open Source or commercial synonym
> engines. I want such a best synonym engine that gives me all possible
> synonyms of a word.
>
>
>
> Thanks in Advance,
> Kishore Veleti A.V.K.
>

Re: Finding all possible synonyms for a word

Posted by Eswar K <kj...@gmail.com>.
Kishore,

Solr has a SynonymFilterFactory which might be off use to you (
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)


Regards,
Eswar

On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti <Ki...@coreobjects.com>
wrote:

> Hi All,
>
> I am new to Lucene / SOLR and developing a POC as part of research. Check
> below my requirement and problem statement. Need help on how I can index the
> data such data I have a very good search functionality in my POC.
>
> ------------------------------------------------------------------
> Requirement:
> ------------------------------------------------------------------
>
> Assume my web application is an Online book store and it sell all
> categories of books like Computers, Social Studies, Physical Sciences etc.
> Each of these categories has sub-categories. For example Computers has
> sub-categories like Software Engineering, Java, SQL Server etc
>
> I have a database table called Categories and it contains both Parent
> Category descriptions and also Child Category descriptions.
>
> Data structure of Category table is:
>
> Category_ID_Primay_Key  integer
> Parent_Category_ID  integer
> Category_Name varchar(100)
> Category_Description varchar(1000)
>
>
> ------------------------------------------------------------------
> My Search UI:
> ------------------------------------------------------------------
>
> My search page is very simple. We have a text field with "Search" button.
>
> ------------------------------------------------------------------
> User Action:
> ------------------------------------------------------------------
>
> User enter below search text in above text field and clicks on "Search"
> button.
>
> "Books on Data Center"
>
> ------------------------------------------------------------------
> What is my expected behavior:
> ------------------------------------------------------------------
>
> Since the word "Data Center" more relevant computers I should show books
> related to computers.
>
> ------------------------------------------------------------------
> My Problem statement and Question to you all:
> ------------------------------------------------------------------
>
> To have a better search in my web applications what kind of strategy
> should I have and index the data accordingly in SOLR/Lucene.
>
> In my Lucene Index I may or may not have the word "data center". Still I
> should be able to return "data center"
>
> One thought I have is as follows:
>
> Modify the Category table by adding one more column to it:
>
> Category_ID_Primay_Key  integer
> Parent_Category_ID  integer
> Category_Name varchar(100)
> Category_Description varchar(1000)
> Category_Description_Keywords varchar(8000)
>
> Now take each word in "Category_description", find synonyms of it and
> store that data in Category_Description_Keywords column. After doing it,
> index the Category table records in SOLR/Lucene.
>
> Below are my questions to you all:
>
> Question 1:
> Need your feedbacks on above approach or any other approach which help me
> to make my search better that returns most relevant results to the user.
>
> Question 2:
> Can you suggest me Java based best Open Source or commercial synonym
> engines. I want such a best synonym engine that gives me all possible
> synonyms of a word.
>
>
>
> Thanks in Advance,
> Kishore Veleti A.V.K.
>