You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Uncle <un...@gmail.com> on 2012/04/27 13:35:39 UTC

Reverse keyword search?

Hello,

I am relatively new to Lucene, this might be a noob question, if so please redirect me. I'd like some guidance on how to use Lucene to address a problem.

I have a set of a few hundred (and growing) user-defined keywords such as "spain" and "volkswagen" and each of which is associated to one of about 20 categories, such as "world" and "automotive". My challenge is to use the summary (title, description, caption, meta-tags, keywords, but not the entire content) from a news article such as what you might find on cnn.com and look for those keywords in the article, to identify the article's category. The article's summary is often "dirty" with special characters, commas, hash tags, etc. and so needs to be tokenized. I would also like to utilize Lucene's natural language processing to match "spanish" to "spain" for example.

This appears to be somewhat the reverse of the typical Lucene use case -- rather than having a set of say 1000 of articles which are indexed, then issuing a query using a few keywords to search on those articles, I have a set of say 1000 keywords, and a single article, and I want to determine which keyword best fits the article's summary. How to best use Lucene to handle this?

I have considered:

1) Creating a Lucene index of the keywords and topics, then tokenizing the summaries using Lucene's tokenizers, then issuing queries with the tokens to find the best match
2) Indexing the article summary, then iterating over all of the keywords, issuing a query for each of them, then keeping the best match.
3) Learning how Lucene does the individual keyword-to-keyword matching and writing some custom solution.

I'd appreciate it if someone could point me in the right direction.

Randy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity coefficient for more exact matching

Posted by Ian Lea <ia...@gmail.com>.

Similarity.setDefault(new MySimilarity()) is certainly better than the
2 calls I recommended.  Thanks.

I find it hard to see why one might not want to do this in normal
usage but have a vague recollection of someone once outlining some
obscure scenarios where different similarities at index and search
time made sense.


--
Ian.


On Fri, May 4, 2012 at 5:32 PM, Paul Hill <pa...@metajure.com> wrote:
>> [use] IndexWriterConfig.setSimilarity() and
>> IndexSearcher.setSimilarity(), unless you are clever or like being confused.
>>
>> SweetSpotSimilarity might also be worth a look.
>>
>> --
>> Ian.
>
> Being even less clever,  I just make sure I set:
>
> Similarity.setDefault(new MySimilarity())
>
> when crawling and searching, so everything uses the same similarity strategies.
>
> Checking the 3.4 code IndexWriterConfig and IndexSearcher, both default to Similarity.getDefault().
>
> Any thoughts on scenarios where you'd not push a custom similarity into the default position?
>
> -Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Similarity coefficient for more exact matching

Posted by Paul Hill <pa...@metajure.com>.

> [use] IndexWriterConfig.setSimilarity() and
> IndexSearcher.setSimilarity(), unless you are clever or like being confused.
> 
> SweetSpotSimilarity might also be worth a look.
> 
> --
> Ian.

Being even less clever,  I just make sure I set:

Similarity.setDefault(new MySimilarity())  

when crawling and searching, so everything uses the same similarity strategies.

Checking the 3.4 code IndexWriterConfig and IndexSearcher, both default to Similarity.getDefault().

Any thoughts on scenarios where you'd not push a custom similarity into the default position?

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity coefficient for more exact matching

Posted by Ian Lea <ia...@gmail.com>.

You can override org.apache.lucene.search.Similarity/DefaultSimilarity
to tweak quite a lot of stuff.

computeNorm() may be the method you are interested in.  Called at
indexing time so be sure to use the same implementation at index and
query time, using IndexWriterConfig.setSimilarity() and
IndexSearcher.setSimilarity(), unless you are clever or like being
confused.

SweetSpotSimilarity might also be worth a look.

--
Ian.

On Fri, Apr 27, 2012 at 1:18 PM, Maxim Terletsky <sx...@yahoo.com> wrote:
> Hi guys,
> I have a field, Anayzed, Store.No.
> Suppose one Document with value inside this field "Hello".
> Another one "Hello world , one, two, three, four".
> Since the field is Analyzed (with norms), the "one two three four) will definitely affect the resulting rating in case we search for "Hello world" query. Does anyone know whether I can control some coefficients to determine what is the weight for exact matching vs. amount of worlds (the norm factor)?
> Thanks,
>
>
> Maxim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Similarity coefficient for more exact matching

Posted by Maxim Terletsky <sx...@yahoo.com>.

Hi guys,
I have a field, Anayzed, Store.No. 
Suppose one Document with value inside this field "Hello".
Another one "Hello world , one, two, three, four".
Since the field is Analyzed (with norms), the "one two three four) will definitely affect the resulting rating in case we search for "Hello world" query. Does anyone know whether I can control some coefficients to determine what is the weight for exact matching vs. amount of worlds (the norm factor)?
Thanks,
 

Maxim

Re: Reverse keyword search?

Posted by Ahmet Arslan <io...@yahoo.com>.

> This appears to be somewhat the reverse of the typical
> Lucene use case -- rather than having a set of say 1000 of
> articles which are indexed, then issuing a query using a few
> keywords to search on those articles, I have a set of say
> 1000 keywords, and a single article, and I want to determine
> which keyword best fits the article's summary.  How to
> best use Lucene to handle this?

Not used myself but MemoryIndex seems what you are after.

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org