You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ian Holsman <li...@holsman.net> on 2008/11/23 04:57:28 UTC

[ot] a reverse lucene

Hi. apologies for the off-topic question.

I was wondering if anyone knew of a open source solution (or a pointer 
to the algorithms)
that do the reverse of lucene.
By that I mean store a whole lot of queries, and run them against a 
document to see which queries match it. (with a score etc)

I can see the case for this would be a news-article and several people 
writing queries to get alerted if it matched a certain condition.


Regards
Ian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by David Sheldon <da...@earth.li>.

On Sun, Nov 23, 2008 at 02:57:28PM +1100, Ian Holsman wrote:
> I can see the case for this would be a news-article and several people 
> writing queries to get alerted if it matched a certain condition.

I haven't tried this, but if you have lots of queries and few documents
then consider using lucene, but reconsidering how you design your
documents.

Turn the "queries" into documents in the index, and turn the "document"
into a query.

Something like google alerts you can have a document which is
   match: keyword

Then the "document" can become a boolean query for each word in it:
   match:foo OR match:bar

Obviously good choices of analysers and simplification of the queries
that you allow will make this better.

If you have fewer than 10k stored queries then the ways of running all
the queries against a document in memory will probably be faster
(depending on your incoming document rate, though you can batch them up
and do the queries every 15 mintues or something if you don't mind the
lag and you're getting lots of incomming documents).

Just an idea.

David
-- 
About the use of language: it is impossible to sharpen a pencil with a blunt
ax.  It is equally vain to try to do it with ten blunt axes instead.
		-- Edsger Dijkstra

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Grant Ingersoll <gs...@apache.org>.

The "formal" name for this stuff is "document filtering" or just  
"filtering".  You can start on it, by looking at TREC, which had a  
filtering task for a number of years: http://trec.nist.gov/tracks.html

At any rate, one approach is to store your queries as Lucene  
documents, albeit short ones.  Then, as others have said, you index  
new, incoming docs into a Memory Index.  From that, you can extract  
the key terms which can then be used to come up with a Query to be run  
against your "query" index.  The MoreLikeThis functionality should  
help in determining the important terms.  Then, you need to decide how  
to handle dealing with the results.  You probably don't want to route  
the document to each and every query that matches.

-Grant

On Nov 23, 2008, at 2:35 AM, Ian Holsman wrote:

> Anshum wrote:
>> Hi Ian,
>> I guess that could be achieved if you write code to read the  
>> queries and
>> query for each document (using lucene).
>> Assuming that I got the question right! :)
>>
>>
>
> yes.. that is one way, but probably not the most efficient one.
>
> think of something like http://www.google.com/alerts, but instead of  
> running once a day, it would run each time it sees a document. (as- 
> it-happens mode)
> and you would have a couple of million queries to run through.
>
> regards
> Ian
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>>
>>
>> On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman <li...@holsman.net>  
>> wrote:
>>
>>
>>> Hi. apologies for the off-topic question.
>>>
>>> I was wondering if anyone knew of a open source solution (or a  
>>> pointer to
>>> the algorithms)
>>> that do the reverse of lucene.
>>> By that I mean store a whole lot of queries, and run them against a
>>> document to see which queries match it. (with a score etc)
>>>
>>> I can see the case for this would be a news-article and several  
>>> people
>>> writing queries to get alerted if it matched a certain condition.
>>>
>>>
>>> Regards
>>> Ian
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Cool The Breezer <te...@yahoo.com>.

May be RSS feed a solution. Just provide RSS feed as a search result for each query and people subscribing these RSS feed would get notifications in regular intervals. They need to install RSS clients, which can run queries in regular intervals. 


--- On Sun, 11/23/08, Ian Holsman <li...@holsman.net> wrote:

> From: Ian Holsman <li...@holsman.net>
> Subject: Re: [ot] a reverse lucene
> To: java-user@lucene.apache.org
> Date: Sunday, November 23, 2008, 2:35 AM
> Anshum wrote:
> > Hi Ian,
> > I guess that could be achieved if you write code to
> read the queries and
> > query for each document (using lucene).
> > Assuming that I got the question right! :)
> >
> >   
> 
> yes.. that is one way, but probably not the most efficient
> one.
> 
> think of something like http://www.google.com/alerts, but
> instead of 
> running once a day, it would run each time it sees a
> document. 
> (as-it-happens mode)
> and you would have a couple of million queries to run
> through.
> 
> regards
> Ian
> > --
> > Anshum Gupta
> > Naukri Labs!
> > http://ai-cafe.blogspot.com
> >
> > The facts expressed here belong to everybody, the
> opinions to me. The
> > distinction is yours to draw............
> >
> >
> > On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman
> <li...@holsman.net> wrote:
> >
> >   
> >> Hi. apologies for the off-topic question.
> >>
> >> I was wondering if anyone knew of a open source
> solution (or a pointer to
> >> the algorithms)
> >> that do the reverse of lucene.
> >> By that I mean store a whole lot of queries, and
> run them against a
> >> document to see which queries match it. (with a
> score etc)
> >>
> >> I can see the case for this would be a
> news-article and several people
> >> writing queries to get alerted if it matched a
> certain condition.
> >>
> >>
> >> Regards
> >> Ian
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>
> >>
> >>     
> >
> >   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Ian Holsman <li...@holsman.net>.

Anshum wrote:
> Hi Ian,
> I guess that could be achieved if you write code to read the queries and
> query for each document (using lucene).
> Assuming that I got the question right! :)
>
>   

yes.. that is one way, but probably not the most efficient one.

think of something like http://www.google.com/alerts, but instead of 
running once a day, it would run each time it sees a document. 
(as-it-happens mode)
and you would have a couple of million queries to run through.

regards
Ian
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman <li...@holsman.net> wrote:
>
>   
>> Hi. apologies for the off-topic question.
>>
>> I was wondering if anyone knew of a open source solution (or a pointer to
>> the algorithms)
>> that do the reverse of lucene.
>> By that I mean store a whole lot of queries, and run them against a
>> document to see which queries match it. (with a score etc)
>>
>> I can see the case for this would be a news-article and several people
>> writing queries to get alerted if it matched a certain condition.
>>
>>
>> Regards
>> Ian
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Anshum <an...@gmail.com>.

Hi Ian,
I guess that could be achieved if you write code to read the queries and
query for each document (using lucene).
Assuming that I got the question right! :)

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............

On Sun, Nov 23, 2008 at 9:27 AM, Ian Holsman <li...@holsman.net> wrote:

> Hi. apologies for the off-topic question.
>
> I was wondering if anyone knew of a open source solution (or a pointer to
> the algorithms)
> that do the reverse of lucene.
> By that I mean store a whole lot of queries, and run them against a
> document to see which queries match it. (with a score etc)
>
> I can see the case for this would be a news-article and several people
> writing queries to get alerted if it matched a certain condition.
>
>
> Regards
> Ian
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: [ot] a reverse lucene

Posted by markharw00d <ma...@yahoo.co.uk>.

If you index the queries consider also that they can potentially be 
indexed in an optimised form.

For example, take a phrase query for "Alonso Smith". You need only index 
one of these terms - an incoming document must contain both terms to be 
considered a match. If you chose to index this query on the rare term 
"Alonso" you would get far fewer requests to run this query than if you 
chose to index the comparitively more common "Smith". Basically any 
query with mandatory terms can be "index optimised" to record only the 
rarest mandatory term (rarity typically being measured by using a 
look-up on some background index).

Cheers,
Mark

Ian Holsman wrote:
> Thanks for all the suggestions guys..
> This is great!
>
>
> Andrzej Bialecki wrote:
>> Ian Holsman wrote:
>>> Hi. apologies for the off-topic question.
>>>
>>> I was wondering if anyone knew of a open source solution (or a 
>>> pointer to the algorithms)
>>> that do the reverse of lucene.
>>> By that I mean store a whole lot of queries, and run them against a 
>>> document to see which queries match it. (with a score etc)
>>>
>>> I can see the case for this would be a news-article and several 
>>> people writing queries to get alerted if it matched a certain 
>>> condition.
>>
>>
>> http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com 
> Version: 8.0.175 / Virus Database: 270.9.9/1806 - Release Date: 11/22/2008 6:59 PM
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Ian Holsman <li...@holsman.net>.

Thanks for all the suggestions guys..
This is great!


Andrzej Bialecki wrote:
> Ian Holsman wrote:
>> Hi. apologies for the off-topic question.
>>
>> I was wondering if anyone knew of a open source solution (or a 
>> pointer to the algorithms)
>> that do the reverse of lucene.
>> By that I mean store a whole lot of queries, and run them against a 
>> document to see which queries match it. (with a score etc)
>>
>> I can see the case for this would be a news-article and several 
>> people writing queries to get alerted if it matched a certain condition.
>
>
> http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ian Holsman wrote:
> Hi. apologies for the off-topic question.
> 
> I was wondering if anyone knew of a open source solution (or a pointer 
> to the algorithms)
> that do the reverse of lucene.
> By that I mean store a whole lot of queries, and run them against a 
> document to see which queries match it. (with a score etc)
> 
> I can see the case for this would be a news-article and several people 
> writing queries to get alerted if it matched a certain condition.


http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Ian Holsman <li...@holsman.net>.

Thanks Erik.
I'll start looking at that.

regards
Ian
Erik Hatcher wrote:
>
> On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:
>> Hi. apologies for the off-topic question.
>
> Not off-topic at all!
>
>> I was wondering if anyone knew of a open source solution (or a 
>> pointer to the algorithms)
>> that do the reverse of lucene.
>> By that I mean store a whole lot of queries, and run them against a 
>> document to see which queries match it. (with a score etc)
>>
>> I can see the case for this would be a news-article and several 
>> people writing queries to get alerted if it matched a certain condition.
>
> This use-case was the reason MemoryIndex was created.  It's a fast 
> single document index where incoming documents could be sent in 
> parallel to the main index - and slamming a bunch of queries at it.  
> There's also InstantiatedIndex to compare to, as it can handle 
> multiple documents.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by jm <jm...@gmail.com>.

I am using MemoryIndex in a similar scenario. I have not as many
queries though, less than 100, but several 'articles' coming per
second.

Works nicely.

On Sun, Nov 23, 2008 at 10:00 AM, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
>
> On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:
>>
>> Hi. apologies for the off-topic question.
>
> Not off-topic at all!
>
>> I was wondering if anyone knew of a open source solution (or a pointer to
>> the algorithms)
>> that do the reverse of lucene.
>> By that I mean store a whole lot of queries, and run them against a
>> document to see which queries match it. (with a score etc)
>>
>> I can see the case for this would be a news-article and several people
>> writing queries to get alerted if it matched a certain condition.
>
> This use-case was the reason MemoryIndex was created.  It's a fast single
> document index where incoming documents could be sent in parallel to the
> main index - and slamming a bunch of queries at it.  There's also
> InstantiatedIndex to compare to, as it can handle multiple documents.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ot] a reverse lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Nov 22, 2008, at 10:57 PM, Ian Holsman wrote:
> Hi. apologies for the off-topic question.

Not off-topic at all!

> I was wondering if anyone knew of a open source solution (or a  
> pointer to the algorithms)
> that do the reverse of lucene.
> By that I mean store a whole lot of queries, and run them against a  
> document to see which queries match it. (with a score etc)
>
> I can see the case for this would be a news-article and several  
> people writing queries to get alerted if it matched a certain  
> condition.

This use-case was the reason MemoryIndex was created.  It's a fast  
single document index where incoming documents could be sent in  
parallel to the main index - and slamming a bunch of queries at it.   
There's also InstantiatedIndex to compare to, as it can handle  
multiple documents.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org