You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/10/22 11:11:25 UTC

Bet you didn't know Lucene can...

Hi All,

I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Thanks in advance,
Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Re: Bet you didn't know Lucene can...

Posted by Erik Hatcher <er...@gmail.com>.
At the group where I worked at UVa once upon a time, a coworker built Juxta, this way cool tool to diff multiple versions of a document visually with heat maps and "difference"-o-meters, and it leverages Lucene analyzers to extract words and positions and such.

You can find it here: http://www.juxtasoftware.org/

	Erik



On Oct 22, 2011, at 05:11 , Grant Ingersoll wrote:

> Hi All,
> 
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
> 
> Thanks in advance,
> Grant
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Dawid Weiss <da...@gmail.com>.
Yes, sure it is interesting -- github would be probably a good spot?

Dawid

On Wed, Oct 26, 2011 at 7:02 PM, mark harwood <ma...@yahoo.co.uk> wrote:
>>>  > Avg lookup time slightly less than a HashSet? Interesting.
>
> Scratch that. A new dataset and revised code shows HashSets out in front (but still not a realistic option for very large sets) : http://goo.gl/Lb4J1
>
> In this benchmark I removed the code common to all previous tests which was first retrieving a random key from a test query Lucene index to then look up in the target Set ( a choice of database, hashset or a different Lucene index).
>
> I assumed that being common code to all tests, this initial Lucene-based fetch would not bias results but it was. Now the tests first load a random sample of 100k keys from a flat file *then* start the timer on the look-ups.
> I'm also using public domain Wikipedia data so can release the code and data somewhere if that's of interest.
>
> Cheers
> Mark
>
>
>
> ----- Original Message -----
> From: Dawid Weiss <da...@gmail.com>
> To: java-user@lucene.apache.org
> Cc:
> Sent: Tuesday, 25 October 2011, 23:17
> Subject: Re: Bet you didn't know Lucene can...
>
>> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.
>
> I don't say the benchmark was wrong or anything, but this is
> surprising. I mean, the default HashSet impl. is a bucketed
> linked-list implementation. It made me wonder how the data was
> distributed. Even with OS file caching the in-memory data structure
> shouldn't fall short, at least intuitively.
>
>> I can make the code available but the data wouldn't be possible.
>> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.
>
> If you find a spare cycle, it'd be great, thanks!
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by mark harwood <ma...@yahoo.co.uk>.
>>  > Avg lookup time slightly less than a HashSet? Interesting.

Scratch that. A new dataset and revised code shows HashSets out in front (but still not a realistic option for very large sets) : http://goo.gl/Lb4J1

In this benchmark I removed the code common to all previous tests which was first retrieving a random key from a test query Lucene index to then look up in the target Set ( a choice of database, hashset or a different Lucene index). 

I assumed that being common code to all tests, this initial Lucene-based fetch would not bias results but it was. Now the tests first load a random sample of 100k keys from a flat file *then* start the timer on the look-ups.
I'm also using public domain Wikipedia data so can release the code and data somewhere if that's of interest.

Cheers
Mark



----- Original Message -----
From: Dawid Weiss <da...@gmail.com>
To: java-user@lucene.apache.org
Cc: 
Sent: Tuesday, 25 October 2011, 23:17
Subject: Re: Bet you didn't know Lucene can...

> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

I don't say the benchmark was wrong or anything, but this is
surprising. I mean, the default HashSet impl. is a bucketed
linked-list implementation. It made me wonder how the data was
distributed. Even with OS file caching the in-memory data structure
shouldn't fall short, at least intuitively.

> I can make the code available but the data wouldn't be possible.
> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.

If you find a spare cycle, it'd be great, thanks!

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Dawid Weiss <da...@gmail.com>.
> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

I don't say the benchmark was wrong or anything, but this is
surprising. I mean, the default HashSet impl. is a bucketed
linked-list implementation. It made me wonder how the data was
distributed. Even with OS file caching the in-memory data structure
shouldn't fall short, at least intuitively.

> I can make the code available but the data wouldn't be possible.
> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.

If you find a spare cycle, it'd be great, thanks!

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Mark Harwood <ma...@yahoo.co.uk>.
> Avg lookup time slightly less than a HashSet? Interesting.

Yep, HashSet comparison was a surprise to me too. I threw it in as a datapoint for what I thought would be the fastest option on the example dataset but clearly not a long-term answer to my problem as it costs so much in RAM. 
Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

> Is the code
> to these benchmarks available somewhere?


I can make the code available but the data wouldn't be possible.
The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with. 

Cheers
Mark

On 25 Oct 2011, at 22:47, Dawid Weiss wrote:

> Avg lookup time slightly less than a HashSet? Interesting. Is the code
> to these benchmarks available somewhere?
> 
> Dawid
> 
> On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>> 
>>>>> using Lucene that don't fit under the core premise of full text search
>>> 
>>>  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
>>> 
>>> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
>>> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
>>> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
>>> 
>>> * Benchmarks for times to initially open the set when stored on disk:  http://goo.gl/dJL3g
>>> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
>>> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn
>> 
>> Those charts are beautiful.  I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.
>> 
>>> 
>>> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
>>> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
>>> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
>>> MySQL requires an inter-process call as it was not  embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>>> 
>>> 
>>> Cheers
>>> Mark
>>> 
>>> 
>>> 
>>> ----- Original Message -----
>>> From: Grant Ingersoll <gs...@apache.org>
>>> To: java-user@lucene.apache.org
>>> Cc:
>>> Sent: Saturday, 22 October 2011, 10:11
>>> Subject: Bet you didn't know Lucene can...
>>> 
>>> Hi All,
>>> 
>>> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>>> 
>>> Thanks in advance,
>>> Grant
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Dawid Weiss <da...@gmail.com>.
Avg lookup time slightly less than a HashSet? Interesting. Is the code
to these benchmarks available somewhere?

Dawid

On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>
>>>> using Lucene that don't fit under the core premise of full text search
>>
>>  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
>>
>> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
>> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
>> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
>>
>> * Benchmarks for times to initially open the set when stored on disk:  http://goo.gl/dJL3g
>> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
>> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn
>
> Those charts are beautiful.  I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.
>
>>
>> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
>> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
>> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
>> MySQL requires an inter-process call as it was not  embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>>
>>
>> Cheers
>> Mark
>>
>>
>>
>> ----- Original Message -----
>> From: Grant Ingersoll <gs...@apache.org>
>> To: java-user@lucene.apache.org
>> Cc:
>> Sent: Saturday, 22 October 2011, 10:11
>> Subject: Bet you didn't know Lucene can...
>>
>> Hi All,
>>
>> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>>
>> Thanks in advance,
>> Grant
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 25, 2011, at 11:26 AM, mark harwood wrote:

>>> using Lucene that don't fit under the core premise of full text search
> 
>  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
> 
> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
> 
> * Benchmarks for times to initially open the set when stored on disk:  http://goo.gl/dJL3g
> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn

Those charts are beautiful.  I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.

> 
> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
> MySQL requires an inter-process call as it was not  embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>  
> 
> Cheers
> Mark
> 
> 
> 
> ----- Original Message -----
> From: Grant Ingersoll <gs...@apache.org>
> To: java-user@lucene.apache.org
> Cc: 
> Sent: Saturday, 22 October 2011, 10:11
> Subject: Bet you didn't know Lucene can...
> 
> Hi All,
> 
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
> 
> Thanks in advance,
> Grant
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Re: Bet you didn't know Lucene can...

Posted by mark harwood <ma...@yahoo.co.uk>.
>>using Lucene that don't fit under the core premise of full text search

 I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:

I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:

* Benchmarks for times to initially open the set when stored on disk:  http://goo.gl/dJL3g
* Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
* Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn

I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
MySQL requires an inter-process call as it was not  embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
 

Cheers
Mark



----- Original Message -----
From: Grant Ingersoll <gs...@apache.org>
To: java-user@lucene.apache.org
Cc: 
Sent: Saturday, 22 October 2011, 10:11
Subject: Bet you didn't know Lucene can...

Hi All,

I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Thanks in advance,
Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 31/10/2011 21:42, Petite Abeille wrote:
>
> On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:
>
>> similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash.
>>
>> The solution is described in SOLR-1918 - Bit-wise scoring field type.
>
> In other words, a simhash, no?
>
> Similarity Estimation Techniques from Rounding Algorithms
> http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
>
> http://www.matpalm.com/resemblance/simhash/

Yes, you could use this. In that project we used a different 
application-specific hash.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Petite Abeille <pe...@me.com>.
On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

> similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash.
> 
> The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 22/10/2011 11:11, Grant Ingersoll wrote:
> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Better late than never ... :) I briefly mentioned this use case to you 
at Eurocon, but here it is for the record.

I used Lucene in a duplicate-detection scenario where instead of 
documents individual sentences would be indexed (with a fuzz). A 
similarity-preserving hash function was calculated on each sentence, and 
the hash was added as a field. The property of the hash was that similar 
documents (sentences) would produce a similar hash, with only some 
bit-level perturbation. The challenge was to find a ranked list of 
possible duplicates with similar (not exact same) hashes, which in this 
case meant to find a ranked list of documents that have the smallest 
bit-level distance in their hashes from the query hash.

The solution is described in SOLR-1918 - Bit-wise scoring field type.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Wouter Heijke <wh...@xs4all.nl>.
Hi Grant,

These are 2 cases into work i've done that I can think of:

-use Lucene to match products in a database with eBay auctions, the title
of the auction is used as the query to Lucene.

-use a servlet filter and Lucene to map well-formed URL's into a website
to it's individual (product) pages. A deeper URL results in a Lucene
BooleanQuery with more clauses.

Hope this is enough (ab)use...

Wouter


> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..."
> (http://na11.apachecon.com/talks/18396).  It's based on my observation,
> that over the years, a number of us in the community have done some pretty
> cool things using Lucene that don't fit under the core premise of full
> text search.  I've got a fair number of ideas for the talk (easily enough
> for 1 hour), but I wanted to reach out to hear your stories of ways you've
> (ab)used Lucene and Solr to see if we couldn't extend the conversation to
> a bit more than the conference and also see if I can't inject more ideas
> beyond the ones I have.  I don't need deep technical details, but just
> high level use case and the basic insight that led you to believe Lucene
> could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Dawid Weiss <da...@gmail.com>.
Hi Grant,

In Carrot2 (and Carrot Search's commercial products) we're not using
Lucene as an indexing/ search service directly, but we are re-using a
lot of internal infrastructure (like analyzers, ported snowball
stemmers and other segmentation stuff). We also plan on using the new
language identifiers, automata, tests framework...

I guess this shows that Lucene is a lot _more_ than just a document
retrieval library. There are nuggets in the codebase that one can
utilize on their own, without the rest of Lucene.

If you need details, let me know on prv, I'll scan the sources and
provide concrete examples.

Dawid

On Sun, Oct 23, 2011 at 2:33 AM, Shashi Kant <sk...@sloan.mit.edu> wrote:
> Using Lucene as a recommendation engine.
>
> On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:
>>
>>> Hi Grant,
>>>
>>> Not sure if this qualifies as a "bet you didn't know", but one could use
>>> Lucene term vectors to construct document vectors for similarity,
>>> clustering and classification tasks. I found this out recently (although
>>> I am probably not the first one), and I think this could be quite
>>> useful.
>>
>> Yep, had these on my list!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Shashi Kant <sk...@sloan.mit.edu>.
Using Lucene as a recommendation engine.

On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:
>
>> Hi Grant,
>>
>> Not sure if this qualifies as a "bet you didn't know", but one could use
>> Lucene term vectors to construct document vectors for similarity,
>> clustering and classification tasks. I found this out recently (although
>> I am probably not the first one), and I think this could be quite
>> useful.
>
> Yep, had these on my list!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:

> Hi Grant,
> 
> Not sure if this qualifies as a "bet you didn't know", but one could use
> Lucene term vectors to construct document vectors for similarity,
> clustering and classification tasks. I found this out recently (although
> I am probably not the first one), and I think this could be quite
> useful.

Yep, had these on my list!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Sujit Pal <su...@comcast.net>.
Hi Grant,

Not sure if this qualifies as a "bet you didn't know", but one could use
Lucene term vectors to construct document vectors for similarity,
clustering and classification tasks. I found this out recently (although
I am probably not the first one), and I think this could be quite
useful.

-sujit

On Sat, 2011-10-22 at 11:11 +0200, Grant Ingersoll wrote:
> Hi All,
> 
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
> 
> Thanks in advance,
> Grant
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bet you didn't know Lucene can...

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Grant,

for years the ActiveMath learning environment has been using as storage engine.
At the time (~2004), it was by far the best storage engine ever doable in a pure java-world.
Now it still is perfect in terms of performance.
We had an issue with the separate versions where the stored-fields were not lazily loaded (~version 1.x-2.0) so that we do not store the big fragments yet there. However, for small fragments it's very very efficient (~5000 queries a second).

The objects stored are fragments of XML documents (the format is called OMDoc, they're mostly hand-written).

Tell me if you need more details, I am sure the pure storage option is something very common.

paul


Le 22 oct. 2011 à 11:11, Grant Ingersoll a écrit :

> Hi All,
> 
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
> 
> Thanks in advance,
> Grant
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org