You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2012/06/19 18:42:11 UTC

Continuous stream indexing and time-based segment management

There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.

In these scenarios I imagine the following facilities would be useful:

a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago 
c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.

I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?

Anyone else had thoughts in this area?

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Continuous stream indexing and time-based segment management

Posted by mark harwood <ma...@yahoo.co.uk>.
> you can do that by subclassing IW and call some package private APIs /


To date I have used separate physical indexes with a MultiReader to combine them then dropping the outdated indexes.
At least this has the benefit that a custom MergePolicy is not required to keep content from the different dates segregated.

Where I saw the potential is when looking at S4 or Esper stream processing technologies when they try to count things in time windows.
It struck me that careful organisation of Lucene segments along time units could provide an efficient means of accessing and comparing counts of many things over time.
It looked like the "Hello World' example in S4 for counting top Twitter topics instantiated a Java object per unique topic String which was then responsible for maintaining counts on things - this seems a fairly inefficient way of modelling things.

>>If you are willing/able to close the IndexWriter, it's easy to drop segments by reading the SegmentInfos, editing, and writing back.

My assumption was that ultimately that's what it comes down to - I just wonder if this is likely to be a common requirement, deserving of a supported API



> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Continuous stream indexing and time-based segment management

Posted by Simon Willnauer <si...@googlemail.com>.
On Tue, Jun 19, 2012 at 9:44 PM, Simon Willnauer
<si...@googlemail.com> wrote:
> On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
>> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>>
>> In these scenarios I imagine the following facilities would be useful:
>>
>> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
>> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago
>
> you can do that by subclassing IW and call some package private APIs /
> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Continuous stream indexing and time-based segment management

Posted by Michael McCandless <lu...@mikemccandless.com>.
If you are willing/able to close the IndexWriter, it's easy to drop
segments by reading the SegmentInfos, editing, and writing back.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 3:44 PM, Simon Willnauer
<si...@googlemail.com> wrote:
> On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
>> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>>
>> In these scenarios I imagine the following facilities would be useful:
>>
>> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
>> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago
>
> you can do that by subclassing IW and call some package private APIs /
> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Continuous stream indexing and time-based segment management

Posted by Simon Willnauer <si...@googlemail.com>.
On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <ma...@yahoo.co.uk> wrote:
> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>
> In these scenarios I imagine the following facilities would be useful:
>
> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago

you can do that by subclassing IW and call some package private APIs /
members. We can certainly make that easier but I personally don't want
to open this as a public API. I can certainly imagine to have a
protected API that allows dropping entire segment.

simon

> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>
> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>
> Anyone else had thoughts in this area?
>
> Cheers
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org