You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Benjamin Edwards <ed...@gmail.com> on 2015/02/15 19:41:00 UTC

Truncating rocks db

Hi,

I am trialling samza for some windowed stream processing. Typically I want
to aggregate a bunch of state over some window of messages, process the
data, then drop the current state. The only way that I can see to do that
at the moment is to delete every key. This seems expensive. Is there no way
to just say I don't care about the old data, gimme a new store?

Ben

Re: Truncating rocks db

Posted by Chris Riccomini <cr...@apache.org>.
Hey Ben,

Cool, please email if anything else comes up. Re: fresh-store, I think it
should be possible to add a .clear() to the KV interface. This would result
in creating a new DB and deleting the old one. Like the RocksDB TTL, it
wouldn't result in any deletes being sent to the changelog, though. If this
sounds useful, definitely open a JIRA for it.

Cheers,
Chris

On Tue, Feb 17, 2015 at 12:10 AM, Benjamin Edwards <ed...@gmail.com>
wrote:

> I think having followed along with the other thread, my initial approach
> was flawed. We use Cassandra in prod a ton (the classic Cassandra / Spark
> combo) at my job and have been running into a few issues with streaming /
> local state etc etc. Hence my wanting to have a look at Samza. Very long
> way round to say that we use TTLs for lots of things! Thanks for the
> write-up about the interaction between the db and the changelog . Very
> thorough. I might come back with a request about the fresh store feature,
> but it definitely needs a bit more baking / experience with Samza.
>
> Ben
>
> On Tue Feb 17 2015 at 01:59:03 Chris Riccomini <cr...@apache.org>
> wrote:
>
> > Hey Ben,
> >
> > The problem with TTL is that it's handled entirely internally in RocksDB.
> > There's no way for us to know when a key's been deleted. You can work
> > around this if you also alter the changelog topic settings in your
> > changelog Kafka topic to be TTL based, not log-compacted, then these two
> > should roughly match. For example, if you have a 1h TTL in RocksDB and a
> 1h
> > TTL in your Kafka changelog topic, then the semantics are ROUGHLY
> > equivalent. I say ROUGHLY because the two are going to be GC'ing expired
> > keys independently of one another.
> >
> > Also, during a restart, the TTLs in the RocksDB store will be fully
> reset.
> > For example, if at minute 59 of a key, you restart the job, then the
> Kafka
> > topic will restore it when the job starts, and the TTL will reset back
> to 0
> > minutes in the RocksDB store (though, a minute later Kafka will drop it
> > from the changelog). If you don't need EXACT TTL guarantees, then this
> > should be fine. If you do need exact, then .all() is probably the way to
> > go.
> >
> > Cheers,
> > Chris
> >
> > On Mon, Feb 16, 2015 at 1:39 PM, Benjamin Edwards <
> edwards.benj@gmail.com>
> > wrote:
> >
> > > Yes, I was using a changelog. You bring up a good point. I think I need
> > to
> > > think harder about what I am trying to do. Maybe deleting all the keys
> > > isn't that bad. Especially is I amortise it over the life of the next
> > > period.
> > >
> > > It seems like waiting for TTLs is probably the right thing to do
> > > ultimately.
> > >
> > > Thanks for the timely response!
> > >
> > > Ben
> > >
> > > On Sun Feb 15 2015 at 23:43:27 Chris Riccomini <cr...@apache.org>
> > > wrote:
> > >
> > > > Hey Benjamin,
> > > >
> > > > You're right. Currently you have to call .all(), and delete
> everything.
> > > >
> > > > RocksDB just committed TTL support for their Java library. This
> feature
> > > > allows data to automatically be expired out. Once RocksDB releases
> > their
> > > > TTL patch (I believe in a few weeks, according to Igor), we'll update
> > > Samza
> > > > 0.9.0. Our tracker patch is here:
> > > >
> > > >   https://issues.apache.org/jira/browse/SAMZA-537
> > > >
> > > > > Is there no way to just say I don't care about the old data, gimme
> a
> > > new
> > > > store?
> > > >
> > > > We don't have this feature right now, but we could add it. This
> feature
> > > is
> > > > a bit more complicated when a changelog is attached, since we will
> have
> > > to
> > > > execute deletes for every key (we still need to call .all()). Are you
> > > > running with a changelog?
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > > > On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <
> > > edwards.benj@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trialling samza for some windowed stream processing.
> Typically I
> > > > want
> > > > > to aggregate a bunch of state over some window of messages, process
> > the
> > > > > data, then drop the current state. The only way that I can see to
> do
> > > that
> > > > > at the moment is to delete every key. This seems expensive. Is
> there
> > no
> > > > way
> > > > > to just say I don't care about the old data, gimme a new store?
> > > > >
> > > > > Ben
> > > > >
> > > >
> > >
> >
>

Re: Truncating rocks db

Posted by Benjamin Edwards <ed...@gmail.com>.
I think having followed along with the other thread, my initial approach
was flawed. We use Cassandra in prod a ton (the classic Cassandra / Spark
combo) at my job and have been running into a few issues with streaming /
local state etc etc. Hence my wanting to have a look at Samza. Very long
way round to say that we use TTLs for lots of things! Thanks for the
write-up about the interaction between the db and the changelog . Very
thorough. I might come back with a request about the fresh store feature,
but it definitely needs a bit more baking / experience with Samza.

Ben

On Tue Feb 17 2015 at 01:59:03 Chris Riccomini <cr...@apache.org>
wrote:

> Hey Ben,
>
> The problem with TTL is that it's handled entirely internally in RocksDB.
> There's no way for us to know when a key's been deleted. You can work
> around this if you also alter the changelog topic settings in your
> changelog Kafka topic to be TTL based, not log-compacted, then these two
> should roughly match. For example, if you have a 1h TTL in RocksDB and a 1h
> TTL in your Kafka changelog topic, then the semantics are ROUGHLY
> equivalent. I say ROUGHLY because the two are going to be GC'ing expired
> keys independently of one another.
>
> Also, during a restart, the TTLs in the RocksDB store will be fully reset.
> For example, if at minute 59 of a key, you restart the job, then the Kafka
> topic will restore it when the job starts, and the TTL will reset back to 0
> minutes in the RocksDB store (though, a minute later Kafka will drop it
> from the changelog). If you don't need EXACT TTL guarantees, then this
> should be fine. If you do need exact, then .all() is probably the way to
> go.
>
> Cheers,
> Chris
>
> On Mon, Feb 16, 2015 at 1:39 PM, Benjamin Edwards <ed...@gmail.com>
> wrote:
>
> > Yes, I was using a changelog. You bring up a good point. I think I need
> to
> > think harder about what I am trying to do. Maybe deleting all the keys
> > isn't that bad. Especially is I amortise it over the life of the next
> > period.
> >
> > It seems like waiting for TTLs is probably the right thing to do
> > ultimately.
> >
> > Thanks for the timely response!
> >
> > Ben
> >
> > On Sun Feb 15 2015 at 23:43:27 Chris Riccomini <cr...@apache.org>
> > wrote:
> >
> > > Hey Benjamin,
> > >
> > > You're right. Currently you have to call .all(), and delete everything.
> > >
> > > RocksDB just committed TTL support for their Java library. This feature
> > > allows data to automatically be expired out. Once RocksDB releases
> their
> > > TTL patch (I believe in a few weeks, according to Igor), we'll update
> > Samza
> > > 0.9.0. Our tracker patch is here:
> > >
> > >   https://issues.apache.org/jira/browse/SAMZA-537
> > >
> > > > Is there no way to just say I don't care about the old data, gimme a
> > new
> > > store?
> > >
> > > We don't have this feature right now, but we could add it. This feature
> > is
> > > a bit more complicated when a changelog is attached, since we will have
> > to
> > > execute deletes for every key (we still need to call .all()). Are you
> > > running with a changelog?
> > >
> > > Cheers,
> > > Chris
> > >
> > > On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <
> > edwards.benj@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trialling samza for some windowed stream processing. Typically I
> > > want
> > > > to aggregate a bunch of state over some window of messages, process
> the
> > > > data, then drop the current state. The only way that I can see to do
> > that
> > > > at the moment is to delete every key. This seems expensive. Is there
> no
> > > way
> > > > to just say I don't care about the old data, gimme a new store?
> > > >
> > > > Ben
> > > >
> > >
> >
>

Re: Truncating rocks db

Posted by Chris Riccomini <cr...@apache.org>.
Hey Ben,

The problem with TTL is that it's handled entirely internally in RocksDB.
There's no way for us to know when a key's been deleted. You can work
around this if you also alter the changelog topic settings in your
changelog Kafka topic to be TTL based, not log-compacted, then these two
should roughly match. For example, if you have a 1h TTL in RocksDB and a 1h
TTL in your Kafka changelog topic, then the semantics are ROUGHLY
equivalent. I say ROUGHLY because the two are going to be GC'ing expired
keys independently of one another.

Also, during a restart, the TTLs in the RocksDB store will be fully reset.
For example, if at minute 59 of a key, you restart the job, then the Kafka
topic will restore it when the job starts, and the TTL will reset back to 0
minutes in the RocksDB store (though, a minute later Kafka will drop it
from the changelog). If you don't need EXACT TTL guarantees, then this
should be fine. If you do need exact, then .all() is probably the way to go.

Cheers,
Chris

On Mon, Feb 16, 2015 at 1:39 PM, Benjamin Edwards <ed...@gmail.com>
wrote:

> Yes, I was using a changelog. You bring up a good point. I think I need to
> think harder about what I am trying to do. Maybe deleting all the keys
> isn't that bad. Especially is I amortise it over the life of the next
> period.
>
> It seems like waiting for TTLs is probably the right thing to do
> ultimately.
>
> Thanks for the timely response!
>
> Ben
>
> On Sun Feb 15 2015 at 23:43:27 Chris Riccomini <cr...@apache.org>
> wrote:
>
> > Hey Benjamin,
> >
> > You're right. Currently you have to call .all(), and delete everything.
> >
> > RocksDB just committed TTL support for their Java library. This feature
> > allows data to automatically be expired out. Once RocksDB releases their
> > TTL patch (I believe in a few weeks, according to Igor), we'll update
> Samza
> > 0.9.0. Our tracker patch is here:
> >
> >   https://issues.apache.org/jira/browse/SAMZA-537
> >
> > > Is there no way to just say I don't care about the old data, gimme a
> new
> > store?
> >
> > We don't have this feature right now, but we could add it. This feature
> is
> > a bit more complicated when a changelog is attached, since we will have
> to
> > execute deletes for every key (we still need to call .all()). Are you
> > running with a changelog?
> >
> > Cheers,
> > Chris
> >
> > On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <
> edwards.benj@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I am trialling samza for some windowed stream processing. Typically I
> > want
> > > to aggregate a bunch of state over some window of messages, process the
> > > data, then drop the current state. The only way that I can see to do
> that
> > > at the moment is to delete every key. This seems expensive. Is there no
> > way
> > > to just say I don't care about the old data, gimme a new store?
> > >
> > > Ben
> > >
> >
>

Re: Truncating rocks db

Posted by Benjamin Edwards <ed...@gmail.com>.
Yes, I was using a changelog. You bring up a good point. I think I need to
think harder about what I am trying to do. Maybe deleting all the keys
isn't that bad. Especially is I amortise it over the life of the next
period.

It seems like waiting for TTLs is probably the right thing to do ultimately.

Thanks for the timely response!

Ben

On Sun Feb 15 2015 at 23:43:27 Chris Riccomini <cr...@apache.org>
wrote:

> Hey Benjamin,
>
> You're right. Currently you have to call .all(), and delete everything.
>
> RocksDB just committed TTL support for their Java library. This feature
> allows data to automatically be expired out. Once RocksDB releases their
> TTL patch (I believe in a few weeks, according to Igor), we'll update Samza
> 0.9.0. Our tracker patch is here:
>
>   https://issues.apache.org/jira/browse/SAMZA-537
>
> > Is there no way to just say I don't care about the old data, gimme a new
> store?
>
> We don't have this feature right now, but we could add it. This feature is
> a bit more complicated when a changelog is attached, since we will have to
> execute deletes for every key (we still need to call .all()). Are you
> running with a changelog?
>
> Cheers,
> Chris
>
> On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <edwards.benj@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > I am trialling samza for some windowed stream processing. Typically I
> want
> > to aggregate a bunch of state over some window of messages, process the
> > data, then drop the current state. The only way that I can see to do that
> > at the moment is to delete every key. This seems expensive. Is there no
> way
> > to just say I don't care about the old data, gimme a new store?
> >
> > Ben
> >
>

Re: Truncating rocks db

Posted by Chris Riccomini <cr...@apache.org>.
Hey Benjamin,

You're right. Currently you have to call .all(), and delete everything.

RocksDB just committed TTL support for their Java library. This feature
allows data to automatically be expired out. Once RocksDB releases their
TTL patch (I believe in a few weeks, according to Igor), we'll update Samza
0.9.0. Our tracker patch is here:

  https://issues.apache.org/jira/browse/SAMZA-537

> Is there no way to just say I don't care about the old data, gimme a new
store?

We don't have this feature right now, but we could add it. This feature is
a bit more complicated when a changelog is attached, since we will have to
execute deletes for every key (we still need to call .all()). Are you
running with a changelog?

Cheers,
Chris

On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <ed...@gmail.com>
wrote:

> Hi,
>
> I am trialling samza for some windowed stream processing. Typically I want
> to aggregate a bunch of state over some window of messages, process the
> data, then drop the current state. The only way that I can see to do that
> at the moment is to delete every key. This seems expensive. Is there no way
> to just say I don't care about the old data, gimme a new store?
>
> Ben
>