You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by Chandni Singh <ch...@datatorrent.com> on 2015/12/09 04:18:00 UTC

Re: Fault-tolerant cache backed by a store

Hi,

Added more details about writing and reading to this doc:
https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#



On Sat, Nov 28, 2015 at 12:24 PM, Thomas Weise <th...@datatorrent.com>
wrote:

> Yes. The small file problem can be mitigated by selecting an appropriate
> time span for the bucket. On read, you would check whether the value is
> already expired or not.
>
> Thomas
>
> On Sat, Nov 28, 2015 at 11:53 AM, Chandni Singh <ch...@datatorrent.com>
> wrote:
>
> > Let's say a bucket is further divided into time-buckets and file is a
> > time-bucket.
> > Purging is easier here by just deleting a time-bucket file.
> >
> > Cons of this approach:
> > 1. There can be multiple small files.
> > 2. A tuple with key K1 and time t1 may belong to time bucket TB1  but
> later
> > a tuple with key K1 and time t2 may belong to another bucket TB2.
> > This makes query for K1 difficult. We need to scan the all the time
> > buckets. Since keys are sorted, this can be made more efficient.
> > 3. Multiple entries for a keys (NOT in the same time bucket).
> >
> > I think this is what we discussed. While explaining this approach I
> thought
> > that there can be a large number of small files which cannot be ignored.
> >
> > Chandni
> >
> > On Sat, Nov 28, 2015 at 10:29 AM, Thomas Weise <th...@datatorrent.com>
> > wrote:
> >
> > > Chandni,
> > >
> > > The approach that we discussed was to organize buckets (not keys) by
> > time,
> > > which allows you to discard the entire file on purge.
> > >
> > > In addition, when you use a block indexed file format, there is no need
> > to
> > > load the entire file to retrieve a single key.
> > >
> > > Thomas
> > >
> > >
> > >
> > > On Sat, Nov 28, 2015 at 9:42 AM, Chandni Singh <
> chandni@datatorrent.com>
> > > wrote:
> > >
> > > > Another approach is to treat an entry in bucket data file as:
> > > > <time><key><value>
> > > >
> > > > Time can be extract from the tuple (or windowId can be used).
> > > > With this approach purging can be simple. For each Bucket Data File
> we
> > > > check the last entry (since data is sorted in the bucket data file)
> and
> > > > delete if that is expired.
> > > >
> > > > Writing to bucket data file can be simple. We will not update value
> of
> > a
> > > > key, but always add a new entry for the key when its value changes.
> > > > Cons- multiple entries for a key.
> > > > If the tuples are not out of order then we may never have to
> re-write a
> > > > bucket data file that is complete.
> > > >
> > > > Reading is a problem here. The whole bucket needs to be de-serialized
> > to
> > > > find a key since data is no longer sorted on disk. If the query for a
> > key
> > > > specifies a time range then that read can be optimized.
> > > >
> > > >
> > > > With Tim's approach, purging can be triggered asynchronously at
> regular
> > > > intervals that may even delete data file which hasn't been updated
> for
> > > > sometime and the latest entry in that file is expired.
> > > > Even though the writes may not be that complicated with this approach
> > but
> > > > updating values when the length of value changes (example in join
> > > operation
> > > > a value is an appending list)  may result in many small stray files.
> > > >
> > > > Chandni
> > > >
> > >
> >
>