You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Wei Song (JIRA)" <ji...@apache.org> on 2017/05/10 17:44:04 UTC

[jira] [Created] (SAMZA-1278) Add Adjunct Data Store for Unbounded DataSets

Wei Song created SAMZA-1278:
-------------------------------

             Summary: Add Adjunct Data Store for Unbounded DataSets
                 Key: SAMZA-1278
                 URL: https://issues.apache.org/jira/browse/SAMZA-1278
             Project: Samza
          Issue Type: Improvement
          Components: kv-store
    Affects Versions: 0.12.0
            Reporter: Wei Song


Samza today supports RocksDB and MemDB as local data stores, which enables users to cache data for later usage during stream processing. However, the population of a data store is end user’s responsibility. This introduced additional effort from end user to develop for and maintain data stores, and also deal with corner cases such as reload after consumers falling off. 

We want to have an adjunct data (AD) store that is a read-only cache. It automatically stores streaming data for later usage. Adjunct data can be accessed the same way as accessing a key-value store in Samza. Data can be either partitioned or unpartitioned. If the dataset is small enough to fit in a RocksDB instance, the same copy would be populated in every container via a broadcast stream; if it is large enough fit in one database instance it would be partitioned across containers of a Samza job. 

A dataset delivered in a stream can be either bounded or unbounded, an example of an unbounded dataset could be a database change stream, and an example of a bounded dataset could be the content of a file. When Samza is running in 24x7 mode, the stream for a bounded dataset may deliver multiple versions. 

This proposal focuses on unbounded datasets.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)