You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Markus Weimer (JIRA)" <ji...@apache.org> on 2015/08/11 02:22:45 UTC

[jira] [Created] (REEF-580) Add a Block Management Service to REEF

Markus Weimer created REEF-580:
----------------------------------

             Summary: Add a Block Management Service to REEF
                 Key: REEF-580
                 URL: https://issues.apache.org/jira/browse/REEF-580
             Project: REEF
          Issue Type: Improvement
            Reporter: Markus Weimer


We propose the addition of a data Block Management service to REEF. The Block Manager manages the transient data of a Big Data application. The Block Manager assumes that transient data can be managed in the following hierarchy:

  * *Data Set:* A data set consists of a set of (physical)n partitions. For instance, a folder on HDFS could be considered a data set, while its files constitute the partitions.
  * *Partition:* a physical partition of a data set. In the example above, it would be a file. Partitions consist of Blocks.
  * *Block:* The atomic unit of data management. Each block belongs to exactly one partition. Blocks are immutable. Blocks can be stored in Evaluator memory, on local Disk or stable, distributed storage. Blocks can have replicas across these memory tiers. Blocks contain data of arbitrary format. From the perspective of this Block Management service, they are large, fixed sized byte arrays.

The purpose of the Block Manager is to manage the metadata and movement of data sets organized in such a way. To facilitate that, each Block, Partition and DataSet has a unique ID.

On the *Task side*, the Block Manager facilitates the retrieval of and access to any Block or Partition by their ID. Specific access methods are yet to be designed (e.g. whether or not there is an order to the blocks). Also, new Blocks can be created on the Task side for a given Partition. Special consideration shall be given to the memory allocation efficiency of this operation.

On the *Driver side*, the Block Manager keeps track of the metadata of all Blocks. It provides a network protocol used by the Task side components to retrieve and update metadata records. Metadata can be kept in memory or, in a later version, in stable storage such as a SQL database.

The Block Management service shall be built in a language and platform agnostic manner. At the very least, the Driver side network protocol needs to be accessible by both JVM and CLR implementations of the Task side. REST could be an appropriate approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)