You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by "Lei Rui (JIRA)" <ji...@apache.org> on 2019/05/16 16:21:00 UTC

[jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Lei Rui created IOTDB-92:
----------------------------

Summary: The data locality principle used by Spark loses ground in the face of TsFile.
Key: IOTDB-92
URL: https://issues.apache.org/jira/browse/IOTDB-92
Project: Apache IoTDB
Issue Type: Improvement
Reporter: Lei Rui

In the development of TsFile-Spark-Connector, we discover the problem that the data locality principle used by Spark loses ground in the face of TsFile. We believe the problem is rooted in the storage structure design of TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to guarantee the proper functionality despite the constraint. The resolvement of the data locality problem is left for future work. Below are the details.
h1. 1. Spark Partition

In Apache Spark, the data is stored in the form of RDDs and divided into partitions across various nodes. A partition is a logical chunk of a large distributed data set that helps parallelize distributed data processing. Spark works on data locality principle to minimize the network traffic for sending data between executors.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

https://techvidvan.com/tutorials/spark-partition/
h1. 2. TsFile Structure

TsFile is a columnar storage file format designed for time series data, which supports efficient compression and query. Data in TsFile are organized by device-measurement hierarchy. As the figure below shows, the storage unit of a device is a chunk group and that of a measurement is a chunk. Measurements of a device are grouped together.

Under this architecture, different partitions of Spark logically contain different sets of chunkgroups of TsFile. Now consider the query process of this TsFile on Spark. Supposing that we query “select * from root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has to deal with the whole data to get the right answer. However, this also means that it is nearly impossible to apply the data locality principle.
h1. 3. Problem

Now we can summarize two problems.

The first problem is how to guarantee the correctness of the queried answer integrated from all the partition task without changing the current storage structure of TsFile. To solve this problem, we propose a solution by converting the space partition constraint to the time partition constraint while still requiring a single task to have access to the whole TsFile data. As shown in the figure below, the task of partition 1 is assigned the yellow marked time partition constraint; the task of partition 2 is assigned the green marked time partition constraint; the task of partition 3 is assigned empty time partition constraint because the former two tasks have completed the query.

The second problem is more fundamental. That is, how we can adjust to enable some extent of data locality of TsFile when it is queried on Spark.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Posted by Xiangdong Huang <sa...@gmail.com>.

Now Jenkins CI is not stable ( from Build #96 to #104)... It is caused by
one of your previous PR about Tsfile-Spark-Connector ( #108 or #173)..

I notice that PR #177 is for fixing that. You can attach it to this JIRA
issue.

Look forward to fix it ASAP..

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Lei Rui (JIRA) <ji...@apache.org> 于2019年5月17日周五 上午12:21写道：

> Lei Rui created IOTDB-92:
> ----------------------------
>
>              Summary: The data locality principle used by Spark loses
> ground in the face of TsFile.
>                  Key: IOTDB-92
>                  URL: https://issues.apache.org/jira/browse/IOTDB-92
>              Project: Apache IoTDB
>           Issue Type: Improvement
>             Reporter: Lei Rui
>
>
>     In the development of TsFile-Spark-Connector, we discover the problem
> that the data locality principle used by Spark loses ground in the face of
> TsFile. We believe the problem is rooted in the storage structure design of
> TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to
> guarantee the proper functionality despite the constraint. The resolvement
> of the data locality problem is left for future work. Below are the details.
> h1. 1. Spark Partition
>
>     In Apache Spark, the data is stored in the form of RDDs and divided
> into partitions across various nodes. A partition is a logical chunk of a
> large distributed data set that helps parallelize distributed data
> processing. Spark works on data locality principle to minimize the network
> traffic for sending data between executors.
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
>
> https://techvidvan.com/tutorials/spark-partition/
> h1. 2. TsFile Structure
>
>     TsFile is a columnar storage file format designed for time series
> data, which supports efficient compression and query. Data in TsFile are
> organized by device-measurement hierarchy. As the figure below shows, the
> storage unit of a device is a chunk group and that of a measurement is a
> chunk. Measurements of a device are grouped together.
>
>     Under this architecture, different partitions of Spark logically
> contain different sets of chunkgroups of TsFile. Now consider the query
> process of this TsFile on Spark. Supposing that we query “select * from
> root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has
> to deal with the whole data to get the right answer. However, this also
> means that it is nearly impossible to apply the data locality principle.
> h1. 3. Problem
>
> Now we can summarize two problems.
>
> The first problem is how to guarantee the correctness of the queried
> answer integrated from all the partition task without changing the current
> storage structure of TsFile. To solve this problem, we propose a solution
> by converting the space partition constraint to the time partition
> constraint while still requiring a single task to have access to the whole
> TsFile data. As shown in the figure below, the task of partition 1 is
> assigned the yellow marked time partition constraint; the task of partition
> 2 is assigned the green marked time partition constraint; the task of
> partition 3 is assigned empty time partition constraint because the former
> two tasks have completed the query.
>
> The second problem is more fundamental. That is, how we can adjust to
> enable some extent of data locality of TsFile when it is queried on Spark.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>