You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by GuoLei Yi <yi...@gmail.com> on 2022/03/29 06:17:24 UTC

Refactor Doris's IO Stack

Currently, there are various interfaces for file IO operations in Doris:

   - There are FileReader and FileWriter in the query layer. There are
   corresponding implementations for HDFS, S3, Broker, and Local.
   - In the storage layer, there is a BlockManager that abstracts Block,
   there are WriteableFileBlock, ReadableFileBlock.
   - For directory management work, there is an Env interface that can
   include directory operations, including RemoteEnv and PosixEnv, and there
   are also some link files and delete blocks in BlockManager; in addition,
   for S3, HDFS, there are operations such as S3StorageBackend that contain
   some file directories, including mkdir, copy , rm these operations

So many ways to operate will  cause the following problems:

   - It's messy, sometimes I don't know which one to use, many functions
   are repeated, but they have different abstract names;
   - Modifying a feature or fix a bug needs to be modified in multiple
   places. For example, if we want to read S3 and have a local cache, then all
   places need to be added;

We need to unify the IO stack. In fact, access to IO can be roughly divided
into the following three types:

   - Directory operations, create files, delete files, get file list, etc.
   - File write operation
   - File read operation

And we could implement these API for different storage backends:


   - Local file
   - S3 file
   - HDFS file
   - Broker

Once implemented, it can be used in the storage layer (separation of hot
and cold, separation of storage and computing), query layer (query S3,
query HDFS), backup and recovery, etc., to avoid repeated development and
maintenance

-- 
Guolei Yi
Tel:134-3991-0228
Email:yiguolei@gmail.com

Re:Re:Re: Refactor Doris's IO Stack

Posted by 陈明雨 <mo...@163.com>.
Add write priv for plat1ko

[1] https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack




--

此致!Best Regards
陈明雨 Mingyu Chen

Email:
chenmingyu@apache.org





在 2022-03-31 14:41:35,"陈明雨" <mo...@163.com> 写道:
>Hi Guolei,
>I have created DSIP-006 for this proposal
>https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack
>
>
>
>
>--
>
>此致!Best Regards
>陈明雨 Mingyu Chen
>
>Email:
>chenmingyu@apache.org
>
>
>
>
>
>在 2022-03-30 12:35:44,"王博" <wa...@gmail.com> 写道:
>>+1
>>Looking forward Teacher Guolei's dsip.
>>
>>GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:
>>
>>> Currently, there are various interfaces for file IO operations in Doris:
>>>
>>>    - There are FileReader and FileWriter in the query layer. There are
>>>    corresponding implementations for HDFS, S3, Broker, and Local.
>>>    - In the storage layer, there is a BlockManager that abstracts Block,
>>>    there are WriteableFileBlock, ReadableFileBlock.
>>>    - For directory management work, there is an Env interface that can
>>>    include directory operations, including RemoteEnv and PosixEnv, and
>>> there
>>>    are also some link files and delete blocks in BlockManager; in addition,
>>>    for S3, HDFS, there are operations such as S3StorageBackend that contain
>>>    some file directories, including mkdir, copy , rm these operations
>>>
>>> So many ways to operate will  cause the following problems:
>>>
>>>    - It's messy, sometimes I don't know which one to use, many functions
>>>    are repeated, but they have different abstract names;
>>>    - Modifying a feature or fix a bug needs to be modified in multiple
>>>    places. For example, if we want to read S3 and have a local cache, then
>>> all
>>>    places need to be added;
>>>
>>> We need to unify the IO stack. In fact, access to IO can be roughly divided
>>> into the following three types:
>>>
>>>    - Directory operations, create files, delete files, get file list, etc.
>>>    - File write operation
>>>    - File read operation
>>>
>>> And we could implement these API for different storage backends:
>>>
>>>
>>>    - Local file
>>>    - S3 file
>>>    - HDFS file
>>>    - Broker
>>>
>>> Once implemented, it can be used in the storage layer (separation of hot
>>> and cold, separation of storage and computing), query layer (query S3,
>>> query HDFS), backup and recovery, etc., to avoid repeated development and
>>> maintenance
>>>
>>> --
>>> Guolei Yi
>>> Tel:134-3991-0228
>>> Email:yiguolei@gmail.com
>>>
>>
>>
>>-- 
>>王博  Wang Bo

Re:Re: Refactor Doris's IO Stack

Posted by 陈明雨 <mo...@163.com>.
Hi Guolei,
I have created DSIP-006 for this proposal
https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack




--

此致!Best Regards
陈明雨 Mingyu Chen

Email:
chenmingyu@apache.org





在 2022-03-30 12:35:44,"王博" <wa...@gmail.com> 写道:
>+1
>Looking forward Teacher Guolei's dsip.
>
>GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:
>
>> Currently, there are various interfaces for file IO operations in Doris:
>>
>>    - There are FileReader and FileWriter in the query layer. There are
>>    corresponding implementations for HDFS, S3, Broker, and Local.
>>    - In the storage layer, there is a BlockManager that abstracts Block,
>>    there are WriteableFileBlock, ReadableFileBlock.
>>    - For directory management work, there is an Env interface that can
>>    include directory operations, including RemoteEnv and PosixEnv, and
>> there
>>    are also some link files and delete blocks in BlockManager; in addition,
>>    for S3, HDFS, there are operations such as S3StorageBackend that contain
>>    some file directories, including mkdir, copy , rm these operations
>>
>> So many ways to operate will  cause the following problems:
>>
>>    - It's messy, sometimes I don't know which one to use, many functions
>>    are repeated, but they have different abstract names;
>>    - Modifying a feature or fix a bug needs to be modified in multiple
>>    places. For example, if we want to read S3 and have a local cache, then
>> all
>>    places need to be added;
>>
>> We need to unify the IO stack. In fact, access to IO can be roughly divided
>> into the following three types:
>>
>>    - Directory operations, create files, delete files, get file list, etc.
>>    - File write operation
>>    - File read operation
>>
>> And we could implement these API for different storage backends:
>>
>>
>>    - Local file
>>    - S3 file
>>    - HDFS file
>>    - Broker
>>
>> Once implemented, it can be used in the storage layer (separation of hot
>> and cold, separation of storage and computing), query layer (query S3,
>> query HDFS), backup and recovery, etc., to avoid repeated development and
>> maintenance
>>
>> --
>> Guolei Yi
>> Tel:134-3991-0228
>> Email:yiguolei@gmail.com
>>
>
>
>-- 
>王博  Wang Bo

Re: Refactor Doris's IO Stack

Posted by 王博 <wa...@gmail.com>.
+1
Looking forward Teacher Guolei's dsip.

GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:

> Currently, there are various interfaces for file IO operations in Doris:
>
>    - There are FileReader and FileWriter in the query layer. There are
>    corresponding implementations for HDFS, S3, Broker, and Local.
>    - In the storage layer, there is a BlockManager that abstracts Block,
>    there are WriteableFileBlock, ReadableFileBlock.
>    - For directory management work, there is an Env interface that can
>    include directory operations, including RemoteEnv and PosixEnv, and
> there
>    are also some link files and delete blocks in BlockManager; in addition,
>    for S3, HDFS, there are operations such as S3StorageBackend that contain
>    some file directories, including mkdir, copy , rm these operations
>
> So many ways to operate will  cause the following problems:
>
>    - It's messy, sometimes I don't know which one to use, many functions
>    are repeated, but they have different abstract names;
>    - Modifying a feature or fix a bug needs to be modified in multiple
>    places. For example, if we want to read S3 and have a local cache, then
> all
>    places need to be added;
>
> We need to unify the IO stack. In fact, access to IO can be roughly divided
> into the following three types:
>
>    - Directory operations, create files, delete files, get file list, etc.
>    - File write operation
>    - File read operation
>
> And we could implement these API for different storage backends:
>
>
>    - Local file
>    - S3 file
>    - HDFS file
>    - Broker
>
> Once implemented, it can be used in the storage layer (separation of hot
> and cold, separation of storage and computing), query layer (query S3,
> query HDFS), backup and recovery, etc., to avoid repeated development and
> maintenance
>
> --
> Guolei Yi
> Tel:134-3991-0228
> Email:yiguolei@gmail.com
>


-- 
王博  Wang Bo

Re: Refactor Doris's IO Stack

Posted by GuoLei Yi <yi...@gmail.com>.
Thanks for your advice. I will follow your instructions, and replace the
usage step by step.

陈明雨 <mo...@163.com> 于2022年3月29日周二 22:55写道:

> Indeed, we need to refactor the IO layer to make it more clear and
> extensible.
> The basic purpose is that when a new kind of file system is introduced, we
> only need to implement a new derived class
> for it and no need to modify any other interface in upper layer.
>
>
> BTW, for now, if we change the IO interface, it will impact lots of place.
> So how about do this for 2 steps:
>
>
> 1. Rewrite the IO stack in totally new files, and leave current implements
> along, for easy reviewing.
> 2. Use the new IO stack to replace current calls.
>
>
>
> --
>
> 此致!Best Regards
> 陈明雨 Mingyu Chen
>
> Email:
> chenmingyu@apache.org
>
>
>
>
>
> At 2022-03-29 14:17:24, "GuoLei Yi" <yi...@gmail.com> wrote:
> >Currently, there are various interfaces for file IO operations in Doris:
> >
> >   - There are FileReader and FileWriter in the query layer. There are
> >   corresponding implementations for HDFS, S3, Broker, and Local.
> >   - In the storage layer, there is a BlockManager that abstracts Block,
> >   there are WriteableFileBlock, ReadableFileBlock.
> >   - For directory management work, there is an Env interface that can
> >   include directory operations, including RemoteEnv and PosixEnv, and
> there
> >   are also some link files and delete blocks in BlockManager; in
> addition,
> >   for S3, HDFS, there are operations such as S3StorageBackend that
> contain
> >   some file directories, including mkdir, copy , rm these operations
> >
> >So many ways to operate will  cause the following problems:
> >
> >   - It's messy, sometimes I don't know which one to use, many functions
> >   are repeated, but they have different abstract names;
> >   - Modifying a feature or fix a bug needs to be modified in multiple
> >   places. For example, if we want to read S3 and have a local cache,
> then all
> >   places need to be added;
> >
> >We need to unify the IO stack. In fact, access to IO can be roughly
> divided
> >into the following three types:
> >
> >   - Directory operations, create files, delete files, get file list, etc.
> >   - File write operation
> >   - File read operation
> >
> >And we could implement these API for different storage backends:
> >
> >
> >   - Local file
> >   - S3 file
> >   - HDFS file
> >   - Broker
> >
> >Once implemented, it can be used in the storage layer (separation of hot
> >and cold, separation of storage and computing), query layer (query S3,
> >query HDFS), backup and recovery, etc., to avoid repeated development and
> >maintenance
> >
> >--
> >Guolei Yi
> >Tel:134-3991-0228
> >Email:yiguolei@gmail.com
>


-- 
祝您心情愉快

衣国垒
Tsing Hua University
Tel:134-3991-0228
Email:yiguolei@gmail.com

Re:Refactor Doris's IO Stack

Posted by 陈明雨 <mo...@163.com>.
Indeed, we need to refactor the IO layer to make it more clear and extensible.
The basic purpose is that when a new kind of file system is introduced, we only need to implement a new derived class
for it and no need to modify any other interface in upper layer.


BTW, for now, if we change the IO interface, it will impact lots of place.
So how about do this for 2 steps:


1. Rewrite the IO stack in totally new files, and leave current implements along, for easy reviewing.
2. Use the new IO stack to replace current calls.



--

此致!Best Regards
陈明雨 Mingyu Chen

Email:
chenmingyu@apache.org





At 2022-03-29 14:17:24, "GuoLei Yi" <yi...@gmail.com> wrote:
>Currently, there are various interfaces for file IO operations in Doris:
>
>   - There are FileReader and FileWriter in the query layer. There are
>   corresponding implementations for HDFS, S3, Broker, and Local.
>   - In the storage layer, there is a BlockManager that abstracts Block,
>   there are WriteableFileBlock, ReadableFileBlock.
>   - For directory management work, there is an Env interface that can
>   include directory operations, including RemoteEnv and PosixEnv, and there
>   are also some link files and delete blocks in BlockManager; in addition,
>   for S3, HDFS, there are operations such as S3StorageBackend that contain
>   some file directories, including mkdir, copy , rm these operations
>
>So many ways to operate will  cause the following problems:
>
>   - It's messy, sometimes I don't know which one to use, many functions
>   are repeated, but they have different abstract names;
>   - Modifying a feature or fix a bug needs to be modified in multiple
>   places. For example, if we want to read S3 and have a local cache, then all
>   places need to be added;
>
>We need to unify the IO stack. In fact, access to IO can be roughly divided
>into the following three types:
>
>   - Directory operations, create files, delete files, get file list, etc.
>   - File write operation
>   - File read operation
>
>And we could implement these API for different storage backends:
>
>
>   - Local file
>   - S3 file
>   - HDFS file
>   - Broker
>
>Once implemented, it can be used in the storage layer (separation of hot
>and cold, separation of storage and computing), query layer (query S3,
>query HDFS), backup and recovery, etc., to avoid repeated development and
>maintenance
>
>-- 
>Guolei Yi
>Tel:134-3991-0228
>Email:yiguolei@gmail.com