You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by GuoLei Yi <yi...@gmail.com> on 2022/03/29 06:17:24 UTC
Refactor Doris's IO Stack
Currently, there are various interfaces for file IO operations in Doris:
- There are FileReader and FileWriter in the query layer. There are
corresponding implementations for HDFS, S3, Broker, and Local.
- In the storage layer, there is a BlockManager that abstracts Block,
there are WriteableFileBlock, ReadableFileBlock.
- For directory management work, there is an Env interface that can
include directory operations, including RemoteEnv and PosixEnv, and there
are also some link files and delete blocks in BlockManager; in addition,
for S3, HDFS, there are operations such as S3StorageBackend that contain
some file directories, including mkdir, copy , rm these operations
So many ways to operate will cause the following problems:
- It's messy, sometimes I don't know which one to use, many functions
are repeated, but they have different abstract names;
- Modifying a feature or fix a bug needs to be modified in multiple
places. For example, if we want to read S3 and have a local cache, then all
places need to be added;
We need to unify the IO stack. In fact, access to IO can be roughly divided
into the following three types:
- Directory operations, create files, delete files, get file list, etc.
- File write operation
- File read operation
And we could implement these API for different storage backends:
- Local file
- S3 file
- HDFS file
- Broker
Once implemented, it can be used in the storage layer (separation of hot
and cold, separation of storage and computing), query layer (query S3,
query HDFS), backup and recovery, etc., to avoid repeated development and
maintenance
--
Guolei Yi
Tel:134-3991-0228
Email:yiguolei@gmail.com
Re:Re:Re: Refactor Doris's IO Stack
Posted by 陈明雨 <mo...@163.com>.
Add write priv for plat1ko
[1] https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack
--
此致!Best Regards
陈明雨 Mingyu Chen
Email:
chenmingyu@apache.org
在 2022-03-31 14:41:35,"陈明雨" <mo...@163.com> 写道:
>Hi Guolei,
>I have created DSIP-006 for this proposal
>https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack
>
>
>
>
>--
>
>此致!Best Regards
>陈明雨 Mingyu Chen
>
>Email:
>chenmingyu@apache.org
>
>
>
>
>
>在 2022-03-30 12:35:44,"王博" <wa...@gmail.com> 写道:
>>+1
>>Looking forward Teacher Guolei's dsip.
>>
>>GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:
>>
>>> Currently, there are various interfaces for file IO operations in Doris:
>>>
>>> - There are FileReader and FileWriter in the query layer. There are
>>> corresponding implementations for HDFS, S3, Broker, and Local.
>>> - In the storage layer, there is a BlockManager that abstracts Block,
>>> there are WriteableFileBlock, ReadableFileBlock.
>>> - For directory management work, there is an Env interface that can
>>> include directory operations, including RemoteEnv and PosixEnv, and
>>> there
>>> are also some link files and delete blocks in BlockManager; in addition,
>>> for S3, HDFS, there are operations such as S3StorageBackend that contain
>>> some file directories, including mkdir, copy , rm these operations
>>>
>>> So many ways to operate will cause the following problems:
>>>
>>> - It's messy, sometimes I don't know which one to use, many functions
>>> are repeated, but they have different abstract names;
>>> - Modifying a feature or fix a bug needs to be modified in multiple
>>> places. For example, if we want to read S3 and have a local cache, then
>>> all
>>> places need to be added;
>>>
>>> We need to unify the IO stack. In fact, access to IO can be roughly divided
>>> into the following three types:
>>>
>>> - Directory operations, create files, delete files, get file list, etc.
>>> - File write operation
>>> - File read operation
>>>
>>> And we could implement these API for different storage backends:
>>>
>>>
>>> - Local file
>>> - S3 file
>>> - HDFS file
>>> - Broker
>>>
>>> Once implemented, it can be used in the storage layer (separation of hot
>>> and cold, separation of storage and computing), query layer (query S3,
>>> query HDFS), backup and recovery, etc., to avoid repeated development and
>>> maintenance
>>>
>>> --
>>> Guolei Yi
>>> Tel:134-3991-0228
>>> Email:yiguolei@gmail.com
>>>
>>
>>
>>--
>>王博 Wang Bo
Re:Re: Refactor Doris's IO Stack
Posted by 陈明雨 <mo...@163.com>.
Hi Guolei,
I have created DSIP-006 for this proposal
https://cwiki.apache.org/confluence/display/DORIS/DSIP-006%3A+Refactor+IO+stack
--
此致!Best Regards
陈明雨 Mingyu Chen
Email:
chenmingyu@apache.org
在 2022-03-30 12:35:44,"王博" <wa...@gmail.com> 写道:
>+1
>Looking forward Teacher Guolei's dsip.
>
>GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:
>
>> Currently, there are various interfaces for file IO operations in Doris:
>>
>> - There are FileReader and FileWriter in the query layer. There are
>> corresponding implementations for HDFS, S3, Broker, and Local.
>> - In the storage layer, there is a BlockManager that abstracts Block,
>> there are WriteableFileBlock, ReadableFileBlock.
>> - For directory management work, there is an Env interface that can
>> include directory operations, including RemoteEnv and PosixEnv, and
>> there
>> are also some link files and delete blocks in BlockManager; in addition,
>> for S3, HDFS, there are operations such as S3StorageBackend that contain
>> some file directories, including mkdir, copy , rm these operations
>>
>> So many ways to operate will cause the following problems:
>>
>> - It's messy, sometimes I don't know which one to use, many functions
>> are repeated, but they have different abstract names;
>> - Modifying a feature or fix a bug needs to be modified in multiple
>> places. For example, if we want to read S3 and have a local cache, then
>> all
>> places need to be added;
>>
>> We need to unify the IO stack. In fact, access to IO can be roughly divided
>> into the following three types:
>>
>> - Directory operations, create files, delete files, get file list, etc.
>> - File write operation
>> - File read operation
>>
>> And we could implement these API for different storage backends:
>>
>>
>> - Local file
>> - S3 file
>> - HDFS file
>> - Broker
>>
>> Once implemented, it can be used in the storage layer (separation of hot
>> and cold, separation of storage and computing), query layer (query S3,
>> query HDFS), backup and recovery, etc., to avoid repeated development and
>> maintenance
>>
>> --
>> Guolei Yi
>> Tel:134-3991-0228
>> Email:yiguolei@gmail.com
>>
>
>
>--
>王博 Wang Bo
Re: Refactor Doris's IO Stack
Posted by 王博 <wa...@gmail.com>.
+1
Looking forward Teacher Guolei's dsip.
GuoLei Yi <yi...@gmail.com> 于2022年3月29日周二 14:17写道:
> Currently, there are various interfaces for file IO operations in Doris:
>
> - There are FileReader and FileWriter in the query layer. There are
> corresponding implementations for HDFS, S3, Broker, and Local.
> - In the storage layer, there is a BlockManager that abstracts Block,
> there are WriteableFileBlock, ReadableFileBlock.
> - For directory management work, there is an Env interface that can
> include directory operations, including RemoteEnv and PosixEnv, and
> there
> are also some link files and delete blocks in BlockManager; in addition,
> for S3, HDFS, there are operations such as S3StorageBackend that contain
> some file directories, including mkdir, copy , rm these operations
>
> So many ways to operate will cause the following problems:
>
> - It's messy, sometimes I don't know which one to use, many functions
> are repeated, but they have different abstract names;
> - Modifying a feature or fix a bug needs to be modified in multiple
> places. For example, if we want to read S3 and have a local cache, then
> all
> places need to be added;
>
> We need to unify the IO stack. In fact, access to IO can be roughly divided
> into the following three types:
>
> - Directory operations, create files, delete files, get file list, etc.
> - File write operation
> - File read operation
>
> And we could implement these API for different storage backends:
>
>
> - Local file
> - S3 file
> - HDFS file
> - Broker
>
> Once implemented, it can be used in the storage layer (separation of hot
> and cold, separation of storage and computing), query layer (query S3,
> query HDFS), backup and recovery, etc., to avoid repeated development and
> maintenance
>
> --
> Guolei Yi
> Tel:134-3991-0228
> Email:yiguolei@gmail.com
>
--
王博 Wang Bo
Re: Refactor Doris's IO Stack
Posted by GuoLei Yi <yi...@gmail.com>.
Thanks for your advice. I will follow your instructions, and replace the
usage step by step.
陈明雨 <mo...@163.com> 于2022年3月29日周二 22:55写道:
> Indeed, we need to refactor the IO layer to make it more clear and
> extensible.
> The basic purpose is that when a new kind of file system is introduced, we
> only need to implement a new derived class
> for it and no need to modify any other interface in upper layer.
>
>
> BTW, for now, if we change the IO interface, it will impact lots of place.
> So how about do this for 2 steps:
>
>
> 1. Rewrite the IO stack in totally new files, and leave current implements
> along, for easy reviewing.
> 2. Use the new IO stack to replace current calls.
>
>
>
> --
>
> 此致!Best Regards
> 陈明雨 Mingyu Chen
>
> Email:
> chenmingyu@apache.org
>
>
>
>
>
> At 2022-03-29 14:17:24, "GuoLei Yi" <yi...@gmail.com> wrote:
> >Currently, there are various interfaces for file IO operations in Doris:
> >
> > - There are FileReader and FileWriter in the query layer. There are
> > corresponding implementations for HDFS, S3, Broker, and Local.
> > - In the storage layer, there is a BlockManager that abstracts Block,
> > there are WriteableFileBlock, ReadableFileBlock.
> > - For directory management work, there is an Env interface that can
> > include directory operations, including RemoteEnv and PosixEnv, and
> there
> > are also some link files and delete blocks in BlockManager; in
> addition,
> > for S3, HDFS, there are operations such as S3StorageBackend that
> contain
> > some file directories, including mkdir, copy , rm these operations
> >
> >So many ways to operate will cause the following problems:
> >
> > - It's messy, sometimes I don't know which one to use, many functions
> > are repeated, but they have different abstract names;
> > - Modifying a feature or fix a bug needs to be modified in multiple
> > places. For example, if we want to read S3 and have a local cache,
> then all
> > places need to be added;
> >
> >We need to unify the IO stack. In fact, access to IO can be roughly
> divided
> >into the following three types:
> >
> > - Directory operations, create files, delete files, get file list, etc.
> > - File write operation
> > - File read operation
> >
> >And we could implement these API for different storage backends:
> >
> >
> > - Local file
> > - S3 file
> > - HDFS file
> > - Broker
> >
> >Once implemented, it can be used in the storage layer (separation of hot
> >and cold, separation of storage and computing), query layer (query S3,
> >query HDFS), backup and recovery, etc., to avoid repeated development and
> >maintenance
> >
> >--
> >Guolei Yi
> >Tel:134-3991-0228
> >Email:yiguolei@gmail.com
>
--
祝您心情愉快
衣国垒
Tsing Hua University
Tel:134-3991-0228
Email:yiguolei@gmail.com
Re:Refactor Doris's IO Stack
Posted by 陈明雨 <mo...@163.com>.
Indeed, we need to refactor the IO layer to make it more clear and extensible.
The basic purpose is that when a new kind of file system is introduced, we only need to implement a new derived class
for it and no need to modify any other interface in upper layer.
BTW, for now, if we change the IO interface, it will impact lots of place.
So how about do this for 2 steps:
1. Rewrite the IO stack in totally new files, and leave current implements along, for easy reviewing.
2. Use the new IO stack to replace current calls.
--
此致!Best Regards
陈明雨 Mingyu Chen
Email:
chenmingyu@apache.org
At 2022-03-29 14:17:24, "GuoLei Yi" <yi...@gmail.com> wrote:
>Currently, there are various interfaces for file IO operations in Doris:
>
> - There are FileReader and FileWriter in the query layer. There are
> corresponding implementations for HDFS, S3, Broker, and Local.
> - In the storage layer, there is a BlockManager that abstracts Block,
> there are WriteableFileBlock, ReadableFileBlock.
> - For directory management work, there is an Env interface that can
> include directory operations, including RemoteEnv and PosixEnv, and there
> are also some link files and delete blocks in BlockManager; in addition,
> for S3, HDFS, there are operations such as S3StorageBackend that contain
> some file directories, including mkdir, copy , rm these operations
>
>So many ways to operate will cause the following problems:
>
> - It's messy, sometimes I don't know which one to use, many functions
> are repeated, but they have different abstract names;
> - Modifying a feature or fix a bug needs to be modified in multiple
> places. For example, if we want to read S3 and have a local cache, then all
> places need to be added;
>
>We need to unify the IO stack. In fact, access to IO can be roughly divided
>into the following three types:
>
> - Directory operations, create files, delete files, get file list, etc.
> - File write operation
> - File read operation
>
>And we could implement these API for different storage backends:
>
>
> - Local file
> - S3 file
> - HDFS file
> - Broker
>
>Once implemented, it can be used in the storage layer (separation of hot
>and cold, separation of storage and computing), query layer (query S3,
>query HDFS), backup and recovery, etc., to avoid repeated development and
>maintenance
>
>--
>Guolei Yi
>Tel:134-3991-0228
>Email:yiguolei@gmail.com