You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@linkis.apache.org by GitBox <gi...@apache.org> on 2022/01/13 12:00:42 UTC

[GitHub] [incubator-linkis] Dlimeng opened a new issue #1303: [Feature] Linkis result set discussion

Dlimeng opened a new issue #1303:
URL: https://github.com/apache/incubator-linkis/issues/1303


   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/incubator-linkis/issues) and found no similar feature requirement.
   
   
   ### Problem Description
   
   Linkis's current result set is stored in Parquet instead of custom Dolphin format
   
   ### Description
   
   1.linkis storage parquet
   2.linkis storage orc
   
   ### Use case
   
   _No response_
   
   ### solutions
   
   1.Apache Parquet Introduce
   2.Apache Orc Introduce
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@linkis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@linkis.apache.org
For additional commands, e-mail: dev-help@linkis.apache.org

[GitHub] [incubator-linkis] Dlimeng commented on issue #1303: [Feature] Linkis result set discussion

Posted by GitBox <gi...@apache.org>.

Dlimeng commented on issue #1303:
URL: https://github.com/apache/incubator-linkis/issues/1303#issuecomment-1012900916

This page describes the process for proposing breaking changes to Linkis.
• Introduction
• Storage stores a variety of file systems
• Result Set - Parquet
• Parquet composition
• Parquet Design
• Parquet implementation
• Result Set - ORC
• ORC composition
• Compare
• Release

### Introduction
Linkis is faced with the need to store various types of data in files, such as: storing Hive table data in files, and hoping to save metadata information such as field types, column names, and comments.

### Storage stores a variety of file systems
![image](https://user-images.githubusercontent.com/16789827/149473151-027bc250-6601-4580-8996-474068b45a6d.png)

### Result Set - Parquet
#### Parquet composition
Parquet is just a storage format, it is language- and platform-independent, and does not need to be bound to any data processing framework. Currently, the components that can be adapted to Parquet include the following, and it can be seen that basically the commonly used queries The engine and computing framework have been adapted, and data generated by other serialization tools can be easily converted into Parquet format.

• Query Engines: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL
• Computing Framework: MapReduce, Spark, Cascading, Crunch, Scalding, Kite
• Data Models: Avro, Thrift, Protocol Buffers, POJOs

The schema of each data model contains multiple fields, and each field can contain multiple fields. Each field has three attributes: repetition number, data type and field name. The repetition number can be the following three types: required (occurrence 1 time ), repeated (0 or more occurrences), optional (0 or 1 occurrences). The data type of each field can be divided into two types: group (complex type) and primitive (basic type).
type of data
INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
![image](https://user-images.githubusercontent.com/16789827/149473203-00721554-dc37-4edc-899f-a1b401bd5208.png)

#### Parquet Design
![image](https://user-images.githubusercontent.com/16789827/149473352-2b5729e6-fc0b-4fa6-9ef3-90742fa5bda4.png)

#### Parquet implementation
![image](https://user-images.githubusercontent.com/16789827/149473372-574f62ad-6d28-4b5b-8e96-18b428f2ba7e.png)

### Result Set - ORC
#### ORC composition
Unlike Parquet, ORC does not natively support nested data formats, but supports nested formats through special processing of complex data types.
CREATE TABLE `orcStructTable`(
`name` string,
`course` struct<course:string,score:int>,
`score` map<string,int>,
`work_locations` array<string>)
Similar to Parquet, ORC files are also stored in binary mode, so they cannot be read directly. ORC files are also self-parsed and contain a lot of metadata, which are serialized by isomorphic ProtoBuffer.

• ORC file: Ordinary binary file saved on the file system. An ORC file can contain multiple stripes, and each stripe contains multiple records. These records are stored independently according to columns, corresponding to the concept of row group in Parquet.
• File-level metadata: including file description information PostScript, file meta information (including statistical information of the entire file), all stripe information and file schema information.
• stripe: A group of rows forms a stripe. Each time a file is read, the unit is row group, generally the block size of HDFS, which saves the index and data of each column.
• stripe metadata: saves the position of the stripe, the statistics of each column in the stripe, and all stream types and positions.
• row group: The smallest unit of the index. A stripe contains multiple row groups, which are composed of 10,000 values by default.
• stream: A stream represents a valid piece of data in the file, including index and data. The index stream saves the position and statistical information of each row group, and the data stream includes various types of data, which are determined by the column type and encoding method.
![image](https://user-images.githubusercontent.com/16789827/149473421-2d2581e8-6623-4ddb-a5fa-d13330cac2a3.png)

### Compare
> hive
• ORC wide table data performs better than parquet data.
• The ORC file storage format performs better in terms of space storage, data import speed and query speed, and ORC can support ACID operations to a certain extent. The development of the community is currently a columnar format that is more advocated in Hive. storage format.

### Release
expected release 2022-03-31

>refer to

* Apache Parquet
* Apache Orc

[Linkis result set discussion](https://blog.csdn.net/qq_19968255/article/details/122471803)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@linkis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@linkis.apache.org
For additional commands, e-mail: dev-help@linkis.apache.org