You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "deshanxiao (via GitHub)" <gi...@apache.org> on 2023/04/11 07:16:08 UTC

[GitHub] [orc] deshanxiao opened a new pull request, #1465: [Docs] Add stream order description in ORC spec.

deshanxiao opened a new pull request, #1465:
URL: https://github.com/apache/orc/pull/1465

   ### What changes were proposed in this pull request?
   This PR is aimed to add more description about stram order in ORC spec.
   
   ### Why are the changes needed?
   There are many users who are misled by the order of the document table, in fact the stream has no fixed order.
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1542228369

   Hi @dongjoon-hyun @wgtmac 
   
   The PR is ready for review. Please take a look at it. Thank you~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190568745


##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.

Review Comment:
   ```suggestion
   There is a general order for index and data streams:
   * Index streams are always placed together in the beginning of the stripe.
   * Data streams are placed together after index streams (if any).
   * Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant.
   
   There is no fixed order within each unencrypted or encryption variant in the index and data streams:
   * Different stream kinds of the same column can be placed in any order.
   * Streams from different columns can even be placed in any order.
   
   To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth.
   ```



##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.

Review Comment:
   Also I would prefer to move them to below line 907.



##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.

Review Comment:
   These statements look a little bit vague to me, I have changed them to provide fixed order and flexible order. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun closed pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
URL: https://github.com/apache/orc/pull/1465


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1550313672

   I also published to Apache ORC website.
   - https://orc.apache.org/specification/ORCv0/#column-encoding-section


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190801956


##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
+
+The following table shows the order of types that contain multiple streams:
+
+ Type      | The First Postion Stream Type   | The Second Postion Stream Type | Contents        |
+:----------|:--------------------------------|:-------------------------------|:----------------|
+| Binary    | DATA                            | LENGTH                         |                 |
+| String    | DATA                            | LENGTH                         | Direct encoding |
+| Decimal   | DATA                            | SECONDARY                      |                 |
+| Timestamp | DATA                            | SECONDARY                      |                 |

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1514380317

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190767601


##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.

Review Comment:
   It looks resonable and will add more description/remove the chart.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs][Draft] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1523257635

   Mark the PR as draft because I may add more details about the stream order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1550752621

   Thank you @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1166191433


##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,12 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+Note that the order of these streams is **not fixed**. In the example
+of the integer column mentioned above, the order of the PRESENT stream
+and the DATA stream cannot be determined in advance. Instead, we need
+to determine the type of stream based on the Stream Kind, rather than

Review Comment:
   Should we mention the following:
   - index and data streams cannot be interleaved.
   - the order of streams from different columns is not fixed as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1166401368


##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,12 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+Note that the order of these streams is **not fixed**. In the example
+of the integer column mentioned above, the order of the PRESENT stream
+and the DATA stream cannot be determined in advance. Instead, we need
+to determine the type of stream based on the Stream Kind, rather than

Review Comment:
   Thank you @wgtmac . Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190593058


##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.

Review Comment:
   Thank you @wgtmac . It has been updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190601006


##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.

Review Comment:
   Seems the doc has a line limit that we need to follow.



##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
+
+The following table shows the order of types that contain multiple streams:
+
+ Type      | The First Postion Stream Type   | The Second Postion Stream Type | Contents        |
+:----------|:--------------------------------|:-------------------------------|:----------------|
+| Binary    | DATA                            | LENGTH                         |                 |
+| String    | DATA                            | LENGTH                         | Direct encoding |
+| Decimal   | DATA                            | SECONDARY                      |                 |
+| Timestamp | DATA                            | SECONDARY                      |                 |

Review Comment:
   This table is duplicated to what `Column Encodings` has described. And here it doesn't include PRESENT stream. I'd prefer to remove this.



##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.

Review Comment:
   We should also mention that the order is the same as what `Column Encodings` section has described for each column type.



##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.

Review Comment:
   ```suggestion
   Note that for columns with multiple streams, the order of stream
   positions in the RowIndex is **fixed**, which may be different to
   the actual data stream placement.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org