You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "deshanxiao (via GitHub)" <gi...@apache.org> on 2023/04/11 07:16:08 UTC
[GitHub] [orc] deshanxiao opened a new pull request, #1465: [Docs] Add stream order description in ORC spec.
deshanxiao opened a new pull request, #1465:
URL: https://github.com/apache/orc/pull/1465
### What changes were proposed in this pull request?
This PR is aimed to add more description about stram order in ORC spec.
### Why are the changes needed?
There are many users who are misled by the order of the document table, in fact the stream has no fixed order.
### How was this patch tested?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1542228369
Hi @dongjoon-hyun @wgtmac
The PR is ready for review. Please take a look at it. Thank you~
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190568745
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.
Review Comment:
```suggestion
There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth.
```
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.
Review Comment:
Also I would prefer to move them to below line 907.
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.
Review Comment:
These statements look a little bit vague to me, I have changed them to provide fixed order and flexible order. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun closed pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
URL: https://github.com/apache/orc/pull/1465
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1550313672
I also published to Apache ORC website.
- https://orc.apache.org/specification/ORCv0/#column-encoding-section
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190801956
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
+
+The following table shows the order of types that contain multiple streams:
+
+ Type | The First Postion Stream Type | The Second Postion Stream Type | Contents |
+:----------|:--------------------------------|:-------------------------------|:----------------|
+| Binary | DATA | LENGTH | |
+| String | DATA | LENGTH | Direct encoding |
+| Decimal | DATA | SECONDARY | |
+| Timestamp | DATA | SECONDARY | |
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1514380317
cc @dongjoon-hyun
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190767601
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
Review Comment:
It looks resonable and will add more description/remove the chart.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on pull request #1465: ORC-1409: [Docs][Draft] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1523257635
Mark the PR as draft because I may add more details about the stream order.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #1465:
URL: https://github.com/apache/orc/pull/1465#issuecomment-1550752621
Thank you @dongjoon-hyun !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1166191433
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,12 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+Note that the order of these streams is **not fixed**. In the example
+of the integer column mentioned above, the order of the PRESENT stream
+and the DATA stream cannot be determined in advance. Instead, we need
+to determine the type of stream based on the Stream Kind, rather than
Review Comment:
Should we mention the following:
- index and data streams cannot be interleaved.
- the order of streams from different columns is not fixed as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1166401368
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,12 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+Note that the order of these streams is **not fixed**. In the example
+of the integer column mentioned above, the order of the PRESENT stream
+and the DATA stream cannot be determined in advance. Instead, we need
+to determine the type of stream based on the Stream Kind, rather than
Review Comment:
Thank you @wgtmac . Updated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] deshanxiao commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190593058
##########
site/specification/ORCv1.md:
##########
@@ -886,6 +886,17 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
+There are a few points to note about the order of streams:
+
+* For a specific column type, the order of streams is **not fixed**.
+* Index and data streams cannot be **interleaved**.
+* The order of streams from different columns is **not fixed** as well.
Review Comment:
Thank you @wgtmac . It has been updated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on a diff in pull request #1465: ORC-1409: [Docs] Add stream order description in ORC spec.
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1465:
URL: https://github.com/apache/orc/pull/1465#discussion_r1190601006
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
Review Comment:
Seems the doc has a line limit that we need to follow.
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
+
+The following table shows the order of types that contain multiple streams:
+
+ Type | The First Postion Stream Type | The Second Postion Stream Type | Contents |
+:----------|:--------------------------------|:-------------------------------|:----------------|
+| Binary | DATA | LENGTH | |
+| String | DATA | LENGTH | Direct encoding |
+| Decimal | DATA | SECONDARY | |
+| Timestamp | DATA | SECONDARY | |
Review Comment:
This table is duplicated to what `Column Encodings` has described. And here it doesn't include PRESENT stream. I'd prefer to remove this.
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
Review Comment:
We should also mention that the order is the same as what `Column Encodings` section has described for each column type.
##########
site/specification/ORCv2.md:
##########
@@ -1257,6 +1277,19 @@ Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of these
+streams is **fixed**, which is different from data stream we mentioned above.
Review Comment:
```suggestion
Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org