You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "vuule (via GitHub)" <gi...@apache.org> on 2023/04/21 21:17:19 UTC

[GitHub] [orc] vuule opened a new issue, #1475: [C++] Order of row index streams does not match the order of streams in the file footer

vuule opened a new issue, #1475:
URL: https://github.com/apache/orc/issues/1475

   When writing a file with a string column and multiple row groups, the resulting file has incorrect row index streams. 
   The string column is encoded using direct encoding. The file footer contains the LENGTH (kind 2) stream before DATA (kind 1) stream. However, the row index seems to contain the index data for the DATA stream before the LENGTH stream. Switching out the order in which we read the row index streams fixes the issue and everything can be used correctly.
   
   Isolation:
   Only observing this behavior with string columns. Other types with multiple streams look correct in this regard.
   Behavior looks unrelated to string content in the column.
   No info on dictionary encoded string columns - writer seemlingly defaults to direct encoding.
   
   See attached repro file. The file contains a single string column, with `["*"] * 10001`
   [10001_strings.zip](https://github.com/apache/orc/files/11299279/10001_strings.zip)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1519092569

   Thanks for reporting the issue! @vuule
   
   The order of data streams are **NOT FIXED** meaning that:
   - In a direct-encoded string columns, `DATA stream` can be placed **BEFORE** or **AFTER** `LENGTH stream`. Same flexibility for `PRESENT stream`.
   - Even data streams of different columns can be interleaved.
   
   However, the order of positions in a index stream is **FIXED**. So for a direct-encoded string column, its `INDEX stream` always put positions in this order: `PRESENT stream` (if exists), `DATA stream` and `LENGTH stream`.
   
   I checked the [specs](https://orc.apache.org/specification/ORCv1/) and it does not state this clearly. It would be a good time to document this as well. @deshanxiao 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] vuule commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "vuule (via GitHub)" <gi...@apache.org>.
vuule commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1520560784

   Thank you for the clarification @wgtmac !
   What about other cases when a column has multiple streams? Is the order of index streams always the same as in the tables at https://orc.apache.org/specification/ORCv1/?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1521870720

   Yes, that would help a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1521038012

   Yes, the order is fixed. This is implemented in the `recordPosition` call as below.
   
   
   In the `TreeWriterBase.java`, positions of present stream are recorded first.
   
   https://github.com/apache/orc/blob/792c3f820d0b7a64b27c9dc4c390443386e6baf0/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L369-L377
   
   
   And then in the `StringBaseTreeWriter.java`, positions of data stream and length stream are recorded in order.
   
   https://github.com/apache/orc/blob/9dbf833868591314014958cc58cd57fb1e8e739c/java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java#L265-L270
   
   I followed the same order when I was implementing the C++ writer so they should be consistent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1519299003

   Thank you @wgtmac , I will add these to #1465.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1519299585

   > Thank you @wgtmac , I will add these to #1465.
   
   Thanks @deshanxiao ! Could you also verify that the java implementation matches this behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1521825385

   > Yes, the order is fixed. This is implemented in the `recordPosition` call as below.
   > 
   > In the `TreeWriterBase.java`, positions of present stream are recorded first.
   > 
   > https://github.com/apache/orc/blob/792c3f820d0b7a64b27c9dc4c390443386e6baf0/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L369-L377
   > 
   > And then in the `StringBaseTreeWriter.java`, positions of data stream and length stream are recorded in order.
   > 
   > https://github.com/apache/orc/blob/9dbf833868591314014958cc58cd57fb1e8e739c/java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java#L265-L270
   > 
   > I followed the same order when I was implementing the C++ writer so they should be consistent.
   
   Thank you for sharing the Java code. I double check it and you are right @wgtmac .
   
   > In a direct-encoded string columns, DATA stream can be placed BEFORE or AFTER LENGTH stream. Same flexibility for PRESENT stream.
   
   In fact, different languages currently have different order implementations. The order of java depends on the method of compareTo to flush the stream to disk.
   
   > Even data streams of different columns can be interleaved.
   Do you mean that the streams will cross for different columns like:
   col1 streamtype1
   col2 streamtype1
   col1 streamtype2
   
    I notice that the streams in the same column will appear together, but the order of the streams in different column is uncertain even they are the same data type.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1521828129

   BTW, Is it necessary for us to add a type list in IndexEntry to describe the type of the position? @wgtmac @dongjoon-hyun @guiyanakuang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] vuule commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "vuule (via GitHub)" <gi...@apache.org>.
vuule commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1518344716

   Logs that point to incorrect order: https://github.com/rapidsai/cudf/issues/11890#issuecomment-1487796573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1475: [C++] Order of row index streams does not match the order of streams in the file footer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1475:
URL: https://github.com/apache/orc/issues/1475#issuecomment-1521038571

   > Thank you for the clarification @wgtmac ! What about other cases when a column has multiple streams? Is the order of index streams always the same as in the tables at https://orc.apache.org/specification/ORCv1/?
   
   IIRC, the order is same as the table of the spec doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org