You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/25 07:37:40 UTC

[GitHub] [orc] dirtysalt opened a new pull request #670: Update ORCv1.md

dirtysalt opened a new pull request #670:
URL: https://github.com/apache/orc/pull/670


   1. There is no need to use `>>>`. We have already flip `combinedHash` if it's negative and `position`  sign bit will be always zero.
   2. To set bit we have to mask `position` with 0x3f to keep the least 6 bits.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #670: Update ORCv1.md

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601168939



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Thank you for making a PR, @dirtysalt . Could you add the URL to the corresponding code part into your PR description, please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601941263



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Yes. I've updated. It's much more consistent. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r602062133



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position % 64));

Review comment:
       Oh, sorry, @dirtysalt . I missed that the original was written in Java style.
   - https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/util/BloomFilter.java#L275-L277
   ```java
   public void set(int index) {
     data[index >>> 6] |= (1L << index);
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601674413



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Thank you for update. Can we use `% 64` consistently instead of `& 0x3f`? At line 1297, we already use `%`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601674413



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Thank you for update. Can we use `% 64` consistently like the code instead of `& 0x3f`? At line 1297, we already use `%`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-807993594


   Thank you, @dirtysalt . Welcome to the Apache ORC community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt edited a comment on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dirtysalt edited a comment on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808692529


   @dongjoon-hyun  Oh. I never know that Java bit shift `<<` is sort of like a circular bit shift.  That's to say `1L << 129` equals `1L << (129 % 64)`. So that explains why `1L << position` works. 
   
   If I do that in C++, the compiler will give a warning, and the result is undefined.
   ```
   /// c++ code
   int main() {
       unsigned long long x = 0;
       x |= (1ULL << 129);
       printf("%llu\n", x);
       return 0;
   }
   
   /// shell execution
   mac :: ~/Public » g++ test.cpp
   test.cpp:14:16: warning: shift count >= width of type [-Wshift-count-overflow]
       x |= (1ULL << 129);
                  ^  ~~~
   1 warning generated.
   mac :: ~/Public » ./a.out
   4526632784
   mac :: ~/Public » ./a.out
   4345032528
   ```
   
   But I do suggest modifying documentation as `1L << (position % 64)` which is less confused to C++ guys. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808820400


   @dongjoon-hyun Understood.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601176072



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Thanks. Sure, I have updated the PR description and title.  Do you think it's clear enough? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808003500


   @dirtysalt . I reverted this because the original claim of this PR is wrong, `The code is correct, but the documentation is not.`. The document is consistent with the Java code. It's my bad to miss that. Sorry about the back and forth.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808692529


   Oh. I never know that Java bit shift `<<` is sort of like a circular bit shift.  That's to say `1L << 129` equals `1L << (129 % 64)`. So that explains why `1L << position` works. 
   
   If I do that in C++, the compiler will give a warning, and the result is undefined.
   ```
   /// c++ code
   int main() {
       unsigned long long x = 0;
       x |= (1ULL << 129);
       printf("%llu\n", x);
       return 0;
   }
   
   /// shell execution
   mac :: ~/Public » g++ test.cpp
   test.cpp:14:16: warning: shift count >= width of type [-Wshift-count-overflow]
       x |= (1ULL << 129);
                  ^  ~~~
   1 warning generated.
   mac :: ~/Public » ./a.out
   4526632784
   mac :: ~/Public » ./a.out
   4345032528
   ```
   
   But I do suggest modifying documentation as `1L << (position % 64)` which is less confused to C++ guys. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #670: Update ORCv1.md

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #670:
URL: https://github.com/apache/orc/pull/670#discussion_r601173060



##########
File path: site/specification/ORCv1.md
##########
@@ -1297,7 +1297,7 @@ in a bloom filter is as follows:
   * position = combinedHash % m
 6. Set the position in bit set. The LSB 6 bits identifies the long index
    within bitset and bit position within the long uses little endian order.
-  * bitset[position >>> 6] \|= (1L << position);
+  * bitset[position >> 6] \|= (1L << (position & 0x3f));

Review comment:
       Thanks.  Here is the corresponding code https://github.com/apache/orc/blob/master/c%2B%2B/src/BloomFilter.cc#L47. 
   
   The code is correct, but the documentation is not. The code is to use `% 64` which has the same effect as `& 0x3f`. 
   
   ```
     constexpr uint64_t BITS_OF_LONG = 64;
     constexpr uint8_t  SHIFT_6_BITS = 6;
     void BitSet::set(uint64_t index) {
       mData[index >> SHIFT_6_BITS] |= (1ULL << (index % BITS_OF_LONG));
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-807999069


   Let me revert this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun merged pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun merged pull request #670:
URL: https://github.com/apache/orc/pull/670


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808747471


   @dirtysalt . I understand your reasoning. The document has been there since Apache ORC 1.0 for 6 years. We had better keep it AS-IS to avoid any unnecessary further confusions because we didn't change the logic.
   - https://github.com/apache/orc/blame/branch-1.0/site/_docs/spec-index.md#L105


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #670: Update ORCv1.md: Fix documentation about bitset in bloom filter section

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #670:
URL: https://github.com/apache/orc/pull/670#issuecomment-808747471


   @dirtysalt . I understand your reasoning. However, the document has been there since Apache ORC 1.0 for 6 years. We had better keep it AS-IS to avoid any unnecessary further confusions because we didn't change the logic.
   - https://github.com/apache/orc/blame/branch-1.0/site/_docs/spec-index.md#L105


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org