You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/08/03 13:17:57 UTC

[GitHub] [orc] guiyanakuang opened a new pull request #810: Optimize userMetadata data structure to avoid traversal

guiyanakuang opened a new pull request #810:
URL: https://github.com/apache/orc/pull/810


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   ReaderImpl.java
   ```java
     @Override
     public ByteBuffer getMetadataValue(String key) {
       for(OrcProto.UserMetadataItem item: userMetadata) {
         if (item.hasName() && item.getName().equals(key)) {
           return item.getValue().asReadOnlyByteBuffer();
         }
       }
       throw new IllegalArgumentException("Can't find user metadata " + key);
     }
   
     @Override
     public boolean hasMetadataValue(String key) {
       for(OrcProto.UserMetadataItem item: userMetadata) {
         if (item.hasName() && item.getName().equals(key)) {
           return true;
         }
       }
       return false;
     }
   ```
   I think the data structure of userMetadata can be modified to map,  avoid traversal
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   There may be performance problems if there are more userMetadata keys
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on a change in pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on a change in pull request #810:
URL: https://github.com/apache/orc/pull/810#discussion_r682230563



##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       > If you don't mind, can we indent with `4-space` instead of `8-space`?
   
   I will fix

##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       > The java stream apis are very slow. It would perform better if you used a loop. This is especially important, because now the set is build preemptively, even if the user doesn't need the metadata values.
   
   How about lazy initialization of the set when the getMetadataValue and hasMetadataValue methods need to be called




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893109802


   [ORC Specification v1](https://orc.apache.org/specification/ORCv1/)
   ![image](https://user-images.githubusercontent.com/4069905/128279409-76fa1e1c-beae-49b5-a440-aa4916c883f4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892400814


   The metadataCache is only initialised when the metadata is needed for compatibility checking or merging. In fact the reader is quickly cleaned up and only survives in the merge context.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #810:
URL: https://github.com/apache/orc/pull/810#discussion_r682128050



##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       The java stream apis are very slow. It would perform better if you used a loop. This is especially important, because now the set is build preemptively, even if the user doesn't need the metadata values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang closed pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang closed pull request #810:
URL: https://github.com/apache/orc/pull/810


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892806923






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893102487


   I agree with you, @guiyanakuang .
   
   Although you revised your PR, I still suspect that some reader might expect the duplicated keys together (not the first one or the last one). Let's revisit this when the performance issue become severe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892405840


   I think I misunderstood, the user is also allowed to actively call methods to get the metadata, so that there are two data structures. Perhaps an internal method could be added to speed up the file merge, separate from the user getting the metadata.
   Sorry, I've been reading and studying the orc source code for an upcoming project. Some of the suggestions may be superficial.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893102905


   Anyway, thank you for your contribution, @guiyanakuang ! 
   I'll promptly review your upcoming PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang edited a comment on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang edited a comment on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893109802


   [ORC Specification v1](https://orc.apache.org/specification/ORCv1/)
   I think that in future versions, it should be possible to not be compatible with duplicate keys
   ![image](https://user-images.githubusercontent.com/4069905/128279409-76fa1e1c-beae-49b5-a440-aa4916c883f4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang closed pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang closed pull request #810:
URL: https://github.com/apache/orc/pull/810


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892806923


   How many users metadata do you have, @guiyanakuang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893102487


   I agree with you, @guiyanakuang .
   
   Although you revised your PR, I still suspect that some readers might expect the duplicated keys together (not the first one or the last one). Let's revisit this when the performance issue become severe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on a change in pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on a change in pull request #810:
URL: https://github.com/apache/orc/pull/810#discussion_r682230563



##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       > If you don't mind, can we indent with `4-space` instead of `8-space`?
   I will fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893101538






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893102487


   I agree with you, @guiyanakuang .
   
   Although you revised your PR, I still suspect that some readers might expect the duplicated keys together (not the first one or the last one). Let's revisit this when the performance issue become severe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #810: ORC-904: Optimize userMetadata data structure to avoid traversal

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #810:
URL: https://github.com/apache/orc/pull/810#discussion_r682099791



##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       If you don't mind, can we indent with `4-space` instead of `8-space`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang edited a comment on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang edited a comment on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893109802


   [ORC Specification v1](https://orc.apache.org/specification/ORCv1/)
   I think that in future versions, it should be possible to not be compatible with duplicate keys
   ![image](https://user-images.githubusercontent.com/4069905/128279409-76fa1e1c-beae-49b5-a440-aa4916c883f4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-893101538


   > How many users metadata do you have, @guiyanakuang ?
   
   Usually around 10.The largest have thousands of this order of magnitude. It may be abusing this feature, but so far there are no performance problems. This pr I just think from reading the source code that the time complexity can be optimised. I think I'll close this pr  and optimize it when I have performance issues in production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892378836


   > @guiyanakuang . Technically, this looks reasonable because Java `WriterImpl` ensures that there is no duplicated keys at least Java layer(writer and reader)
   > 
   > ```java
   > private final Map<String, ByteString> userMetadata = new TreeMap<>();
   > ```
   > 
   > However, it seems that `ORC Foot` spec itself has no assumption that the user metadata items have unique keys. I'm wondering if we already have some other places to put this unique key assumption. Otherwise, this PR might not be safe with another ORC writers.
   > 
   > ```proto
   > message Footer {
   >   optional uint64 headerLength = 1;
   >   optional uint64 contentLength = 2;
   >   repeated StripeInformation stripes = 3;
   >   repeated Type types = 4;
   >   repeated UserMetadataItem metadata = 5;
   > ```
   > 
   > cc @omalley , @pgaref , @wgtmac , @williamhyun
   
   @dongjoon-hyun Additional commits solve the compatibility problem
   
   ```java
       private void lazyInitCache() {
         if (metadataCache == null) {
           metadataCache = new TreeMap<>();
           for(OrcProto.UserMetadataItem item: innerUserMetadata) {
             metadataCache.putIfAbsent(item.getName(), item.getValue());
           }
         }
       }
   ```
   `metadataCache.putIfAbsent(item.getName(), item.getValue());` , the same key will only get the first value. Keeping the same logic as the loop traversal.
   `getMetadataKeys` is still the same as before and can get non-unique keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on a change in pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on a change in pull request #810:
URL: https://github.com/apache/orc/pull/810#discussion_r682234526



##########
File path: java/core/src/java/org/apache/orc/impl/ReaderImpl.java
##########
@@ -569,7 +561,13 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) throws IOException {
       this.rowIndexStride = tail.getFooter().getRowIndexStride();
       this.contentLength = tail.getFooter().getContentLength();
       this.numberOfRows = tail.getFooter().getNumberOfRows();
-      this.userMetadata = tail.getFooter().getMetadataList();
+      this.userMetadata = tail.getFooter().getMetadataList()
+              .stream().collect(Collectors.toMap(

Review comment:
       > Thank you for making a PR, @guiyanakuang . Do you think we can have a benchmark result for your claim?
   > 
   > > There may be performance problems if there are more userMetadata keys
   > 
   > cc @pgaref , @williamhyun
   
   Sorry for the late reply due to different time zones. I made a guess based on the time complexity of the algorithm and have not yet benchmarked it. getMetadataValue is used in compatibility checking and metadata merging, which would have an O(n^2) time complexity if used in a loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #810: ORC-904: Use Map for userMetadata in ReaderImpl

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #810:
URL: https://github.com/apache/orc/pull/810#issuecomment-892385379


   Thank you for updating, @guiyanakuang . Now, do we have both structure? It looks like a regression in memory perspective.
   
   ```
     public static class UserMetadata {
        private final List<OrcProto.UserMetadataItem> innerUserMetadata;
   
        private Map<String, ByteString> metadataCache;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org