You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/08 00:17:23 UTC

[GitHub] [druid] clintropolis opened a new pull request, #13051: improve nested column serializer

clintropolis opened a new pull request, #13051:
URL: https://github.com/apache/druid/pull/13051

   changes:
   * long and double value columns are now written directly, at the same time as writing out the 'intermediary' dictionaryid column with unsorted ids
   * remove reverse value lookup from GlobalDictionaryIdLookup since it is no longer needed
   
   ### Description
   This PR improves nested column serializer heap usage and write performance by modifying the serializer to write the long and double value columns for the nested fields on the "first" pass though the data, which is building the local dictionaries for the nested fields while writing out the 'raw' JSON-smile encoded column. This is a change from the current version, which writes out the value columns on the "second" pass, where it sorts the local dictionary for each field and writes out the new dictionaryid int column, builds bitmap indexes, and writes the column itself.
   
   Making this change eliminate the need of the 'reverse' value lookup for long and double values in `GlobalDictionaryIdLookup`, and so they have been removed. Besides eliminating this from memory, it also saves a value lookup per row per column to write the long and double value columns.
   
   I haven't measured this at all yet, but naively I assume it should be a fair bit more chill in practice to do things this way.
   
   <hr>
   
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
clintropolis commented on code in PR #13051:
URL: https://github.com/apache/druid/pull/13051#discussion_r965496164


##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSerializer.java:
##########
@@ -551,12 +552,12 @@ int lookupGlobalId(Long value)
     }
 
     @Override
-    void openColumnSerializer(String field, SegmentWriteOutMedium medium, int maxId) throws IOException
+    public void open(String field) throws IOException

Review Comment:
   ah the diff is making this look funny, i changed long and double columns to override `open` instead of `openColumnSerializer` (open happens before the first pass and opens the intermediary dictionaryid int column that has the unsorted local ids) to also open the long and double column serializers at this phase.
   
   `openColumnSerializer` happens in `writeTo` when writing out the compressed dictionary id int column with the sorted localIds.
   
   I think `openColumnSerializer` could just be moved into `writeTo` since nothing is overriding it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
clintropolis merged PR #13051:
URL: https://github.com/apache/druid/pull/13051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on code in PR #13051:
URL: https://github.com/apache/druid/pull/13051#discussion_r965460448


##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSerializer.java:
##########
@@ -398,17 +405,12 @@ void openColumnSerializer(String field, SegmentWriteOutMedium medium, int maxId)
       encodedValueSerializer.open();
     }
 
-    void serializeRow(int globalId, int localId) throws IOException
-    {
-      encodedValueSerializer.addValue(localId);
-    }
-
     long getSerializedColumnSize() throws IOException
     {
       return Integer.BYTES + Integer.BYTES + encodedValueSerializer.getSerializedSize();
     }
 
-    public void open() throws IOException
+    public void open(String field) throws IOException

Review Comment:
   This parameter appears to not be getting used?



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSerializer.java:
##########
@@ -551,12 +552,12 @@ int lookupGlobalId(Long value)
     }
 
     @Override
-    void openColumnSerializer(String field, SegmentWriteOutMedium medium, int maxId) throws IOException
+    public void open(String field) throws IOException

Review Comment:
   Just from looking at the diff, I'm a bit lost as to why all of these parameters were able to be eliminated...  I think it's because I've lost track of which class I'm looking at.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
clintropolis commented on PR #13051:
URL: https://github.com/apache/druid/pull/13051#issuecomment-1240216986

   >Anyway, I'm pretty sure that this is good. I'd really like us to refactor out all of the private static classes though, would've made reviewing this a lot simpler if that had been done first.
   
   Oh, i guess part of the problem is that they _aren't_ static classes, they refer to stuff of `NestedDataColumnSerializer`, though they could be reworked to have these references passed in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
clintropolis commented on PR #13051:
URL: https://github.com/apache/druid/pull/13051#issuecomment-1240214998

   > Anyway, I'm pretty sure that this is good. I'd really like us to refactor out all of the private static classes though, would've made reviewing this a lot simpler if that had been done first.
   
   yeah, fair, i'll split this file up in a follow-up PR since it won't really help the diff situation here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on PR #13051:
URL: https://github.com/apache/druid/pull/13051#issuecomment-1240161017

   Meta commentary: I've been trying to review this, the changes are quite surgical to a set of classes that are all private static classes laid out in one long file.  This is making it really hard to review as the diff lines lose the context of which class I'm looking at and I have to click on the arrows and stuff to try to keep track of which class I'm in.  Can we move all of the different writers into their own files?  E.g.
   
   ```
   private final class StringFieldColumnWriter extends GlobalDictionaryEncodedFieldColumnWriter<String>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13051: improve nested column serializer

Posted by GitBox <gi...@apache.org>.
clintropolis commented on code in PR #13051:
URL: https://github.com/apache/druid/pull/13051#discussion_r965495192


##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSerializer.java:
##########
@@ -398,17 +405,12 @@ void openColumnSerializer(String field, SegmentWriteOutMedium medium, int maxId)
       encodedValueSerializer.open();
     }
 
-    void serializeRow(int globalId, int localId) throws IOException
-    {
-      encodedValueSerializer.addValue(localId);
-    }
-
     long getSerializedColumnSize() throws IOException
     {
       return Integer.BYTES + Integer.BYTES + encodedValueSerializer.getSerializedSize();
     }
 
-    public void open() throws IOException
+    public void open(String field) throws IOException

Review Comment:
   this is getting used by the overrides for long and doubles, https://github.com/apache/druid/pull/13051/files#diff-79a5d0543cecca970d2986e5437ed90ab0964cefc89297a23698de2693e9f9b7R555



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org