You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/10/14 07:16:29 UTC
[GitHub] [druid] imply-cheddar commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

imply-cheddar commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r995391240


##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from doing binary-search per element in the value
+  // set to doing a sorted merge algorithm between value set and dictionary. The ratio here represents the ratio b/w
+  // the number of elements in value set and the number of elements in the dictionary. The number has been derived
+  // using benchmark in https://github.com/apache/druid/pull/13133. If the ratio is higher than the threshold, we use
+  // sorted merge instead of binary-search based algorithm.
+  private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
+  private static final int SIZE_WORTH_CHECKING_MIN = 8;
+  private static final Comparator<ByteBuffer> COMPARATOR = ByteBufferUtils.unsignedComparator();
+
+  private final BitmapFactory bitmapFactory;
+  private final TDictionary dictionary;
+  private final Indexed<ImmutableBitmap> bitmaps;
+
+  public IndexedUtf8ValueSetIndex(
+      BitmapFactory bitmapFactory,
+      TDictionary dictionary,
+      Indexed<ImmutableBitmap> bitmaps
+  )
+  {
+    this.bitmapFactory = bitmapFactory;
+    this.dictionary = dictionary;
+    this.bitmaps = bitmaps;
+  }
+
+  @Override
+  public BitmapColumnIndex forValue(@Nullable String value)
+  {
+    return new SimpleBitmapColumnIndex()
+    {
+      @Override
+      public double estimateSelectivity(int totalRows)
+      {
+        return Math.min(1, (double) getBitmapForValue().size() / totalRows);
+      }
+
+      @Override
+      public <T> T computeBitmapResult(BitmapResultFactory<T> bitmapResultFactory)
+      {
+
+        return bitmapResultFactory.wrapDimensionValue(getBitmapForValue());
+      }
+
+      private ImmutableBitmap getBitmapForValue()
+      {
+        final ByteBuffer valueUtf8 = value == null ? null : ByteBuffer.wrap(StringUtils.toUtf8(value));

Review Comment:
   Do we ever call `.forValue(String)` and then throw away the result?  If not, it should be safe to move this to line 69 and then close over it instead.  That way we don't convert to utf8 bytes multiple times (right now, it will happen on every call to `estimateSelectivity` and `computeBitmapResult`)



##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from doing binary-search per element in the value
+  // set to doing a sorted merge algorithm between value set and dictionary. The ratio here represents the ratio b/w
+  // the number of elements in value set and the number of elements in the dictionary. The number has been derived
+  // using benchmark in https://github.com/apache/druid/pull/13133. If the ratio is higher than the threshold, we use
+  // sorted merge instead of binary-search based algorithm.
+  private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
+  private static final int SIZE_WORTH_CHECKING_MIN = 8;
+  private static final Comparator<ByteBuffer> COMPARATOR = ByteBufferUtils.unsignedComparator();
+
+  private final BitmapFactory bitmapFactory;
+  private final TDictionary dictionary;
+  private final Indexed<ImmutableBitmap> bitmaps;
+
+  public IndexedUtf8ValueSetIndex(
+      BitmapFactory bitmapFactory,
+      TDictionary dictionary,
+      Indexed<ImmutableBitmap> bitmaps
+  )
+  {
+    this.bitmapFactory = bitmapFactory;
+    this.dictionary = dictionary;
+    this.bitmaps = bitmaps;
+  }
+
+  @Override
+  public BitmapColumnIndex forValue(@Nullable String value)
+  {
+    return new SimpleBitmapColumnIndex()
+    {
+      @Override
+      public double estimateSelectivity(int totalRows)
+      {
+        return Math.min(1, (double) getBitmapForValue().size() / totalRows);
+      }
+
+      @Override
+      public <T> T computeBitmapResult(BitmapResultFactory<T> bitmapResultFactory)
+      {
+
+        return bitmapResultFactory.wrapDimensionValue(getBitmapForValue());
+      }
+
+      private ImmutableBitmap getBitmapForValue()
+      {
+        final ByteBuffer valueUtf8 = value == null ? null : ByteBuffer.wrap(StringUtils.toUtf8(value));
+        final int idx = dictionary.indexOf(valueUtf8);
+        return getBitmap(idx);
+      }
+    };
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValues(SortedSet<String> values)
+  {
+    return getBitmapColumnIndexForSortedIterableUtf8(
+        Iterables.transform(
+            values,
+            input -> input != null ? ByteBuffer.wrap(StringUtils.toUtf8(input)) : null
+        ),
+        values.size()
+    );
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValuesUtf8(SortedSet<ByteBuffer> valuesUtf8)
+  {
+    final SortedSet<ByteBuffer> tailSet;
+
+    if (valuesUtf8.size() >= SIZE_WORTH_CHECKING_MIN) {
+      final ByteBuffer minValueInColumn = dictionary.get(0);
+      tailSet = valuesUtf8.tailSet(minValueInColumn);
+    } else {
+      tailSet = valuesUtf8;
+    }
+
+    return getBitmapColumnIndexForSortedIterableUtf8(tailSet, tailSet.size());
+  }
+
+  private ImmutableBitmap getBitmap(int idx)
+  {
+    if (idx < 0) {
+      return bitmapFactory.makeEmptyImmutableBitmap();
+    }
+
+    final ImmutableBitmap bitmap = bitmaps.get(idx);
+    return bitmap == null ? bitmapFactory.makeEmptyImmutableBitmap() : bitmap;
+  }
+
+  /**
+   * Helper used by {@link #forSortedValues} and {@link #forSortedValuesUtf8}.
+   */
+  private BitmapColumnIndex getBitmapColumnIndexForSortedIterableUtf8(Iterable<ByteBuffer> valuesUtf8, int size)
+  {
+    // for large number of in-filter values in comparison to the dictionary size, use the sorted merge algorithm.
+    if (size > SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
+      return new SimpleImmutableBitmapIterableIndex()
+      {
+        @Override
+        public Iterable<ImmutableBitmap> getBitmapIterable()
+        {
+          return () -> new Iterator<ImmutableBitmap>()
+          {
+            final PeekingIterator<ByteBuffer> valuesIterator = Iterators.peekingIterator(valuesUtf8.iterator());
+            final PeekingIterator<ByteBuffer> dictionaryIterator = Iterators.peekingIterator(dictionary.iterator());
+            int next = -1;
+            int idx = 0;
+
+            @Override
+            public boolean hasNext()
+            {
+              if (next < 0) {
+                findNext();
+              }
+              return next >= 0;
+            }
+
+            @Override
+            public ImmutableBitmap next()
+            {
+              if (next < 0) {
+                findNext();
+                if (next < 0) {
+                  throw new NoSuchElementException();
+                }
+              }
+              final int swap = next;
+              next = -1;
+              return getBitmap(swap);
+            }
+
+            private void findNext()
+            {
+              while (next < 0 && valuesIterator.hasNext() && dictionaryIterator.hasNext()) {
+                final ByteBuffer nextValue = valuesIterator.peek();
+                final ByteBuffer nextDictionaryKey = dictionaryIterator.peek();
+                final int comparison = COMPARATOR.compare(nextValue, nextDictionaryKey);
+                if (comparison == 0) {
+                  next = idx;
+                  valuesIterator.next();
+                  break;
+                } else if (comparison < 0) {
+                  valuesIterator.next();
+                } else {
+                  dictionaryIterator.next();
+                  idx++;
+                }
+              }
+            }
+          };
+        }
+      };
+    }
+
+    // if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search
+    // based lookup per value. The algorithm works well for smaller number of values.
+    return new SimpleImmutableBitmapIterableIndex()
+    {
+      @Override
+      public Iterable<ImmutableBitmap> getBitmapIterable()
+      {
+        return () -> new Iterator<ImmutableBitmap>()
+        {
+          final int dictionarySize = dictionary.size();
+          final Iterator<ByteBuffer> iterator = valuesUtf8.iterator();
+          int next = -1;
+
+          @Override
+          public boolean hasNext()
+          {
+            if (next < 0) {
+              findNext();
+            }
+            return next >= 0;
+          }
+
+          @Override
+          public ImmutableBitmap next()
+          {
+            if (next < 0) {
+              findNext();
+              if (next < 0) {
+                throw new NoSuchElementException();
+              }
+            }
+            final int swap = next;
+            next = -1;
+            return getBitmap(swap);
+          }
+
+          private void findNext()
+          {
+            while (next < 0 && iterator.hasNext()) {
+              ByteBuffer nextValue = iterator.next();
+              next = dictionary.indexOf(nextValue);
+
+              if (next == -dictionarySize - 1) {
+                // nextValue is past the end of the dictionary.
+                // Note: we can rely on indexOf returning (-(insertion point) - 1), even though Indexed doesn't
+                // guarantee it, because "dictionary" comes from GenericIndexed singleThreaded().

Review Comment:
   I question if this comment is still accurate.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexedWriter.java:
##########
@@ -0,0 +1,296 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.primitives.Ints;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.io.Channels;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+import org.apache.druid.segment.writeout.WriteOutBytes;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.channels.WritableByteChannel;
+
+
+/**
+ * {@link DictionaryWriter} for a {@link FrontCodedIndexed}, written to a {@link SegmentWriteOutMedium}. Values MUST
+ * be added to this dictionary writer in sorted order, which is enforced.
+ *
+ * Front coding is a type of delta encoding for byte arrays, where values are grouped into buckets. The first value of
+ * the bucket is written entirely, and remaining values are stored as pairs of an integer which indicates how much
+ * of the first byte array of the bucket to use as a prefix, followed by the remaining value bytes after the prefix.
+ *
+ * This is valid to use for any values which can be compared byte by byte with unsigned comparison. Otherwise, this
+ * is not the collection for you.
+ */
+public class FrontCodedIndexedWriter implements DictionaryWriter<byte[]>
+{
+  private static final int MAX_LOG_BUFFER_SIZE = 26;
+  private final SegmentWriteOutMedium segmentWriteOutMedium;
+  private final int bucketSize;
+  private final ByteOrder byteOrder;
+  private final byte[][] bucketBuffer;
+  @Nullable
+  private byte[] prevObject = null;
+  @Nullable
+  private WriteOutBytes headerOut = null;
+  @Nullable
+  private WriteOutBytes valuesOut = null;
+  private int numWritten = 0;
+  private ByteBuffer scratch;
+  private int logScratchSize = 10;
+  private boolean isClosed = false;
+  private boolean hasNulls = false;
+
+
+  public FrontCodedIndexedWriter(
+      SegmentWriteOutMedium segmentWriteOutMedium,
+      ByteOrder byteOrder,
+      int bucketSize
+  )
+  {
+    this.segmentWriteOutMedium = segmentWriteOutMedium;
+    this.scratch = ByteBuffer.allocate(1 << logScratchSize).order(byteOrder);
+    this.bucketSize = bucketSize;
+    this.byteOrder = byteOrder;
+    this.bucketBuffer = new byte[bucketSize][];
+  }
+
+  @Override
+  public void open() throws IOException
+  {
+    headerOut = segmentWriteOutMedium.makeWriteOutBytes();
+    valuesOut = segmentWriteOutMedium.makeWriteOutBytes();
+  }
+
+  @Override
+  public void write(@Nullable byte[] value) throws IOException
+  {
+    if (prevObject != null && unsignedCompare(prevObject, value) >= 0) {
+      throw new ISE(
+          "Values must be sorted and unique. Element [%s] with value [%s] is before or equivalent to [%s]",
+          numWritten,
+          value == null ? null : StringUtils.fromUtf8(value),
+          StringUtils.fromUtf8(prevObject)
+      );
+    }
+
+    if (value == null) {
+      hasNulls = true;
+      return;
+    }
+
+    // if the bucket buffer is full, write the bucket
+    if (numWritten > 0 && (numWritten % bucketSize) == 0) {

Review Comment:
   I don't know that it really matters, but it looks like you are buffering a bucket and then writing the whole bucket out.  I don't think that's strictly necessary, is it?  That is, I think we can do a completely streaming write?  Or is there some state that you need to have which you can only get once the bucket is full?  I think even the offset written into the header could be figured out in a streaming fashion?
   
   I don't think it really matters, so I'm not actually asking you to change this.  Just wanting to validate my understanding.



##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedColumnPartSerde.java:
##########
@@ -305,6 +311,33 @@ public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnCo
 
         builder.setType(ValueType.STRING);
 
+        final int dictionaryStartPosition = buffer.position();
+        final byte dictionaryVersion = buffer.get();
+
+        if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+          final byte encodingId = buffer.get();
+          if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+            readFrontCodedColumn(buffer, builder, rVersion, rFlags, hasMultipleValues);
+          } else {
+            throw new ISE("impossible, unknown encoding strategy id: %s", encodingId);

Review Comment:
   We should likely also support an ID for the GenericIndexed based approach so that in a future release we can protect from GenericIndexed getting a new version number by persisting those columns as just another encoding here.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexedWriter.java:
##########
@@ -0,0 +1,296 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.primitives.Ints;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.io.Channels;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+import org.apache.druid.segment.writeout.WriteOutBytes;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.channels.WritableByteChannel;
+
+
+/**
+ * {@link DictionaryWriter} for a {@link FrontCodedIndexed}, written to a {@link SegmentWriteOutMedium}. Values MUST
+ * be added to this dictionary writer in sorted order, which is enforced.
+ *
+ * Front coding is a type of delta encoding for byte arrays, where values are grouped into buckets. The first value of
+ * the bucket is written entirely, and remaining values are stored as pairs of an integer which indicates how much
+ * of the first byte array of the bucket to use as a prefix, followed by the remaining value bytes after the prefix.
+ *
+ * This is valid to use for any values which can be compared byte by byte with unsigned comparison. Otherwise, this
+ * is not the collection for you.
+ */
+public class FrontCodedIndexedWriter implements DictionaryWriter<byte[]>
+{
+  private static final int MAX_LOG_BUFFER_SIZE = 26;
+  private final SegmentWriteOutMedium segmentWriteOutMedium;
+  private final int bucketSize;
+  private final ByteOrder byteOrder;
+  private final byte[][] bucketBuffer;
+  @Nullable
+  private byte[] prevObject = null;
+  @Nullable
+  private WriteOutBytes headerOut = null;
+  @Nullable
+  private WriteOutBytes valuesOut = null;
+  private int numWritten = 0;
+  private ByteBuffer scratch;
+  private int logScratchSize = 10;
+  private boolean isClosed = false;
+  private boolean hasNulls = false;
+
+
+  public FrontCodedIndexedWriter(
+      SegmentWriteOutMedium segmentWriteOutMedium,
+      ByteOrder byteOrder,
+      int bucketSize
+  )
+  {
+    this.segmentWriteOutMedium = segmentWriteOutMedium;
+    this.scratch = ByteBuffer.allocate(1 << logScratchSize).order(byteOrder);
+    this.bucketSize = bucketSize;
+    this.byteOrder = byteOrder;
+    this.bucketBuffer = new byte[bucketSize][];
+  }
+
+  @Override
+  public void open() throws IOException
+  {
+    headerOut = segmentWriteOutMedium.makeWriteOutBytes();
+    valuesOut = segmentWriteOutMedium.makeWriteOutBytes();
+  }
+
+  @Override
+  public void write(@Nullable byte[] value) throws IOException
+  {
+    if (prevObject != null && unsignedCompare(prevObject, value) >= 0) {
+      throw new ISE(
+          "Values must be sorted and unique. Element [%s] with value [%s] is before or equivalent to [%s]",
+          numWritten,
+          value == null ? null : StringUtils.fromUtf8(value),
+          StringUtils.fromUtf8(prevObject)
+      );
+    }
+
+    if (value == null) {
+      hasNulls = true;
+      return;
+    }
+
+    // if the bucket buffer is full, write the bucket
+    if (numWritten > 0 && (numWritten % bucketSize) == 0) {
+      resetScratch();
+      int written;
+      // write the bucket, growing scratch buffer as necessary
+      do {
+        written = writeBucket(scratch, bucketBuffer, bucketSize);
+        if (written < 0) {
+          growScratch();
+        }
+      } while (written < 0);
+      scratch.flip();
+      Channels.writeFully(valuesOut, scratch);
+
+      resetScratch();
+      // write end offset for current value
+      scratch.putInt((int) valuesOut.size());
+      scratch.flip();
+      Channels.writeFully(headerOut, scratch);
+    }
+
+    bucketBuffer[numWritten % bucketSize] = value;
+
+    ++numWritten;
+    prevObject = value;
+  }
+
+
+  @Override
+  public long getSerializedSize() throws IOException
+  {
+    if (!isClosed) {
+      flush();
+    }
+    int headerAndValues = Ints.checkedCast(headerOut.size() + valuesOut.size());
+    return Byte.BYTES +
+           Byte.BYTES +
+           Byte.BYTES +
+           VByte.estimateIntSize(numWritten) +
+           VByte.estimateIntSize(headerAndValues) +
+           headerAndValues;

Review Comment:
   I get scared of estimates of size, they are wrong sometimes.  You could alternatively build the header bytes (the 3 bytes and 2 VBytes) and then look at how many bytes it used to get at the size of the header, yes?



##########
processing/src/main/java/org/apache/druid/segment/data/EncodedStringDictionaryWriter.java:
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.column.StringEncodingStrategy;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.channels.WritableByteChannel;
+
+public class EncodedStringDictionaryWriter implements DictionaryWriter<String>
+{
+  public static final byte VERSION = Byte.MAX_VALUE; // hopefully GenericIndexed never makes a version this high...

Review Comment:
   Why does it matter?  Even if GenericIndexed does, the thing that matters is that the version of the *column* itself never collides.  In order to ensure that, we should really start persisting all of the columns with the newly incremented version (even if they are still set to persist with GenericIndexed), but the downside to that is that it doesn't allow for rollback.  So, instead the new version can hopefully still persist a GenericIndexed version, we continue to make things with the old version in case we are in a state where we might want rollback and then in some future version, we should just never persist a String column without adding some String-column-specific version id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org