You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/01 02:25:11 UTC
[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r480613932



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param allocator the allocator for allocating memory for compressed buffer.
+   * @param unCompressedBuffer the buffer to compress.
+   *                           Implementation of this method should take care of releasing this buffer.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer);

Review comment:
       @emkornfield Thank you for starting this discussion and sharing your good ideas. 
   Your reasoning makes sense to me. 
   
   I guess I was looking at the problem from a different perspective. 
   
   IMO, the bottleneck of a compressing codec is the CPU resource, and the main purpose of compressing is to reduce memory/network bandwidth consumption.
   
   Given the above assumptions, we should try to do the compression as early as possible. The earliest possible place should be in the `getFieldBuffers` method. In this PR, we do it in `VectorUnLoader`, which is not the best, but close enough to the best. Similarly, we should try to do the decompression as late as possible. In this PR, we do it in `VectorLoader`, which is close to the optimal.
   
   Admittedly, we have additional copies after introducing the compression framework. However, both additional copies are based on the compressed data, with reduced data size, so the overhead should be small.
   
   The above reasoning is based on the assumption that the compression codec could effectively reduce the data size, which is not always true in practice. So I think we can make the decision based on the specific compression codec, and real benchmark data?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org