You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/01/13 15:44:01 UTC

[GitHub] [accumulo] trietopsoft opened a new issue #2416: Hadoop compression codec configuration is ignored

trietopsoft opened a new issue #2416:
URL: https://github.com/apache/accumulo/issues/2416


   **Describe the bug**
   Hadoop and Accumulo provide a mechanism to override the codec implementation via configuration property. The logic within [org.apache.accumulo.core.file.rfile.bcfile.Compression#createNewCodec](https://github.com/apache/accumulo/blob/45a4a93/core/src/main/java/org/apache/accumulo/core/file/rfile/bcfile/Compression.java#L762) currently discards the Hadoop configuration in the first elvis operator.
   
   For example, if _io.compression.codec.lz4.class_ is defined as a property in Hadoop's `core-site.xml`, the function will set **null** for the _extClazz_ value in _createNewCodec_ and will not check the system property.  The original developer's intent was to probably return the Hadoop configuration if it was not null.
   
   **Workaround**
   Define a system property for the corresponding codec in `${ACCUMULO_HOME}/conf/accumulo-env.sh` `ACCUMULO_GENERAL_OPTS` environment variable with the desired codec class.  In this case, there must be **no override** specified within the Hadoop configuration files (`core-site.xml` or elsewhere).
   
   Example Workaround:
   ```sh
   test -z "$ACCUMULO_GENERAL_OPTS" && export ACCUMULO_GENERAL_OPTS="-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -Djava.net.preferIPv4Stack=true -XX:+CMSClassUnloadingEnabled -Dio.compression.codec.lz4.class=com.intel.compression.hadoop.IntelCompressionCodec"
   ```
   
   **Versions (OS, Maven, Java, and others, as appropriate):**
    - Affected version(s) of this project: 1.10.0 1.10.1 2.x
    - OS: Any
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Add an override for a codec (e.g. `io.compression.codec.lz4.class`) in Hadoop's `core-site.xml`
   2. Expect a log message `Trying to load codec class *new codec class* for lz4`
   3. Actual log message `Trying to load codec class *default codec class* for lz4`
   
   **Workaround Reproduction**
   Steps to reproduce the behavior:
   1. Ensure no codec overrides exist in Hadoop's `core-site.xml`
   2. Define a system property in `ACCUMULO_GENERAL_OPTS` with the desired codec
   3. Expect a log message `Trying to load codec class *new codec class* for lz4`
   4. Actual log message `Trying to load codec class *new codec class* for lz4`
   
   **Expected behavior**
   Expect any codec configuration overrides supported by Accumulo and defined within Hadoop's `core-site.xml` to be used as the codec library for compressing/decompressing RFiles.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2416: Hadoop compression codec configuration is ignored

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2416:
URL: https://github.com/apache/accumulo/issues/2416#issuecomment-1013442755


   I seem to remember discussing this before. I've definitely noticed the issue before. However, I couldn't find a reference to an existing conversation about it. It might have just happened in my head :smiley_cat: 
   A refactor of this code to make the behavior a bit more clear and intuitive would be nice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] trietopsoft commented on issue #2416: Hadoop compression codec configuration is ignored

Posted by GitBox <gi...@apache.org>.
trietopsoft commented on issue #2416:
URL: https://github.com/apache/accumulo/issues/2416#issuecomment-1012554027


   A form of this feature has been around since the Accumulo [1.3.5](https://github.com/apache/accumulo/blob/1.3.5/src/core/src/main/java/org/apache/accumulo/core/file/rfile/bcfile/Compression.java#L221) release.  It has been present in Hadoop since [2.2.x](https://github.com/apache/hadoop/blob/e103c83765898f756f88c27b2243c8dd3098a989/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/Compression.java#L88) - this link has the code that implements this capability for the Hadoop TFile (it is virtually identical to the intended behavior in Accumulo).  Accumulo's version has had a bug since the initial commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #2416: Hadoop compression codec configuration is ignored

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #2416:
URL: https://github.com/apache/accumulo/issues/2416#issuecomment-1012519727


   > Hi @milleruntime - that property is specifying the codec **type**, not the actual **implementation**. In most cases, the distinction is irrelevant. However, alternate performance-based implementations are available and can be defined with the `io.compression.codec.${type}.class` Hadoop property. The code as written ([45a4a93](https://github.com/apache/accumulo/commit/45a4a93d2926760d6a916ae2da1d7e668ec4dcb1)) does not make use of the Hadoop configuration property.
   
   Thanks for the explanation. I think I see what you are trying to do but can't seem to find any hadoop documentation about it. I haven't gone digging through the source code yet though. Do you know of any relevant hadoop documentation? Also, do you know if the `io.compression.codec` package is tied to a specific version of hadoop?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] trietopsoft commented on issue #2416: Hadoop compression codec configuration is ignored

Posted by GitBox <gi...@apache.org>.
trietopsoft commented on issue #2416:
URL: https://github.com/apache/accumulo/issues/2416#issuecomment-1012436458


   Hi @milleruntime - that property is specifying the codec **type**, not the actual **implementation**.  In most cases, the distinction is irrelevant.  However, alternate performance-based implementations are available and can be defined with the `io.compression.codec.${type}.class` Hadoop property.  The code as written (45a4a93) does not make use of the Hadoop configuration property.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #2416: Hadoop compression codec configuration is ignored

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #2416:
URL: https://github.com/apache/accumulo/issues/2416#issuecomment-1012421542


   Have you tried setting the Accumulo property `table.file.compress.type`? That is typically how you set the compression codec for files in accumulo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org