You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/11/26 14:29:00 UTC

[jira] [Work logged] (GOBBLIN-1749) Add dependency for handling xz-compressed Avro file

     [ https://issues.apache.org/jira/browse/GOBBLIN-1749?focusedWorklogId=829084&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-829084 ]

ASF GitHub Bot logged work on GOBBLIN-1749:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Nov/22 14:28
            Start Date: 26/Nov/22 14:28
    Worklog Time Spent: 10m 
      Work Description: sekikn opened a new pull request, #3609:
URL: https://github.com/apache/gobblin/pull/3609

   * Add dependency on xz for handling xz-compressed Avro files
   
   * Fix unit test to ensure all codecs are correctly supported
   
   * Update AvroHdfsDataWriter's document for covering all compression codecs
   
   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1749
   
   
   ### Description
   - [x] Here are some details about my PR, including screenshots (if applicable):
   
   After upgrading Avro to 1.9.2, reading and writing xz-compressed Avro file fails by default. This PR fixes it.
   
   
   ### Tests
   - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   I updated AvroHdfsDataWriterTest to ensure that all codecs are supported
   
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 829084)
    Remaining Estimate: 0h
            Time Spent: 10m

> Add dependency for handling xz-compressed Avro file
> ---------------------------------------------------
>
>                 Key: GOBBLIN-1749
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1749
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-core
>            Reporter: Kengo Seki
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> After [upgrading Avro to 1.9.2|GOBBLIN-1726] on master, xz-compressed Avro files are neither readable or writable by default.
> For example, given the following avro file which is compressed with xz codec,
> {code}
> $ java -jar avro-tools-1.11.1.jar getmeta /tmp/avro/weather.avro
> avro.schema	{"type":"record","name":"Weather","namespace":"test","doc":"A weather reading.","fields":[{"name":"station","type":"string"},{"name":"time","type":"long"},{"name":"temp","type":"int"}]}
> avro.codec	xz
> {code}
> reading that file fails on master as follows.
> {code}
> $ git status 
> On branch master
> Your branch is ahead of 'origin/master' by 285 commits.
>   (use "git push" to publish your local commits)
> nothing to commit, working tree clean
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules. They can conflict with other modules on Jackson and Avro (via transitive dependency) respectively.
> $ ./gradlew assemble
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.17.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ cat /tmp/sample.job 
> source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource
> source.filebased.data.directory=/tmp/avro
> extract.table.type=SNAPSHOT_ONLY
> writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder
> data.publisher.type=org.apache.gobblin.publisher.NoopPublisher
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
> ...
> 2022-11-26 20:15:52 JST ERROR [TaskExecutor-0] org.apache.gobblin.runtime.Task  - Task task_EmbeddedGobblin_1669461352066_0 failed
> java.lang.NoClassDefFoundError: org/tukaani/xz/XZInputStream
> 	at org.apache.avro.file.XZCodec.decompress(XZCodec.java:74)
> ...
> {code}
> This issue doesn't occur on past releases.
> {code}
> $ curl -sLO https://downloads.apache.org/gobblin/apache-gobblin-0.16.0/apache-gobblin-incubating-sources-0.16.0.tgz
> $ tar xf apache-gobblin-incubating-sources-0.16.0.tgz 
> $ cd apache-gobblin-incubating-sources-0.16.0
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules
> $ curl -sL https://github.com/apache/gobblin/raw/master/gradle/wrapper/gradle-wrapper.jar -o gradle/wrapper/gradle-wrapper.jar
> $ ./gradlew assemble
> $ rm -rf /tmp/gobblin-dist
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.16.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
> ...
> {"station": "011990-99999", "time": -619524000000, "temp": 0}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619524000000, "temp": 0}
> {"station": "011990-99999", "time": -619506000000, "temp": 22}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619506000000, "temp": 22}
> {"station": "011990-99999", "time": -619484400000, "temp": -11}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619484400000, "temp": -11}
> {"station": "012650-99999", "time": -655531200000, "temp": 111}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655531200000, "temp": 111}
> {"station": "012650-99999", "time": -655509600000, "temp": 78}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655509600000, "temp": 78}
> {code}
> This is because [Avro 1.9.2 declares xz's scope as "provided"|https://github.com/apache/avro/blob/release-1.9.2/lang/java/avro/pom.xml#L207] for some reason. [It was fixed in the next release|https://github.com/apache/avro/blob/release-1.10.0/lang/java/avro/pom.xml#L238-L242], but while using Avro 1.9.2, it would be helpful for users to include this dependency on Gobblin's side.
> In addition, upgrading Avro to 1.9.2 enables to leverage zstd compression. It should be documented as it's beneficial for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)