You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/02/14 18:18:00 UTC
[jira] [Work logged] (GOBBLIN-1749) Add dependency for handling xz-compressed Avro file
[ https://issues.apache.org/jira/browse/GOBBLIN-1749?focusedWorklogId=845456&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-845456 ]
ASF GitHub Bot logged work on GOBBLIN-1749:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 14/Feb/23 18:17
Start Date: 14/Feb/23 18:17
Worklog Time Spent: 10m
Work Description: Will-Lo merged PR #3609:
URL: https://github.com/apache/gobblin/pull/3609
Issue Time Tracking
-------------------
Worklog Id: (was: 845456)
Time Spent: 40m (was: 0.5h)
> Add dependency for handling xz-compressed Avro file
> ---------------------------------------------------
>
> Key: GOBBLIN-1749
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1749
> Project: Apache Gobblin
> Issue Type: Improvement
> Components: gobblin-core
> Reporter: Kengo Seki
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> After [upgrading Avro to 1.9.2|GOBBLIN-1726] on master, xz-compressed Avro files are neither readable or writable by default.
> For example, given the following avro file which is compressed with xz codec,
> {code}
> $ java -jar avro-tools-1.11.1.jar getmeta /tmp/avro/weather.avro
> avro.schema {"type":"record","name":"Weather","namespace":"test","doc":"A weather reading.","fields":[{"name":"station","type":"string"},{"name":"time","type":"long"},{"name":"temp","type":"int"}]}
> avro.codec xz
> {code}
> reading that file fails on master as follows.
> {code}
> $ git status
> On branch master
> Your branch is ahead of 'origin/master' by 285 commits.
> (use "git push" to publish your local commits)
> nothing to commit, working tree clean
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle # Remove the gobblin-elasticsearch and gobblin-example submodules. They can conflict with other modules on Jackson and Avro (via transitive dependency) respectively.
> $ ./gradlew assemble
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.17.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ cat /tmp/sample.job
> source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource
> source.filebased.data.directory=/tmp/avro
> extract.table.type=SNAPSHOT_ONLY
> writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder
> data.publisher.type=org.apache.gobblin.publisher.NoopPublisher
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job
> ...
> 2022-11-26 20:15:52 JST ERROR [TaskExecutor-0] org.apache.gobblin.runtime.Task - Task task_EmbeddedGobblin_1669461352066_0 failed
> java.lang.NoClassDefFoundError: org/tukaani/xz/XZInputStream
> at org.apache.avro.file.XZCodec.decompress(XZCodec.java:74)
> ...
> {code}
> This issue doesn't occur on past releases.
> {code}
> $ curl -sLO https://downloads.apache.org/gobblin/apache-gobblin-0.16.0/apache-gobblin-incubating-sources-0.16.0.tgz
> $ tar xf apache-gobblin-incubating-sources-0.16.0.tgz
> $ cd apache-gobblin-incubating-sources-0.16.0
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle # Remove the gobblin-elasticsearch and gobblin-example submodules
> $ curl -sL https://github.com/apache/gobblin/raw/master/gradle/wrapper/gradle-wrapper.jar -o gradle/wrapper/gradle-wrapper.jar
> $ ./gradlew assemble
> $ rm -rf /tmp/gobblin-dist
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.16.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job
> ...
> {"station": "011990-99999", "time": -619524000000, "temp": 0}
> 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619524000000, "temp": 0}
> {"station": "011990-99999", "time": -619506000000, "temp": 22}
> 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619506000000, "temp": 22}
> {"station": "011990-99999", "time": -619484400000, "temp": -11}
> 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619484400000, "temp": -11}
> {"station": "012650-99999", "time": -655531200000, "temp": 111}
> 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "012650-99999", "time": -655531200000, "temp": 111}
> {"station": "012650-99999", "time": -655509600000, "temp": 78}
> 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "012650-99999", "time": -655509600000, "temp": 78}
> {code}
> This is because [Avro 1.9.2 declares xz's scope as "provided"|https://github.com/apache/avro/blob/release-1.9.2/lang/java/avro/pom.xml#L207] for some reason. [It was fixed in the next release|https://github.com/apache/avro/blob/release-1.10.0/lang/java/avro/pom.xml#L238-L242], but while using Avro 1.9.2, it would be helpful for users to include this dependency on Gobblin's side.
> In addition, upgrading Avro to 1.9.2 enables to leverage zstd compression. It should be documented as it's beneficial for users.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)