You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/02/14 18:18:00 UTC

[jira] [Work logged] (GOBBLIN-1749) Add dependency for handling xz-compressed Avro file

     [ https://issues.apache.org/jira/browse/GOBBLIN-1749?focusedWorklogId=845456&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-845456 ]

ASF GitHub Bot logged work on GOBBLIN-1749:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Feb/23 18:17
            Start Date: 14/Feb/23 18:17
    Worklog Time Spent: 10m 
      Work Description: Will-Lo merged PR #3609:
URL: https://github.com/apache/gobblin/pull/3609




Issue Time Tracking
-------------------

    Worklog Id:     (was: 845456)
    Time Spent: 40m  (was: 0.5h)

> Add dependency for handling xz-compressed Avro file
> ---------------------------------------------------
>
>                 Key: GOBBLIN-1749
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1749
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-core
>            Reporter: Kengo Seki
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> After [upgrading Avro to 1.9.2|GOBBLIN-1726] on master, xz-compressed Avro files are neither readable or writable by default.
> For example, given the following avro file which is compressed with xz codec,
> {code}
> $ java -jar avro-tools-1.11.1.jar getmeta /tmp/avro/weather.avro
> avro.schema	{"type":"record","name":"Weather","namespace":"test","doc":"A weather reading.","fields":[{"name":"station","type":"string"},{"name":"time","type":"long"},{"name":"temp","type":"int"}]}
> avro.codec	xz
> {code}
> reading that file fails on master as follows.
> {code}
> $ git status 
> On branch master
> Your branch is ahead of 'origin/master' by 285 commits.
>   (use "git push" to publish your local commits)
> nothing to commit, working tree clean
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules. They can conflict with other modules on Jackson and Avro (via transitive dependency) respectively.
> $ ./gradlew assemble
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.17.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ cat /tmp/sample.job 
> source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource
> source.filebased.data.directory=/tmp/avro
> extract.table.type=SNAPSHOT_ONLY
> writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder
> data.publisher.type=org.apache.gobblin.publisher.NoopPublisher
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
> ...
> 2022-11-26 20:15:52 JST ERROR [TaskExecutor-0] org.apache.gobblin.runtime.Task  - Task task_EmbeddedGobblin_1669461352066_0 failed
> java.lang.NoClassDefFoundError: org/tukaani/xz/XZInputStream
> 	at org.apache.avro.file.XZCodec.decompress(XZCodec.java:74)
> ...
> {code}
> This issue doesn't occur on past releases.
> {code}
> $ curl -sLO https://downloads.apache.org/gobblin/apache-gobblin-0.16.0/apache-gobblin-incubating-sources-0.16.0.tgz
> $ tar xf apache-gobblin-incubating-sources-0.16.0.tgz 
> $ cd apache-gobblin-incubating-sources-0.16.0
> $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules
> $ curl -sL https://github.com/apache/gobblin/raw/master/gradle/wrapper/gradle-wrapper.jar -o gradle/wrapper/gradle-wrapper.jar
> $ ./gradlew assemble
> $ rm -rf /tmp/gobblin-dist
> $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.16.0.tar.gz -C /tmp
> $ cd /tmp/gobblin-dist
> $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
> ...
> {"station": "011990-99999", "time": -619524000000, "temp": 0}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619524000000, "temp": 0}
> {"station": "011990-99999", "time": -619506000000, "temp": 22}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619506000000, "temp": 22}
> {"station": "011990-99999", "time": -619484400000, "temp": -11}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619484400000, "temp": -11}
> {"station": "012650-99999", "time": -655531200000, "temp": 111}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655531200000, "temp": 111}
> {"station": "012650-99999", "time": -655509600000, "temp": 78}
> 2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655509600000, "temp": 78}
> {code}
> This is because [Avro 1.9.2 declares xz's scope as "provided"|https://github.com/apache/avro/blob/release-1.9.2/lang/java/avro/pom.xml#L207] for some reason. [It was fixed in the next release|https://github.com/apache/avro/blob/release-1.10.0/lang/java/avro/pom.xml#L238-L242], but while using Avro 1.9.2, it would be helpful for users to include this dependency on Gobblin's side.
> In addition, upgrading Avro to 1.9.2 enables to leverage zstd compression. It should be documented as it's beneficial for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)