You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Bhavani Sudha (Jira)" <ji...@apache.org> on 2020/05/20 07:47:00 UTC
[jira] [Commented] (HUDI-907) Test Presto mor query support changes
in HDFS Env
[ https://issues.apache.org/jira/browse/HUDI-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111875#comment-17111875 ]
Bhavani Sudha commented on HUDI-907:
------------------------------------
[~bdscheller] These are the steps to recreate and test.
h4. Setup to test mor queries through Presto on HDFS data
I made some changes to this original presto patch - [https://github.com/bschell/presto/commit/a3fb658c1cd70fd72f0a3021b3d994fe383303aa]
* Rebased it on top of latest Presto master that brings hudi as a compile time dependency.
* I added changes to this function isHudiInputFormat. Renamed it to isHudiParquetInputFormat. So the new behavior will be: if its for COW table it would use the HoodieROTablePathFilter route, else if its MOR table query it would invoke HoodieParquetRealtimeInputFormat.getSplits()
You can find the changes here - [https://github.com/bhasudha/presto/commit/ce961a6ee10e154dd98f28615d628c2cf995a3c7]
Next I took these changes and tried to run a query. I got NoClassDefError at runtime for AvroSchemaConverter. From here it would mean either
* adding additional deps on org.apache.parquet:parquet-avro and org.apache.avro:avro inside presto-hive module *OR*
* compile time dep on hudi-presto-bundle which already shades these deps.
I took the second route and changed the root presto pom to depend on `hudi-presto-bundle` instead of 'hudi-hadoop-mr' and also made similar changes inside presto-hive module's pom. At this point when trying to build presto, I got conflicting errors between hudi's version of parquet and presto's version of parquet. So, I tried relocating the shaded parquet inside hudi-presto-bundle and also added deps on 'parquet-common', 'parquet-encoding','parquet-column', 'parquet-hadoop` etc inside hudi-presto-bundle. Next time build ran fine but saw NoClassDefFoundError for `org/apache/parquet/format/TypeDefinedOrder` which is a thrift generated class in parquet-format. At this point I was blocked.
h5. *Docker set up*
* I built hudi locally with changes (if any as described above) in hudi-presto-bunde's pom.
* I would publish it local .m2 maven repo using `mvn install:install-file -Dfile=./hudi-presto-bundle-0.6.0-SNAPSHOT.jar -DgroupId=org.apache.hudi -DartifactId=hudi-presto-bundle -Dversion=0.6.0-SNAPSHOT -Dpackaging=jar` command
* build presto normally with the changes from your patch (described above). This will pick the above hudi version in local .m2 repo
* Copied the presto-server/target/presto-server-0.236-SNAPSHOT.tar.gz and presto-cli-0.236-SNAPSHOT-executable.jar to a temporary directory where a [https://www.pythonforbeginners.com/modules-in-python/how-to-use-simplehttpserver/] would run. Example command `python -m SimpleHTTPServer 1234`. This would serve as a webserver url from where local Docker presto images can be built in next steps.
* Build a docker presto image using this patch (Replace x.x.x.x with your host ip)
**
{quote}diff --git a/docker/hoodie/hadoop/prestobase/Dockerfile b/docker/hoodie/hadoop/prestobase/Dockerfile
index 43b989e6..98b5dc7c 100644
--- a/docker/hoodie/hadoop/prestobase/Dockerfile
+++ b/docker/hoodie/hadoop/prestobase/Dockerfile
@@ -22,10 +22,9 @@ ARG HADOOP_VERSION=2.8.4
ARG HIVE_VERSION=2.3.3
FROM apachehudi/hudi-hadoop_${HADOOP_VERSION}-base:latest as hadoop-base
-ARG PRESTO_VERSION=0.217
-
+ARG PRESTO_VERSION=0.236
ENV PRESTO_VERSION ${PRESTO_VERSION}
-ENV PRESTO_HOME /opt/presto-server-${PRESTO_VERSION}
+ENV PRESTO_HOME /opt/presto-server-${PRESTO_VERSION}-SNAPSHOT
ENV PRESTO_CONF_DIR ${PRESTO_HOME}/etc
ENV PRESTO_LOG_DIR /var/log/presto
ENV PRESTO_JVM_MAX_HEAP 2G
@@ -53,11 +52,11 @@ RUN set -x \
gosu \
&& rm -rf /var/lib/apt/lists/* \
## presto-server
- && wget -q -O - [https://repo1.maven.org/maven2/com/facebook/presto/presto-server/${PRESTO_VERSION}/presto-server-${PRESTO_VERSION}.tar.gz|https://repo1.maven.org/maven2/com/facebook/presto/presto-server/$%7BPRESTO_VERSION%7D/presto-server-$%7BPRESTO_VERSION%7D.tar.gz] \
+ && wget -q -O - [http://x.x.x.x:1234/presto-server-${PRESTO_VERSION}.tar.gz|http://10.0.0.148:1234/presto-server-$%7BPRESTO_VERSION%7D.tar.gz] \
| tar -xzf - -C /opt/ \
&& mkdir -p /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/ \
## presto-client
- && wget -q -O /usr/local/bin/presto [https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/${PRESTO_VERSION}/presto-cli-${PRESTO_VERSION}-executable.jar|https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/$%7BPRESTO_VERSION%7D/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar] \
+ && wget -q -O /usr/local/bin/presto [http://x.x.x.x:1234/presto-cli-${PRESTO_VERSION}-executable.jar|http://10.0.0.148:1234/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar] \
&& chmod +x /usr/local/bin/presto \
## user/dir/permmsion
&& adduser --shell /sbin/nologin --uid 1000 docker \
@@ -76,10 +75,6 @@ COPY bin/* /usr/local/bin/
COPY lib/* /usr/local/lib/
RUN chmod +x /usr/local/bin/entrypoint.sh
-ADD target/ /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/
-ENV HUDI_PRESTO_BUNDLE /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/hudi-presto-bundle.jar
-RUN cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
VOLUME ["${PRESTO_LOG_DIR}"]
WORKDIR ${PRESTO_HOME}
diff --git a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
index 58b55085..c457f646 100755
--- a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
+++ b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
@@ -54,10 +54,6 @@ do
conf_file=${template%.mustache}
cat ${conf_file}.mustache | mustache.sh > ${conf_file}
done
-
-# Copy the presto bundle at run time so that locally built bundle overrides the one that is present in the image
-cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
case "$1" in
"coordinator" | "worker" )
server_role="$1"{quote}
Now build image using command
{quote}cd docker/hoodie/hadoop/prestobase{quote}
{quote}docker build ./Dockerfile{quote} * The next step is to push this to a local docker registry using steps
{quote}docker run -d -p 5000:5000 --restart=always --name registry registry:2
docker tag <ImageID> localhost:5000/prestobase:latest
docker push localhost:5000/prestobase:latest
{quote}
Now make changes to the docker/compose/docker-compose_hadoop284_hive233_spark244.yml to replace `image: apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest` with `
image: localhost:5000/prestobase_0.236:latest` and run docker as described here - [https://hudi.apache.org/docs/docker_demo.html]
This way Presto queries can be tested in HDFS env locally using Hudi's docker setup.
> Test Presto mor query support changes in HDFS Env
> -------------------------------------------------
>
> Key: HUDI-907
> URL: https://issues.apache.org/jira/browse/HUDI-907
> Project: Apache Hudi (incubating)
> Issue Type: Sub-task
> Components: Presto Integration
> Reporter: Bhavani Sudha
> Assignee: Bhavani Sudha
> Priority: Major
> Fix For: 0.5.3
>
>
> Test presto integration for HDFS environment as well in addition to S3.
>
> Blockers faced so far
> [~bdscheller] I tried to apply your presto patch to test mor queries on Presto. The way I set it up was create a docker image from your presto patch and use that image in hudi local docker environment. I observed couple of issues there:
> * I got NoClassDefFoundError for these classes:
> ** org/apache/parquet/avro/AvroSchemaConverter
> ** org/apache/parquet/hadoop/ParquetFileReader
> ** org/apache/parquet/io/InputFile
> ** org/apache/parquet/format/TypeDefinedOrder
> I was able to get around the first three errors by shading org.apache.parquet inside hudi-presto-bundle and changing presto-hive to depend on the hudi-presto-bundle. However, for the last one shading dint help because its already a Thrift generated class. I am wondering you also ran into similar issues while testing S3.
> Could you please elaborate your test set up so we can do similar thing for HDFS as well. If we need to add more changes to hudi-presto-bundle, we would need to prioritize that for 0.5.3 release asap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)