You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Bhavani Sudha (Jira)" <ji...@apache.org> on 2020/05/20 07:47:00 UTC

[jira] [Commented] (HUDI-907) Test Presto mor query support changes in HDFS Env

    [ https://issues.apache.org/jira/browse/HUDI-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111875#comment-17111875 ] 

Bhavani Sudha commented on HUDI-907:
------------------------------------

[~bdscheller] These are the steps to recreate and test.  
h4. Setup to test mor queries through Presto on HDFS data

I made some changes to this original presto patch - [https://github.com/bschell/presto/commit/a3fb658c1cd70fd72f0a3021b3d994fe383303aa] 
 * Rebased it on top of latest Presto master that brings hudi as a compile time dependency.

 * I added changes to this function isHudiInputFormat. Renamed it to  isHudiParquetInputFormat. So the new behavior will be: if its for COW table it would use the HoodieROTablePathFilter route, else if its  MOR table query it would invoke HoodieParquetRealtimeInputFormat.getSplits() 

You can find the changes here - [https://github.com/bhasudha/presto/commit/ce961a6ee10e154dd98f28615d628c2cf995a3c7] 

 

Next I took these changes and tried to run a query. I got NoClassDefError at runtime for AvroSchemaConverter. From here it would mean either
 * adding additional deps on org.apache.parquet:parquet-avro and org.apache.avro:avro inside presto-hive module *OR* 
 * compile time dep on hudi-presto-bundle which already shades these deps.

 

I took the second route and changed the root presto pom to depend on `hudi-presto-bundle` instead of  'hudi-hadoop-mr' and also made similar changes inside presto-hive module's pom.  At this point when trying to build presto, I got conflicting errors between hudi's version of parquet and presto's version of parquet. So, I tried relocating the shaded parquet inside hudi-presto-bundle and also added deps on 'parquet-common', 'parquet-encoding','parquet-column', 'parquet-hadoop` etc inside hudi-presto-bundle. Next time build ran fine but saw NoClassDefFoundError for `org/apache/parquet/format/TypeDefinedOrder` which is a thrift generated class in parquet-format. At this point I was blocked. 

 
h5. *Docker set up*
 * I built hudi locally with changes (if any as described above) in hudi-presto-bunde's pom. 
 * I would publish it local .m2  maven repo using `mvn install:install-file -Dfile=./hudi-presto-bundle-0.6.0-SNAPSHOT.jar -DgroupId=org.apache.hudi -DartifactId=hudi-presto-bundle -Dversion=0.6.0-SNAPSHOT -Dpackaging=jar` command
 * build presto normally with the changes from your patch (described above). This will pick the above hudi version in local .m2  repo
 * Copied the presto-server/target/presto-server-0.236-SNAPSHOT.tar.gz and presto-cli-0.236-SNAPSHOT-executable.jar to a temporary directory where a [https://www.pythonforbeginners.com/modules-in-python/how-to-use-simplehttpserver/]  would run. Example command `python -m SimpleHTTPServer 1234`. This would serve as a webserver url from where local Docker presto images can be built in next steps.
 * Build a docker presto image using this patch (Replace x.x.x.x with your host ip)
 ** 
{quote}diff --git a/docker/hoodie/hadoop/prestobase/Dockerfile b/docker/hoodie/hadoop/prestobase/Dockerfile
index 43b989e6..98b5dc7c 100644
--- a/docker/hoodie/hadoop/prestobase/Dockerfile
+++ b/docker/hoodie/hadoop/prestobase/Dockerfile
@@ -22,10 +22,9 @@ ARG HADOOP_VERSION=2.8.4
 ARG HIVE_VERSION=2.3.3
 FROM apachehudi/hudi-hadoop_${HADOOP_VERSION}-base:latest as hadoop-base
-ARG PRESTO_VERSION=0.217
-
+ARG PRESTO_VERSION=0.236
 ENV PRESTO_VERSION       ${PRESTO_VERSION}
-ENV PRESTO_HOME          /opt/presto-server-${PRESTO_VERSION}
+ENV PRESTO_HOME          /opt/presto-server-${PRESTO_VERSION}-SNAPSHOT
 ENV PRESTO_CONF_DIR      ${PRESTO_HOME}/etc
 ENV PRESTO_LOG_DIR       /var/log/presto
 ENV PRESTO_JVM_MAX_HEAP  2G
@@ -53,11 +52,11 @@ RUN set -x \
         gosu \
     && rm -rf /var/lib/apt/lists/* \
     ## presto-server
-    && wget -q -O - [https://repo1.maven.org/maven2/com/facebook/presto/presto-server/${PRESTO_VERSION}/presto-server-${PRESTO_VERSION}.tar.gz|https://repo1.maven.org/maven2/com/facebook/presto/presto-server/$%7BPRESTO_VERSION%7D/presto-server-$%7BPRESTO_VERSION%7D.tar.gz] \
+    && wget -q -O - [http://x.x.x.x:1234/presto-server-${PRESTO_VERSION}.tar.gz|http://10.0.0.148:1234/presto-server-$%7BPRESTO_VERSION%7D.tar.gz] \
         | tar -xzf - -C /opt/  \
     && mkdir -p /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/ \
     ## presto-client
-    && wget -q -O /usr/local/bin/presto [https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/${PRESTO_VERSION}/presto-cli-${PRESTO_VERSION}-executable.jar|https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/$%7BPRESTO_VERSION%7D/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar] \
+    && wget -q -O /usr/local/bin/presto [http://x.x.x.x:1234/presto-cli-${PRESTO_VERSION}-executable.jar|http://10.0.0.148:1234/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar] \
     && chmod +x /usr/local/bin/presto \
     ## user/dir/permmsion
     && adduser --shell /sbin/nologin --uid 1000 docker \
@@ -76,10 +75,6 @@ COPY bin/*  /usr/local/bin/
 COPY lib/*  /usr/local/lib/
 RUN chmod +x /usr/local/bin/entrypoint.sh
-ADD target/ /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/
-ENV HUDI_PRESTO_BUNDLE /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/hudi-presto-bundle.jar
-RUN  cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
 VOLUME ["${PRESTO_LOG_DIR}"]
 WORKDIR ${PRESTO_HOME}
diff --git a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
index 58b55085..c457f646 100755
--- a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
+++ b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
@@ -54,10 +54,6 @@ do
     conf_file=${template%.mustache}
     cat ${conf_file}.mustache | mustache.sh > ${conf_file}
 done
-
-# Copy the presto bundle at run time so that locally built bundle overrides the one that is present in the image
-cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
 case "$1" in
     "coordinator" | "worker" )
         server_role="$1"{quote}
 

Now build image using command

 
{quote}cd docker/hoodie/hadoop/prestobase{quote}
{quote}docker build ./Dockerfile{quote} * The next step is to push this to a local docker registry using steps

{quote}docker run -d -p 5000:5000 --restart=always --name registry registry:2

docker tag <ImageID> localhost:5000/prestobase:latest

docker push localhost:5000/prestobase:latest
{quote}
Now make changes to the docker/compose/docker-compose_hadoop284_hive233_spark244.yml to replace `image: apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest` with `

image: localhost:5000/prestobase_0.236:latest` and run docker as described here - [https://hudi.apache.org/docs/docker_demo.html] 

This way Presto queries can be tested in HDFS env locally using Hudi's docker setup.

> Test Presto mor query support changes in HDFS Env
> -------------------------------------------------
>
>                 Key: HUDI-907
>                 URL: https://issues.apache.org/jira/browse/HUDI-907
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Presto Integration
>            Reporter: Bhavani Sudha
>            Assignee: Bhavani Sudha
>            Priority: Major
>             Fix For: 0.5.3
>
>
> Test presto integration for HDFS environment as well in addition to S3.
>  
> Blockers faced so far
> [~bdscheller] I tried to apply your presto patch to test mor queries on Presto. The way I set it up was create a docker image from your presto patch and use that image in hudi local docker environment. I observed couple of issues there:
>  * I got NoClassDefFoundError for these classes:
>  ** org/apache/parquet/avro/AvroSchemaConverter
>  ** org/apache/parquet/hadoop/ParquetFileReader
>  ** org/apache/parquet/io/InputFile
>  ** org/apache/parquet/format/TypeDefinedOrder
> I was able to get around the first three errors by shading org.apache.parquet inside hudi-presto-bundle and changing presto-hive to depend on the hudi-presto-bundle. However, for the last one shading dint help because its already a Thrift generated class. I am wondering you  also ran into similar issues while testing S3.  
> Could you please elaborate your test set up so we can do similar thing for HDFS as well. If we need to add more changes to hudi-presto-bundle, we would need to prioritize that for 0.5.3 release asap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)