You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/03 15:09:17 UTC

[GitHub] [hudi] nsivabalan opened a new issue #3395: [SUPPORT]

nsivabalan opened a new issue #3395:
URL: https://github.com/apache/hudi/issues/3395

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

A clear and concise description of the problem.

I am an Architect in a reputable Product based IT firm.
I am in the evaluation process to use Hudi to incorporate a refreshable
data lake.
I am currently running the setup in my local machine and using a spark
datasource to write and read from the Hudi temp table.
I have evaluated the Cow and MoR write mechanisms but while trying to read
the Hudi table using Read_Optimized type I am getting the below exception:

```
Exception in thread "main" org.apache.hudi.exception.HoodieException:
Invalid query type :read_optimized
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
```

Below is the how I am trying to read from the Hudi location:

```
spark.read
.format("hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)
.load(s"$basePath/$tableName")
.show(50,false)
```

Kindly suggest if I am doing anything wrong?

**To Reproduce**

Steps to reproduce the behavior:

1.
2.
3.
4.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.7.0

* Spark version : 2.4.7

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :

scala 2.12

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-926274331


   I realized @Ambarish-Giri was asking for windows help. Not sure how we can. I actually don't have access to a machine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-891965879


   Hi @nsivabalan , I am having a hudi setup in my windows laptop currently. 
   I am in the evaluation phase. 
   I running my spark program using spark.read() as mentioned above.
   
   Below is the code snippet for write and read:
   
   df.write
         .format("hudi")
          .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
          .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   
   and 
   
   spark.read
     .format("hudi")
     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
   DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)
     .load(s"$basePath/$tableName")
     .show(50,false)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-892051015


   Can you paste the spark-shell launch command please? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-891928642


   ambarish: Can you try with spark2.11. Can you give us the spark-shell launch command as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-892350596


   Hi @nsivabalan , 
   
   I am not executing the spark job from spark-shell ....I am running the spark driver program from Intellij IDE. 
   
   By spark-shell launch command I am little confused ....
   Do you mean  how I launch spark-shell or spark-submit command to run the program?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893140263


   Sure @nsivabalan eventually our test and prod environment will be EMR only. But before doing actual testing and derive the benchmarking metrics as I said earlier just evaluating Hudi to explore all its features in my local setup.
    
   But for now below are the libraries I am using :   
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.7.0"
   libraryDependencies += "org.apache.hudi" %% "hudi-utilities-bundle" % "0.7.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   
   and while creating Spark session Object below is the spark config settings:
   val spark: SparkSession = SparkSession.builder()
         .appName("hudi-datalake")
         .master("local[*]")
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .config("spark.shuffle.compress", "true")
         .config("spark.shuffle.spill.compress", "true")
         .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
         .config("spark.sql.hive.convertMetastoreParquet", "false") 
         .getOrCreate()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893136900


   yes, For eg, something like this
   ```
   spark-shell \
     --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   Ref: https://hudi.apache.org/docs/quick-start-guide
   
   Also, is your production env(eventually) going to be windows? If not, I would recommend you to try it out in EMR or some cluster, as those are well tested environments. I am not sure if hudi is tested end to end in windows. Just a suggestion. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-991913226


   thanks for the update. closing the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-895725694


   Hi @nsivabalan ,
   
   The only difference I was having was an older version of hudi 0.7.0 but now I have upgraded that to 0.8.0 as well for verification.
   As suggested I tried the 2 possible configurations for spark2 one with scala 2.11 and other with scala 2.12 as below :   
   
   scalaVersion := "2.12.11"
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.8.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   and 
   
   scalaVersion := "2.11.12"
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.8.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   But still no luck.
   
   Just wanted to know if there is some more configurations required while querying from MoR table using Read_Optimized query option?
   As currently I am using as below:
   
   spark.read
   .format("hudi")
   .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
   DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)
   .load(s"$basePath/$tableName")
   .show(50,false)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri edited a comment on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893140263


   Sure @nsivabalan eventually our test and prod environment will be EMR only. But before doing actual testing and derive the benchmarking metrics as I said earlier just evaluating Hudi to explore all its features in my local setup.
    
   But for now below are the libraries I am using :  
   scalaVersion := "2.12.11" 
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.7.0"
   libraryDependencies += "org.apache.hudi" %% "hudi-utilities-bundle" % "0.7.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   
   and while creating Spark session Object below is the spark config settings:
   val spark: SparkSession = SparkSession.builder()
         .appName("hudi-datalake")
         .master("local[*]")
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .config("spark.shuffle.compress", "true")
         .config("spark.shuffle.spill.compress", "true")
         .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
         .config("spark.sql.hive.convertMetastoreParquet", "false") 
         .getOrCreate()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar closed issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

vinothchandar closed issue #3395:
URL: https://github.com/apache/hudi/issues/3395


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-905203993


   sure @nsivabalan  started testing on EMR  will update the ticket accordingly .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893140263


   Sure @nsivabalan eventually our test and prod environment will be EMR only. But before doing actual testing and derive the benchmarking metrics as I said earlier just evaluating Hudi to explore all its features in my local setup.
    
   But for now below are the libraries I am using :   
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.7.0"
   libraryDependencies += "org.apache.hudi" %% "hudi-utilities-bundle" % "0.7.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   
   and while creating Spark session Object below is the spark config settings:
   val spark: SparkSession = SparkSession.builder()
         .appName("hudi-datalake")
         .master("local[*]")
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .config("spark.shuffle.compress", "true")
         .config("spark.shuffle.spill.compress", "true")
         .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
         .config("spark.sql.hive.convertMetastoreParquet", "false") 
         .getOrCreate()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-891928642


   ambarish: Can you try with scala 2.11. Can you give us the spark-shell launch command as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-894945928


   Hi @nsivabalan , as suggested have removed utilities bundle dependency and changed the write mode to Overwrite but still no luck. Getting the same exception:
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Invalid query type :read_optimized
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
   	
   
   Below are the configuration for reference :
   
   scalaVersion := "2.12.11"
   
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.7.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   
   userSegDf.write
         .format("hudi")
          .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
          .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
          .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .option("hoodie.cleaner.policy", HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
         .option("hoodie.cleaner.fileversions.retained", "3")
         .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP,2)
         .option(HoodieCompactionConfig.INLINE_COMPACT_PROP,"true")
         .option(HoodieCompactionConfig.AUTO_CLEAN_PROP,"true")
         .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP,2)
         .mode(SaveMode.Overwrite)
         .save(s"$basePath/$tableName/")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-894470471


   hudi-spark-bundle alone is good enough. you don't need utilities bundle. Can you retry removing dep on utilities bundle. Also, can you set Overwrite as mode to start from scratch to do the writing. 
   
   I don't have much exp with running hudi in windows. So, will try my best to help you out. 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-895330634


   Or in other words,
   // spark-shell for spark 3
   hudi-spark3-bundle_2.12:0.8.0
   org.apache.spark:spark-avro_2.12:3.0.1
     
   // spark-shell for spark 2 with scala 2.12
   hudi-spark-bundle_2.12:0.8.0
   org.apache.spark:spark-avro_2.12:2.4.4 
     
   // spark-shell for spark 2 with scala 2.11
   hudi-spark-bundle_2.11:0.8.0
   org.apache.spark:spark-avro_2.11:2.4.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-895329692


   Can you try w/ scala version 2.11. 
   If not, artifacts for bundle has to change. 
   
   ```
   spark-shell \
     --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
     
   // spark-shell for spark 2 with scala 2.12
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
     
   // spark-shell for spark 2 with scala 2.11
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   
   Here are the 3 diff possible configurations wrt spark and scala version. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-926281577


   Hi @vinothchandar , For POC purpose I was using Windows where the code the not working directly from IDE where as if packaged and run it executed.  Currently testing on EMR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-926273616


   Closing! Thanks everyone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #3395:
URL: https://github.com/apache/hudi/issues/3395


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-905186839


   sorry, not sure how else we can help here. We don't have windows env with us to reproduce. Would recommend you to try out with EMR and let us know how it goes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-895329692


   Can you try w/ scala version 2.11. 
   If not, artifacts for bundle has to change. 
   
   ```
   // spark-shell for spark 3
   spark-shell \
     --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
     
   // spark-shell for spark 2 with scala 2.12
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
     
   // spark-shell for spark 2 with scala 2.11
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   
   Here are the 3 diff possible configurations wrt spark and scala version. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri edited a comment on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-926281577


   Hi @vinothchandar , For POC purpose I was using Windows where the code was not working directly from IDE where as if packaged and run, it executed.  Currently testing on EMR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-891989921


   I have tried with scala 2.11 as well its giving the same exception:
   
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Invalid query type :read_optimized
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-913392798


   it's working on EMR. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-912848073


   thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893136900


   yes, For eg, something like this
   ```
   spark-shell \
     --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   Ref: https://hudi.apache.org/docs/quick-start-guide
   
   Also, is your production env(eventually) going to be windows? If not, I would recommend you to try it out in EMR or some cluster, as those are well tested environments. I am not sure if hudi is tested end to end in windows. Just a suggestion. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri commented on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri commented on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-891973156


   I am not using Hive, Hadoop or any Cloud storage(S3,HDFS OR GCS) in the current setup. 
   In windows laptop, I have just just pointed HADOOP_HOME  to the directory having winutils.exe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3395: [SUPPORT] Issues with read optimized query MOR table

Posted by GitBox <gi...@apache.org>.

Ambarish-Giri edited a comment on issue #3395:
URL: https://github.com/apache/hudi/issues/3395#issuecomment-893140263


   Sure @nsivabalan eventually our test and prod environment will be EMR only. But before doing actual testing and derive the benchmarking metrics as I said earlier just evaluating Hudi to explore all its features in my local setup.
    
   But for now below are the libraries I am using :  
   scalaVersion := "2.12.11" 
   libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
   libraryDependencies += "org.apache.hudi" %% "hudi-spark-bundle" % "0.7.0"
   libraryDependencies += "org.apache.hudi" %% "hudi-utilities-bundle" % "0.7.0"
   libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.7"
   
   
   and while creating Spark session Object below is the spark config settings:
   val spark: SparkSession = SparkSession.builder()
         .appName("hudi-datalake")
         .master("local[*]")
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .config("spark.shuffle.compress", "true")
         .config("spark.shuffle.spill.compress", "true")
         .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
         .config("spark.sql.hive.convertMetastoreParquet", "false") 
         .getOrCreate()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org