You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/26 09:22:46 UTC

[GitHub] [hudi] matthiasdg opened a new issue #3868: [SUPPORT] Querying hudi datasets from standalone metastore

matthiasdg opened a new issue #3868:
URL: https://github.com/apache/hudi/issues/3868


   **Describe the problem you faced**
   We're running hudi (now 0.9 with spark 3.1.2) with Azure data lake storage gen2, and are now trying to get the hive integration working. Since we're running everything on Kubernetes, we got inspired by https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1 and run `hive-standalone-metastore-3.0.0` (Dockerfile cf. https://gist.github.com/joshuarobinson/2d6142d1ff3750376d9559e259cd94e0 with hadoop 3.2.0 and hadoop-azure instead of s3 jars). We also have a Spark thrift server (with hudi-spark3-bundle) but I think this can be kept out of the current problem.
   
   Syncing to this hive metastore completes successfully, both in Spark with `DataSourceWriteOptions` or with the `hudi-hive-sync` tool if I have a `hive-site.xml` with the metastore's thrift URI and set syncmode to `hms`.
   
   The problem occurs when I try to query data that has partition columns in Hive metastore. I can list databases, tables, and query data from tables without partitions, but get errors as soon as partitions come into play (was using spark in scala with `builder.enableHiveSupport` and config settings like `hive.metastore.uris` and `spark.sql.hive.convertMetastoreParquet` (though it could be that's no longer necessary with 0.9?)).
   
   Some examples, I have a table like:
   ```
   +-------------------------------+----------------------------------------------------+----------+
   |           col_name            |                     data_type                      | comment  |
   +-------------------------------+----------------------------------------------------+----------+
   | _hoodie_commit_time           | string                                             | NULL     |
   | _hoodie_commit_seqno          | string                                             | NULL     |
   | _hoodie_record_key            | string                                             | NULL     |
   | _hoodie_partition_path        | string                                             | NULL     |
   | _hoodie_file_name             | string                                             | NULL     |
   | data                          | struct<overview:struct<visitorTypeAmount:struct<uniqueVisitorAmount:int>>> | NULL     |
   | sensorId                      | string                                             | NULL     |
   | ts                            | timestamp                                          | NULL     |
   | hiveid                        | string                                             | NULL     |
   | hivets                        | string                                             | NULL     |
   | # Partition Information       |                                                    |          |
   | # col_name                    | data_type                                          | comment  |
   | hiveid                        | string                                             | NULL     |
   | hivets                        | string                                             | NULL     |
   |                               |                                                    |          |
   | # Detailed Table Information  |                                                    |          |
   | Database                      | degeyt70                                           |          |
   | Table                         | ctmmsm_2510                                        |          |
   | Created Time                  | Mon Oct 25 15:42:23 UTC 2021                       |          |
   | Last Access                   | UNKNOWN                                            |          |
   | Created By                    | Spark 2.2 or prior                                 |          |
   | Type                          | EXTERNAL                                           |          |
   | Provider                      | hudi                                               |          |
   | Table Properties              | [last_commit_time_sync=20211022175915]             |          |
   | Location                      | abfss://dev@stsdpglasshouse.dfs.core.windows.net/devs/degeyt70/partitiontests/datalakehouse/ctm.tf_msm |          |
   | Serde Library                 | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |          |
   | InputFormat                   | org.apache.hudi.hadoop.HoodieParquetInputFormat    |          |
   | OutputFormat                  | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |          |
   | Storage Properties            | [hoodie.query.as.ro.table=false]                   |          |
   +-------------------------------+----------------------------------------------------+----------+
   ```
   When querying, I get an error like: 
   ```
   [info]   Cause: java.lang.RuntimeException: Failed to cast value `2021/06/03` to `TimestampType` for partition column `ts`
   [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313)
   [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251)
   [info]   at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:37)
   [info]   at org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:404)
   ```
   (`ts` is a timestamp-micros field used to create year/month/day hudi partitions, is not a hive partition column).
   
   Another example:
   ```
   +-------------------------------+----------------------------------------------------+----------+
   |           col_name            |                     data_type                      | comment  |
   +-------------------------------+----------------------------------------------------+----------+
   | _hoodie_commit_time           | string                                             | NULL     |
   | _hoodie_commit_seqno          | string                                             | NULL     |
   | _hoodie_record_key            | string                                             | NULL     |
   | _hoodie_partition_path        | string                                             | NULL     |
   | _hoodie_file_name             | string                                             | NULL     |
   | sensorId                      | bigint                                             | NULL     |
   | timestamp                     | timestamp                                          | NULL     |
   | value                         | double                                             | NULL     |
   | devId                         | string                                             | NULL     |
   | year                          | string                                             | NULL     |
   | month                         | string                                             | NULL     |
   | day                           | string                                             | NULL     |
   | # Partition Information       |                                                    |          |
   | # col_name                    | data_type                                          | comment  |
   | devId                         | string                                             | NULL     |
   | year                          | string                                             | NULL     |
   | month                         | string                                             | NULL     |
   | day                           | string                                             | NULL     |
   |                               |                                                    |          |
   | # Detailed Table Information  |                                                    |          |
   | Database                      | degeyt70                                           |          |
   | Table                         | vmmmsm                                             |          |
   | Owner                         | root                                               |          |
   | Created Time                  | Fri Oct 22 16:40:25 UTC 2021                       |          |
   | Last Access                   | UNKNOWN                                            |          |
   | Created By                    | Spark 2.2 or prior                                 |          |
   | Type                          | EXTERNAL                                           |          |
   | Provider                      | hudi                                               |          |
   | Table Properties              | [last_commit_time_sync=20211022181214]             |          |
   | Location                      | abfss://dev@stsdpglasshouse.dfs.core.windows.net/devs/degeyt70/partitiontests/datalakehouse/vmm.aq_msm |          |
   | Serde Library                 | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |          |
   | InputFormat                   | org.apache.hudi.hadoop.HoodieParquetInputFormat    |          |
   | OutputFormat                  | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |          |
   | Storage Properties            | [hoodie.query.as.ro.table=false]                   |          |
   | Partition Provider            | Catalog                                            |          |
   +-------------------------------+----------------------------------------------------+----------+
   ```
   Here I got:
   ```
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 15) (192.168.0.215 executor driver): java.io.IOException: Required column is missing in data file. Col: [devId]
   [info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:314)
   [info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:154)
   [info] 	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:329)
   [info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
   [info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
   [info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   [info] 	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
   [info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
   [info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   [info] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   [info] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   [info] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   [info] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   [info] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   [info] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   [info] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   [info] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   [info] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   [info] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   [info] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   [info] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   [info] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   [info] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [info] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [info] 	at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   Here the `devId` column it's complaining about is a Hive partition column and guess it's normal it's not in the parquet files.
   
   Other thing I noticed, for MoR `_rt` tables, a `describe` through Spark does not show the Hive partition columns at the end of the regular columns (and I can query, though can't use partition pruning), for `_ro` it does.
   
   Yet another thing I tried was providing hive (tried with versions 2.3.9 and 3.1.2, both work) locally with the `hudi-hadoop-mr-bundle` and a `hive-site.xml` pointing to the remote metastore and open a hive shell for querying. That worked fine after I changed metastore configuration cf. https://issues.apache.org/jira/browse/HIVE-21702 to be able to filter on Hive partition columns. (Getting some mapreduce errors here in case of aggregations but guessing that's other configuration).
   
   This makes me think the sync is indeed OK, and the problem is in the querying. Should it be possible to run spark/hudi queries with this standalone metastore? Could it be due to a hive dependency issue?
   
   **Expected behavior**
   Being able to query successfully through this metastore (without a HiveServer), or the Spark thrift server later on.
   
   **Environment Description**
   
   * Hudi version : 0.9
   
   * Spark version : 3.1.2
   
   * Hive version : in my code dependencies I apparently have 2.3.7, while this standalone metastore is 3.0.0. Will try switching this next, but wondering if it will make much difference, based on the hive experiment I did above.
   
   * Hadoop version : 3.2.0 (minimum requirement for azure storage) 
   
   * Storage (HDFS/S3/GCS..) : Azure data lake gen 2 (abfss)
   
   * Running on Docker? (yes/no) : querying happens from my local machine to a port forwarded k8s pod.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-953069335


   I got it working by providing a `--spark-datasource` parameter upon syncing. What is quite confusing, is that this actually **disables** `syncAsSparkDataSourceTable`, since the default value is true. It basically just toggles the default (described here: https://github.com/cbeust/jcommander/issues/378). Maybe better to define an arity 1 for booleans so that you explicitly have to specify the value when you provide the flag (cf. https://jcommander.org/)? 
   
   So if I sync with `syncAsSparkDataSourceTable` I can't query my hive tables with spark sql. When can this then be used? Since it's on by default and queries like in the examples on the hudi website didn't work like that I feel like I missed some documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1035963670


   @matthiasdg Could you remove the partition extraction config (by default it is slash encoded day partition) and try again? I have updated the gist with both failed and successful runs in the gist: https://gist.github.com/codope/c4487d35beb60e322316d9a18773103a
   
   only difference in the cases of successful run is that i'm using the default partition extractor class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg edited a comment on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg edited a comment on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-953069335


   I got it working by providing a `--spark-datasource` parameter upon syncing. What is quite confusing, is that this actually **disables** `syncAsSparkDataSourceTable`, since the default value is true. It basically just toggles the default (described here: https://github.com/cbeust/jcommander/issues/378). Maybe better to define an arity 1 for booleans so that you explicitly have to specify the value when you provide the flag (cf. https://jcommander.org/)? 
   
   So if I sync with `syncAsSparkDataSourceTable` I can't query my hive tables with spark sql. When can this then be used? Since it's on by default and queries like in the examples on the hudi website didn't work like that I feel like I missed some documentation.
   
   Other assumption: since the `syncAsSparkDataSourceTable` does not work, I still need to provide `spark.sql.hive.convertMetastoreParquet=false`, right? In that case I do get https://github.com/apache/hudi/issues/2544 when reading from a table where the timestamp option was used. So am using bigint for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-961860610


   hey. Can you give us full set of configs you used to create/write to hudi table along with sync configs used. would help us triage the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1047400875


   @matthiasdg : can we have any updates here please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1052138018


   I'll try to take a look somewhere in the coming week!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1022197453


   issue also present with hudi 0.10.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-961860610


   hey. Can you give us full set of configs you used to create/write to hudi table along with sync configs used. would help us triage the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-964200441


   Ok, should be doable.
   So typically, we partition by time, sometimes by an id as well. One example (using datasource writer):
   ```
   df.write("org.apache.hudi")
   .options(
     Map("hoodie.insert.shuffle.parallelism" -> "4",
             "hoodie.upsert.shuffle.parallelism"->"4",
             "hoodie.embed.timeline.server.port"-> "27055",
            "hoodie.filesystem.view.remote.port" -> "27054"
     )
   )
   .options(
     Map(
       HoodieWriteConfig.TABLE_NAME                    -> "awv.tf_msm",
       DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY  -> "tijd_waarneming,unieke_id",
       DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "tijd_laatst_gewijzigd",
       DataSourceWriteOptions.TABLE_TYPE_OPT_KEY       -> HoodieTableType.MERGE_ON_READ
     )
   )
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
   .options(
     Map(
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "tijd_waarneming:TIMESTAMP",
     DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.keygen.CustomKeyGenerator].getName,
     "hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd",
     "hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "SCALAR",
     "hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit" -> "microseconds"
     )
   )
   .mode(SaveMode.Overwrite)
   ```
   (Could be that there is a typo here or there; reconstructed this through some abstraction layers).
   We read raw files through spark, hence have microsecond timestamps for all our data. We sometimes specify ports since we also run in k8s client mode so executor pods need to be able to contact driver.
   
   Spark session something like:
   ```
   val conf = new SparkConf()
       conf.setAll(ConfigUtils.dfsADLS2_AuthConfig().iterator.toIterable)
       conf.set("spark.scheduler.mode", "FAIR")
       conf.setMaster("local[2]")
       conf.set("spark.ui.enabled", "false")
       conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   val builder = SparkSession.builder()
   builder.config(conf)
   val session = builder.getOrCreate()
   ```
   where dfsADLS2_AuthConfig() is something like:
   ```
   s"spark.hadoop.fs.azure.account.auth.type.$storageAccountKey"              → "OAuth",
       s"spark.hadoop.fs.azure.account.oauth2.client.endpoint.$storageAccountKey" → s"https://login.microsoftonline.com/$tenantId/oauth2/token",
       s"spark.hadoop.fs.azure.account.oauth.provider.type.$storageAccountKey"    → "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       s"spark.hadoop.fs.azure.account.oauth2.client.id.$storageAccountKey"       → clientId,
       s"spark.hadoop.fs.azure.account.oauth2.client.secret.$storageAccountKey"   → clientSecret
   ```
   
   For syncing to hive, I would typically use options like:
   ```
   --base-path abfss://dev@stsdpglasshouse.dfs.core.windows.net/datalakehouse/awv.tf_msm --database degeyt70 --sync-mode hms --partitioned-by year,month,day --spark-datasource --table awv_tf_msm --jdbc-url thrift://localhost:9083 --user hive --pass hive --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor
   ```
   With the `--spark-datasource` options (which disables the `syncAsSparkDataSourceTable`), I can query everything OK. Without it, it fails. Same behavior if I sync using DataSourceWriteOptions. I don't use the `--support-timestamp` option for now, since that does not work with `spark.sql.hive.convertMetastoreParquet=false`.
   
   (Data sample (is open data):
   ```
       <meetpunt beschrijvende_id="H222L10" unieke_id="29">
           <lve_nr>55</lve_nr>
           <tijd_waarneming>2021-05-05T15:08:00+01:00</tijd_waarneming>
           <tijd_laatst_gewijzigd>2021-05-05T15:09:16+01:00</tijd_laatst_gewijzigd>
           <actueel_publicatie>1</actueel_publicatie>
           <beschikbaar>1</beschikbaar>
           <defect>0</defect>
           <geldig>0</geldig>
           <meetdata klasse_id="1">
               <verkeersintensiteit>0</verkeersintensiteit>
               <voertuigsnelheid_rekenkundig>0</voertuigsnelheid_rekenkundig>
               <voertuigsnelheid_harmonisch>252</voertuigsnelheid_harmonisch>
           </meetdata>
           <meetdata klasse_id="2">
               <verkeersintensiteit>0</verkeersintensiteit>
               <voertuigsnelheid_rekenkundig>0</voertuigsnelheid_rekenkundig>
               <voertuigsnelheid_harmonisch>252</voertuigsnelheid_harmonisch>
           </meetdata>
           <meetdata klasse_id="3">
               <verkeersintensiteit>0</verkeersintensiteit>
               <voertuigsnelheid_rekenkundig>0</voertuigsnelheid_rekenkundig>
               <voertuigsnelheid_harmonisch>252</voertuigsnelheid_harmonisch>
           </meetdata>
           <meetdata klasse_id="4">
               <verkeersintensiteit>0</verkeersintensiteit>
               <voertuigsnelheid_rekenkundig>0</voertuigsnelheid_rekenkundig>
               <voertuigsnelheid_harmonisch>252</voertuigsnelheid_harmonisch>
           </meetdata>
           <meetdata klasse_id="5">
               <verkeersintensiteit>0</verkeersintensiteit>
               <voertuigsnelheid_rekenkundig>0</voertuigsnelheid_rekenkundig>
               <voertuigsnelheid_harmonisch>252</voertuigsnelheid_harmonisch>
           </meetdata>
           <rekendata>
               <bezettingsgraad>0</bezettingsgraad>
               <beschikbaarheidsgraad>100</beschikbaarheidsgraad>
               <onrustigheid>0</onrustigheid>
           </rekendata>
       </meetpunt>
   ```
   But it happens for all our data sets I've tried till now (JSON, XML...), so not related to spark-xml)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1030857030


   I tried to reproduce this issue using above configs with the current master. The table gets created in Hive, but partition sync fails. https://gist.github.com/codope/c4487d35beb60e322316d9a18773103a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-961860610


   hey. Can you give us full set of configs you used to create/write to hudi table along with sync configs used. would help us triage the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] Querying hudi datasets from standalone metastore

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-951981890


   Meanwhile experimented with some other versions of hive metastore + mysql running in docker containers (e.g. 2.3.7 cf. spark). Same problems like the hive partition columns missing in the data:
   ```
   21/10/26 16:05:26 WARN HoodieFileIndex: Cannot do the partition prune for table abfss://dev@stsdpglasshouse.dfs.core.windows.net/devs/degeyt70/partitiontests/datalakehouse/vmm.aq_msm.The partitionFragments size (10893,2021,06,30) is not equal to the partition columns size(StructField(sensorId,LongType,false),StructField(timestamp,TimestampType,true))
   21/10/26 16:05:28 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 15) 1]
   java.io.IOException: Required column is missing in data file. Col: [hiveid]
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:314)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:154)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:329)
   ```
   Or is querying only supposed to work via jdbc?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg edited a comment on issue #3868: [SUPPORT] Querying hudi datasets from standalone metastore

Posted by GitBox <gi...@apache.org>.

matthiasdg edited a comment on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-951981890


   Meanwhile experimented with some other versions of hive metastore + mysql running in docker containers (e.g. 2.3.7 cf. spark). Same problems like the hive partition columns missing in the data:
   ```
   21/10/26 16:05:26 WARN HoodieFileIndex: Cannot do the partition prune for table abfss://dev@stsdpglasshouse.dfs.core.windows.net/devs/degeyt70/partitiontests/datalakehouse/vmm.aq_msm.The partitionFragments size (10893,2021,06,30) is not equal to the partition columns size(StructField(sensorId,LongType,false),StructField(timestamp,TimestampType,true))
   21/10/26 16:05:28 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 15) 1]
   java.io.IOException: Required column is missing in data file. Col: [hiveid]
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:314)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:154)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:329)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] matthiasdg commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

Posted by GitBox <gi...@apache.org>.

matthiasdg commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-1069380960

@nsivabalan @codope finally had the time to have a look.
I have the same problem in the spark-shell with both gist examples. We don't have a Hive server, only a metastore, so I used
```
option("hoodie.datasource.hive_sync.mode", "hms").
option("hoodie.datasource.hive_sync.jdbcurl", "thrift://localhost:9083").
```
(port forwarded our metastore running on k8s, used an azure path to write to).

I don't get any errors upon writing/syncing regardless the partition extractor.
(For the slash encoded day partition, I still had to replace the `hoodie.datasource.hive_sync.partition_fields` from the gist with a single value).

If I do something like `spark.sql("select * from hudi_mor_ts_ro").show`, I get in both cases the
```
22/03/16 18:09:30 ERROR Executor: Exception in task 0.0 in stage 40.0 (TID 61)1]
java.io.IOException: Required column is missing in data file. Col: [year]
```
error I described earlier (`Col: [year]` in case of the partition fields `year,month,day` for MultiPartKeysValue, or `Col: [date]` in case of a partition field `date` for SlashEncoded).

Let me know if I can try something else (maybe run metastore locally + write to local storage to see if that makes a difference)?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org