You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/23 20:58:27 UTC

[GitHub] [hudi] rubenssoto opened a new issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

rubenssoto opened a new issue #2013:
URL: https://github.com/apache/hudi/issues/2013


   Hi Guys,
   
   I have a table could have updated at any point in time, so I would try MoR tables, this table would be a source for my Redshift DW, so I need a method to pull this data incrementally.
   
   I saw that Spark Datasource only query MoR tables in batch, so, would be good full support of Hudi on spark datasources and full support of hudi in a spark structure streaming source.
   
   I found some Jira tickets with this topic.
   
   https://issues.apache.org/jira/projects/HUDI/issues/HUDI-920?filter=allopenissues
   
   https://issues.apache.org/jira/projects/HUDI/issues/HUDI-1109?filter=allopenissues


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-678881198


   @rubenssoto Hello, the incremental pulling for MOR table is currently under review and will be available in the 0.6.1 release, which will be shortly after the 0.6.0 release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fripple commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

fripple commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767706110


   Yes, I'm using spark as provided by AWS. Is there any way to make this work or am I out of luck until AWS EMR supports hudi 0.7?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar closed issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar closed issue #2013:
URL: https://github.com/apache/hudi/issues/2013


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767204986


   @garyli1019 : can you give any updates you have on on this regard. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-678842778


   @garyli1019 : I would let you answer this question. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767265499


   This is now out in the 0.7.0 release. 
   
   See https://github.com/apache/hudi/blame/release-0.7.0/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala#L183 this test for examples


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar closed issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar closed issue #2013:
URL: https://github.com/apache/hudi/issues/2013


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fripple commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

fripple commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767624984


   @vinothchandar Will this work with a table created using an older version of hudi?
   
   When I try to do this using 0.7.0 (and spark 3.0.0 on emr 6.1, and spark-avro_2.12-3.0.0), I get the following error:
   incremental_read_options = {
       'hoodie.datasource.query.type': 'incremental',
       'hoodie.datasource.read.begin.instanttime': beginTime - 1
   }
   incremental = spark.read.format("org.apache.hudi"). \
       options(**incremental_read_options). \
       load(basePath)
   
   
   An error occurred while calling o127.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadIncrementalRelation.$anonfun$buildFileIndex$7(MergeOnReadIncrementalRelation.scala:195)
   	at scala.collection.immutable.List.map(List.scala:286)
   	at org.apache.hudi.MergeOnReadIncrementalRelation.buildFileIndex(MergeOnReadIncrementalRelation.scala:189)
   	at org.apache.hudi.MergeOnReadIncrementalRelation.<init>(MergeOnReadIncrementalRelation.scala:80)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 131, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o127.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadIncrementalRelation.$anonfun$buildFileIndex$7(MergeOnReadIncrementalRelation.scala:195)
   	at scala.collection.immutable.List.map(List.scala:286)
   	at org.apache.hudi.MergeOnReadIncrementalRelation.buildFileIndex(MergeOnReadIncrementalRelation.scala:189)
   	at org.apache.hudi.MergeOnReadIncrementalRelation.<init>(MergeOnReadIncrementalRelation.scala:80)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767720824


   I think aws has to support/recompile against their spark version. cc @umehrot2 
   
   for now, you can test using apache spark ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767265499


   This is now out in the 0.7.0 release. 
   
   See https://github.com/apache/hudi/blame/release-0.7.0/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala#L183 this test for examples


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767204986


   @garyli1019 : can you give any updates you have on on this regard. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767703987


   This error seems to be due to using the aws spark distro? This change would work with any table written using previous versions. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org