You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lfiaschi <lu...@gmail.com> on 2015/07/18 13:19:28 UTC

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load
it into a pypark RDD for ETL data processing.

I realized that BigQuery supports the Hadoop Input / Output format

https://cloud.google.com/hadoop/writing-with-bigquery-connector

and pyspark should be able to use this interface in order to create an RDD
by using the method "newAPIHadoopRDD".

http://spark.apache.org/docs/latest/api/python/pyspark.html

Unfortunately, the documentation on both ends seems scarce and goes beyond
my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out
how to do this?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/BigQuery-connector-for-pyspark-via-Hadoop-Input-Format-example-tp23900.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org