You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Atluri Yaswanth <ya...@gmail.com> on 2020/04/08 00:33:19 UTC
Pyspark with hudi scripts
Hi Team,
I would like to know are there any scripts in PySpark to upsert the data in
hudi dataset.
I am working with Scala now, but i want to use Pyspark as my data is not in
good format(i need to use various libraries inside).
Thanks in advance
yaswanth
Re: Pyspark with hudi scripts
Posted by Vinoth Govindarajan <vi...@gmail.com>.
Sorry, I mixed up the names in my last comment and missed to provide the jars info.
Hi Yaswanth,
You need to include the following three jar file using the --jars option to either spark-submit or pyspark command before using the "org.apach.hudi" format in your code to create hudi datasets.
- https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.2-incubating/hudi-spark-bundle_2.11-0.5.2-incubating.jar
- https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2.jar
- https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.5/spark-avro_2.11-2.4.5.jar
Note: Tested with Spark 2.4.5 and Scala 2.11 version, make sure the scala version is matching among all the jar files.
Thanks,
Vinoth
On 2020/04/09 01:24:56, Vinoth Govindarajan <vi...@gmail.com> wrote:
> Hi Udit,
> You can use the scripts provided by Yaswanth for reading/writing the hudi dataset using pyspark.
>
> I need to understand your requirements little bit more to add formal support.
>
> Are you looking for a python command-line tool similar to deltastreamer (https://hudi.apache.org/docs/writing_data.html#deltastreamer) for both hudi reader/writer
> or interested in using Data Source APIs like
>
> hudiOpts = {
> "hoodie.datasource.write.recordkey.field": "uuid",
> "hoodie.datasource.write.precombine.field": "update_timestamp",
> "hoodie.datasource.write.operation": "upsert",
> "hoodie.table.name": "tmp.stock_ticker"
> }
> basePath = "/tmp/stock_ticker"
> inputDF.write.format("org.apache.hudi")
> .options(**hudiOpts)
> .mode("Append")
> .save(basePath)
>
> basePath = "/tmp/stock_ticker/*"
> outputDF = inputDF.read.format("org.apache.hudi").load(basePath)
>
> Thanks,
> Vinoth
>
>
> On 2020/04/09 00:39:49, Vinoth Chandar <vi...@apache.org> wrote:
> > Thanks Udit! I also believe there will be a PR soon for pySpark and we
> > should have formal support next release.
> >
> >
> >
> > On Wed, Apr 8, 2020 at 4:49 PM Mehrotra, Udit <ud...@amazon.com.invalid>
> > wrote:
> >
> > > Hi Yaswanth,
> > >
> > > PFA an example I prepared sometime back which can help you get started.
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 4/8/20, 3:21 PM, "Atluri Yaswanth" <ya...@gmail.com> wrote:
> > >
> > > CAUTION: This email originated from outside of the organization. Do
> > > not click links or open attachments unless you can confirm the sender and
> > > know the content is safe.
> > >
> > >
> > >
> > > Hi Team,
> > >
> > > I would like to know are there any scripts in PySpark to upsert the
> > > data in
> > > hudi dataset.
> > >
> > > I am working with Scala now, but i want to use Pyspark as my data is
> > > not in
> > > good format(i need to use various libraries inside).
> > >
> > > Thanks in advance
> > > yaswanth
> > >
> > >
> > >
> >
>
Re: Pyspark with hudi scripts
Posted by Vinoth Govindarajan <vi...@gmail.com>.
Hi Udit,
You can use the scripts provided by Yaswanth for reading/writing the hudi dataset using pyspark.
I need to understand your requirements little bit more to add formal support.
Are you looking for a python command-line tool similar to deltastreamer (https://hudi.apache.org/docs/writing_data.html#deltastreamer) for both hudi reader/writer
or interested in using Data Source APIs like
hudiOpts = {
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.datasource.write.precombine.field": "update_timestamp",
"hoodie.datasource.write.operation": "upsert",
"hoodie.table.name": "tmp.stock_ticker"
}
basePath = "/tmp/stock_ticker"
inputDF.write.format("org.apache.hudi")
.options(**hudiOpts)
.mode("Append")
.save(basePath)
basePath = "/tmp/stock_ticker/*"
outputDF = inputDF.read.format("org.apache.hudi").load(basePath)
Thanks,
Vinoth
On 2020/04/09 00:39:49, Vinoth Chandar <vi...@apache.org> wrote:
> Thanks Udit! I also believe there will be a PR soon for pySpark and we
> should have formal support next release.
>
>
>
> On Wed, Apr 8, 2020 at 4:49 PM Mehrotra, Udit <ud...@amazon.com.invalid>
> wrote:
>
> > Hi Yaswanth,
> >
> > PFA an example I prepared sometime back which can help you get started.
> >
> > Thanks,
> > Udit
> >
> > On 4/8/20, 3:21 PM, "Atluri Yaswanth" <ya...@gmail.com> wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > Hi Team,
> >
> > I would like to know are there any scripts in PySpark to upsert the
> > data in
> > hudi dataset.
> >
> > I am working with Scala now, but i want to use Pyspark as my data is
> > not in
> > good format(i need to use various libraries inside).
> >
> > Thanks in advance
> > yaswanth
> >
> >
> >
>
Re: Pyspark with hudi scripts
Posted by Vinoth Chandar <vi...@apache.org>.
Thanks Udit! I also believe there will be a PR soon for pySpark and we
should have formal support next release.
On Wed, Apr 8, 2020 at 4:49 PM Mehrotra, Udit <ud...@amazon.com.invalid>
wrote:
> Hi Yaswanth,
>
> PFA an example I prepared sometime back which can help you get started.
>
> Thanks,
> Udit
>
> On 4/8/20, 3:21 PM, "Atluri Yaswanth" <ya...@gmail.com> wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Hi Team,
>
> I would like to know are there any scripts in PySpark to upsert the
> data in
> hudi dataset.
>
> I am working with Scala now, but i want to use Pyspark as my data is
> not in
> good format(i need to use various libraries inside).
>
> Thanks in advance
> yaswanth
>
>
>
Re: Pyspark with hudi scripts
Posted by "Mehrotra, Udit" <ud...@amazon.com.INVALID>.
Hi Yaswanth,
PFA an example I prepared sometime back which can help you get started.
Thanks,
Udit
On 4/8/20, 3:21 PM, "Atluri Yaswanth" <ya...@gmail.com> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
Hi Team,
I would like to know are there any scripts in PySpark to upsert the data in
hudi dataset.
I am working with Scala now, but i want to use Pyspark as my data is not in
good format(i need to use various libraries inside).
Thanks in advance
yaswanth