You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Akanksha Sharma B <ak...@ericsson.com> on 2018/07/24 14:47:25 UTC

pipeline with parquet and sql

Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha

Re: Schema Aware PCollections

Posted by Akanksha Sharma B <ak...@ericsson.com>.

Hi Anton,


Thank you !!!


Regards,

Akanksha

________________________________
From: Anton Kedin <ke...@google.com>
Sent: Wednesday, August 8, 2018 9:57:33 PM
To: user@beam.apache.org
Cc: dev@beam.apache.org
Subject: Re: Schema Aware PCollections

Yes, this should be possible eventually. In fact, limited version of this functionality is already supported for Beans (e.g. see this test<https://github.com/apache/beam/blob/20d95a57ad7e5a4c20b2d0824675afefe52dfe9c/sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/JavaBeanSchemaTest.java>), but it's still experimental and there are no good end-to-end examples yet.

Regards,
Anton

On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B <ak...@ericsson.com>> wrote:

Hi,


(changed the email-subject to make it generic)


It is mentioned in Schema-Aware PCollections design doc (https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc)


"There are a number of existing data types from which schemas can be inferred. Protocol buffers, Avro objects, Json objects, POJOs, primitive Java types - all of these have schemas that can be inferred from the type itself at pipeline-construction time. We should be able to automatically infer these schemas with a minimum of involvement from the programmer. "

Can I assume that the following usecase will be possible sometime in future :-
"read parquet (along with inferred schema) into something like dataframe or Beam Rows. And vice versa for write i.e. get rows and write parquet based on Row's schema.""

Regards,
Akanksha

________________________________
From: Chamikara Jayalath <ch...@google.com>>
Sent: Wednesday, August 1, 2018 3:57 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Cc: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql



On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <ak...@ericsson.com>> wrote:

Hi,


Thanks. I understood the Parquet point. I will wait for couple of days on this topic. Even if this scenario cannot be achieved now, any design document or future plans towards this direction will also be helpful to me.


To summarize, I do not understand beam well enough, can someone please help me and comment whether the following fits with beam's model and future direction :-

"read parquet (along with inferred schema) into something like dataframe or Beam Rows. And vice versa for write i.e. get rows and write parquet based on Row's schema."

Beam currently does not have a standard message format. A Beam pipeline consists of PCollections and transforms (that converts PCollections to other PCollections). You can transform the PCollection read from Parquet using a ParDo and writing the resulting transform back to Parquet format. I think Schema aware PCollections [1] might be close to what you need but not sure if it fulfills your exact requirement.

Thanks,
Cham

[1]  https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E





Regards,

Akanksha


________________________________
From: Łukasz Gajowy <lu...@gmail.com>>
Sent: Tuesday, July 31, 2018 12:43:32 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Cc: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql

In terms of schema and ParquetIO source/sink, there was an answer in some previous thread:

Currently (without introducing any change in ParquetIO) there is no way to not pass the avro schema. It will probably be replaced with Beam's schema in the future ()

[1] https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>> napisał(a):

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha

________________________________
From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha


________________________________
From: Akanksha Sharma B
Sent: Tuesday, July 24, 2018 4:47:25 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Subject: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha

Re: Schema Aware PCollections

Posted by Anton Kedin <ke...@google.com>.

Yes, this should be possible eventually. In fact, limited version of this
functionality is already supported for Beans (e.g. see this test
<https://github.com/apache/beam/blob/20d95a57ad7e5a4c20b2d0824675afefe52dfe9c/sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/JavaBeanSchemaTest.java>),
but it's still experimental and there are no good end-to-end examples yet.

Regards,
Anton

On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B <
akanksha.b.sharma@ericsson.com> wrote:

> Hi,
>
>
> (changed the email-subject to make it generic)
>
>
> It is mentioned in Schema-Aware PCollections design doc (
> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc
> )
>
>
> "There are a number of existing data types from which schemas can be
> inferred. Protocol buffers, Avro objects, Json objects, POJOs, primitive
> Java types - all of these have schemas that can be inferred from the type
> itself at pipeline-construction time. We should be able to automatically
> infer these schemas with a minimum of involvement from the programmer. "
>
> Can I assume that the following usecase will be possible sometime in
> future :-
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema.""
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Chamikara Jayalath <ch...@google.com>
> *Sent:* Wednesday, August 1, 2018 3:57 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
>
> On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
> akanksha.b.sharma@ericsson.com> wrote:
>
> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>
>
> Beam currently does not have a standard message format. A Beam pipeline
> consists of PCollections and transforms (that converts PCollections to
> other PCollections). You can transform the PCollection read from Parquet
> using a ParDo and writing the resulting transform back to Parquet format. I
> think Schema aware PCollections [1] might be close to what you need but not
> sure if it fulfills your exact requirement.
>
> Thanks,
> Cham
>
> [1]
> https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E
>
>
>
>
>
> Regards,
>
> Akanksha
>
>
> ------------------------------
> *From:* Łukasz Gajowy <lu...@gmail.com>
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Re: Schema Aware PCollections

Posted by Anton Kedin <ke...@google.com>.

Yes, this should be possible eventually. In fact, limited version of this
functionality is already supported for Beans (e.g. see this test
<https://github.com/apache/beam/blob/20d95a57ad7e5a4c20b2d0824675afefe52dfe9c/sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/JavaBeanSchemaTest.java>),
but it's still experimental and there are no good end-to-end examples yet.

Regards,
Anton

On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B <
akanksha.b.sharma@ericsson.com> wrote:

> Hi,
>
>
> (changed the email-subject to make it generic)
>
>
> It is mentioned in Schema-Aware PCollections design doc (
> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc
> )
>
>
> "There are a number of existing data types from which schemas can be
> inferred. Protocol buffers, Avro objects, Json objects, POJOs, primitive
> Java types - all of these have schemas that can be inferred from the type
> itself at pipeline-construction time. We should be able to automatically
> infer these schemas with a minimum of involvement from the programmer. "
>
> Can I assume that the following usecase will be possible sometime in
> future :-
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema.""
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Chamikara Jayalath <ch...@google.com>
> *Sent:* Wednesday, August 1, 2018 3:57 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
>
> On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
> akanksha.b.sharma@ericsson.com> wrote:
>
> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>
>
> Beam currently does not have a standard message format. A Beam pipeline
> consists of PCollections and transforms (that converts PCollections to
> other PCollections). You can transform the PCollection read from Parquet
> using a ParDo and writing the resulting transform back to Parquet format. I
> think Schema aware PCollections [1] might be close to what you need but not
> sure if it fulfills your exact requirement.
>
> Thanks,
> Cham
>
> [1]
> https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E
>
>
>
>
>
> Regards,
>
> Akanksha
>
>
> ------------------------------
> *From:* Łukasz Gajowy <lu...@gmail.com>
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Schema Aware PCollections

Posted by Akanksha Sharma B <ak...@ericsson.com>.

Hi,


(changed the email-subject to make it generic)


It is mentioned in Schema-Aware PCollections design doc (https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc)


"There are a number of existing data types from which schemas can be inferred. Protocol buffers, Avro objects, Json objects, POJOs, primitive Java types - all of these have schemas that can be inferred from the type itself at pipeline-construction time. We should be able to automatically infer these schemas with a minimum of involvement from the programmer. "

Can I assume that the following usecase will be possible sometime in future :-
"read parquet (along with inferred schema) into something like dataframe or Beam Rows. And vice versa for write i.e. get rows and write parquet based on Row's schema.""

Regards,
Akanksha

________________________________
From: Chamikara Jayalath <ch...@google.com>
Sent: Wednesday, August 1, 2018 3:57 PM
To: user@beam.apache.org
Cc: dev@beam.apache.org
Subject: Re: pipeline with parquet and sql



On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <ak...@ericsson.com>> wrote:

Hi,


Thanks. I understood the Parquet point. I will wait for couple of days on this topic. Even if this scenario cannot be achieved now, any design document or future plans towards this direction will also be helpful to me.


To summarize, I do not understand beam well enough, can someone please help me and comment whether the following fits with beam's model and future direction :-

"read parquet (along with inferred schema) into something like dataframe or Beam Rows. And vice versa for write i.e. get rows and write parquet based on Row's schema."

Beam currently does not have a standard message format. A Beam pipeline consists of PCollections and transforms (that converts PCollections to other PCollections). You can transform the PCollection read from Parquet using a ParDo and writing the resulting transform back to Parquet format. I think Schema aware PCollections [1] might be close to what you need but not sure if it fulfills your exact requirement.

Thanks,
Cham

[1]  https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E





Regards,

Akanksha


________________________________
From: Łukasz Gajowy <lu...@gmail.com>>
Sent: Tuesday, July 31, 2018 12:43:32 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Cc: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql

In terms of schema and ParquetIO source/sink, there was an answer in some previous thread:

Currently (without introducing any change in ParquetIO) there is no way to not pass the avro schema. It will probably be replaced with Beam's schema in the future ()

[1] https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>> napisał(a):

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha

________________________________
From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha


________________________________
From: Akanksha Sharma B
Sent: Tuesday, July 24, 2018 4:47:25 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Subject: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha

Re: pipeline with parquet and sql

Posted by Chamikara Jayalath <ch...@google.com>.

On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
akanksha.b.sharma@ericsson.com> wrote:

> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>

Beam currently does not have a standard message format. A Beam pipeline
consists of PCollections and transforms (that converts PCollections to
other PCollections). You can transform the PCollection read from Parquet
using a ParDo and writing the resulting transform back to Parquet format. I
think Schema aware PCollections [1] might be close to what you need but not
sure if it fulfills your exact requirement.

Thanks,
Cham

[1]
https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E


>
>
>
> Regards,
>
> Akanksha
>
>
> ------------------------------
> *From:* Łukasz Gajowy <lu...@gmail.com>
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Re: pipeline with parquet and sql

Posted by Chamikara Jayalath <ch...@google.com>.

On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
akanksha.b.sharma@ericsson.com> wrote:

> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>

Beam currently does not have a standard message format. A Beam pipeline
consists of PCollections and transforms (that converts PCollections to
other PCollections). You can transform the PCollection read from Parquet
using a ParDo and writing the resulting transform back to Parquet format. I
think Schema aware PCollections [1] might be close to what you need but not
sure if it fulfills your exact requirement.

Thanks,
Cham

[1]
https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E


>
>
>
> Regards,
>
> Akanksha
>
>
> ------------------------------
> *From:* Łukasz Gajowy <lu...@gmail.com>
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* user@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Re: pipeline with parquet and sql

Posted by Akanksha Sharma B <ak...@ericsson.com>.

Hi,


Thanks. I understood the Parquet point. I will wait for couple of days on this topic. Even if this scenario cannot be achieved now, any design document or future plans towards this direction will also be helpful to me.


To summarize, I do not understand beam well enough, can someone please help me and comment whether the following fits with beam's model and future direction :-

"read parquet (along with inferred schema) into something like dataframe or Beam Rows. And vice versa for write i.e. get rows and write parquet based on Row's schema."



Regards,

Akanksha


________________________________
From: Łukasz Gajowy <lu...@gmail.com>
Sent: Tuesday, July 31, 2018 12:43:32 PM
To: user@beam.apache.org
Cc: dev@beam.apache.org
Subject: Re: pipeline with parquet and sql

In terms of schema and ParquetIO source/sink, there was an answer in some previous thread:

Currently (without introducing any change in ParquetIO) there is no way to not pass the avro schema. It will probably be replaced with Beam's schema in the future ()

[1] https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>> napisał(a):

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha

________________________________
From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha


________________________________
From: Akanksha Sharma B
Sent: Tuesday, July 24, 2018 4:47:25 PM
To: user@beam.apache.org<ma...@beam.apache.org>
Subject: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha

Re: pipeline with parquet and sql

Posted by Łukasz Gajowy <lu...@gmail.com>.

Sorry, I sent not finished message.

In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread [1]. Currently (without introducing any change in
ParquetIO) there is no way to not pass the avro schema. It will probably be
replaced with Beam's schema in the future [2].

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-4812

wt., 31 lip 2018 o 12:43 Łukasz Gajowy <lu...@gmail.com> napisał(a):

> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
>> Hi,
>>
>>
>> I am hoping to get some hints/pointers from the experts here.
>>
>> I hope the scenario described below was understandable. I hope it is a
>> valid use-case. Please let me know if I need to explain the scenario
>> better.
>>
>>
>> Regards,
>>
>> Akanksha
>>
>> ------------------------------
>> *From:* Akanksha Sharma B
>> *Sent:* Friday, July 27, 2018 9:44 AM
>> *To:* dev@beam.apache.org
>> *Subject:* Re: pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>> ------------------------------
>> *From:* Akanksha Sharma B
>> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
>> *To:* user@beam.apache.org
>> *Subject:* pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>>
>>
>>

Re: pipeline with parquet and sql

Posted by Łukasz Gajowy <lu...@gmail.com>.

Sorry, I sent not finished message.

In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread [1]. Currently (without introducing any change in
ParquetIO) there is no way to not pass the avro schema. It will probably be
replaced with Beam's schema in the future [2].

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-4812

wt., 31 lip 2018 o 12:43 Łukasz Gajowy <lu...@gmail.com> napisał(a):

> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
> napisał(a):
>
>> Hi,
>>
>>
>> I am hoping to get some hints/pointers from the experts here.
>>
>> I hope the scenario described below was understandable. I hope it is a
>> valid use-case. Please let me know if I need to explain the scenario
>> better.
>>
>>
>> Regards,
>>
>> Akanksha
>>
>> ------------------------------
>> *From:* Akanksha Sharma B
>> *Sent:* Friday, July 27, 2018 9:44 AM
>> *To:* dev@beam.apache.org
>> *Subject:* Re: pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>> ------------------------------
>> *From:* Akanksha Sharma B
>> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
>> *To:* user@beam.apache.org
>> *Subject:* pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>>
>>
>>

Re: pipeline with parquet and sql

Posted by Łukasz Gajowy <lu...@gmail.com>.

In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread:

Currently (without introducing any change in ParquetIO) there is no way to
not pass the avro schema. It will probably be replaced with Beam's schema
in the future ()

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
napisał(a):

> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Re: pipeline with parquet and sql

Posted by Łukasz Gajowy <lu...@gmail.com>.

In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread:

Currently (without introducing any change in ParquetIO) there is no way to
not pass the avro schema. It will probably be replaced with Beam's schema
in the future ()

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B <ak...@ericsson.com>
napisał(a):

> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> ------------------------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* user@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>

Re: pipeline with parquet and sql

Posted by Akanksha Sharma B <ak...@ericsson.com>.

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha

________________________________
From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: dev@beam.apache.org
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha


________________________________
From: Akanksha Sharma B
Sent: Tuesday, July 24, 2018 4:47:25 PM
To: user@beam.apache.org
Subject: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info.


Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha