You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by Andy Grove <an...@gmail.com> on 2016/05/06 00:51:37 UTC

Intro & kudu-spark support for persisting DataFrames

Hi,

I'm working with some colleagues at AgilData on Spark/Kudu integration and
we expect to be able to contribute a number of features to the code base.

To kick things off, here is a gerrit for discussion that adds support for
persisting a DataFrame to a Kudu table. It would be great to hear feedback
and feature requests for this capability.

http://gerrit.cloudera.org:8080/#/c/2969/

Thanks,

Andy.

Re: Intro & kudu-spark support for persisting DataFrames

Posted by Mike Percy <mp...@apache.org>.
Hey Andy, I think you may have solved this already but the way to apply an
update after-the-fact to a prior patch: (Assuming you have patch 1, 2, and
3 where patch 3 is logically part of the change in 1.) You do an
interactive git rebase (git rebase -i) and squash the 3rd patch into the
first patch (making sure to retain the Commit-id of the first patch in the
git commit message). After the rebase you are left with only 2 commits
again. Then you just repush both commits to Gerrit. Actually, it is
possible to only repush commit 1 to Gerrit in this case (git checkout
commit-1-hash; push to gerrit; git checkout branch-name).

HTH,
Mike

On Fri, May 6, 2016 at 6:47 AM, Andy Grove <an...@gmail.com> wrote:

> Mike,
>
> Thanks. I have amended the original commit to assume the table already
> exists and have mostly addressed Dan's feedback, and then created a second
> commit to add deleteTable/createTable methods. I made CreateTableOptions a
> parameter to createTable but I expect more may be required.
>
> After doing all of this I realized that I did not address one of Dan's
> comments on the first commit and I'm not sure how to address that with
> Gerrit so I'll go ahead and issue a third commit to improve data type
> coverage later today.
>
> I'd like to add some tests too, so I'm working on getting the test suite to
> run at the moment.
>
> Thanks,
>
> Andy.
>
> On Thu, May 5, 2016 at 9:20 PM, Mike Percy <mp...@cloudera.com> wrote:
>
> > Hey Andy, thanks for posting your code to Gerrit. It's great to see more
> > work progressing on the Spark side of things!
> >
> > Some other stuff you guys might want to consider for the create table
> > feature, in addition to the partitioning details, is the ability to
> specify
> > replication factor. In fact, one might want to specify any of the stuff
> in
> > the CreateTableRequestPB
> > <
> >
> https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337
> > >
> > :
> >
> > message CreateTableRequestPB {
> >   required string name = 1;
> >   required SchemaPB schema = 2;
> >   optional RowOperationsPB split_rows = 6;
> >   optional PartitionSchemaPB partition_schema = 7;
> >   optional int32 num_replicas = 4;
> > }
> >
> > Ideally you could even specify encodings for the columns (dictionary
> > encoding, etc) as part of the schema.
> >
> > All of this does seem orthogonal to the DataFrame writing capability,
> which
> > is probably more straightforward in terms of features. Consider treating
> > these as two separate patches: one that assumes the table already exists,
> > and one that adds the table creation capability. Gerrit will treat two
> > separate commits as two separate code reviews, even if one depends on the
> > other (it can track that). Doing that will make it easier to review and
> > probably faster to get the different parts of it committed.
> >
> > Mike
> >
> >
> > --
> > Mike Percy
> > Software Engineer, Cloudera
> >
> >
> > On Thu, May 5, 2016 at 7:52 PM, Andy Grove <an...@gmail.com>
> wrote:
> >
> > > Hi Dan,
> > >
> > > Thanks for the feedback. I'm thinking that the PK could be specified by
> > > passing in a tuple of column indices to use for the key (assuming that
> > > composite keys are supported?).
> > >
> > > Maybe it would be useful for KuduContext to have createTable(name:
> > String,
> > > schema: StructType) and dropTable(name: String) methods that can
> > eventually
> > > be wrapped in Spark SQL?
> > >
> > > I'm just getting up to speed with Kudu by trying to run TPC-H via Spark
> > and
> > > the use case for driving the current code was being able to take
> existing
> > > DataFrames (created by reading CSV files) and save them to Kudu so that
> > we
> > > can then start testing the TPC-H SQL queries.
> > >
> > > We will be experimenting with both Spark SQL and own own SQL parser /
> > query
> > > planner for Spark, that we are planning on open sourcing too.
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <da...@cloudera.com> wrote:
> > >
> > > > Hey Andy,
> > > >
> > > > Thanks for the patch!  I left you some specific feedback in on the
> > gerrit
> > > > review, but I want to discuss the high level approach a bit.  I think
> > the
> > > > patch as it's written now is going to have limited use, because it
> > > doesn't
> > > > allow for specifying primary keys or partitioning, which are critical
> > for
> > > > correctness and performance. In the long run we will definitely want
> to
> > > be
> > > > able to create tables through Spark SQL, but perhaps we should start
> of
> > > > with just inserting/updating rows in existing tables.  It would be
> > > > interesting to see how other databases solved this problem, since I'm
> > > sure
> > > > we're not the only ones with configuration options on table create.
> > The
> > > > relational databases in particular must have PK options.
> > > >
> > > > - Dan
> > > >
> > > > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <an...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm working with some colleagues at AgilData on Spark/Kudu
> > integration
> > > > and
> > > > > we expect to be able to contribute a number of features to the code
> > > base.
> > > > >
> > > > > To kick things off, here is a gerrit for discussion that adds
> support
> > > for
> > > > > persisting a DataFrame to a Kudu table. It would be great to hear
> > > > feedback
> > > > > and feature requests for this capability.
> > > > >
> > > > > http://gerrit.cloudera.org:8080/#/c/2969/
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > >
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Posted by Andy Grove <an...@gmail.com>.
Mike,

Thanks. I have amended the original commit to assume the table already
exists and have mostly addressed Dan's feedback, and then created a second
commit to add deleteTable/createTable methods. I made CreateTableOptions a
parameter to createTable but I expect more may be required.

After doing all of this I realized that I did not address one of Dan's
comments on the first commit and I'm not sure how to address that with
Gerrit so I'll go ahead and issue a third commit to improve data type
coverage later today.

I'd like to add some tests too, so I'm working on getting the test suite to
run at the moment.

Thanks,

Andy.

On Thu, May 5, 2016 at 9:20 PM, Mike Percy <mp...@cloudera.com> wrote:

> Hey Andy, thanks for posting your code to Gerrit. It's great to see more
> work progressing on the Spark side of things!
>
> Some other stuff you guys might want to consider for the create table
> feature, in addition to the partitioning details, is the ability to specify
> replication factor. In fact, one might want to specify any of the stuff in
> the CreateTableRequestPB
> <
> https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337
> >
> :
>
> message CreateTableRequestPB {
>   required string name = 1;
>   required SchemaPB schema = 2;
>   optional RowOperationsPB split_rows = 6;
>   optional PartitionSchemaPB partition_schema = 7;
>   optional int32 num_replicas = 4;
> }
>
> Ideally you could even specify encodings for the columns (dictionary
> encoding, etc) as part of the schema.
>
> All of this does seem orthogonal to the DataFrame writing capability, which
> is probably more straightforward in terms of features. Consider treating
> these as two separate patches: one that assumes the table already exists,
> and one that adds the table creation capability. Gerrit will treat two
> separate commits as two separate code reviews, even if one depends on the
> other (it can track that). Doing that will make it easier to review and
> probably faster to get the different parts of it committed.
>
> Mike
>
>
> --
> Mike Percy
> Software Engineer, Cloudera
>
>
> On Thu, May 5, 2016 at 7:52 PM, Andy Grove <an...@gmail.com> wrote:
>
> > Hi Dan,
> >
> > Thanks for the feedback. I'm thinking that the PK could be specified by
> > passing in a tuple of column indices to use for the key (assuming that
> > composite keys are supported?).
> >
> > Maybe it would be useful for KuduContext to have createTable(name:
> String,
> > schema: StructType) and dropTable(name: String) methods that can
> eventually
> > be wrapped in Spark SQL?
> >
> > I'm just getting up to speed with Kudu by trying to run TPC-H via Spark
> and
> > the use case for driving the current code was being able to take existing
> > DataFrames (created by reading CSV files) and save them to Kudu so that
> we
> > can then start testing the TPC-H SQL queries.
> >
> > We will be experimenting with both Spark SQL and own own SQL parser /
> query
> > planner for Spark, that we are planning on open sourcing too.
> >
> > Thanks,
> >
> > Andy.
> >
> > On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <da...@cloudera.com> wrote:
> >
> > > Hey Andy,
> > >
> > > Thanks for the patch!  I left you some specific feedback in on the
> gerrit
> > > review, but I want to discuss the high level approach a bit.  I think
> the
> > > patch as it's written now is going to have limited use, because it
> > doesn't
> > > allow for specifying primary keys or partitioning, which are critical
> for
> > > correctness and performance. In the long run we will definitely want to
> > be
> > > able to create tables through Spark SQL, but perhaps we should start of
> > > with just inserting/updating rows in existing tables.  It would be
> > > interesting to see how other databases solved this problem, since I'm
> > sure
> > > we're not the only ones with configuration options on table create.
> The
> > > relational databases in particular must have PK options.
> > >
> > > - Dan
> > >
> > > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <an...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm working with some colleagues at AgilData on Spark/Kudu
> integration
> > > and
> > > > we expect to be able to contribute a number of features to the code
> > base.
> > > >
> > > > To kick things off, here is a gerrit for discussion that adds support
> > for
> > > > persisting a DataFrame to a Kudu table. It would be great to hear
> > > feedback
> > > > and feature requests for this capability.
> > > >
> > > > http://gerrit.cloudera.org:8080/#/c/2969/
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Posted by Mike Percy <mp...@cloudera.com>.
Hey Andy, thanks for posting your code to Gerrit. It's great to see more
work progressing on the Spark side of things!

Some other stuff you guys might want to consider for the create table
feature, in addition to the partitioning details, is the ability to specify
replication factor. In fact, one might want to specify any of the stuff in
the CreateTableRequestPB
<https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337>
:

message CreateTableRequestPB {
  required string name = 1;
  required SchemaPB schema = 2;
  optional RowOperationsPB split_rows = 6;
  optional PartitionSchemaPB partition_schema = 7;
  optional int32 num_replicas = 4;
}

Ideally you could even specify encodings for the columns (dictionary
encoding, etc) as part of the schema.

All of this does seem orthogonal to the DataFrame writing capability, which
is probably more straightforward in terms of features. Consider treating
these as two separate patches: one that assumes the table already exists,
and one that adds the table creation capability. Gerrit will treat two
separate commits as two separate code reviews, even if one depends on the
other (it can track that). Doing that will make it easier to review and
probably faster to get the different parts of it committed.

Mike


--
Mike Percy
Software Engineer, Cloudera


On Thu, May 5, 2016 at 7:52 PM, Andy Grove <an...@gmail.com> wrote:

> Hi Dan,
>
> Thanks for the feedback. I'm thinking that the PK could be specified by
> passing in a tuple of column indices to use for the key (assuming that
> composite keys are supported?).
>
> Maybe it would be useful for KuduContext to have createTable(name: String,
> schema: StructType) and dropTable(name: String) methods that can eventually
> be wrapped in Spark SQL?
>
> I'm just getting up to speed with Kudu by trying to run TPC-H via Spark and
> the use case for driving the current code was being able to take existing
> DataFrames (created by reading CSV files) and save them to Kudu so that we
> can then start testing the TPC-H SQL queries.
>
> We will be experimenting with both Spark SQL and own own SQL parser / query
> planner for Spark, that we are planning on open sourcing too.
>
> Thanks,
>
> Andy.
>
> On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Hey Andy,
> >
> > Thanks for the patch!  I left you some specific feedback in on the gerrit
> > review, but I want to discuss the high level approach a bit.  I think the
> > patch as it's written now is going to have limited use, because it
> doesn't
> > allow for specifying primary keys or partitioning, which are critical for
> > correctness and performance. In the long run we will definitely want to
> be
> > able to create tables through Spark SQL, but perhaps we should start of
> > with just inserting/updating rows in existing tables.  It would be
> > interesting to see how other databases solved this problem, since I'm
> sure
> > we're not the only ones with configuration options on table create.  The
> > relational databases in particular must have PK options.
> >
> > - Dan
> >
> > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <an...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm working with some colleagues at AgilData on Spark/Kudu integration
> > and
> > > we expect to be able to contribute a number of features to the code
> base.
> > >
> > > To kick things off, here is a gerrit for discussion that adds support
> for
> > > persisting a DataFrame to a Kudu table. It would be great to hear
> > feedback
> > > and feature requests for this capability.
> > >
> > > http://gerrit.cloudera.org:8080/#/c/2969/
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Posted by Andy Grove <an...@gmail.com>.
Hi Dan,

Thanks for the feedback. I'm thinking that the PK could be specified by
passing in a tuple of column indices to use for the key (assuming that
composite keys are supported?).

Maybe it would be useful for KuduContext to have createTable(name: String,
schema: StructType) and dropTable(name: String) methods that can eventually
be wrapped in Spark SQL?

I'm just getting up to speed with Kudu by trying to run TPC-H via Spark and
the use case for driving the current code was being able to take existing
DataFrames (created by reading CSV files) and save them to Kudu so that we
can then start testing the TPC-H SQL queries.

We will be experimenting with both Spark SQL and own own SQL parser / query
planner for Spark, that we are planning on open sourcing too.

Thanks,

Andy.

On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hey Andy,
>
> Thanks for the patch!  I left you some specific feedback in on the gerrit
> review, but I want to discuss the high level approach a bit.  I think the
> patch as it's written now is going to have limited use, because it doesn't
> allow for specifying primary keys or partitioning, which are critical for
> correctness and performance. In the long run we will definitely want to be
> able to create tables through Spark SQL, but perhaps we should start of
> with just inserting/updating rows in existing tables.  It would be
> interesting to see how other databases solved this problem, since I'm sure
> we're not the only ones with configuration options on table create.  The
> relational databases in particular must have PK options.
>
> - Dan
>
> On Thu, May 5, 2016 at 5:51 PM, Andy Grove <an...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm working with some colleagues at AgilData on Spark/Kudu integration
> and
> > we expect to be able to contribute a number of features to the code base.
> >
> > To kick things off, here is a gerrit for discussion that adds support for
> > persisting a DataFrame to a Kudu table. It would be great to hear
> feedback
> > and feature requests for this capability.
> >
> > http://gerrit.cloudera.org:8080/#/c/2969/
> >
> > Thanks,
> >
> > Andy.
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Posted by Dan Burkert <da...@cloudera.com>.
Hey Andy,

Thanks for the patch!  I left you some specific feedback in on the gerrit
review, but I want to discuss the high level approach a bit.  I think the
patch as it's written now is going to have limited use, because it doesn't
allow for specifying primary keys or partitioning, which are critical for
correctness and performance. In the long run we will definitely want to be
able to create tables through Spark SQL, but perhaps we should start of
with just inserting/updating rows in existing tables.  It would be
interesting to see how other databases solved this problem, since I'm sure
we're not the only ones with configuration options on table create.  The
relational databases in particular must have PK options.

- Dan

On Thu, May 5, 2016 at 5:51 PM, Andy Grove <an...@gmail.com> wrote:

> Hi,
>
> I'm working with some colleagues at AgilData on Spark/Kudu integration and
> we expect to be able to contribute a number of features to the code base.
>
> To kick things off, here is a gerrit for discussion that adds support for
> persisting a DataFrame to a Kudu table. It would be great to hear feedback
> and feature requests for this capability.
>
> http://gerrit.cloudera.org:8080/#/c/2969/
>
> Thanks,
>
> Andy.
>