You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Hamel Kothari <ha...@gmail.com> on 2016/02/26 17:44:33 UTC

More Robust DataSource Parameters

Hi devs,

Has there been any discussion around changing the DataSource parameters
arguments be something more sophisticated than Map[String, String]? As you
write more complex DataSources there are likely to be a variety of
parameters of varying formats which are needed and having to coerce them to
be strings becomes suboptimal pretty fast.

Quite often I see this combated by people specifying parameters which take
in Json strings and then parse them into the parameter objects that they
actually need. Unfortunately having people write Json strings can be a
really error prone process so to ensure compile time safety people write
convenience functions written which take in actual POJOs as parameters,
serialize them to json so they can be passed into the data source API and
then deserialize them in the constructors of their data sources. There's
also no real story around discoverability of options with the current
Map[String, String] setup other than looking at the source code of the
datasource and hoping that they specified constants somewhere.

Rather than doing all of the above, we could adapt the DataSource API to
have RelationProviders be templated on a parameter class which could be
provided to the createRelation call. On the user's side, they could just
create the appropriate configuration object and provide that object to the
DataFrameReader.parameters call and it would be possible to guarantee that
enough parameters were provided to construct a DataFrame in that case.

The key challenge I see with this approach is that I'm not sure how to make
the above changes in a backwards compatible way that doesn't involve
duplicating a bunch of methods.

Do people have thoughts regarding this approach? I'm happy to file a JIRA
and have the discussion there if it makes sense.

Best,
Hamel

Re: More Robust DataSource Parameters

Posted by Reynold Xin <rx...@databricks.com>.

Hi Hamel,

Sorry for the slow reply. Do you mind writing down the thoughts in a
document, with API sketches? I think all the devils are in the details of
the API for this one.

If we can design an API that is type-safe, supports all languages, and also
can be stable, then it sounds like a great idea.



On Sat, Feb 27, 2016 at 10:12 AM, Hamel Kothari <ha...@gmail.com>
wrote:

> Thanks for the flags Reynold.
>
> 1. For the 4+ languages, these are just on the consumption side (i.e. you
> can't write a data source in Python or SQL, correct), right? ? If this is
> correct and you can only write data sources in the JVM languages than that
> makes this story a lot easier. On the DataSource side we just require that
> the configuration object is JSON deserializable.
>
> Then on the consumption side (ie. from sqlContext.read):
>  - From Java/Scala these objects can be passed through to the DataSource
> natively since it's in the same JVM and people have access to the concrete
> parameter classes.
>  - On the Python side this object can be passed over via JSON which is
> deserialized and could be forced to generate explicit serialization
> failures when insufficient options are provided. The datasource provide
> could even (optionally) provide a python object which performs validation
> on the python side to make this easier for consumers.
> - In the SQL instance, since these objects are JSON serializable, we can
> alter the OPTIONS keyword to allow nested maps to create the JSON object.
>
> In all of these cases the solution proposed still worst case degrades to
> something equivalent to the Map[String, String] (except that it has nesting
> support), but in the best cases we have POJOs and optionally provided
> python objects which help facilitate this in a first class fashion.
>
> 2. Yeah agree this is a big problem, which is why I flagged it in the
> initial email. I'll put some more thought into how this can be done in a
> reasonable fashion (although any sugguestions wouild be greatly
> appreciated).
>
> With the above answer to #1 and contingent on finding a solution to the
> API stability part of it, would you be supportive of a change to do this?
> If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to
> do #2 in a more sane way.
>
> On Fri, Feb 26, 2016 at 5:02 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Thanks for the email. This sounds great in theory, but might run into two
>> major problems:
>>
>> 1. Need to support 4+ programming languages (SQL, Python, Java, Scala)
>>
>> 2. API stability (both backward and forward)
>>
>>
>>
>> On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <ha...@gmail.com>
>> wrote:
>>
>>> Hi devs,
>>>
>>> Has there been any discussion around changing the DataSource parameters
>>> arguments be something more sophisticated than Map[String, String]? As you
>>> write more complex DataSources there are likely to be a variety of
>>> parameters of varying formats which are needed and having to coerce them to
>>> be strings becomes suboptimal pretty fast.
>>>
>>> Quite often I see this combated by people specifying parameters which
>>> take in Json strings and then parse them into the parameter objects that
>>> they actually need. Unfortunately having people write Json strings can be a
>>> really error prone process so to ensure compile time safety people write
>>> convenience functions written which take in actual POJOs as parameters,
>>> serialize them to json so they can be passed into the data source API and
>>> then deserialize them in the constructors of their data sources. There's
>>> also no real story around discoverability of options with the current
>>> Map[String, String] setup other than looking at the source code of the
>>> datasource and hoping that they specified constants somewhere.
>>>
>>> Rather than doing all of the above, we could adapt the DataSource API to
>>> have RelationProviders be templated on a parameter class which could be
>>> provided to the createRelation call. On the user's side, they could just
>>> create the appropriate configuration object and provide that object to the
>>> DataFrameReader.parameters call and it would be possible to guarantee that
>>> enough parameters were provided to construct a DataFrame in that case.
>>>
>>> The key challenge I see with this approach is that I'm not sure how to
>>> make the above changes in a backwards compatible way that doesn't involve
>>> duplicating a bunch of methods.
>>>
>>> Do people have thoughts regarding this approach? I'm happy to file a
>>> JIRA and have the discussion there if it makes sense.
>>>
>>> Best,
>>> Hamel
>>>
>>
>>

Re: More Robust DataSource Parameters

Posted by Hamel Kothari <ha...@gmail.com>.

Thanks for the flags Reynold.

1. For the 4+ languages, these are just on the consumption side (i.e. you
can't write a data source in Python or SQL, correct), right? ? If this is
correct and you can only write data sources in the JVM languages than that
makes this story a lot easier. On the DataSource side we just require that
the configuration object is JSON deserializable.

Then on the consumption side (ie. from sqlContext.read):
 - From Java/Scala these objects can be passed through to the DataSource
natively since it's in the same JVM and people have access to the concrete
parameter classes.
 - On the Python side this object can be passed over via JSON which is
deserialized and could be forced to generate explicit serialization
failures when insufficient options are provided. The datasource provide
could even (optionally) provide a python object which performs validation
on the python side to make this easier for consumers.
- In the SQL instance, since these objects are JSON serializable, we can
alter the OPTIONS keyword to allow nested maps to create the JSON object.

In all of these cases the solution proposed still worst case degrades to
something equivalent to the Map[String, String] (except that it has nesting
support), but in the best cases we have POJOs and optionally provided
python objects which help facilitate this in a first class fashion.

2. Yeah agree this is a big problem, which is why I flagged it in the
initial email. I'll put some more thought into how this can be done in a
reasonable fashion (although any sugguestions wouild be greatly
appreciated).

With the above answer to #1 and contingent on finding a solution to the API
stability part of it, would you be supportive of a change to do this? If
so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to do
#2 in a more sane way.

On Fri, Feb 26, 2016 at 5:02 PM Reynold Xin <rx...@databricks.com> wrote:

> Thanks for the email. This sounds great in theory, but might run into two
> major problems:
>
> 1. Need to support 4+ programming languages (SQL, Python, Java, Scala)
>
> 2. API stability (both backward and forward)
>
>
>
> On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <ha...@gmail.com>
> wrote:
>
>> Hi devs,
>>
>> Has there been any discussion around changing the DataSource parameters
>> arguments be something more sophisticated than Map[String, String]? As you
>> write more complex DataSources there are likely to be a variety of
>> parameters of varying formats which are needed and having to coerce them to
>> be strings becomes suboptimal pretty fast.
>>
>> Quite often I see this combated by people specifying parameters which
>> take in Json strings and then parse them into the parameter objects that
>> they actually need. Unfortunately having people write Json strings can be a
>> really error prone process so to ensure compile time safety people write
>> convenience functions written which take in actual POJOs as parameters,
>> serialize them to json so they can be passed into the data source API and
>> then deserialize them in the constructors of their data sources. There's
>> also no real story around discoverability of options with the current
>> Map[String, String] setup other than looking at the source code of the
>> datasource and hoping that they specified constants somewhere.
>>
>> Rather than doing all of the above, we could adapt the DataSource API to
>> have RelationProviders be templated on a parameter class which could be
>> provided to the createRelation call. On the user's side, they could just
>> create the appropriate configuration object and provide that object to the
>> DataFrameReader.parameters call and it would be possible to guarantee that
>> enough parameters were provided to construct a DataFrame in that case.
>>
>> The key challenge I see with this approach is that I'm not sure how to
>> make the above changes in a backwards compatible way that doesn't involve
>> duplicating a bunch of methods.
>>
>> Do people have thoughts regarding this approach? I'm happy to file a JIRA
>> and have the discussion there if it makes sense.
>>
>> Best,
>> Hamel
>>
>
>

Re: More Robust DataSource Parameters

Posted by Reynold Xin <rx...@databricks.com>.

Thanks for the email. This sounds great in theory, but might run into two
major problems:

1. Need to support 4+ programming languages (SQL, Python, Java, Scala)

2. API stability (both backward and forward)



On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <ha...@gmail.com>
wrote:

> Hi devs,
>
> Has there been any discussion around changing the DataSource parameters
> arguments be something more sophisticated than Map[String, String]? As you
> write more complex DataSources there are likely to be a variety of
> parameters of varying formats which are needed and having to coerce them to
> be strings becomes suboptimal pretty fast.
>
> Quite often I see this combated by people specifying parameters which take
> in Json strings and then parse them into the parameter objects that they
> actually need. Unfortunately having people write Json strings can be a
> really error prone process so to ensure compile time safety people write
> convenience functions written which take in actual POJOs as parameters,
> serialize them to json so they can be passed into the data source API and
> then deserialize them in the constructors of their data sources. There's
> also no real story around discoverability of options with the current
> Map[String, String] setup other than looking at the source code of the
> datasource and hoping that they specified constants somewhere.
>
> Rather than doing all of the above, we could adapt the DataSource API to
> have RelationProviders be templated on a parameter class which could be
> provided to the createRelation call. On the user's side, they could just
> create the appropriate configuration object and provide that object to the
> DataFrameReader.parameters call and it would be possible to guarantee that
> enough parameters were provided to construct a DataFrame in that case.
>
> The key challenge I see with this approach is that I'm not sure how to
> make the above changes in a backwards compatible way that doesn't involve
> duplicating a bunch of methods.
>
> Do people have thoughts regarding this approach? I'm happy to file a JIRA
> and have the discussion there if it makes sense.
>
> Best,
> Hamel
>