You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by roehst <ro...@gmail.com> on 2016/10/15 00:38:32 UTC

On convenience methods

Hi, I sometimes write convenience methods for pre-processing data frames, and
I wonder if it makes sense to make a contribution -- should this be included
in Spark or supplied as Spark Packages/3rd party libraries?

Example:

Get all fields in a DataFrame schema of a certain type.

I end up writing something like getFieldsByDataType(dataFrame: DataFrame,
dataType: DataType): List[StructField] and may be adding that to the Schema
class with implicits. Something like:

dataFrame.schema.fields.filter(_.dataType == dataType)

Should the fields variable in the Schema class contain a method like
"filterByDataType" so we can write:

dataFrame.getFieldsByDataType(StringType)?

Is it useful? Is it too bloated? Would that be acceptable? That is a small
contribution that a junior developer might be able to write, for example.
This adds more code, but may be makes the library more user friendly (not
that it is not user friendly).

Just want to hear your thoughts on this question.

Thanks,
Rodrigo



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/On-convenience-methods-tp19460.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: On convenience methods

Posted by Holden Karau <ho...@pigscanfly.ca>.

I think what Reynold means is that if its easy for a developer to build
this convenience function using the current Spark API it probably doesn't
need to go into Spark unless its being done to provide a similar API to a
system we are attempting to be semi-compatible with (e.g. if a
corresponding convenience function existed in the pandas API).

On Tue, Oct 18, 2016 at 7:03 AM, roehst <ro...@gmail.com> wrote:

> Sorry, by API you mean by use of 3rd party libraries or user code or
> something else?
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/On-convenience-
> methods-tp19460p19496.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: On convenience methods

Posted by roehst <ro...@gmail.com>.

Sorry, by API you mean by use of 3rd party libraries or user code or
something else?

Thanks



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/On-convenience-methods-tp19460p19496.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: On convenience methods

Posted by Reynold Xin <rx...@databricks.com>.

It is very difficult to give a general answer. We would need to discuss
each case. In general things that are trivially doable using existing APIs,
it is not a good idea to provide them, unless for compatibility with other
frameworks (e.g. Pandas).

On Fri, Oct 14, 2016 at 5:38 PM, roehst <ro...@gmail.com> wrote:

> Hi, I sometimes write convenience methods for pre-processing data frames,
> and
> I wonder if it makes sense to make a contribution -- should this be
> included
> in Spark or supplied as Spark Packages/3rd party libraries?
>
> Example:
>
> Get all fields in a DataFrame schema of a certain type.
>
> I end up writing something like getFieldsByDataType(dataFrame: DataFrame,
> dataType: DataType): List[StructField] and may be adding that to the Schema
> class with implicits. Something like:
>
> dataFrame.schema.fields.filter(_.dataType == dataType)
>
> Should the fields variable in the Schema class contain a method like
> "filterByDataType" so we can write:
>
> dataFrame.getFieldsByDataType(StringType)?
>
> Is it useful? Is it too bloated? Would that be acceptable? That is a small
> contribution that a junior developer might be able to write, for example.
> This adds more code, but may be makes the library more user friendly (not
> that it is not user friendly).
>
> Just want to hear your thoughts on this question.
>
> Thanks,
> Rodrigo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/On-convenience-methods-tp19460.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>