You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Olivier Girardot <o....@lateral-thoughts.com> on 2015/04/29 13:26:08 UTC

Pandas' Shift in Dataframe

Hi,
Is there any plan to add the "shift" method from Pandas to Spark Dataframe,
not that I think it's an easy task...

c.f.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html

Regards,

Olivier.

Re: Pandas' Shift in Dataframe

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

To close this thread rxin created a broader Jira to handle window functions
in Dataframes : https://issues.apache.org/jira/browse/SPARK-7322
Thanks everyone.

Le mer. 29 avr. 2015 à 22:51, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> To give you a broader idea of the current use case, I have a few
> transformations (sort and column creations) oriented towards a simple goal.
> My data is timestamped and if two lines are identical, that time
> difference will have to be more than X days in order to be kept, so there
> are a few shifts done but very locally : only -1 or +1.
>
> FYI regarding JIRA, i created one -
> https://issues.apache.org/jira/browse/SPARK-7247 - associated to this
> discussion.
> @rxin considering, in my use case, the data is sorted beforehand, there
> might be a better way - but I guess some shuffle would needed anyway...
>
>
> Le mer. 29 avr. 2015 à 22:34, Evan R. Sparks <ev...@gmail.com> a
> écrit :
>
>> In general there's a tension between ordered data and set-oriented data
>> model underlying DataFrames. You can force a total ordering on the data,
>> but it may come at a high cost with respect to performance.
>>
>> It would be good to get a sense of the use case you're trying to support,
>> but one suggestion would be to apply I can imagine achieving a similar
>> result by applying a datetime.timedelta (in Python terms) to a time
>> attribute (your "axis") and then performing join between the base table and
>> this derived table to merge the data back together. This type of join could
>> then be optimized if the use case is frequent enough to warrant it.
>>
>> - Evan
>>
>> On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> In this case it's fine to discuss whether this would fit in Spark
>>> DataFrames' high level direction before putting it in JIRA. Otherwise we
>>> might end up creating a lot of tickets just for querying whether
>>> something
>>> might be a good idea.
>>>
>>> About this specific feature -- I'm not sure what it means in general
>>> given
>>> we don't have axis in Spark DataFrames. But I think it'd probably be good
>>> to be able to shift a column by one so we can support the end time /
>>> begin
>>> time case, although it'd require two passes over the data.
>>>
>>>
>>>
>>> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>> > I can't comment on the direction of the DataFrame API (that's more for
>>> > Reynold or Michael I guess), but I just wanted to point out that the
>>> JIRA
>>> > would be the recommended way to create a central place for discussing a
>>> > feature add like that.
>>> >
>>> > Nick
>>> >
>>> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
>>> > o.girardot@lateral-thoughts.com> wrote:
>>> >
>>> > > Hi Nicholas,
>>> > > yes I've already checked, and I've just created the
>>> > > https://issues.apache.org/jira/browse/SPARK-7247
>>> > > I'm not even sure why this would be a good feature to add except the
>>> fact
>>> > > that some of the data scientists I'm working with are using it, and
>>> it
>>> > > would be therefore useful for me to translate Pandas code to Spark...
>>> > >
>>> > > Isn't the goal of Spark Dataframe to allow all the features of
>>> Pandas/R
>>> > > Dataframe using Spark ?
>>> > >
>>> > > Regards,
>>> > >
>>> > > Olivier.
>>> > >
>>> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
>>> > nicholas.chammas@gmail.com>
>>> > > a écrit :
>>> > >
>>> > >> You can check JIRA for any existing plans. If there isn't any, then
>>> feel
>>> > >> free to create a JIRA and make the case there for why this would be
>>> a
>>> > good
>>> > >> feature to add.
>>> > >>
>>> > >> Nick
>>> > >>
>>> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
>>> > >> o.girardot@lateral-thoughts.com> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>> Is there any plan to add the "shift" method from Pandas to Spark
>>> > >>> Dataframe,
>>> > >>> not that I think it's an easy task...
>>> > >>>
>>> > >>> c.f.
>>> > >>>
>>> > >>>
>>> >
>>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>>> > >>>
>>> > >>> Regards,
>>> > >>>
>>> > >>> Olivier.
>>> > >>>
>>> > >>
>>> >
>>>
>>
>>

Re: Pandas' Shift in Dataframe

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

To give you a broader idea of the current use case, I have a few
transformations (sort and column creations) oriented towards a simple goal.
My data is timestamped and if two lines are identical, that time difference
will have to be more than X days in order to be kept, so there are a few
shifts done but very locally : only -1 or +1.

FYI regarding JIRA, i created one -
https://issues.apache.org/jira/browse/SPARK-7247 - associated to this
discussion.
@rxin considering, in my use case, the data is sorted beforehand, there
might be a better way - but I guess some shuffle would needed anyway...


Le mer. 29 avr. 2015 à 22:34, Evan R. Sparks <ev...@gmail.com> a
écrit :

> In general there's a tension between ordered data and set-oriented data
> model underlying DataFrames. You can force a total ordering on the data,
> but it may come at a high cost with respect to performance.
>
> It would be good to get a sense of the use case you're trying to support,
> but one suggestion would be to apply I can imagine achieving a similar
> result by applying a datetime.timedelta (in Python terms) to a time
> attribute (your "axis") and then performing join between the base table and
> this derived table to merge the data back together. This type of join could
> then be optimized if the use case is frequent enough to warrant it.
>
> - Evan
>
> On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> In this case it's fine to discuss whether this would fit in Spark
>> DataFrames' high level direction before putting it in JIRA. Otherwise we
>> might end up creating a lot of tickets just for querying whether something
>> might be a good idea.
>>
>> About this specific feature -- I'm not sure what it means in general given
>> we don't have axis in Spark DataFrames. But I think it'd probably be good
>> to be able to shift a column by one so we can support the end time / begin
>> time case, although it'd require two passes over the data.
>>
>>
>>
>> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> > I can't comment on the direction of the DataFrame API (that's more for
>> > Reynold or Michael I guess), but I just wanted to point out that the
>> JIRA
>> > would be the recommended way to create a central place for discussing a
>> > feature add like that.
>> >
>> > Nick
>> >
>> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
>> > o.girardot@lateral-thoughts.com> wrote:
>> >
>> > > Hi Nicholas,
>> > > yes I've already checked, and I've just created the
>> > > https://issues.apache.org/jira/browse/SPARK-7247
>> > > I'm not even sure why this would be a good feature to add except the
>> fact
>> > > that some of the data scientists I'm working with are using it, and it
>> > > would be therefore useful for me to translate Pandas code to Spark...
>> > >
>> > > Isn't the goal of Spark Dataframe to allow all the features of
>> Pandas/R
>> > > Dataframe using Spark ?
>> > >
>> > > Regards,
>> > >
>> > > Olivier.
>> > >
>> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
>> > nicholas.chammas@gmail.com>
>> > > a écrit :
>> > >
>> > >> You can check JIRA for any existing plans. If there isn't any, then
>> feel
>> > >> free to create a JIRA and make the case there for why this would be a
>> > good
>> > >> feature to add.
>> > >>
>> > >> Nick
>> > >>
>> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
>> > >> o.girardot@lateral-thoughts.com> wrote:
>> > >>
>> > >>> Hi,
>> > >>> Is there any plan to add the "shift" method from Pandas to Spark
>> > >>> Dataframe,
>> > >>> not that I think it's an easy task...
>> > >>>
>> > >>> c.f.
>> > >>>
>> > >>>
>> >
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>> > >>>
>> > >>> Regards,
>> > >>>
>> > >>> Olivier.
>> > >>>
>> > >>
>> >
>>
>
>

Re: Pandas' Shift in Dataframe

Posted by "Evan R. Sparks" <ev...@gmail.com>.

In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.

It would be good to get a sense of the use case you're trying to support,
but one suggestion would be to apply I can imagine achieving a similar
result by applying a datetime.timedelta (in Python terms) to a time
attribute (your "axis") and then performing join between the base table and
this derived table to merge the data back together. This type of join could
then be optimized if the use case is frequent enough to warrant it.

- Evan

On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin <rx...@databricks.com> wrote:

> In this case it's fine to discuss whether this would fit in Spark
> DataFrames' high level direction before putting it in JIRA. Otherwise we
> might end up creating a lot of tickets just for querying whether something
> might be a good idea.
>
> About this specific feature -- I'm not sure what it means in general given
> we don't have axis in Spark DataFrames. But I think it'd probably be good
> to be able to shift a column by one so we can support the end time / begin
> time case, although it'd require two passes over the data.
>
>
>
> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
> > I can't comment on the direction of the DataFrame API (that's more for
> > Reynold or Michael I guess), but I just wanted to point out that the JIRA
> > would be the recommended way to create a central place for discussing a
> > feature add like that.
> >
> > Nick
> >
> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
> > o.girardot@lateral-thoughts.com> wrote:
> >
> > > Hi Nicholas,
> > > yes I've already checked, and I've just created the
> > > https://issues.apache.org/jira/browse/SPARK-7247
> > > I'm not even sure why this would be a good feature to add except the
> fact
> > > that some of the data scientists I'm working with are using it, and it
> > > would be therefore useful for me to translate Pandas code to Spark...
> > >
> > > Isn't the goal of Spark Dataframe to allow all the features of Pandas/R
> > > Dataframe using Spark ?
> > >
> > > Regards,
> > >
> > > Olivier.
> > >
> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
> > nicholas.chammas@gmail.com>
> > > a écrit :
> > >
> > >> You can check JIRA for any existing plans. If there isn't any, then
> feel
> > >> free to create a JIRA and make the case there for why this would be a
> > good
> > >> feature to add.
> > >>
> > >> Nick
> > >>
> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
> > >> o.girardot@lateral-thoughts.com> wrote:
> > >>
> > >>> Hi,
> > >>> Is there any plan to add the "shift" method from Pandas to Spark
> > >>> Dataframe,
> > >>> not that I think it's an easy task...
> > >>>
> > >>> c.f.
> > >>>
> > >>>
> >
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
> > >>>
> > >>> Regards,
> > >>>
> > >>> Olivier.
> > >>>
> > >>
> >
>

Re: Pandas' Shift in Dataframe

Posted by Reynold Xin <rx...@databricks.com>.

In this case it's fine to discuss whether this would fit in Spark
DataFrames' high level direction before putting it in JIRA. Otherwise we
might end up creating a lot of tickets just for querying whether something
might be a good idea.

About this specific feature -- I'm not sure what it means in general given
we don't have axis in Spark DataFrames. But I think it'd probably be good
to be able to shift a column by one so we can support the end time / begin
time case, although it'd require two passes over the data.



On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> I can't comment on the direction of the DataFrame API (that's more for
> Reynold or Michael I guess), but I just wanted to point out that the JIRA
> would be the recommended way to create a central place for discussing a
> feature add like that.
>
> Nick
>
> On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
> > Hi Nicholas,
> > yes I've already checked, and I've just created the
> > https://issues.apache.org/jira/browse/SPARK-7247
> > I'm not even sure why this would be a good feature to add except the fact
> > that some of the data scientists I'm working with are using it, and it
> > would be therefore useful for me to translate Pandas code to Spark...
> >
> > Isn't the goal of Spark Dataframe to allow all the features of Pandas/R
> > Dataframe using Spark ?
> >
> > Regards,
> >
> > Olivier.
> >
> > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
> nicholas.chammas@gmail.com>
> > a écrit :
> >
> >> You can check JIRA for any existing plans. If there isn't any, then feel
> >> free to create a JIRA and make the case there for why this would be a
> good
> >> feature to add.
> >>
> >> Nick
> >>
> >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
> >> o.girardot@lateral-thoughts.com> wrote:
> >>
> >>> Hi,
> >>> Is there any plan to add the "shift" method from Pandas to Spark
> >>> Dataframe,
> >>> not that I think it's an easy task...
> >>>
> >>> c.f.
> >>>
> >>>
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
> >>>
> >>> Regards,
> >>>
> >>> Olivier.
> >>>
> >>
>

Re: Pandas' Shift in Dataframe

Posted by Nicholas Chammas <ni...@gmail.com>.

I can't comment on the direction of the DataFrame API (that's more for
Reynold or Michael I guess), but I just wanted to point out that the JIRA
would be the recommended way to create a central place for discussing a
feature add like that.

Nick

On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi Nicholas,
> yes I've already checked, and I've just created the
> https://issues.apache.org/jira/browse/SPARK-7247
> I'm not even sure why this would be a good feature to add except the fact
> that some of the data scientists I'm working with are using it, and it
> would be therefore useful for me to translate Pandas code to Spark...
>
> Isn't the goal of Spark Dataframe to allow all the features of Pandas/R
> Dataframe using Spark ?
>
> Regards,
>
> Olivier.
>
> Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <ni...@gmail.com>
> a écrit :
>
>> You can check JIRA for any existing plans. If there isn't any, then feel
>> free to create a JIRA and make the case there for why this would be a good
>> feature to add.
>>
>> Nick
>>
>> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> Hi,
>>> Is there any plan to add the "shift" method from Pandas to Spark
>>> Dataframe,
>>> not that I think it's an easy task...
>>>
>>> c.f.
>>>
>>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>

Re: Pandas' Shift in Dataframe

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Hi Nicholas,
yes I've already checked, and I've just created the
https://issues.apache.org/jira/browse/SPARK-7247
I'm not even sure why this would be a good feature to add except the fact
that some of the data scientists I'm working with are using it, and it
would be therefore useful for me to translate Pandas code to Spark...

Isn't the goal of Spark Dataframe to allow all the features of Pandas/R
Dataframe using Spark ?

Regards,

Olivier.

Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <ni...@gmail.com>
a écrit :

> You can check JIRA for any existing plans. If there isn't any, then feel
> free to create a JIRA and make the case there for why this would be a good
> feature to add.
>
> Nick
>
> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi,
>> Is there any plan to add the "shift" method from Pandas to Spark
>> Dataframe,
>> not that I think it's an easy task...
>>
>> c.f.
>>
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>>
>> Regards,
>>
>> Olivier.
>>
>

Re: Pandas' Shift in Dataframe

Posted by Nicholas Chammas <ni...@gmail.com>.

You can check JIRA for any existing plans. If there isn't any, then feel
free to create a JIRA and make the case there for why this would be a good
feature to add.

Nick

On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi,
> Is there any plan to add the "shift" method from Pandas to Spark Dataframe,
> not that I think it's an easy task...
>
> c.f.
>
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>
> Regards,
>
> Olivier.
>