You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Brian Hulette <bh...@google.com> on 2021/09/15 18:51:38 UTC

Re: Apache Beam to Pandas DataFrame

Hi Jackson,
Could you share the specific pandas features that you need in Beam
DataFrames? It's possible that we could prioritize implementing them, or
suggest alternatives.

Brian

On Tue, Jul 13, 2021 at 4:19 PM Jackson Fan <ja...@google.com> wrote:

> Hi Ning,
>
> Thanks for the information! I think I will need to stick with Beam
> Dataframes then...is there an effective way to convert Beam Dataframes to
> pandas since most of the operations down stream is written with the
> assumption of the pandas dataframe and it is hard to change them.
>
> Best,
>
> On Tue, Jul 13, 2021 at 3:44 PM Ning Kang <ni...@google.com> wrote:
>
>> Hi Jackson,
>>
>> ib.collect needs to work with InteractiveRunner in a REPL notebook
>> environment. You can find more information about it here
>> <https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#reading_and_visualizing_the_data>
>> .
>>
>> convert_to_dataframe is from support of Beam Dataframes
>> <https://beam.apache.org/documentation/dsls/dataframes/overview/>. It's
>> different from pandas DataFrame.
>>
>> On Tue, Jul 13, 2021 at 3:38 PM Jackson Fan <ja...@google.com>
>> wrote:
>>
>>> Dear Beam user community,
>>>
>>> I am new to Apache beam yet would like to leverage Beam for fast file
>>> processing. Right now I am struggling converting a Beam object to a
>>> dataframe.
>>>
>>> So I used the convert_to_dataframe to convert the pipeline to Deferred
>>> Dataframe.
>>> I am wondering if there is a way to convert that further to dataframe so
>>> that I can manipulate the dataframe with Pandas code like head etc.
>>>
>>> I received this error if I use method like Pandas.head, pandas.regex and
>>> etc.:
>>> [image: image.png]
>>>
>>> I am wondering if I should use ib.collect and if so, is there any extra
>>> restriction I should be aware of? All the downstream methods I wrote are
>>> based on Pandas dataframe so I would like to avoid any potential errores
>>> down the line.
>>>
>>> Thank you so much!
>>>
>>> --
>>>
>>>  •  Jackson Fan
>>>
>>>  •  PTM
>>>
>>>   •  Shopping
>>>
>>>  •  MTV
>>>
>>
>
> --
>
>  •  Jackson Fan
>
>  •  PTM
>
>   •  Shopping
>
>  •  MTV
>

Re: Apache Beam to Pandas DataFrame

Posted by Jackson Fan <ja...@google.com>.

Got it..perfect! Thank you both for the response and I can actually test
these out.

Best,

On Thu, Sep 16, 2021 at 1:23 PM Brian Hulette <bh...@google.com> wrote:

> > here are the functions: reindex, filter, drop_duplicates, group_by and
> concat.
>
> We actually support all of these except for reindex (because it is
> order-sensitive [1]). Although it's possible we didn't have implementations
> for some of them when you tried it out, a lot of new operations were added
> in the releases leading up to 2.32.0, when it exited experimental [2].
>
> [1]
> https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/#order-sensitive-operations
> [2] https://beam.apache.org/blog/beam-2.32.0/
>
> On Wed, Sep 15, 2021 at 2:38 PM Ning Kang <ni...@google.com> wrote:
>
>> Hi Jackson,
>>
>> Correct me if I'm wrong, Brian.
>> Beam Dataframe is a DSL in Beam. A Beam Dataframe instance represents the
>> distributed data set, it's not materialized. It has a similar API to pandas
>> Dataframe so that users who are familiar with pandas can write the same
>> expressions but execute it through Beam runners. So if you define a
>> pipeline with Beam Dataframe, you can run it on a cluster with multiple
>> workers distributedly.
>>
>> While a pandas Dataframe instance represents schema-ed data in memory.
>> It's not distributed and it resides on a single machine.
>>
>> "ib.collect" materializes data of a Beam Dataframe/PCollection into a
>> pandas Dataframe.
>> Based on your use case, you are trying to use Beam Dataframe to build a
>> pipeline and run it on scale. Thus you should not use "ib.collect" in your
>> code.
>>
>> And it seems that the missing feature of Beam Dataframe you need is APIs
>> like "pandas.head" and "pandas.regex".
>> Please let us know if that's all the features you are trying to use. We
>> can file JIRA tickets and prioritize implementing them.
>>
>> On Wed, Sep 15, 2021 at 11:52 AM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> Hi Jackson,
>>> Could you share the specific pandas features that you need in Beam
>>> DataFrames? It's possible that we could prioritize implementing them, or
>>> suggest alternatives.
>>>
>>> Brian
>>>
>>> On Tue, Jul 13, 2021 at 4:19 PM Jackson Fan <ja...@google.com>
>>> wrote:
>>>
>>>> Hi Ning,
>>>>
>>>> Thanks for the information! I think I will need to stick with Beam
>>>> Dataframes then...is there an effective way to convert Beam Dataframes to
>>>> pandas since most of the operations down stream is written with the
>>>> assumption of the pandas dataframe and it is hard to change them.
>>>>
>>>> Best,
>>>>
>>>> On Tue, Jul 13, 2021 at 3:44 PM Ning Kang <ni...@google.com> wrote:
>>>>
>>>>> Hi Jackson,
>>>>>
>>>>> ib.collect needs to work with InteractiveRunner in a REPL notebook
>>>>> environment. You can find more information about it here
>>>>> <https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#reading_and_visualizing_the_data>
>>>>> .
>>>>>
>>>>> convert_to_dataframe is from support of Beam Dataframes
>>>>> <https://beam.apache.org/documentation/dsls/dataframes/overview/>.
>>>>> It's different from pandas DataFrame.
>>>>>
>>>>> On Tue, Jul 13, 2021 at 3:38 PM Jackson Fan <ja...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Beam user community,
>>>>>>
>>>>>> I am new to Apache beam yet would like to leverage Beam for fast file
>>>>>> processing. Right now I am struggling converting a Beam object to a
>>>>>> dataframe.
>>>>>>
>>>>>> So I used the convert_to_dataframe to convert the pipeline to
>>>>>> Deferred Dataframe.
>>>>>> I am wondering if there is a way to convert that further to dataframe
>>>>>> so that I can manipulate the dataframe with Pandas code like head etc.
>>>>>>
>>>>>> I received this error if I use method like Pandas.head, pandas.regex
>>>>>> and etc.:
>>>>>> [image: image.png]
>>>>>>
>>>>>> I am wondering if I should use ib.collect and if so, is there any
>>>>>> extra restriction I should be aware of? All the downstream methods I wrote
>>>>>> are based on Pandas dataframe so I would like to avoid any potential
>>>>>> errores down the line.
>>>>>>
>>>>>> Thank you so much!
>>>>>>
>>>>>> --
>>>>>>
>>>>>>  •  Jackson Fan
>>>>>>
>>>>>>  •  PTM
>>>>>>
>>>>>>   •  Shopping
>>>>>>
>>>>>>  •  MTV
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>>  •  Jackson Fan
>>>>
>>>>  •  PTM
>>>>
>>>>   •  Shopping
>>>>
>>>>  •  MTV
>>>>
>>>

-- 

 •  Jackson Fan

 •  PTM

  •  Shopping

 •  MTV

Re: Apache Beam to Pandas DataFrame

Posted by Brian Hulette <bh...@google.com>.

> here are the functions: reindex, filter, drop_duplicates, group_by and
concat.

We actually support all of these except for reindex (because it is
order-sensitive [1]). Although it's possible we didn't have implementations
for some of them when you tried it out, a lot of new operations were added
in the releases leading up to 2.32.0, when it exited experimental [2].

[1]
https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/#order-sensitive-operations
[2] https://beam.apache.org/blog/beam-2.32.0/

On Wed, Sep 15, 2021 at 2:38 PM Ning Kang <ni...@google.com> wrote:

> Hi Jackson,
>
> Correct me if I'm wrong, Brian.
> Beam Dataframe is a DSL in Beam. A Beam Dataframe instance represents the
> distributed data set, it's not materialized. It has a similar API to pandas
> Dataframe so that users who are familiar with pandas can write the same
> expressions but execute it through Beam runners. So if you define a
> pipeline with Beam Dataframe, you can run it on a cluster with multiple
> workers distributedly.
>
> While a pandas Dataframe instance represents schema-ed data in memory.
> It's not distributed and it resides on a single machine.
>
> "ib.collect" materializes data of a Beam Dataframe/PCollection into a
> pandas Dataframe.
> Based on your use case, you are trying to use Beam Dataframe to build a
> pipeline and run it on scale. Thus you should not use "ib.collect" in your
> code.
>
> And it seems that the missing feature of Beam Dataframe you need is APIs
> like "pandas.head" and "pandas.regex".
> Please let us know if that's all the features you are trying to use. We
> can file JIRA tickets and prioritize implementing them.
>
> On Wed, Sep 15, 2021 at 11:52 AM Brian Hulette <bh...@google.com>
> wrote:
>
>> Hi Jackson,
>> Could you share the specific pandas features that you need in Beam
>> DataFrames? It's possible that we could prioritize implementing them, or
>> suggest alternatives.
>>
>> Brian
>>
>> On Tue, Jul 13, 2021 at 4:19 PM Jackson Fan <ja...@google.com>
>> wrote:
>>
>>> Hi Ning,
>>>
>>> Thanks for the information! I think I will need to stick with Beam
>>> Dataframes then...is there an effective way to convert Beam Dataframes to
>>> pandas since most of the operations down stream is written with the
>>> assumption of the pandas dataframe and it is hard to change them.
>>>
>>> Best,
>>>
>>> On Tue, Jul 13, 2021 at 3:44 PM Ning Kang <ni...@google.com> wrote:
>>>
>>>> Hi Jackson,
>>>>
>>>> ib.collect needs to work with InteractiveRunner in a REPL notebook
>>>> environment. You can find more information about it here
>>>> <https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#reading_and_visualizing_the_data>
>>>> .
>>>>
>>>> convert_to_dataframe is from support of Beam Dataframes
>>>> <https://beam.apache.org/documentation/dsls/dataframes/overview/>.
>>>> It's different from pandas DataFrame.
>>>>
>>>> On Tue, Jul 13, 2021 at 3:38 PM Jackson Fan <ja...@google.com>
>>>> wrote:
>>>>
>>>>> Dear Beam user community,
>>>>>
>>>>> I am new to Apache beam yet would like to leverage Beam for fast file
>>>>> processing. Right now I am struggling converting a Beam object to a
>>>>> dataframe.
>>>>>
>>>>> So I used the convert_to_dataframe to convert the pipeline to Deferred
>>>>> Dataframe.
>>>>> I am wondering if there is a way to convert that further to dataframe
>>>>> so that I can manipulate the dataframe with Pandas code like head etc.
>>>>>
>>>>> I received this error if I use method like Pandas.head, pandas.regex
>>>>> and etc.:
>>>>> [image: image.png]
>>>>>
>>>>> I am wondering if I should use ib.collect and if so, is there any
>>>>> extra restriction I should be aware of? All the downstream methods I wrote
>>>>> are based on Pandas dataframe so I would like to avoid any potential
>>>>> errores down the line.
>>>>>
>>>>> Thank you so much!
>>>>>
>>>>> --
>>>>>
>>>>>  •  Jackson Fan
>>>>>
>>>>>  •  PTM
>>>>>
>>>>>   •  Shopping
>>>>>
>>>>>  •  MTV
>>>>>
>>>>
>>>
>>> --
>>>
>>>  •  Jackson Fan
>>>
>>>  •  PTM
>>>
>>>   •  Shopping
>>>
>>>  •  MTV
>>>
>>

Re: Apache Beam to Pandas DataFrame

Posted by Ning Kang <ni...@google.com>.

Hi Jackson,

Correct me if I'm wrong, Brian.
Beam Dataframe is a DSL in Beam. A Beam Dataframe instance represents the
distributed data set, it's not materialized. It has a similar API to pandas
Dataframe so that users who are familiar with pandas can write the same
expressions but execute it through Beam runners. So if you define a
pipeline with Beam Dataframe, you can run it on a cluster with multiple
workers distributedly.

While a pandas Dataframe instance represents schema-ed data in memory. It's
not distributed and it resides on a single machine.

"ib.collect" materializes data of a Beam Dataframe/PCollection into a
pandas Dataframe.
Based on your use case, you are trying to use Beam Dataframe to build a
pipeline and run it on scale. Thus you should not use "ib.collect" in your
code.

And it seems that the missing feature of Beam Dataframe you need is APIs
like "pandas.head" and "pandas.regex".
Please let us know if that's all the features you are trying to use. We can
file JIRA tickets and prioritize implementing them.

On Wed, Sep 15, 2021 at 11:52 AM Brian Hulette <bh...@google.com> wrote:

> Hi Jackson,
> Could you share the specific pandas features that you need in Beam
> DataFrames? It's possible that we could prioritize implementing them, or
> suggest alternatives.
>
> Brian
>
> On Tue, Jul 13, 2021 at 4:19 PM Jackson Fan <ja...@google.com> wrote:
>
>> Hi Ning,
>>
>> Thanks for the information! I think I will need to stick with Beam
>> Dataframes then...is there an effective way to convert Beam Dataframes to
>> pandas since most of the operations down stream is written with the
>> assumption of the pandas dataframe and it is hard to change them.
>>
>> Best,
>>
>> On Tue, Jul 13, 2021 at 3:44 PM Ning Kang <ni...@google.com> wrote:
>>
>>> Hi Jackson,
>>>
>>> ib.collect needs to work with InteractiveRunner in a REPL notebook
>>> environment. You can find more information about it here
>>> <https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#reading_and_visualizing_the_data>
>>> .
>>>
>>> convert_to_dataframe is from support of Beam Dataframes
>>> <https://beam.apache.org/documentation/dsls/dataframes/overview/>. It's
>>> different from pandas DataFrame.
>>>
>>> On Tue, Jul 13, 2021 at 3:38 PM Jackson Fan <ja...@google.com>
>>> wrote:
>>>
>>>> Dear Beam user community,
>>>>
>>>> I am new to Apache beam yet would like to leverage Beam for fast file
>>>> processing. Right now I am struggling converting a Beam object to a
>>>> dataframe.
>>>>
>>>> So I used the convert_to_dataframe to convert the pipeline to Deferred
>>>> Dataframe.
>>>> I am wondering if there is a way to convert that further to dataframe
>>>> so that I can manipulate the dataframe with Pandas code like head etc.
>>>>
>>>> I received this error if I use method like Pandas.head, pandas.regex
>>>> and etc.:
>>>> [image: image.png]
>>>>
>>>> I am wondering if I should use ib.collect and if so, is there any extra
>>>> restriction I should be aware of? All the downstream methods I wrote are
>>>> based on Pandas dataframe so I would like to avoid any potential errores
>>>> down the line.
>>>>
>>>> Thank you so much!
>>>>
>>>> --
>>>>
>>>>  •  Jackson Fan
>>>>
>>>>  •  PTM
>>>>
>>>>   •  Shopping
>>>>
>>>>  •  MTV
>>>>
>>>
>>
>> --
>>
>>  •  Jackson Fan
>>
>>  •  PTM
>>
>>   •  Shopping
>>
>>  •  MTV
>>
>

Re: Apache Beam to Pandas DataFrame

Posted by Jackson Fan <ja...@google.com>.

Hi Brian,

Thanks so much for the reply. I have already switched tech stack but here
are the functions: reindex, filter, drop_duplicates, group_by and concat.
There is no rush since I no longer have the urgent need but would be happy
to be of any help.

Best,

On Wed, Sep 15, 2021 at 11:52 AM Brian Hulette <bh...@google.com> wrote:

> Hi Jackson,
> Could you share the specific pandas features that you need in Beam
> DataFrames? It's possible that we could prioritize implementing them, or
> suggest alternatives.
>
> Brian
>
> On Tue, Jul 13, 2021 at 4:19 PM Jackson Fan <ja...@google.com> wrote:
>
>> Hi Ning,
>>
>> Thanks for the information! I think I will need to stick with Beam
>> Dataframes then...is there an effective way to convert Beam Dataframes to
>> pandas since most of the operations down stream is written with the
>> assumption of the pandas dataframe and it is hard to change them.
>>
>> Best,
>>
>> On Tue, Jul 13, 2021 at 3:44 PM Ning Kang <ni...@google.com> wrote:
>>
>>> Hi Jackson,
>>>
>>> ib.collect needs to work with InteractiveRunner in a REPL notebook
>>> environment. You can find more information about it here
>>> <https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#reading_and_visualizing_the_data>
>>> .
>>>
>>> convert_to_dataframe is from support of Beam Dataframes
>>> <https://beam.apache.org/documentation/dsls/dataframes/overview/>. It's
>>> different from pandas DataFrame.
>>>
>>> On Tue, Jul 13, 2021 at 3:38 PM Jackson Fan <ja...@google.com>
>>> wrote:
>>>
>>>> Dear Beam user community,
>>>>
>>>> I am new to Apache beam yet would like to leverage Beam for fast file
>>>> processing. Right now I am struggling converting a Beam object to a
>>>> dataframe.
>>>>
>>>> So I used the convert_to_dataframe to convert the pipeline to Deferred
>>>> Dataframe.
>>>> I am wondering if there is a way to convert that further to dataframe
>>>> so that I can manipulate the dataframe with Pandas code like head etc.
>>>>
>>>> I received this error if I use method like Pandas.head, pandas.regex
>>>> and etc.:
>>>> [image: image.png]
>>>>
>>>> I am wondering if I should use ib.collect and if so, is there any extra
>>>> restriction I should be aware of? All the downstream methods I wrote are
>>>> based on Pandas dataframe so I would like to avoid any potential errores
>>>> down the line.
>>>>
>>>> Thank you so much!
>>>>
>>>> --
>>>>
>>>>  •  Jackson Fan
>>>>
>>>>  •  PTM
>>>>
>>>>   •  Shopping
>>>>
>>>>  •  MTV
>>>>
>>>
>>
>> --
>>
>>  •  Jackson Fan
>>
>>  •  PTM
>>
>>   •  Shopping
>>
>>  •  MTV
>>
>

-- 

 •  Jackson Fan

 •  PTM

  •  Shopping

 •  MTV