You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/16 01:37:06 UTC

[GitHub] [beam] TheNeuralBit opened a new pull request #13561: Add DataFrame Preview announcment blog post

TheNeuralBit opened a new pull request #13561:
URL: https://github.com/apache/beam/pull/13561


   
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam
 .apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.a
 pache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam
 .apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website | Whitespace | Typescript
   --- | --- | --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) <br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/be
 am_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | --- | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] pcoet commented on a change in pull request #13561: Add DataFrame Preview announcment blog post

Posted by GitBox <gi...@apache.org>.

pcoet commented on a change in pull request #13561:
URL: https://github.com/apache/beam/pull/13561#discussion_r545160926



##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We are proud to announce that a preview of the Beam Python SDK's new DataFrame
+API is now available in [Beam
+2.26.0](https://beam.apache.org/blog/beam-2.26.0/). Much like SqlTransform
+([Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/sql/SqlTransform.html),
+[Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform)),
+the DataFrame API gives Beam users a way to express complex
+relational logic much more concisely than previously possible.
+<!--more-->
+
+## A more expressive API
+Beam's new Dataframe API aims to be compatible with the well known

Review comment:
       Dataframe -> DataFrame

##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We are proud to announce that a preview of the Beam Python SDK's new DataFrame

Review comment:
       Maybe something like, "We're excited to announce ...."

##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We are proud to announce that a preview of the Beam Python SDK's new DataFrame
+API is now available in [Beam
+2.26.0](https://beam.apache.org/blog/beam-2.26.0/). Much like SqlTransform
+([Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/sql/SqlTransform.html),
+[Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform)),
+the DataFrame API gives Beam users a way to express complex
+relational logic much more concisely than previously possible.
+<!--more-->
+
+## A more expressive API
+Beam's new Dataframe API aims to be compatible with the well known
+[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
+DataFrame API, with a few caveats detailed below. With this new API a simple
+pipeline that reads NYC taxiride data from a CSV, performs a grouped
+aggregation, and writes the output to CSV, can be expressed very concisely:
+
+```
+from apache_beam.dataframe.io import read_csv
+
+with beam.Pipeline() as p:
+  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                    use_ncols=['passenger_count' , 'DOLocationID'])
+  # Count the number of passengers dropped off per LocationID
+  agg = df.groupby('DOLocationID').sum()
+  agg.to_csv(output)
+```
+
+Compare this to the same logic implemented as a conventional Beam python
+pipeline with a `CombinePerKey`:
+
+```
+with beam.Pipeline() as p:
+  (p | beam.io.ReadFromText("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                            skip_header_lines=1)
+     | beam.Map(lambda line: line.split(','))
+     # Parse CSV, create key - value pairs
+     | beam.Map(lambda splits: (int(splits[8] or 0),  # DOLocationID
+                                int(splits[3] or 0))) # passenger_count
+     # Sum values per key
+     | beam.CombinePerKey(sum)
+     | beam.MapTuple(lambda loc_id, pc: f'{loc_id}: {pc}')
+     | beam.io.WriteToText(known_args.output))
+```
+
+The DataFrame example is much easier to quickly inspect and understand, as it
+allows you to concisely express grouped aggregations without using the low-level
+`CombinePerKey`.
+
+In addition to being more expressive and concise, a pipeline written with the
+DataFrame API can often be more efficient than a conventional Beam pipeline.
+This is because the DataFrame API defers to the very efficient, columnar Pandas
+implementation as much as possible.
+
+## DataFrames as a DSL
+You may already be aware of [Beam
+SQL](https://beam.apache.org/documentation/dsls/sql/overview/), which is
+a Domain-Specific Language (DSL) built with Beam's Java SDK. SQL is
+considered a DSL because it's possible to express a full pipeline, including IOs
+and complex operations, entirely with SQL. 
+
+Similarly, the DataFrame API is a DSL built with the Python SDK. You can see
+that the above example is written without traditional Beam constructs like IOs,
+ParDo, or CombinePerKey. In fact the only traditional Beam type is the Pipeline
+instance! Otherwise this pipeline is written completely using the DataFrame API.
+This is possible because the DataFrame API doesn't just implement Pandas'
+computation operations, it also includes IOs based on the Pandas native
+implementations (`pd.read_{csv,parquet,...}` and `pd.DataFrame.to_{csv,parquet,...}`).
+
+Like SQL, it’s also possible to embed the DataFrame API into a larger pipeline
+by using
+[schemas](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema).
+A schema-aware PCollection can be converted to a DataFrame, processed, and the
+result converted back to another schema-aware PCollection.  For example, if you
+wanted to use traditional Beam IOs rather than one of the DataFrame IOs you
+could rewrite the above pipeline like this:
+
+```
+from apache_beam.dataframe.convert import to_dataframe
+from apache_beam.dataframe.convert import to_pcollection
+
+with beam.Pipeline() as p:
+  ...
+  schema_pc = (p | beam.ReadFromText(..)
+                 # Use beam.Select to assign a schema
+                 | beam.Select(DOLocationID=lambda line: int(...),
+                               passenger_count=lambda line: int(...)))
+  df = to_dataframe(schema_pc)
+  agg = df.groupby('DOLocationID').sum()
+  agg_pc = to_pcollection(pc)
+
+  # agg_pc has a schema based on the structure of agg
+  (agg_pc | beam.Map(lambda row: f'{row.DOLocationID}: {row.passenger_count}')
+          | beam.WriteToText(..))
+```
+
+It’s also possible to use the DataFrame API by passing a function to
+[`DataframeTransform`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform):
+
+```
+from apache_beam.dataframe.transforms import DataframeTransform
+
+with beam.Pipeline() as p:
+  ...
+  | beam.Select(DOLocationID=lambda line: int(..),
+                passenger_count=lambda line: int(..))
+  | DataframeTransform(lambda df: df.groupby('DOLocationID').sum())
+  | beam.Map(lambda row: f'{row.DOLocationID}: {row.passenger_count}')
+  ...
+```
+
+## Caveats
+As hinted above, there are some differences between Beam's DataFrame API and the
+Pandas API. The most significant difference is that the Beam  DataFrame API is
+*deferred*, just like the rest of the Beam API. This means that you can't
+`print()` a DataFrame instance in order to inspect the data, because we haven't
+computed the data yet! The computation doesn't take place until the pipeline is
+`run()`.  Before that, we only know about the shape/schema of the result (i.e.
+the names and types of the columns), and not the result itself.
+
+There are a few common exceptions you will likely see when attempting to use
+certain Pandas operations:
+
+- **NotImplementedError:** Indicates this is an operation or argument that we
+  haven't had time to look at yet. We've tried to make as many Pandas operations
+  as possible available in the Preview offering of this new API, but there's
+  still a long tail of operations to go.
+- **WontImplementError:** Indicates this is an operation or argument we do not
+  intend to support in the near-term because it's incompatible with the Beam
+  model. The largest class of operations that raise this error are those that
+  are order sensitive (e.g. shift, cummax, cummin, head, tail, etc..). These
+  cannot be trivially mapped to Beam because PCollections, representing
+  distributed datasets, are unordered. Note that even some of these operations
+  *may* get implemented in the future - we actually have some ideas for how we
+  might support order sensitive operations - but it's a ways off.
+
+Finally, it's important to note that this is a preview of a new feature that
+will get hardened over the next few Beam releases. We would love for you to try
+it out now and give us some feedback, but we do not yet recommend it for use in
+production workloads.
+
+## How to get involved
+The easiest way to get involved with this effort is to try out DataFrames and
+let us know what you think! You can send questions to user@beam.apache.org, or
+file bug reports and feature requests in [jira](https://issues.apache.org/jira).
+In particular, it would be really helpful to know if there’s an operation we
+haven’t implemented yet that you’d find useful, so that we can prioritize it.
+
+If you’d like to learn more about how the DataFrame API works under the hood and
+get involved with the development we recommend you take a look at the
+[design doc](http://s.apache.org/beam-dataframes)
+and our [Beam summit
+presentation](https://2020.beamsummit.org/sessions/simpler-python-pipelines/).
+From there the best way to help is to knock out some of those not implemented
+operations, we’re coordinating that work in

Review comment:
       "operations, we’re" -> "operations. We’re"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit merged pull request #13561: Add DataFrame Preview announcment blog post

Posted by GitBox <gi...@apache.org>.

TheNeuralBit merged pull request #13561:
URL: https://github.com/apache/beam/pull/13561


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] robertwb commented on a change in pull request #13561: Add DataFrame Preview announcment blog post

Posted by GitBox <gi...@apache.org>.

robertwb commented on a change in pull request #13561:
URL: https://github.com/apache/beam/pull/13561#discussion_r545472957



##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We're excited to announce that a preview of the Beam Python SDK's new DataFrame
+API is now available in [Beam
+2.26.0](https://beam.apache.org/blog/beam-2.26.0/). Much like `SqlTransform`
+([Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/sql/SqlTransform.html),
+[Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform)),
+the DataFrame API gives Beam users a way to express complex
+relational logic much more concisely than previously possible.
+<!--more-->
+
+## A more expressive API
+Beam's new DataFrame API aims to be compatible with the well known
+[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
+DataFrame API, with a few caveats detailed below. With this new API a simple
+pipeline that reads NYC taxiride data from a CSV, performs a grouped
+aggregation, and writes the output to CSV, can be expressed very concisely:
+
+```
+from apache_beam.dataframe.io import read_csv
+
+with beam.Pipeline() as p:
+  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                    use_ncols=['passenger_count' , 'DOLocationID'])
+  # Count the number of passengers dropped off per LocationID
+  agg = df.groupby('DOLocationID').sum()
+  agg.to_csv(output)
+```
+
+Compare this to the same logic implemented as a conventional Beam python
+pipeline with a `CombinePerKey`:
+
+```
+with beam.Pipeline() as p:
+  (p | beam.io.ReadFromText("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                            skip_header_lines=1)
+     | beam.Map(lambda line: line.split(','))
+     # Parse CSV, create key - value pairs
+     | beam.Map(lambda splits: (int(splits[8] or 0),  # DOLocationID
+                                int(splits[3] or 0))) # passenger_count
+     # Sum values per key
+     | beam.CombinePerKey(sum)
+     | beam.MapTuple(lambda loc_id, pc: f'{loc_id}: {pc}')

Review comment:
       Should this be a comma rather than ": " so it's 1:1 the same result?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on pull request #13561: Add DataFrame Preview announcment blog post

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on pull request #13561:
URL: https://github.com/apache/beam/pull/13561#issuecomment-746822718


   R: @robertwb @pcoet 
   
   Preview can be found here: http://apache-beam-website-pull-requests.storage.googleapis.com/13561/blog/dataframe-api-preview-available/index.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #13561: Add DataFrame Preview announcment blog post

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on a change in pull request #13561:
URL: https://github.com/apache/beam/pull/13561#discussion_r545493726



##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We're excited to announce that a preview of the Beam Python SDK's new DataFrame
+API is now available in [Beam
+2.26.0](https://beam.apache.org/blog/beam-2.26.0/). Much like `SqlTransform`
+([Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/sql/SqlTransform.html),
+[Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform)),
+the DataFrame API gives Beam users a way to express complex
+relational logic much more concisely than previously possible.
+<!--more-->
+
+## A more expressive API
+Beam's new DataFrame API aims to be compatible with the well known
+[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
+DataFrame API, with a few caveats detailed below. With this new API a simple
+pipeline that reads NYC taxiride data from a CSV, performs a grouped
+aggregation, and writes the output to CSV, can be expressed very concisely:
+
+```
+from apache_beam.dataframe.io import read_csv
+
+with beam.Pipeline() as p:
+  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                    use_ncols=['passenger_count' , 'DOLocationID'])
+  # Count the number of passengers dropped off per LocationID
+  agg = df.groupby('DOLocationID').sum()
+  agg.to_csv(output)
+```
+
+Compare this to the same logic implemented as a conventional Beam python
+pipeline with a `CombinePerKey`:
+
+```
+with beam.Pipeline() as p:
+  (p | beam.io.ReadFromText("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                            skip_header_lines=1)
+     | beam.Map(lambda line: line.split(','))
+     # Parse CSV, create key - value pairs
+     | beam.Map(lambda splits: (int(splits[8] or 0),  # DOLocationID
+                                int(splits[3] or 0))) # passenger_count
+     # Sum values per key
+     | beam.CombinePerKey(sum)
+     | beam.MapTuple(lambda loc_id, pc: f'{loc_id}: {pc}')

Review comment:
       Good catch, thank you!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org