You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/18 00:27:13 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #13561: Add DataFrame Preview announcment blog post

TheNeuralBit commented on a change in pull request #13561:
URL: https://github.com/apache/beam/pull/13561#discussion_r545493726



##########
File path: website/www/site/content/en/blog/dataframe-api-preview-available.md
##########
@@ -0,0 +1,178 @@
+---
+title:  "DataFrame API Preview now Available!"
+date: "2020-12-16T09:09:41-08:00"
+categories:
+  - blog
+authors:
+  - bhulette
+  - robertwb
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We're excited to announce that a preview of the Beam Python SDK's new DataFrame
+API is now available in [Beam
+2.26.0](https://beam.apache.org/blog/beam-2.26.0/). Much like `SqlTransform`
+([Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/sql/SqlTransform.html),
+[Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform)),
+the DataFrame API gives Beam users a way to express complex
+relational logic much more concisely than previously possible.
+<!--more-->
+
+## A more expressive API
+Beam's new DataFrame API aims to be compatible with the well known
+[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
+DataFrame API, with a few caveats detailed below. With this new API a simple
+pipeline that reads NYC taxiride data from a CSV, performs a grouped
+aggregation, and writes the output to CSV, can be expressed very concisely:
+
+```
+from apache_beam.dataframe.io import read_csv
+
+with beam.Pipeline() as p:
+  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                    use_ncols=['passenger_count' , 'DOLocationID'])
+  # Count the number of passengers dropped off per LocationID
+  agg = df.groupby('DOLocationID').sum()
+  agg.to_csv(output)
+```
+
+Compare this to the same logic implemented as a conventional Beam python
+pipeline with a `CombinePerKey`:
+
+```
+with beam.Pipeline() as p:
+  (p | beam.io.ReadFromText("gs://apache-beam-samples/nyc_taxi/2019/*.csv",
+                            skip_header_lines=1)
+     | beam.Map(lambda line: line.split(','))
+     # Parse CSV, create key - value pairs
+     | beam.Map(lambda splits: (int(splits[8] or 0),  # DOLocationID
+                                int(splits[3] or 0))) # passenger_count
+     # Sum values per key
+     | beam.CombinePerKey(sum)
+     | beam.MapTuple(lambda loc_id, pc: f'{loc_id}: {pc}')

Review comment:
       Good catch, thank you!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org