You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/24 17:14:30 UTC

[GitHub] [beam] pcoet commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

pcoet commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658137020



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.
+

Review comment:
       Good suggestions. Thanks! I integrated your changes, made a few other minor tweaks, and pushed another commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org