You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/24 15:32:16 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

TheNeuralBit commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658037149



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations

Review comment:
       ```suggestion
   ## Classes of Unsupported Operations
   ### Non-parallelizable operations
   ```
   WDYT about collecting all the classes of operations under a heading like this ("Using interactive Beam" would still be at the same level though)?

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values

Review comment:
       ```suggestion
   ## Using Interactive Beam to access the full pandas API
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.

Review comment:
       ```suggestion
   The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas, but there are a few differences to be aware of. This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.

Review comment:
       ```suggestion
   ```
   
   I don't know that we need a general explanation here, since these are all explained above.

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.
+

Review comment:
       Should there be a TODO here with a jira for adding an example?

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.

Review comment:
       ```suggestion
   Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` function that brings a `PCollection` or deferred DataFrrame into local memory as a pandas DataFrame. After using `ib.collect` to materialize a deferred DataFrame you will be able to perform any operation in the pandas API, not just those that are supported in Beam. 
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.

Review comment:
       ```suggestion
   ```
   
   I think this is incorrect (unless I'm misinterpreting?). to_pcollection and to_dataframe aren't helpful here.

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.

Review comment:
       ```suggestion
   Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). If you think we should prioritize this work you can also [contact us](https://beam.apache.org/community/contact-us/) to let us know.
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.

Review comment:
       ```suggestion
   If you want to use a non-parallelizable operation, you can guard it with a `beam.dataframe.allow_non_parallel_operations` block, for example:
   
   \```
   with beam.dataframe.allow_non_parallel_operations:
      quantiles = df.quantile()
   \```
   
   Note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
   ```
   
   Might be nice to have a usage example here. (Note my suggestion has an extraneous slash on the code fences, otherwise GitHub renders this incorrectly)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org