You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/24 03:44:40 UTC

[GitHub] [beam] pcoet opened a new pull request #15074: added "Differences from Pandas" page for DataFrame

pcoet opened a new pull request #15074:
URL: https://github.com/apache/beam/pull/15074


   **Please** add a meaningful description for your change here
   
   Added a page to document differences between Beam DataFrame and Pandas DataFrame.
   
   @TheNeuralBit 
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   `ValidatesRunner` compliance status (on master branch)
   --------------------------------------------------------
   
   <table>
     <thead>
       <tr>
         <th>Lang</th>
         <th>ULR</th>
         <th>Dataflow</th>
         <th>Flink</th>
         <th>Samza</th>
         <th>Spark</th>
         <th>Twister2</th>
       </tr>
     </thead>
     <tbody>
       <tr>
         <td>Go</td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon">
           </a>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
       </tr>
       <tr>
         <td>Java</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_ULR/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_ULR/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon?subject=V1">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming/lastCompletedBuild/badge/icon?subject=V1+Streaming">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon?subject=V1+Java+11">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/badge/icon?subject=V2">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2_Streaming/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2_Streaming/lastCompletedBuild/badge/icon?subject=V2+Streaming">
           </a><br>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon?subject=Java+8">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon?subject=Java+11">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon?subject=Portable">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon?subject=Portable+Streaming">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Samza/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Samza/lastCompletedBuild/badge/icon?subject=Portable">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon?subject=Portable">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon?subject=Structured+Streaming">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon">
           </a>
         </td>
       </tr>
       <tr>
         <td>Python</td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon?subject=V1">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon?subject=V2">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon?subject=ValCont">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon?subject=Portable">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
       </tr>
       <tr>
         <td>XLang</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
       </tr>
     </tbody>
   </table>
   
   Examples testing status on various runners
   --------------------------------------------------------
   
   <table>
     <thead>
       <tr>
         <th>Lang</th>
         <th>ULR</th>
         <th>Dataflow</th>
         <th>Flink</th>
         <th>Samza</th>
         <th>Spark</th>
         <th>Twister2</th>
       </tr>
     </thead>
     <tbody>
       <tr>
         <td>Go</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
       </tr>
       <tr>
         <td>Java</td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/lastCompletedBuild/badge/icon?subject=V1">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Java11_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Java11_Cron/lastCompletedBuild/badge/icon?subject=V1+Java11">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java_Examples_Dataflow_V2/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java_Examples_Dataflow_V2/lastCompletedBuild/badge/icon?subject=V2">
           </a><br>
         </td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
       </tr>
       <tr>
         <td>Python</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
       </tr>
       <tr>
         <td>XLang</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
       </tr>
     </tbody>
   </table>
   
   Post-Commit SDK/Transform Integration Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   <table>
     <thead>
       <tr>
         <th>Go</th>
         <th>Java</th>
         <th>Python</th>
       </tr>
     </thead>
     <tbody>
       <tr>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon?subject=3.6">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon?subject=3.7">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon?subject=3.8">
           </a>
         </td>
       </tr>
     </tbody>
   </table>
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   <table>
     <thead>
       <tr>
         <th>---</th>
         <th>Java</th>
         <th>Python</th>
         <th>Go</th>
         <th>Website</th>
         <th>Whitespace</th>
         <th>Typescript</th>
       </tr>
     </thead>
     <tbody>
       <tr>
         <td>Non-portable</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon">
           </a><br>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon?subject=Tests">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon?subject=Lint">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon?subject=Docker">
           </a><br>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon?subject=Docs">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
       </tr>
       <tr>
         <td>Portable</td>
         <td>---</td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>
           <a href="https://ci-beam.apache.org/job/beam_PreCommit_GoPortable_Cron/lastCompletedBuild/">
             <img alt="Build Status" src="https://ci-beam.apache.org/job/beam_PreCommit_GoPortable_Cron/lastCompletedBuild/badge/icon">
           </a>
         </td>
         <td>---</td>
         <td>---</td>
         <td>---</td>
       </tr>
     </tbody>
   </table>
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] pcoet commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

pcoet commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658137020



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.
+

Review comment:
       Good suggestions. Thanks! I integrated your changes, made a few other minor tweaks, and pushed another commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658220348



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,90 @@
+---
+type: languages
+title: "Differences from pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for pandas, but there are a few differences to be aware of. This page describes divergences between the Beam and pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Classes of unsupported operations
+
+The sections below describe classes of operations that are not supported, or not yet supported, by Beam DataFrame. Workarounds are suggested, where applicable.
+
+### Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you can guard it with a `beam.dataframe.allow_non_parallel_operations` block. For example:
+
+    with beam.dataframe.allow_non_parallel_operations:
+      quantiles = df.quantile()

Review comment:
       ```suggestion
       from apache_beam import dataframe
       
       with dataframe.allow_non_parallel_operations():
         quantiles = df.quantile()
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,90 @@
+---
+type: languages
+title: "Differences from pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for pandas, but there are a few differences to be aware of. This page describes divergences between the Beam and pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Classes of unsupported operations
+
+The sections below describe classes of operations that are not supported, or not yet supported, by Beam DataFrame. Workarounds are suggested, where applicable.

Review comment:
       ```suggestion
   The sections below describe classes of operations that are not supported, or not yet supported, by the Beam DataFrame API. Workarounds are suggested, where applicable.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] pcoet commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

pcoet commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658434588



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/overview.md
##########
@@ -112,22 +112,3 @@ pc1, pc2 = {'a': pc} | DataframeTransform(lambda a: expr1, expr2)
 
 {...} = {a: pc} | DataframeTransform(lambda a: {...})
 {{< /highlight >}}
-
-## Differences from standard Pandas {#differences_from_standard_pandas}
-
-Beam DataFrames are deferred, like the rest of the Beam API. As a result, there are some limitations on what you can do with Beam DataFrames, compared to the standard Pandas implementation:
-
-* Because all operations are deferred, the result of a given operation may not be available for control flow. For example, you can compute a sum, but you can't branch on the result.
-* Result columns must be computable without access to the data. For example, you can’t use [transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html).
-* PCollections in Beam are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are unsupported. For example, order-sensitive operations such as [shift](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html), [cummax](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummax.html), [cummin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummin.html), [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html), and [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) are not supported.
-
-With Beam DataFrames, computation doesn’t take place until the pipeline runs. Before that, only the shape or schema of the result is known, meaning that you can work with the names and types of the columns, but not the result data itself.
-
-There are a few common exceptions you may see when attempting to use certain Pandas operations:
-
-* **WontImplementError**: Indicates that this operation or argument isn’t supported because it’s incompatible with the Beam model. The largest class of operations that raise this error are order-sensitive operations.
-* **NotImplementedError**: Indicates this is an operation or argument that hasn’t been implemented yet. Many Pandas operations are already available through Beam DataFrames, but there’s still a long tail of unimplemented operations.
-* **NonParallelOperation**: Indicates that you’re attempting a non-parallel operation outside of an `allow_non_parallel_operations` block. Some operations don't lend themselves to parallel computation. They can still be used, but must be guarded in a `with beam.dataframe.allow_non_parallel_operations(True)` block.
-
-[pydoc_dataframe_transform]: https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform
-[pydoc_sql_transform]: https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform

Review comment:
       Ugh. Thanks for catching!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] pcoet commented on pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

pcoet commented on pull request #15074:
URL: https://github.com/apache/beam/pull/15074#issuecomment-868084405


   Section removed and docs linked. Also fixed some extra whitespace that was breaking the build...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on pull request #15074:
URL: https://github.com/apache/beam/pull/15074#issuecomment-868671326


   Thank you @pcoet! :tada: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658037149



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations

Review comment:
       ```suggestion
   ## Classes of Unsupported Operations
   ### Non-parallelizable operations
   ```
   WDYT about collecting all the classes of operations under a heading like this ("Using interactive Beam" would still be at the same level though)?

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values

Review comment:
       ```suggestion
   ## Using Interactive Beam to access the full pandas API
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.

Review comment:
       ```suggestion
   The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas, but there are a few differences to be aware of. This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.

Review comment:
       ```suggestion
   ```
   
   I don't know that we need a general explanation here, since these are all explained above.

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.
+

Review comment:
       Should there be a TODO here with a jira for adding an example?

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order sensitive. For example, Pandas users often call the order-sensitive [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) operation to peek at data, but if you just want to view a subset of elements, you can also use `sample`, which doesn’t require you to collect the data first. Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual computation of the values is deferred, and so  the values are not available for control flow. For example, you can compute a sum with `Series.sum`, but you can’t immediately branch on the result, because the result data is not immediately available. `Series.is_unique` is a similar example. Using a deferred scalar for branching logic or truth tests raises a [TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame operations, and we’re actively working to support the remaining operations. But Pandas has a large API, and there are still gaps ([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke an operation that hasn’t been implemented yet, it will raise a `NotImplementedError`. Please [let us know](https://beam.apache.org/community/contact-us/) if you encounter a missing operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they produce deferred values that are incompatible with the Beam programming model. Other operations with deferred results are implemented, but the results aren’t available for control flow in the pipeline. A third class of operations can’t be implemented because they’re order sensitive, and Beam PCollections are unordered. For all these cases, [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html) can provide workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` operation that brings a dataset into local memory and makes it available for DataFrame operations that are order-sensitive or can’t be deferred.

Review comment:
       ```suggestion
   Interactive Beam is a module designed for use in interactive notebooks. The module, which by convention is imported as `ib`, provides an `ib.collect` function that brings a `PCollection` or deferred DataFrrame into local memory as a pandas DataFrame. After using `ib.collect` to materialize a deferred DataFrame you will be able to perform any operation in the pandas API, not just those that are supported in Beam. 
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.

Review comment:
       ```suggestion
   ```
   
   I think this is incorrect (unless I'm misinterpreting?). to_pcollection and to_dataframe aren't helpful here.

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting DataFrames are not, meaning that result columns must be computable without access to the data. Some DataFrame operations can’t support this usage, so they can’t be implemented. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam Dataframe may support non-deferred column operations on categorical columns. This work is being tracked in [BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame APIs that produce non-deferred values or plots. If invoked, these operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.runners.interactive.interactive_beam.html), you can use `collect` to bring a dataset into local memory and then perform these operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection) to convert a deferred DataFrame to a PCollection, and you can use [to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) to convert a PCollection to a deferred DataFrame. These methods provide additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are not supported. These operations raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also [contact us](https://beam.apache.org/community/contact-us/) to let us know we should prioritize this work.

Review comment:
       ```suggestion
   Order-sensitive operations may be supported in the future. To track progress on this issue, follow [BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). If you think we should prioritize this work you can also [contact us](https://beam.apache.org/community/contact-us/) to let us know.
   ```

##########
File path: website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas DataFrame, but there are a few differences to be aware of. The Beam DataFrame API is adapted for deferred processing, and Beam doesn’t implement all of the Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data into a Beam DataFrame, you have to apply the source to a pipeline object. For example, to read input from a CSV file, you could use [read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but `df` is a deferred Beam DataFrame representing the contents of the file. The input filename can be any file pattern understood by [fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see [taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on subsets of data in parallel. Some DataFrame operations can’t be parallelized, and these operations raise a [NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation) error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a `beam.dataframe.allow_non_parallel_operations(True)` block. But note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.

Review comment:
       ```suggestion
   If you want to use a non-parallelizable operation, you can guard it with a `beam.dataframe.allow_non_parallel_operations` block, for example:
   
   \```
   with beam.dataframe.allow_non_parallel_operations:
      quantiles = df.quantile()
   \```
   
   Note that this collects the entire input dataset on a single node, so there’s a risk of running out of memory. You should only use this workaround if you’re sure that the input is small enough to process on a single worker.
   ```
   
   Might be nice to have a usage example here. (Note my suggestion has an extraneous slash on the code fences, otherwise GitHub renders this incorrectly)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658371715



##########
File path: website/www/site/content/en/documentation/dsls/dataframes/overview.md
##########
@@ -112,22 +112,3 @@ pc1, pc2 = {'a': pc} | DataframeTransform(lambda a: expr1, expr2)
 
 {...} = {a: pc} | DataframeTransform(lambda a: {...})
 {{< /highlight >}}
-
-## Differences from standard Pandas {#differences_from_standard_pandas}
-
-Beam DataFrames are deferred, like the rest of the Beam API. As a result, there are some limitations on what you can do with Beam DataFrames, compared to the standard Pandas implementation:
-
-* Because all operations are deferred, the result of a given operation may not be available for control flow. For example, you can compute a sum, but you can't branch on the result.
-* Result columns must be computable without access to the data. For example, you can’t use [transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html).
-* PCollections in Beam are inherently unordered, so Pandas operations that are sensitive to the ordering of rows are unsupported. For example, order-sensitive operations such as [shift](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html), [cummax](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummax.html), [cummin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummin.html), [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html), and [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) are not supported.
-
-With Beam DataFrames, computation doesn’t take place until the pipeline runs. Before that, only the shape or schema of the result is known, meaning that you can work with the names and types of the columns, but not the result data itself.
-
-There are a few common exceptions you may see when attempting to use certain Pandas operations:
-
-* **WontImplementError**: Indicates that this operation or argument isn’t supported because it’s incompatible with the Beam model. The largest class of operations that raise this error are order-sensitive operations.
-* **NotImplementedError**: Indicates this is an operation or argument that hasn’t been implemented yet. Many Pandas operations are already available through Beam DataFrames, but there’s still a long tail of unimplemented operations.
-* **NonParallelOperation**: Indicates that you’re attempting a non-parallel operation outside of an `allow_non_parallel_operations` block. Some operations don't lend themselves to parallel computation. They can still be used, but must be guarded in a `with beam.dataframe.allow_non_parallel_operations(True)` block.
-
-[pydoc_dataframe_transform]: https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform
-[pydoc_sql_transform]: https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform

Review comment:
       I don't think you meant to remvoe these, looks like it broke some links: 
   ![image](https://user-images.githubusercontent.com/675055/123351324-b69f8500-d511-11eb-8980-ff1666192d9f.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit merged pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Posted by GitBox <gi...@apache.org>.

TheNeuralBit merged pull request #15074:
URL: https://github.com/apache/beam/pull/15074


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org