You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2018/05/08 13:13:35 UTC

Documenting the various DataFrame/SQL join types

The documentation for DataFrame.join()
<https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join>
lists all the join types we support:

   - inner
   - cross
   - outer
   - full
   - full_outer
   - left
   - left_outer
   - right
   - right_outer
   - left_semi
   - left_anti

Some of these join types are also listed on the SQL Programming Guide
<http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features>
.

Is it obvious to everyone what all these different join types are? For
example, I had never heard of a LEFT ANTI join until stumbling on it in the
PySpark docs. It’s quite handy! But I had to experiment with it a bit just
to understand what it does.

I think it would be a good service to our users if we either documented
these join types ourselves clearly, or provided a link to an external
resource that documented them sufficiently. I’m happy to file a JIRA about
this and do the work itself. It would be great if the documentation could
be expressed as a series of simple doc tests, but brief prose describing
how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer
guidance on how best to provide this kind of documentation so that it’s
easy to find by users, regardless of the language they’re using?

Nick
​

Re: Documenting the various DataFrame/SQL join types

Posted by Nicholas Chammas <ni...@gmail.com>.
OK great, I’m happy to take this on.

Does it make sense to approach this by adding an example for each join type
here
<https://github.com/apache/spark/blob/master/examples/src/main/python/sql/basic.py>
(and perhaps also in the matching areas for Scala, Java, and R), and then
referencing the examples from the SQL Programming Guide
<https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md>
using include_example tags?

e.g.:

<div data-lang="python"  markdown="1">
{% include_example write_sorting_and_bucketing python/sql/datasource.py %}</div>

And would this let me implement simple tests for the examples? It’s not
clear to me whether the comment blocks in that example file are used for
testing somehow.

Just looking for some high level guidance.

Nick
​

On Tue, May 8, 2018 at 11:42 AM Reynold Xin <rx...@databricks.com> wrote:

> Would be great to document. Probably best with examples.
>
> On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> The documentation for DataFrame.join()
>> <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join>
>> lists all the join types we support:
>>
>>    - inner
>>    - cross
>>    - outer
>>    - full
>>    - full_outer
>>    - left
>>    - left_outer
>>    - right
>>    - right_outer
>>    - left_semi
>>    - left_anti
>>
>> Some of these join types are also listed on the SQL Programming Guide
>> <http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features>
>> .
>>
>> Is it obvious to everyone what all these different join types are? For
>> example, I had never heard of a LEFT ANTI join until stumbling on it in the
>> PySpark docs. It’s quite handy! But I had to experiment with it a bit just
>> to understand what it does.
>>
>> I think it would be a good service to our users if we either documented
>> these join types ourselves clearly, or provided a link to an external
>> resource that documented them sufficiently. I’m happy to file a JIRA about
>> this and do the work itself. It would be great if the documentation could
>> be expressed as a series of simple doc tests, but brief prose describing
>> how each join works would still be valuable.
>>
>> Does this seem worthwhile to folks here? And does anyone want to offer
>> guidance on how best to provide this kind of documentation so that it’s
>> easy to find by users, regardless of the language they’re using?
>>
>> Nick
>> ​
>>
>

Re: Documenting the various DataFrame/SQL join types

Posted by Reynold Xin <rx...@databricks.com>.
Would be great to document. Probably best with examples.

On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> The documentation for DataFrame.join()
> <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join>
> lists all the join types we support:
>
>    - inner
>    - cross
>    - outer
>    - full
>    - full_outer
>    - left
>    - left_outer
>    - right
>    - right_outer
>    - left_semi
>    - left_anti
>
> Some of these join types are also listed on the SQL Programming Guide
> <http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features>
> .
>
> Is it obvious to everyone what all these different join types are? For
> example, I had never heard of a LEFT ANTI join until stumbling on it in the
> PySpark docs. It’s quite handy! But I had to experiment with it a bit just
> to understand what it does.
>
> I think it would be a good service to our users if we either documented
> these join types ourselves clearly, or provided a link to an external
> resource that documented them sufficiently. I’m happy to file a JIRA about
> this and do the work itself. It would be great if the documentation could
> be expressed as a series of simple doc tests, but brief prose describing
> how each join works would still be valuable.
>
> Does this seem worthwhile to folks here? And does anyone want to offer
> guidance on how best to provide this kind of documentation so that it’s
> easy to find by users, regardless of the language they’re using?
>
> Nick
> ​
>