You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2014/08/05 02:26:36 UTC

[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1774

    [SPARK-2179] [SQL] Public API for DataTypes and Schema (Draft update for SQL programming guide)

    This is the draft update for SQL programming guide. It adds doc for the data type and schema APIs. You can access it at http://yhuai.github.io/site/sql-programming-guide.html. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark dataTypeDoc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1774
    
----
commit 29bc6688943b5639c2e2705cb65d6d1ceca881c0
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-05T00:19:47Z

    Draft doc for data type and schema APIs.

commit 31ba240ac37280072d97422275d4b2c2bf5f04a5
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-05T00:20:07Z

    Merge remote-tracking branch 'upstream/master' into dataTypeDoc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-54113827
  
    @pwendell seems it is not a part of our sql programming guide. I can update it next week (I am out of town this week).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790538
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -259,6 +342,40 @@ for teenName in teenNames.collect():
       print teenName
     {% endhighlight %}
     
    +Another way to turns an RDD to table is to use `applySchema`. Here is an example.
    --- End diff --
    
    Same - maybe do a replaceAll


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-54113233
  
    @yhuai can you close this now? I think it was fixed in another PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15827707
  
    --- Diff: python/pyspark/sql.py ---
    @@ -269,7 +269,7 @@ def __repr__(self):
     class StructType(DataType):
         """Spark SQL StructType
     
    -    The data type representing rows.
    +    The data type representing tuple or list values.
    --- End diff --
    
    This inconsistency is introduced by the difference between the JVM Row and Python Row. For a JVM Row (both Scala and Java), fields in it are nameless and we need to extract values by providing ordinals. However, a field in a Python Row has its name. Right now, in Python, if users have an `RDD[Row]`, they need to use `inferSchema` to create a `SchemaRDD`. If they have an `RDD[tuple]` or `RDD[list]`, they need to use `applySchema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790528
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -225,6 +260,54 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {
     
     {% endhighlight %}
     
    +Another way to turns an RDD to table is to use `applySchema`. Here is an example.
    --- End diff --
    
    "to turn"; "to a table"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-51138896
  
    QA results for PR 1774:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-54240312
  
    @marmbrus should I close it now or wait until you have the new pr for our sql programming guide?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790282
  
    --- Diff: python/pyspark/sql.py ---
    @@ -269,7 +269,7 @@ def __repr__(self):
     class StructType(DataType):
         """Spark SQL StructType
     
    -    The data type representing rows.
    +    The data type representing tuple or list values.
    --- End diff --
    
    @davies told me that we only accept tuples or lists as values of `StructType` for`applySchema`. We need to finalize what are acceptable value types before the release. https://issues.apache.org/jira/browse/SPARK-2854 is used to track it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai closed the pull request at:

    https://github.com/apache/spark/pull/1774


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-54240358
  
    You can close it.
    On Sep 2, 2014 6:13 PM, "Yin Huai" <no...@github.com> wrote:
    
    > @marmbrus <https://github.com/marmbrus> should I close it now or wait
    > until you have the new pr for our sql programming guide?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1774#issuecomment-54240312>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790429
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age
     teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
     {% endhighlight %}
     
    +Another way to turns an RDD to table is to use `applySchema`. Here is an example.
    --- End diff --
    
    "to turn"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-51276850
  
    QA tests have started for PR 1774. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790159
  
    --- Diff: python/pyspark/sql.py ---
    @@ -269,7 +269,7 @@ def __repr__(self):
     class StructType(DataType):
         """Spark SQL StructType
     
    -    The data type representing rows.
    +    The data type representing tuple or list values.
    --- End diff --
    
    Whats up with this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-51135882
  
    QA tests have started for PR 1774. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790226
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age
     teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
     {% endhighlight %}
     
    +Another way to turns an RDD to table is to use `applySchema`. Here is an example.
    --- End diff --
    
    It would be good to provide some motivation here.  Perhaps talk about programmatically creating a schema when it is not possible to statically define classes ahead of time.
    
    Related: an example where the schema is determined statically might make more sense (i.e. read from the first row of the file?) but maybe that is too complicated...
    
    Minor: Usually we just say "For example".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-54114283
  
    I plan to use this branch as the starting point for the documentation I'll
    be writing this week.
    On Sep 1, 2014 11:28 PM, "Yin Huai" <no...@github.com> wrote:
    
    > @pwendell <https://github.com/pwendell> seems it is not a part of our sql
    > programming guide. I can update it next week (I am out of town this week).
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1774#issuecomment-54113827>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1774#issuecomment-51281876
  
    QA results for PR 1774:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1774#discussion_r15790441
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age
     teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
     {% endhighlight %}
     
    +Another way to turns an RDD to table is to use `applySchema`. Here is an example.
    +{% highlight scala %}
    +// sc is an existing SparkContext.
    +val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +
    +// Create an RDD
    +val people = sc.textFile("examples/src/main/resources/people.txt")
    +
    +// Import Spark SQL data types and Row.
    +import org.apache.spark.sql._
    +
    +// Define the schema that will be applied to the RDD.
    +val schema =
    +  StructType(
    +    StructField("name", StringType, true) ::
    +    StructField("age", IntegerType, true) :: Nil)
    +
    +// Convert records of the RDD (people) to rows.
    --- End diff --
    
    "to Rows"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org