You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2015/11/02 15:49:37 UTC

[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/9415

    [SPARK-11458][SQL] add word count example for Dataset

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark wordcount

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9415
    
----
commit 788c1a1675b1470d09175cdf31f2131e8d4767ac
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-11-02T14:40:52Z

    word count example for Dataset

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153367929
  
    **[Test build #44922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44922/consoleFull)** for PR 9415 at commit [`ca7f099`](https://github.com/apache/spark/commit/ca7f099cf1a6f16d6f89bd91fe76b2ce114ab49e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Exchange(`\n  * `class CoalescedPartitioner(val parent: Partitioner, val partitionStartIndices: Array[Int])`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43712471
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    Actually `mapGroups` and `cogroup` are the only 2 aggregations we have in `Dataset`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153050637
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43649263
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    I'm questioning if GroupedDataset should just have `map` and `flatMap` to match the RDD API.  It is kind of annoying to have to wrap things in `Iterator` when I think the common case might be to return a single item per group.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153043608
  
    **[Test build #44813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)** for PR 9415 at commit [`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43649615
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    /cc @rxin, #mateiz


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43712005
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    For this particular example, why are we using mapGroups? We should be using aggregation, otherwise it' not a great example.
    
    I agree it would be nice to have methods that don't return an Iterator though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153361619
  
    **[Test build #44922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44922/consoleFull)** for PR 9415 at commit [`ca7f099`](https://github.com/apache/spark/commit/ca7f099cf1a6f16d6f89bd91fe76b2ce114ab49e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43648991
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    +    }
    +
    +    counts.foreach { case (word, count) => println(s"$word: $count") }
    --- End diff --
    
    We should `collect()` here so this doesn't run on the executors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/9415


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153356857
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153050470
  
    **[Test build #44813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)** for PR 9415 at commit [`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43712012
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    +    }
    +
    +    counts.foreach { case (word, count) => println(s"$word: $count") }
    --- End diff --
    
    It can be just counts.collect.foreach(println)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153040919
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153356924
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153050635
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153040844
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153368071
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44922/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153368068
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/9415#issuecomment-153079772
  
    This is great!  Its really helpful to see the API in use to think about what parts might be hard for users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r44048499
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    Yeah, agree.  I wasn't planning to commit all of these unless they end up being good examples of the API.  I mostly just think it super helpful to see various ways that people might try to implement common tasks.
    
    Anyway, this will be much easier once #9499 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43762343
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    +  def main(args: Array[String]): Unit = {
    +    val sparkConf = new SparkConf().setAppName("DatasetWordCount")
    +    val sc = new SparkContext(sparkConf)
    +    val sqlContext = new SQLContext(sc)
    +
    +    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    +    import sqlContext.implicits._
    +
    +    val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS()
    +    val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1)
    +    val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
    +      case (word, iter) => Iterator(word -> iter.length)
    --- End diff --
    
    It's best not to use this example, otherwise you might find a lot of users writing wordcounts in this inefficient way. If you can't come up with a better one, maybe put a big warning in the example code saying this is a really inefficient way to do wordcount, and is used here only for illustrative purpose.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9415#discussion_r43649975
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
    @@ -0,0 +1,42 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.sql
    +
    +import org.apache.spark.sql.{Dataset, SQLContext}
    +import org.apache.spark.{SparkContext, SparkConf}
    +
    +object DatasetWordCount {
    --- End diff --
    
    We should add some scaladoc to explain what the example is going.
    
    ```
    Given a Dataset of Strings, we tokenize by splitting on whitespace and count the number of occurrences of each unique word.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org