You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2017/03/03 22:44:40 UTC

[GitHub] spark pull request #16856: [SPARK-19516][DOC] update public doc to use Spark...

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16856#discussion_r104255711
  
    --- Diff: docs/quick-start.md ---
    @@ -29,28 +30,28 @@ or Python. Start it by running the following in the Spark directory:
     
         ./bin/spark-shell
     
    -Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
    +Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Let's make a new Dataset from the text of the README file in the Spark source directory:
     
     {% highlight scala %}
    -scala> val textFile = sc.textFile("README.md")
    -textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
    +scala> val textFile = spark.read.textFile("README.md")
    +textFile: org.apache.spark.sql.Dataset[String] = [value: string]
     {% endhighlight %}
     
    -RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
    +You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/index.html#org.apache.spark.sql.Dataset)_.
     
     {% highlight scala %}
    -scala> textFile.count() // Number of items in this RDD
    +scala> textFile.count() // Number of items in this Dataset
     res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
     
    -scala> textFile.first() // First item in this RDD
    +scala> textFile.first() // First item in this Dataset
     res1: String = # Apache Spark
     {% endhighlight %}
     
    -Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
    +Now let's transform this Dataset to a new one. We will call the `filter` to return a new Dataset with a subset of the items in the file.
    --- End diff --
    
    Just "call `filter`"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org