You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jomach <gi...@git.apache.org> on 2017/10/12 16:05:52 UTC

[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...

GitHub user jomach opened a pull request:

    https://github.com/apache/spark/pull/19485

    [SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames Fix

    ## What changes were proposed in this pull request?
    
    Small  rendering fix
    
    ## How was this patch tested?
     Reviewers


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jomach/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19485.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19485
    
----
commit f5941bf196a36afe8715d713fcaaf3f1a136d9e8
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-04T13:09:16Z

    SPARK-20055 Documentation
     -Added documentation for loading csv files into Dataframes

commit 812bdf7a44ed2e52c7012921814da6bb73d0033c
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-04T12:58:44Z

    SPARK-20055 Documentation
     - Some examples on how to create a dataframe with a csv file
    
    (cherry picked from commit e8ca1dc)

commit 4e4a02ba271bfb9811d31cd1909c942be4322682
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-04T13:09:16Z

    SPARK-20055 Documentation
     -Added documentation for loading csv files into Dataframes

commit a2ec38a7b86b9cf89f7f4b9cf6368b9864ef10c2
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-05T08:27:20Z

    SPARK-20055 Documentation
     - Some examples on how to create a dataframe with a csv file
    
    (cherry picked from commit a546421)

commit 793628bbedcc50c0845a3fd999d2720e2c63ea1d
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-05T08:32:15Z

    Merge remote-tracking branch 'origin/master'
    
    # Conflicts:
    #	docs/sql-programming-guide.md
    #	examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
    #	examples/src/main/r/RSparkSQLExample.R

commit cd69fa240d453a7b8344796349a2bf03a20ffbfc
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-10T05:52:37Z

    SPARK-20055 Documentation
     - Some examples on how to create a dataframe with a csv file

commit 68799ede999ec1874c80d242441032cd29a2f695
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-11T07:29:33Z

    SPARK-20055 Documentation
     - PR comments

commit 7ff1d84779acc50ab3c63d9bc0651ac53193f555
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-11T08:09:49Z

    SPARK-20055 Documentation
     - PR comments

commit 07d73fcac85529fa17e34b170f2941f0f579fe00
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-12T15:12:35Z

    Merge branch 'upstream/masterlocal'

commit 73b1d7aed4c0fd740d5fbdde569d6b3ff3b86271
Author: Jorge Machado <jo...@hotmail.com>
Date:   2017-10-12T16:03:51Z

    SPARK-20055 Documentation
     - PR comments

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Oh, @jomach, I had to be clear. I actually left it so that followup addressing https://github.com/apache/spark/pull/19429#issuecomment-335732059 could fix this newline issue together. Would you be willing to address that comment too here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Could it be an option to leave a link back to the new page in the API doc to refer the options and remove the option list in API doc @gatorsmile and @liancheng?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @gatorsmile, sure, detailed doc is great and defintely I support it.
    
    Just one thing I am worried of is duplication. If we add or change option, we have to update those  together and .. you know it.
    
    Wouldn't it be nicer if we simply leave a pointer and remove the duplication if possible? If I understood correctly, the options would also be described in more details in the future in the new chapter and I think simpliy redirecting it might be feasible.
    
    I guess it shouldn't be too difficult to make a sub-chapter for options only, for example, like http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
    
    Otherwise, would you maybe thimk there should be dfferent contents for a different purpose, or want to leave the duplication just for now as something to be fixed soon? If so, I am okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Appreciate it. Thanks! 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19485#discussion_r145055873
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -462,7 +462,6 @@ names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames lo
     source type can be converted into other types using this syntax.
     
     To load a JSON file you can use:
    --- End diff --
    
    I guess we should revert this one too then.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...

Posted by jomach <gi...@git.apache.org>.
Github user jomach closed the pull request at:

    https://github.com/apache/spark/pull/19485


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Yes I will do it. give me some days please. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Sure, I'll be working on this for this weekend. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Less duplication is good but could we similar contents with http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets? It looks the examples are quite different.
    
    Also, up to my knowledge, we can shorten the link to, for example, `api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame` (not tested).
    
    You could check the HTML by following https://github.com/apache/spark/tree/master/docs#prerequisites.  Adding a new chapter is actually not quite trivial, IMHO. Let's put our efforts here together.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Ok so I will do: 
      - Create a new Section for csv-datasets
      - add more  example options on the code fromJavaSQLDataSourceExample.java (.scala .py and .r)
      - Make reference to the links from the api. 
    
    This will have the effect that we will not see all the options on .md page and people will need to jump in to the api. Do you agree with this ? 
    
    Cool would be if from jekyllrb we could create something like a iframe and get the options from the scala api... Any ideias ? 
    
    Please net me know if it is ok to proceed this way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @jomach and @HyukjinKwon 
    
    I did not generate the doc. I think we should follow what we did for JDBC. http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
    
    List all the public options for each built-in data sources. Thus, it makes sense to add a new chapter for CSV



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @gatorsmile WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Thanks for taking a look for this one. Actually, I thought we should add a chapter like http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
    
    And, add a link to, for example, https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv for Python, http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame for Scala and http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv-scala.collection.Seq- for Java to refer the options, rather than duplicating the option list (which we should duplicately update when we fix or add options).
    
    Probably, we should add some links to JSON ones too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    I meant adding a new chapter describing options, removing duplication, for example here 
    https://github.com/apache/spark/blob/73d80ec49713605d6a589e688020f0fc2d6feab2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L513
    and then leaving  a link to the new chapter instead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    My only worry is duplication and we would have another place to update the doc for options. Others sound okay to me too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    This is the API link you refer `https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame` 
    
    I just quickly scanned them. The option descriptions are pretty rough. They are made for advanced dev who the read API docs and play with them. In the long term, we should follow what the mainstream RDBMS reference manual. Something like
    - https://dev.mysql.com/doc/refman/5.5/en/creating-tables.html
    - https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtable.html
    - https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN01503
    
    I prefer to having something more human friendly. The whole SQL doc needs a complete re-org. cc @jiangxb1987 Maybe you are the right person to take it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Thanks for explanation. I guess there would be a big doc change soon? Will check those changes too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Yes I'm viewing the  docs with Jekyll.  I addressed that  on my previous comment. I really don't think we should make a huge example as the json does. It's a csv ... 
    
    What do you think ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @HyukjinKwon  Here is the enter as the other is closed / merged


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Just checked it with @liancheng Both think creating a separate page sounds good.
    
    Also cc @rxin  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @HyukjinKwon I came up with this. What do you think ? What I don't like on it is that I did not find anyway to read Javadocs into the markdown so that we don't have duplicates. Any ideia or should we leave it as in this PR ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @gatorsmile: we will have a lot of duplication.
    
    Ist that Fine ? I will create a complete new Page like sql programming guide and name it Data sources guide and add all the data sources with all the options (and duplicating information from the api into the docs) ist that ok for all ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    The reference manual and API docs are different. Below is a link of DB2 LUW:
    http://www-01.ibm.com/support/docview.wss?uid=swg27038855


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Sure, please take your time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    So I removed the duplicated stuff and added the links. I do it on purpose not to add more example as the document is getting huge and hard to find stuff. What do you think ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @HyukjinKwon I did not understand what is your suggestion. 
    
    @jomach Any reason you closed this PR or you plan to open a new one?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    @gatorsmile will do


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19485
  
    Yup, I think that's what I initially intended in the JIRA. Not sure for the iframe idea, for now. I'd just make it simple like with links.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org