You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jomach <gi...@git.apache.org> on 2017/10/12 16:05:52 UTC
[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...
GitHub user jomach opened a pull request:
https://github.com/apache/spark/pull/19485
[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames Fix
## What changes were proposed in this pull request?
Small rendering fix
## How was this patch tested?
Reviewers
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jomach/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19485.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19485
----
commit f5941bf196a36afe8715d713fcaaf3f1a136d9e8
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-04T13:09:16Z
SPARK-20055 Documentation
-Added documentation for loading csv files into Dataframes
commit 812bdf7a44ed2e52c7012921814da6bb73d0033c
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-04T12:58:44Z
SPARK-20055 Documentation
- Some examples on how to create a dataframe with a csv file
(cherry picked from commit e8ca1dc)
commit 4e4a02ba271bfb9811d31cd1909c942be4322682
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-04T13:09:16Z
SPARK-20055 Documentation
-Added documentation for loading csv files into Dataframes
commit a2ec38a7b86b9cf89f7f4b9cf6368b9864ef10c2
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-05T08:27:20Z
SPARK-20055 Documentation
- Some examples on how to create a dataframe with a csv file
(cherry picked from commit a546421)
commit 793628bbedcc50c0845a3fd999d2720e2c63ea1d
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-05T08:32:15Z
Merge remote-tracking branch 'origin/master'
# Conflicts:
# docs/sql-programming-guide.md
# examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
# examples/src/main/r/RSparkSQLExample.R
commit cd69fa240d453a7b8344796349a2bf03a20ffbfc
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-10T05:52:37Z
SPARK-20055 Documentation
- Some examples on how to create a dataframe with a csv file
commit 68799ede999ec1874c80d242441032cd29a2f695
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-11T07:29:33Z
SPARK-20055 Documentation
- PR comments
commit 7ff1d84779acc50ab3c63d9bc0651ac53193f555
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-11T08:09:49Z
SPARK-20055 Documentation
- PR comments
commit 07d73fcac85529fa17e34b170f2941f0f579fe00
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-12T15:12:35Z
Merge branch 'upstream/masterlocal'
commit 73b1d7aed4c0fd740d5fbdde569d6b3ff3b86271
Author: Jorge Machado <jo...@hotmail.com>
Date: 2017-10-12T16:03:51Z
SPARK-20055 Documentation
- PR comments
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Oh, @jomach, I had to be clear. I actually left it so that followup addressing https://github.com/apache/spark/pull/19429#issuecomment-335732059 could fix this newline issue together. Would you be willing to address that comment too here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Could it be an option to leave a link back to the new page in the API doc to refer the options and remove the option list in API doc @gatorsmile and @liancheng?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
@gatorsmile, sure, detailed doc is great and defintely I support it.
Just one thing I am worried of is duplication. If we add or change option, we have to update those together and .. you know it.
Wouldn't it be nicer if we simply leave a pointer and remove the duplication if possible? If I understood correctly, the options would also be described in more details in the future in the new chapter and I think simpliy redirecting it might be feasible.
I guess it shouldn't be too difficult to make a sub-chapter for options only, for example, like http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
Otherwise, would you maybe thimk there should be dfferent contents for a different purpose, or want to leave the duplication just for now as something to be fixed soon? If so, I am okay.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
Appreciate it. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19485#discussion_r145055873
--- Diff: docs/sql-programming-guide.md ---
@@ -462,7 +462,6 @@ names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames lo
source type can be converted into other types using this syntax.
To load a JSON file you can use:
--- End diff --
I guess we should revert this one too then.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...
Posted by jomach <gi...@git.apache.org>.
Github user jomach closed the pull request at:
https://github.com/apache/spark/pull/19485
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
Yes I will do it. give me some days please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/19485
Sure, I'll be working on this for this weekend. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Less duplication is good but could we similar contents with http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets? It looks the examples are quite different.
Also, up to my knowledge, we can shorten the link to, for example, `api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame` (not tested).
You could check the HTML by following https://github.com/apache/spark/tree/master/docs#prerequisites. Adding a new chapter is actually not quite trivial, IMHO. Let's put our efforts here together.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
Ok so I will do:
- Create a new Section for csv-datasets
- add more example options on the code fromJavaSQLDataSourceExample.java (.scala .py and .r)
- Make reference to the links from the api.
This will have the effect that we will not see all the options on .md page and people will need to jump in to the api. Do you agree with this ?
Cool would be if from jekyllrb we could create something like a iframe and get the options from the scala api... Any ideias ?
Please net me know if it is ok to proceed this way.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
@jomach and @HyukjinKwon
I did not generate the doc. I think we should follow what we did for JDBC. http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
List all the public options for each built-in data sources. Thus, it makes sense to add a new chapter for CSV
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
@gatorsmile WDYT?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Thanks for taking a look for this one. Actually, I thought we should add a chapter like http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
And, add a link to, for example, https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv for Python, http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame for Scala and http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv-scala.collection.Seq- for Java to refer the options, rather than duplicating the option list (which we should duplicately update when we fix or add options).
Probably, we should add some links to JSON ones too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
I meant adding a new chapter describing options, removing duplication, for example here
https://github.com/apache/spark/blob/73d80ec49713605d6a589e688020f0fc2d6feab2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L513
and then leaving a link to the new chapter instead.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
My only worry is duplication and we would have another place to update the doc for options. Others sound okay to me too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
This is the API link you refer `https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame`
I just quickly scanned them. The option descriptions are pretty rough. They are made for advanced dev who the read API docs and play with them. In the long term, we should follow what the mainstream RDBMS reference manual. Something like
- https://dev.mysql.com/doc/refman/5.5/en/creating-tables.html
- https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtable.html
- https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN01503
I prefer to having something more human friendly. The whole SQL doc needs a complete re-org. cc @jiangxb1987 Maybe you are the right person to take it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Thanks for explanation. I guess there would be a big doc change soon? Will check those changes too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
Yes I'm viewing the docs with Jekyll. I addressed that on my previous comment. I really don't think we should make a huge example as the json does. It's a csv ...
What do you think ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
@HyukjinKwon Here is the enter as the other is closed / merged
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19485
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
Just checked it with @liancheng Both think creating a separate page sounds good.
Also cc @rxin
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
@HyukjinKwon I came up with this. What do you think ? What I don't like on it is that I did not find anyway to read Javadocs into the markdown so that we don't have duplicates. Any ideia or should we leave it as in this PR ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
@gatorsmile: we will have a lot of duplication.
Ist that Fine ? I will create a complete new Page like sql programming guide and name it Data sources guide and add all the data sources with all the options (and duplicating information from the api into the docs) ist that ok for all ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
The reference manual and API docs are different. Below is a link of DB2 LUW:
http://www-01.ibm.com/support/docview.wss?uid=swg27038855
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Sure, please take your time.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
So I removed the duplicated stuff and added the links. I do it on purpose not to add more example as the document is getting huge and hard to find stuff. What do you think ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19485
@HyukjinKwon I did not understand what is your suggestion.
@jomach Any reason you closed this PR or you plan to open a new one?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by jomach <gi...@git.apache.org>.
Github user jomach commented on the issue:
https://github.com/apache/spark/pull/19485
@gatorsmile will do
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19485
Yup, I think that's what I initially intended in the JIRA. Not sure for the iframe idea, for now. I'd just make it simple like with links.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org