You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/03 07:22:34 UTC
[GitHub] [spark] itholic opened a new pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
itholic opened a new pull request #33907:
URL: https://github.com/apache/spark/pull/33907
### What changes were proposed in this pull request?
This PR proposes adding `thousands` argument to `ps.read_csv`.
csv_file:
```csv
name;age;job;money
Jorge;30;Developer;10,000,000
Bob;32;Developer;-2,000,000
John;29;Developer;
```
before:
```python
>>> ps.read_csv(path, sep=";").money
0 10,000,000
1 -2,000,000
2 None
Name: money, dtype: object
```
after:
```python
>>> ps.read_csv(path, sep=";", thousands=",").money
0 10000000.0
1 -2000000.0
2 NaN
Name: money, dtype: float64
```
### Why are the changes needed?
We should support the API same as pandas, as much as possible.
### Does this PR introduce _any_ user-facing change?
Yes, now the `thousands` argument is supported for `ps.read_csv`.
### How was this patch tested?
Unittest
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702568290
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
index_spark_column_names = []
index_names = []
+ data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+ if thousands is not None:
Review comment:
I think seems like pandas just simply replace the string specified by `thousands` parameter to empty string, if the column is possible to cast to numeric type, regardless of locale standard ??
For example, given csv file:
```csv
name;age;job;money
Jorge;30;Developer;10000,,,,,00,0,0,0,0
Bob;32;Developer;-1234,424,142424,0,,,,,,,
```
with pandas `read_csv`:
```python
>>> pd.read_csv(path, sep=";", thousands=",")
name age job money
0 Jorge 30 Developer 10000000000
1 Bob 32 Developer -12344241424240
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702591649
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
index_spark_column_names = []
index_names = []
+ data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+ if thousands is not None:
+ if not isinstance(thousands, str) or len(thousands) != 1:
+ raise ValueError("Only length-1 comment characters supported")
+ new_data_spark_columns = list()
+ for scol in data_spark_columns:
+ scol_replaced = F.regexp_replace(scol, thousands, "")
Review comment:
This isn't really correct. It assumes that the string representation of other types are same as the raw data but that's not the true.
For example, it wouldn't work if you set `timestampFormat` to something like `yyyy-MM` and set `thousands` to `-`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356054
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350008
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142970/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350008
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142970/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356054
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912321545
**[Test build #142970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
itholic commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-913354874
Let me close this for now, since current implementation way is too hacky.
Seems like we should first handle around the CSVInferSchema and revisit later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350164
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912388470
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47471/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912353119
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47471/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912321545
**[Test build #142970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702591184
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
index_spark_column_names = []
index_names = []
+ data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+ if thousands is not None:
+ if not isinstance(thousands, str) or len(thousands) != 1:
+ raise ValueError("Only length-1 comment characters supported")
+ new_data_spark_columns = list()
+ for scol in data_spark_columns:
+ scol_replaced = F.regexp_replace(scol, thousands, "")
+ cond = F.when(scol_replaced.isNull(), np.nan).otherwise(scol_replaced.cast("float"))
+ if len(sdf.select(cond).filter(cond.isNull()).head(1)) == 0:
Review comment:
can we avoid launching jobs for every column?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic closed pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
itholic closed pull request #33907:
URL: https://github.com/apache/spark/pull/33907
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912388470
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47471/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702555143
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
index_spark_column_names = []
index_names = []
+ data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+ if thousands is not None:
Review comment:
I think you can set `locale` option in CSV. Should probably find one standard that adds `,` in numbers. e.g.) see https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912339644
**[Test build #142970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702568290
##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
index_spark_column_names = []
index_names = []
+ data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+ if thousands is not None:
Review comment:
I think seems like pandas just simply replace the string specified by `thousands` parameter to empty string, if the column is possible to cast to numeric type, regardless of locale standard ??
For example,
```csv
name;age;job;money
Jorge;30;Developer;10000,,00,0,0,0,0
Bob;32;Developer;-1234,424,142424,0
```
```python
>>> pd.read_csv(path, sep=";", thousands=",")
name age job money
0 Jorge 30 Developer 10000000000
1 Bob 32 Developer -12344241424240
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912379933
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47471/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356012
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47470/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org