You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/03 07:22:34 UTC

[GitHub] [spark] itholic opened a new pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

itholic opened a new pull request #33907:
URL: https://github.com/apache/spark/pull/33907


   ### What changes were proposed in this pull request?
   
   This PR proposes adding `thousands` argument to `ps.read_csv`.
   
   csv_file:
   ```csv
   name;age;job;money
   Jorge;30;Developer;10,000,000
   Bob;32;Developer;-2,000,000
   John;29;Developer;
   ```
   
   before:
   ```python
   >>> ps.read_csv(path, sep=";").money
   0    10,000,000
   1    -2,000,000
   2          None
   Name: money, dtype: object
   ```
   
   after:
   ```python
   >>> ps.read_csv(path, sep=";", thousands=",").money
   0    10000000.0
   1    -2000000.0
   2           NaN
   Name: money, dtype: float64
   ```
   
   ### Why are the changes needed?
   
   We should support the API same as pandas, as much as possible.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now the `thousands` argument is supported for `ps.read_csv`.
   
   ### How was this patch tested?
   
   Unittest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702568290



##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
         index_spark_column_names = []
         index_names = []
 
+    data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+    if thousands is not None:

Review comment:
       I think seems like pandas just simply replace the string specified by `thousands` parameter to empty string, if the column is possible to cast to numeric type, regardless of locale standard ??
   
   For example, given csv file:
   
   ```csv
   name;age;job;money
   Jorge;30;Developer;10000,,,,,00,0,0,0,0
   Bob;32;Developer;-1234,424,142424,0,,,,,,,
   ```
   
   with pandas `read_csv`:
   
   ```python
   >>> pd.read_csv(path, sep=";", thousands=",")
       name  age        job           money
   0  Jorge   30  Developer     10000000000
   1    Bob   32  Developer -12344241424240
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702591649



##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
         index_spark_column_names = []
         index_names = []
 
+    data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+    if thousands is not None:
+        if not isinstance(thousands, str) or len(thousands) != 1:
+            raise ValueError("Only length-1 comment characters supported")
+        new_data_spark_columns = list()
+        for scol in data_spark_columns:
+            scol_replaced = F.regexp_replace(scol, thousands, "")

Review comment:
       This isn't really correct. It assumes that the string representation of other types are same as the raw data but that's not the true.
   
   For example, it wouldn't work if you set `timestampFormat` to something like `yyyy-MM` and set `thousands` to `-`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356054


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350008


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142970/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350008


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142970/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356054


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912321545


   **[Test build #142970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
itholic commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-913354874


   Let me close this for now, since current implementation way is too hacky.
   
   Seems like we should first handle around the CSVInferSchema and revisit later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912350164


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912388470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47471/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912353119


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47471/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912321545


   **[Test build #142970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702591184



##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
         index_spark_column_names = []
         index_names = []
 
+    data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+    if thousands is not None:
+        if not isinstance(thousands, str) or len(thousands) != 1:
+            raise ValueError("Only length-1 comment characters supported")
+        new_data_spark_columns = list()
+        for scol in data_spark_columns:
+            scol_replaced = F.regexp_replace(scol, thousands, "")
+            cond = F.when(scol_replaced.isNull(), np.nan).otherwise(scol_replaced.cast("float"))
+            if len(sdf.select(cond).filter(cond.isNull()).head(1)) == 0:

Review comment:
       can we avoid launching jobs for every column?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic closed pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
itholic closed pull request #33907:
URL: https://github.com/apache/spark/pull/33907


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912388470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47471/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702555143



##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
         index_spark_column_names = []
         index_names = []
 
+    data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+    if thousands is not None:

Review comment:
       I think you can set `locale` option in CSV. Should probably find one standard that adds `,` in numbers. e.g.) see https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912339644


   **[Test build #142970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142970/testReport)** for PR 33907 at commit [`23520b2`](https://github.com/apache/spark/commit/23520b2080c9376c9e70f88ac4ac25efd7d53b2a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #33907:
URL: https://github.com/apache/spark/pull/33907#discussion_r702568290



##########
File path: python/pyspark/pandas/namespace.py
##########
@@ -407,6 +410,20 @@ def read_csv(
         index_spark_column_names = []
         index_names = []
 
+    data_spark_columns = [scol_for(sdf, col) for col in column_labels.values()]
+    if thousands is not None:

Review comment:
       I think seems like pandas just simply replace the string specified by `thousands` parameter to empty string, if the column is possible to cast to numeric type, regardless of locale standard ??
   
   For example,
   
   ```csv
   name;age;job;money
   Jorge;30;Developer;10000,,00,0,0,0,0
   Bob;32;Developer;-1234,424,142424,0
   ```
   
   ```python
   >>> pd.read_csv(path, sep=";", thousands=",")
       name  age        job           money
   0  Jorge   30  Developer     10000000000
   1    Bob   32  Developer -12344241424240
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912379933


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47471/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33907: [SPARK-36610][PYTHON] Add `thousands` argument to `ps.read_csv`.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33907:
URL: https://github.com/apache/spark/pull/33907#issuecomment-912356012


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47470/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org