You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/06/09 20:00:00 UTC

[jira] [Resolved] (SPARK-27982) In spark 2.2.1 filter on a particular column follwed by the drop of the same column fail to filter the all the records

     [ https://issues.apache.org/jira/browse/SPARK-27982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved SPARK-27982.
-----------------------------------
    Resolution: Incomplete

Hi, [~karan970].
1. Please use mailing lists for the questions.
- https://spark.apache.org/contributing.html
2. Also, please do not set `Target Versions`.
3. Finally, Spark 2.2 is EOL already.

> In spark 2.2.1 filter on a particular column follwed by the drop of the same column fail to filter the all the records
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27982
>                 URL: https://issues.apache.org/jira/browse/SPARK-27982
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.2.1
>            Reporter: Karan Hebbar K S
>            Priority: Minor
>              Labels: newbie
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> The issue here follows the design of the spark, If the filer is applied on a column followed by the drop of the same of the column,  Then spark filters only the first record then drops the column as all the transformation  filter + drop is applied to a record as it reads because both the transformation falls in Narrow stage.
>  
> There by resulting in filtering of only few records neglecting the rest
>  
> Here is sample code
> inserts_filtered = inserts.toDF().filter(col("op")=='I')
>  inserts_without_column_op = inserts_filtered.drop('op')
> inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path)
>  
> The above lines of code will only write one record with 'I' (value of the column 'op') filtered neglecting the order records with 'I' (value of the column 'op') as the column was dropped when first record was filtered.
>  
>  
> Below is the sample record in csv trying to convert to parquet writing with partition keys
>  
>  Op,key1,key2,created_at,updated_at,name
>  I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3
>  I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2
>  I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org