You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/07/25 02:47:00 UTC

[jira] [Commented] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

    [ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386801#comment-17386801 ] 

Hyukjin Kwon commented on SPARK-36277:
--------------------------------------

Can you provide a self-contained reproducer?

> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
>                 Key: SPARK-36277
>                 URL: https://issues.apache.org/jira/browse/SPARK-36277
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: anju
>            Priority: Major
>         Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark
>   here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in  df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment
> code 
>  
> {code} 
> s3_obj =boto3.client('s3')
> s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json')
> s3_clientobj
> s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
> schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
> extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
> extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
> extract_with_schema_df.select("col1","col2").show()
> cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1))
> cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2))
> cnt2=extract_without_schema_df.select("col1","col2").show()}}
> {code} 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org