You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (Jira)" <ji...@apache.org> on 2022/01/21 23:38:00 UTC

[jira] [Comment Edited] (SPARK-37981) Deletes columns with all Null as default.

    [ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480291#comment-17480291 ] 

Maciej Szymkiewicz edited comment on SPARK-37981 at 1/21/22, 11:37 PM:
-----------------------------------------------------------------------

This doesn't seem valid.

{{dropFieldIfAllNull}} is a reader option. For writes, we use {{ignoreNullFields}}.

So your write code should use appropriate option:

{code}

d3.write.option("ignoreNullFields", "false").json("d3.json")

{code}


was (Author: zero323):
This doesn't seem valid.

{{dropFieldIfAllNull}} is a reader option. For writes, we use {{ignoreNullFields}}.

So your code should be 

{code}

d3.write.option("ignoreNullFields", "false").json("d3.json")

{code}

> Deletes columns with all Null as default.
> -----------------------------------------
>
>                 Key: SPARK-37981
>                 URL: https://issues.apache.org/jira/browse/SPARK-37981
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>         Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNull	false as default, according to https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
>     conf.setMaster('local[*]')
>     conf \
>       .set('spark.driver.memory', '64g')\
>       .set("fs.s3a.access.key", "minio") \
>       .set("fs.s3a.secret.key", "") \
>       .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \
>       .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>       .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>       .set("spark.sql.repl.eagerEval.enabled", "True") \
>       .set("spark.sql.adaptive.enabled", "True") \
>       .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>       .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
>       .set("sc.setLogLevel", "error")
>    
>     return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
>     return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
>     return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org