You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (Jira)" <ji...@apache.org> on 2022/01/21 23:38:00 UTC
[jira] [Comment Edited] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480291#comment-17480291 ]
Maciej Szymkiewicz edited comment on SPARK-37981 at 1/21/22, 11:37 PM:
-----------------------------------------------------------------------
This doesn't seem valid.
{{dropFieldIfAllNull}} is a reader option. For writes, we use {{ignoreNullFields}}.
So your write code should use appropriate option:
{code}
d3.write.option("ignoreNullFields", "false").json("d3.json")
{code}
was (Author: zero323):
This doesn't seem valid.
{{dropFieldIfAllNull}} is a reader option. For writes, we use {{ignoreNullFields}}.
So your code should be
{code}
d3.write.option("ignoreNullFields", "false").json("d3.json")
{code}
> Deletes columns with all Null as default.
> -----------------------------------------
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.0
> Reporter: Bjørn Jørgensen
> Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2
> During write.json spark deletes columns with all Null as default.
>
> Spark does have dropFieldIfAllNull false as default, according to https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
> .set('spark.driver.memory', '64g')\
> .set("fs.s3a.access.key", "minio") \
> .set("fs.s3a.secret.key", "") \
> .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \
> .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
> .set("spark.hadoop.fs.s3a.path.style.access", "true") \
> .set("spark.sql.repl.eagerEval.enabled", "True") \
> .set("spark.sql.adaptive.enabled", "True") \
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
> .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
> .set("sc.setLogLevel", "error")
>
> return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org