You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Moshe Israel (JIRA)" <ji...@apache.org> on 2018/06/11 11:37:00 UTC

[jira] [Created] (SPARK-24517) Bug in loading unstructured data

Moshe Israel created SPARK-24517:
------------------------------------

             Summary: Bug in loading unstructured data
                 Key: SPARK-24517
                 URL: https://issues.apache.org/jira/browse/SPARK-24517
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.3.0
            Reporter: Moshe Israel


When loading data using spark.read from unstructured data sets to Spark dataframes there is a bug in the value of unexisting properties. I found the issue while loading data from Azure CosmosDB which is based on json files, but the issue might be relevant to other providers too.

I'll explain more through an example... Let's assume we have a dataset of users with *20* json files with properties \{name, age, isMale} and *40* more json files with the properties \{name, age}. Loading the data to a dataframe will create a dataframe object with *60* rows and three columns of \{name, age, isMale}.

querying *df.filter(col("isMale").isNull())* returns 0 rows; Expected 20 rows. Looks like instead of a null there is no content in the cell when the source row does not have the property.

Querying *df.where(df.isMale == True)* returns 60 rows (let's assume all are males). Meaning including the rows which don't include the property too. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org