You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2016/10/08 19:46:20 UTC

[jira] [Commented] (SPARK-10805) JSON Data Frame does not return correct string lengths

    [ https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15558595#comment-15558595 ] 

Xiao Li commented on SPARK-10805:
---------------------------------

This is pretty expensive to find the max length for each field. That means we need to read all the records. When you read the schema, the schema is inferred from the file. Even if we can find it, but the new recorded could be appended. 

Now, CBO is being implemented. Thus, this part should be resolved with CBO.

> JSON Data Frame does not return correct string lengths
> ------------------------------------------------------
>
>                 Key: SPARK-10805
>                 URL: https://issues.apache.org/jira/browse/SPARK-10805
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: Jeff Li
>            Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>  		DataFrame jsonDataFrame = sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
> 		jsonDataFrame.printSchema();
> 		StructType jsonSchema = jsonDataFrame.schema();
> 		StructField[] dataFields = jsonSchema.fields();
> 		for ( int fieldIndex = 0; fieldIndex < dataFields.length;  fieldIndex++) {
> 			StructField aField = dataFields[fieldIndex];
> 			DataType aType = aField.dataType();
> 			System.out.println("name: " + aField.name() + " type: " + aType.typeName()
> 					+ " size: " +aType.defaultSize());
> 		}
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org