You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sam Elamin <hu...@gmail.com> on 2017/02/04 13:46:00 UTC

specifing schema on dataframe

Hi All

I would like to specify a schema when reading from a json but when trying
to map a number to a Double it fails, I tried FloatType and IntType with no
joy!


When inferring the schema customer id is set to String, and I would like to
cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json,
I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType,
StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)


Any help would be appreciated, I am sure I am missing something obvious but
for the life of me I cant tell what it is!


Kind Regards
Sam

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

For anyone interested how I resolved this issue

I realised there was alot of issues by trying to programatically work out
the intended schema from a json string, so much could potentially go wrong.

Instead im saving the original data in avro format which is able to deal
with 64 bit integers, then read it using the avro package from databricks

that allows me to keep the original schema intact.

All the best
Sam

On Mon, Feb 6, 2017 at 7:57 PM, Sam Elamin <hu...@gmail.com> wrote:

> Hi Folks
>
> I need some help here, this is what I have so far, it works fine for a
> flat schema, but when I have a nested StructType I need to iterate over it
> and use withColumn on them
>
>
> My question is how do I replace the incorrect column? even if I somehow
> call the method itself and it returns a dataframe how do I then join it
> with the parent dataframe?
>
> Basically all I want to do is replace the nested column (which is of
> StructType) with a new column which has the correct values
>
> Any Ideas?
>
> def convertDataframe(originalDataFrame: DataFrame, schema: StructType, parentName: String =null): DataFrame = {
>   var df = originalDataFrame
>   for (column: StructField <- schema.fields) {
>     var columnName = column.name
>     if (column.dataType.isInstanceOf[StructType]) {
>       logger.warn(s"nested record set is $columnName")
>
>       // replace nested column here?
>     } else {
>       if(parentName != null) {
>         columnName = s"$parentName.$columnName"
>       }
>       df = df.withColumn(columnName, df(s"$columnName").cast(column.dataType))
>     }
>   }
>   df
> }
>
>
> On Mon, Feb 6, 2017 at 9:11 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Yes it is :)
>>
>> Do not forget to post back if you arrive at the solution.....may save
>> someone's day :)
>>
>> On Mon, Feb 6, 2017 at 8:08 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>>> Ah ok
>>>
>>>
>>> Thanks for clearing it up Ayan! i will give that a go
>>>
>>>
>>>
>>> Thank you all for your help, this mailing list is awesome!
>>>
>>> On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> If I am not missing anything here, "So I know which columns are
>>>> numeric and which arent because I have a StructType and all the internal
>>>> StructFields will tell me which ones have a DataType which is numeric and
>>>> which arent" will lead to getting to a list of fields which should be
>>>> numeric.
>>>>
>>>> Essentially, You will create a list of numeric fields from your
>>>> "should-be" struct type. Then you will load your raw data using built-in
>>>> json reader. At this point, your data have a wrong schema. Now, you will
>>>> need to correct it. How? You will loop over the list of numeric fields (or,
>>>> you can do it directly on the struct type), and try to match the type. If
>>>> you find a mismatch, you'd add a withColumn clause to cast to the correct
>>>> data type (from your "should-be" struct).
>>>>
>>>> HTH?
>>>>
>>>> Best
>>>> Ayan
>>>>
>>>> On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yup sorry I should have explained myself better
>>>>>
>>>>> So I know which columns are numeric and which arent because I have a
>>>>> StructType and all the internal StructFields will tell me which ones have a
>>>>> DataType which is numeric and which arent
>>>>>
>>>>> So assuming I have a json string which has double quotes on numbers
>>>>> when it shouldnt, and I have the correct schema in a struct type
>>>>>
>>>>>
>>>>> how can I iterate over them to programatically create the new
>>>>> dataframe in the correct format
>>>>>
>>>>> do i iterate over the columns in the StructType? or iterate over the
>>>>> columns in the dataframe and try to match them with the StructType?
>>>>>
>>>>> I hope I cleared things up, What I wouldnt do for a drawing board
>>>>> right now!
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> Umm....I think the premise is you need to "know" beforehand which
>>>>>> columns are numeric.....Unless you know it, how would you apply the schema?
>>>>>>
>>>>>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks ayan but I meant how to derive the list automatically
>>>>>>>
>>>>>>> In your example you are specifying the numeric columns and I would
>>>>>>> like it to be applied to any schema if that makes sense
>>>>>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> SImple (pyspark) example:
>>>>>>>>
>>>>>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>>>>>> >>> df.printSchema()
>>>>>>>> root
>>>>>>>>  |-- customerid: string (nullable = true)
>>>>>>>>  |-- foo: string (nullable = true)
>>>>>>>>
>>>>>>>> >>> numeric_field_list = ['customerid']
>>>>>>>>
>>>>>>>> >>> for k in numeric_field_list:
>>>>>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>>>>>> ...
>>>>>>>> >>> df.printSchema()
>>>>>>>> root
>>>>>>>>  |-- customerid: long (nullable = true)
>>>>>>>>  |-- foo: string (nullable = true)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.elamin@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> Ok thanks Micheal!
>>>>>>>>
>>>>>>>>
>>>>>>>> Can I get an idea on where to start? Assuming I have the end schema
>>>>>>>> and the current dataframe...
>>>>>>>> How can I loop through it and create a new dataframe using the
>>>>>>>> WithColumn?
>>>>>>>>
>>>>>>>>
>>>>>>>> Am I iterating through the dataframe or the schema?
>>>>>>>>
>>>>>>>> I'm assuming it's easier to iterate through the columns in the old
>>>>>>>> df. For each column cast it correctly and generate a new df?
>>>>>>>>
>>>>>>>>
>>>>>>>> Would you recommend that?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sam
>>>>>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>> If you already have the expected schema, and you know that all
>>>>>>>> numbers will always be formatted as strings in the input JSON, you could
>>>>>>>> probably derive this list automatically.
>>>>>>>>
>>>>>>>> Wouldn't it be simpler to just regex replace the numbers to remove
>>>>>>>> the quotes?
>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is likely to be a slower and less robust solution.
>>>>>>>> You would have to make sure that you got all the corner cases right (i.e.
>>>>>>>> escaping and what not).
>>>>>>>>
>>>>>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.elamin@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> I see so for the connector I need to pass in an array/list of
>>>>>>>> numerical columns?
>>>>>>>>
>>>>>>>> Wouldnt it be simpler to just regex replace the numbers to remove
>>>>>>>> the quotes?
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sam
>>>>>>>>
>>>>>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>> Specifying the schema when parsing JSON will only let you pick
>>>>>>>> between similar datatypes (i.e should this be a short, long float, double
>>>>>>>> etc).  It will not let you perform conversions like string <-> number.
>>>>>>>> This has to be done with explicit casts after the data has been loaded.
>>>>>>>>
>>>>>>>> I think you can make a solution that uses select or withColumn
>>>>>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>>>>>> as strings.  Then construct a list of columns that should be numbers and
>>>>>>>> apply the necessary conversions.
>>>>>>>>
>>>>>>>> import org.apache.spark.sql.functions.col
>>>>>>>> var df = spark.read.schema(parseSchema).json("...")
>>>>>>>> numericColumns.foreach { columnName =>
>>>>>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.elamin@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> Thanks Micheal
>>>>>>>>
>>>>>>>> I've been spending the past few days researching this
>>>>>>>>
>>>>>>>> The problem is the generated json has double quotes on fields that
>>>>>>>> are numbers because the producing datastore doesn't want to lose precision
>>>>>>>>
>>>>>>>> I can change the data type true but that would be on specific to a
>>>>>>>> job rather than a generic streaming job. I'm writing a structured streaming
>>>>>>>> connector and I have the schema the generated dataframe should match.
>>>>>>>>
>>>>>>>> Unfortunately using withColumn won't help me here since the
>>>>>>>> solution needs to be generic
>>>>>>>>
>>>>>>>> To summarise assume I have the following json
>>>>>>>>
>>>>>>>> [{
>>>>>>>> "customerid": "535137",
>>>>>>>> "foo": "bar"
>>>>>>>> }]
>>>>>>>>
>>>>>>>>
>>>>>>>> and I know the schema should be:
>>>>>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>>>>>> uctField("foo",StringType,true)))
>>>>>>>>
>>>>>>>> Whats the best way of solving this?
>>>>>>>>
>>>>>>>> My current approach is to iterate over the JSON and identify which
>>>>>>>> fields are numbers and which arent then recreate the json
>>>>>>>>
>>>>>>>> But to be honest that doesnt seem like the cleanest approach, so
>>>>>>>> happy for advice on this
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sam
>>>>>>>>
>>>>>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>> -dev
>>>>>>>>
>>>>>>>> You can use withColumn to change the type after the data has been
>>>>>>>> loaded
>>>>>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>>>>>> .
>>>>>>>>
>>>>>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.elamin@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> Hi Direceu
>>>>>>>>
>>>>>>>> Thanks your right! that did work
>>>>>>>>
>>>>>>>>
>>>>>>>> But now im facing an even bigger problem since i dont have access
>>>>>>>> to change the underlying data, I just want to apply a schema over something
>>>>>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>>>>>
>>>>>>>> Basically I am reading in a RDD[JsonObject] and would like to
>>>>>>>> convert it into a dataframe which I pass the schema into
>>>>>>>>
>>>>>>>> Whats the best way to do this?
>>>>>>>>
>>>>>>>> I doubt removing all the quotes in the JSON is the best solution is
>>>>>>>> it?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sam
>>>>>>>>
>>>>>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Sam
>>>>>>>> Remove the " from the number that it will work
>>>>>>>>
>>>>>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>>>>>> escreveu:
>>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> I would like to specify a schema when reading from a json but when
>>>>>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>>>>>> with no joy!
>>>>>>>>
>>>>>>>>
>>>>>>>> When inferring the schema customer id is set to String, and I would
>>>>>>>> like to cast it as Double
>>>>>>>>
>>>>>>>> so df1 is corrupted while df2 shows
>>>>>>>>
>>>>>>>>
>>>>>>>> Also FYI I need this to be generic as I would like to apply it to
>>>>>>>> any json, I specified the below schema as an example of the issue I am
>>>>>>>> facing
>>>>>>>>
>>>>>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>>>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>>>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>>>> df1.show(1)
>>>>>>>> df2.show(1)
>>>>>>>>
>>>>>>>>
>>>>>>>> Any help would be appreciated, I am sure I am missing something
>>>>>>>> obvious but for the life of me I cant tell what it is!
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind Regards
>>>>>>>> Sam
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Hi Folks

I need some help here, this is what I have so far, it works fine for a flat
schema, but when I have a nested StructType I need to iterate over it and
use withColumn on them


My question is how do I replace the incorrect column? even if I somehow
call the method itself and it returns a dataframe how do I then join it
with the parent dataframe?

Basically all I want to do is replace the nested column (which is of
StructType) with a new column which has the correct values

Any Ideas?

def convertDataframe(originalDataFrame: DataFrame, schema: StructType,
parentName: String =null): DataFrame = {
  var df = originalDataFrame
  for (column: StructField <- schema.fields) {
    var columnName = column.name
    if (column.dataType.isInstanceOf[StructType]) {
      logger.warn(s"nested record set is $columnName")

      // replace nested column here?
    } else {
      if(parentName != null) {
        columnName = s"$parentName.$columnName"
      }
      df = df.withColumn(columnName, df(s"$columnName").cast(column.dataType))
    }
  }
  df
}


On Mon, Feb 6, 2017 at 9:11 AM, ayan guha <gu...@gmail.com> wrote:

> Yes it is :)
>
> Do not forget to post back if you arrive at the solution.....may save
> someone's day :)
>
> On Mon, Feb 6, 2017 at 8:08 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Ah ok
>>
>>
>> Thanks for clearing it up Ayan! i will give that a go
>>
>>
>>
>> Thank you all for your help, this mailing list is awesome!
>>
>> On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <gu...@gmail.com> wrote:
>>
>>> If I am not missing anything here, "So I know which columns are numeric
>>> and which arent because I have a StructType and all the internal
>>> StructFields will tell me which ones have a DataType which is numeric and
>>> which arent" will lead to getting to a list of fields which should be
>>> numeric.
>>>
>>> Essentially, You will create a list of numeric fields from your
>>> "should-be" struct type. Then you will load your raw data using built-in
>>> json reader. At this point, your data have a wrong schema. Now, you will
>>> need to correct it. How? You will loop over the list of numeric fields (or,
>>> you can do it directly on the struct type), and try to match the type. If
>>> you find a mismatch, you'd add a withColumn clause to cast to the correct
>>> data type (from your "should-be" struct).
>>>
>>> HTH?
>>>
>>> Best
>>> Ayan
>>>
>>> On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>>> Yup sorry I should have explained myself better
>>>>
>>>> So I know which columns are numeric and which arent because I have a
>>>> StructType and all the internal StructFields will tell me which ones have a
>>>> DataType which is numeric and which arent
>>>>
>>>> So assuming I have a json string which has double quotes on numbers
>>>> when it shouldnt, and I have the correct schema in a struct type
>>>>
>>>>
>>>> how can I iterate over them to programatically create the new dataframe
>>>> in the correct format
>>>>
>>>> do i iterate over the columns in the StructType? or iterate over the
>>>> columns in the dataframe and try to match them with the StructType?
>>>>
>>>> I hope I cleared things up, What I wouldnt do for a drawing board right
>>>> now!
>>>>
>>>>
>>>> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> Umm....I think the premise is you need to "know" beforehand which
>>>>> columns are numeric.....Unless you know it, how would you apply the schema?
>>>>>
>>>>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks ayan but I meant how to derive the list automatically
>>>>>>
>>>>>> In your example you are specifying the numeric columns and I would
>>>>>> like it to be applied to any schema if that makes sense
>>>>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> SImple (pyspark) example:
>>>>>>>
>>>>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>>>>> >>> df.printSchema()
>>>>>>> root
>>>>>>>  |-- customerid: string (nullable = true)
>>>>>>>  |-- foo: string (nullable = true)
>>>>>>>
>>>>>>> >>> numeric_field_list = ['customerid']
>>>>>>>
>>>>>>> >>> for k in numeric_field_list:
>>>>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>>>>> ...
>>>>>>> >>> df.printSchema()
>>>>>>> root
>>>>>>>  |-- customerid: long (nullable = true)
>>>>>>>  |-- foo: string (nullable = true)
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ok thanks Micheal!
>>>>>>>
>>>>>>>
>>>>>>> Can I get an idea on where to start? Assuming I have the end schema
>>>>>>> and the current dataframe...
>>>>>>> How can I loop through it and create a new dataframe using the
>>>>>>> WithColumn?
>>>>>>>
>>>>>>>
>>>>>>> Am I iterating through the dataframe or the schema?
>>>>>>>
>>>>>>> I'm assuming it's easier to iterate through the columns in the old
>>>>>>> df. For each column cast it correctly and generate a new df?
>>>>>>>
>>>>>>>
>>>>>>> Would you recommend that?
>>>>>>>
>>>>>>> Regards
>>>>>>> Sam
>>>>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>> If you already have the expected schema, and you know that all
>>>>>>> numbers will always be formatted as strings in the input JSON, you could
>>>>>>> probably derive this list automatically.
>>>>>>>
>>>>>>> Wouldn't it be simpler to just regex replace the numbers to remove
>>>>>>> the quotes?
>>>>>>>
>>>>>>>
>>>>>>> I think this is likely to be a slower and less robust solution.  You
>>>>>>> would have to make sure that you got all the corner cases right (i.e.
>>>>>>> escaping and what not).
>>>>>>>
>>>>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I see so for the connector I need to pass in an array/list of
>>>>>>> numerical columns?
>>>>>>>
>>>>>>> Wouldnt it be simpler to just regex replace the numbers to remove
>>>>>>> the quotes?
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sam
>>>>>>>
>>>>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>> Specifying the schema when parsing JSON will only let you pick
>>>>>>> between similar datatypes (i.e should this be a short, long float, double
>>>>>>> etc).  It will not let you perform conversions like string <-> number.
>>>>>>> This has to be done with explicit casts after the data has been loaded.
>>>>>>>
>>>>>>> I think you can make a solution that uses select or withColumn
>>>>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>>>>> as strings.  Then construct a list of columns that should be numbers and
>>>>>>> apply the necessary conversions.
>>>>>>>
>>>>>>> import org.apache.spark.sql.functions.col
>>>>>>> var df = spark.read.schema(parseSchema).json("...")
>>>>>>> numericColumns.foreach { columnName =>
>>>>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Micheal
>>>>>>>
>>>>>>> I've been spending the past few days researching this
>>>>>>>
>>>>>>> The problem is the generated json has double quotes on fields that
>>>>>>> are numbers because the producing datastore doesn't want to lose precision
>>>>>>>
>>>>>>> I can change the data type true but that would be on specific to a
>>>>>>> job rather than a generic streaming job. I'm writing a structured streaming
>>>>>>> connector and I have the schema the generated dataframe should match.
>>>>>>>
>>>>>>> Unfortunately using withColumn won't help me here since the solution
>>>>>>> needs to be generic
>>>>>>>
>>>>>>> To summarise assume I have the following json
>>>>>>>
>>>>>>> [{
>>>>>>> "customerid": "535137",
>>>>>>> "foo": "bar"
>>>>>>> }]
>>>>>>>
>>>>>>>
>>>>>>> and I know the schema should be:
>>>>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>>>>> uctField("foo",StringType,true)))
>>>>>>>
>>>>>>> Whats the best way of solving this?
>>>>>>>
>>>>>>> My current approach is to iterate over the JSON and identify which
>>>>>>> fields are numbers and which arent then recreate the json
>>>>>>>
>>>>>>> But to be honest that doesnt seem like the cleanest approach, so
>>>>>>> happy for advice on this
>>>>>>>
>>>>>>> Regards
>>>>>>> Sam
>>>>>>>
>>>>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>> -dev
>>>>>>>
>>>>>>> You can use withColumn to change the type after the data has been
>>>>>>> loaded
>>>>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>>>>> .
>>>>>>>
>>>>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Direceu
>>>>>>>
>>>>>>> Thanks your right! that did work
>>>>>>>
>>>>>>>
>>>>>>> But now im facing an even bigger problem since i dont have access to
>>>>>>> change the underlying data, I just want to apply a schema over something
>>>>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>>>>
>>>>>>> Basically I am reading in a RDD[JsonObject] and would like to
>>>>>>> convert it into a dataframe which I pass the schema into
>>>>>>>
>>>>>>> Whats the best way to do this?
>>>>>>>
>>>>>>> I doubt removing all the quotes in the JSON is the best solution is
>>>>>>> it?
>>>>>>>
>>>>>>> Regards
>>>>>>> Sam
>>>>>>>
>>>>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Sam
>>>>>>> Remove the " from the number that it will work
>>>>>>>
>>>>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>>>>> escreveu:
>>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> I would like to specify a schema when reading from a json but when
>>>>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>>>>> with no joy!
>>>>>>>
>>>>>>>
>>>>>>> When inferring the schema customer id is set to String, and I would
>>>>>>> like to cast it as Double
>>>>>>>
>>>>>>> so df1 is corrupted while df2 shows
>>>>>>>
>>>>>>>
>>>>>>> Also FYI I need this to be generic as I would like to apply it to
>>>>>>> any json, I specified the below schema as an example of the issue I am
>>>>>>> facing
>>>>>>>
>>>>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>>> df1.show(1)
>>>>>>> df2.show(1)
>>>>>>>
>>>>>>>
>>>>>>> Any help would be appreciated, I am sure I am missing something
>>>>>>> obvious but for the life of me I cant tell what it is!
>>>>>>>
>>>>>>>
>>>>>>> Kind Regards
>>>>>>> Sam
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Posted by ayan guha <gu...@gmail.com>.

Yes it is :)

Do not forget to post back if you arrive at the solution.....may save
someone's day :)

On Mon, Feb 6, 2017 at 8:08 PM, Sam Elamin <hu...@gmail.com> wrote:

> Ah ok
>
>
> Thanks for clearing it up Ayan! i will give that a go
>
>
>
> Thank you all for your help, this mailing list is awesome!
>
> On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <gu...@gmail.com> wrote:
>
>> If I am not missing anything here, "So I know which columns are numeric
>> and which arent because I have a StructType and all the internal
>> StructFields will tell me which ones have a DataType which is numeric and
>> which arent" will lead to getting to a list of fields which should be
>> numeric.
>>
>> Essentially, You will create a list of numeric fields from your
>> "should-be" struct type. Then you will load your raw data using built-in
>> json reader. At this point, your data have a wrong schema. Now, you will
>> need to correct it. How? You will loop over the list of numeric fields (or,
>> you can do it directly on the struct type), and try to match the type. If
>> you find a mismatch, you'd add a withColumn clause to cast to the correct
>> data type (from your "should-be" struct).
>>
>> HTH?
>>
>> Best
>> Ayan
>>
>> On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>>> Yup sorry I should have explained myself better
>>>
>>> So I know which columns are numeric and which arent because I have a
>>> StructType and all the internal StructFields will tell me which ones have a
>>> DataType which is numeric and which arent
>>>
>>> So assuming I have a json string which has double quotes on numbers when
>>> it shouldnt, and I have the correct schema in a struct type
>>>
>>>
>>> how can I iterate over them to programatically create the new dataframe
>>> in the correct format
>>>
>>> do i iterate over the columns in the StructType? or iterate over the
>>> columns in the dataframe and try to match them with the StructType?
>>>
>>> I hope I cleared things up, What I wouldnt do for a drawing board right
>>> now!
>>>
>>>
>>> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> Umm....I think the premise is you need to "know" beforehand which
>>>> columns are numeric.....Unless you know it, how would you apply the schema?
>>>>
>>>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks ayan but I meant how to derive the list automatically
>>>>>
>>>>> In your example you are specifying the numeric columns and I would
>>>>> like it to be applied to any schema if that makes sense
>>>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> SImple (pyspark) example:
>>>>>>
>>>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>>>> >>> df.printSchema()
>>>>>> root
>>>>>>  |-- customerid: string (nullable = true)
>>>>>>  |-- foo: string (nullable = true)
>>>>>>
>>>>>> >>> numeric_field_list = ['customerid']
>>>>>>
>>>>>> >>> for k in numeric_field_list:
>>>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>>>> ...
>>>>>> >>> df.printSchema()
>>>>>> root
>>>>>>  |-- customerid: long (nullable = true)
>>>>>>  |-- foo: string (nullable = true)
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Ok thanks Micheal!
>>>>>>
>>>>>>
>>>>>> Can I get an idea on where to start? Assuming I have the end schema
>>>>>> and the current dataframe...
>>>>>> How can I loop through it and create a new dataframe using the
>>>>>> WithColumn?
>>>>>>
>>>>>>
>>>>>> Am I iterating through the dataframe or the schema?
>>>>>>
>>>>>> I'm assuming it's easier to iterate through the columns in the old
>>>>>> df. For each column cast it correctly and generate a new df?
>>>>>>
>>>>>>
>>>>>> Would you recommend that?
>>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> If you already have the expected schema, and you know that all
>>>>>> numbers will always be formatted as strings in the input JSON, you could
>>>>>> probably derive this list automatically.
>>>>>>
>>>>>> Wouldn't it be simpler to just regex replace the numbers to remove
>>>>>> the quotes?
>>>>>>
>>>>>>
>>>>>> I think this is likely to be a slower and less robust solution.  You
>>>>>> would have to make sure that you got all the corner cases right (i.e.
>>>>>> escaping and what not).
>>>>>>
>>>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I see so for the connector I need to pass in an array/list of
>>>>>> numerical columns?
>>>>>>
>>>>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>>>>> quotes?
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>>
>>>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>> Specifying the schema when parsing JSON will only let you pick
>>>>>> between similar datatypes (i.e should this be a short, long float, double
>>>>>> etc).  It will not let you perform conversions like string <-> number.
>>>>>> This has to be done with explicit casts after the data has been loaded.
>>>>>>
>>>>>> I think you can make a solution that uses select or withColumn
>>>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>>>> as strings.  Then construct a list of columns that should be numbers and
>>>>>> apply the necessary conversions.
>>>>>>
>>>>>> import org.apache.spark.sql.functions.col
>>>>>> var df = spark.read.schema(parseSchema).json("...")
>>>>>> numericColumns.foreach { columnName =>
>>>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks Micheal
>>>>>>
>>>>>> I've been spending the past few days researching this
>>>>>>
>>>>>> The problem is the generated json has double quotes on fields that
>>>>>> are numbers because the producing datastore doesn't want to lose precision
>>>>>>
>>>>>> I can change the data type true but that would be on specific to a
>>>>>> job rather than a generic streaming job. I'm writing a structured streaming
>>>>>> connector and I have the schema the generated dataframe should match.
>>>>>>
>>>>>> Unfortunately using withColumn won't help me here since the solution
>>>>>> needs to be generic
>>>>>>
>>>>>> To summarise assume I have the following json
>>>>>>
>>>>>> [{
>>>>>> "customerid": "535137",
>>>>>> "foo": "bar"
>>>>>> }]
>>>>>>
>>>>>>
>>>>>> and I know the schema should be:
>>>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>>>> uctField("foo",StringType,true)))
>>>>>>
>>>>>> Whats the best way of solving this?
>>>>>>
>>>>>> My current approach is to iterate over the JSON and identify which
>>>>>> fields are numbers and which arent then recreate the json
>>>>>>
>>>>>> But to be honest that doesnt seem like the cleanest approach, so
>>>>>> happy for advice on this
>>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>>
>>>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> -dev
>>>>>>
>>>>>> You can use withColumn to change the type after the data has been
>>>>>> loaded
>>>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>>>> .
>>>>>>
>>>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Direceu
>>>>>>
>>>>>> Thanks your right! that did work
>>>>>>
>>>>>>
>>>>>> But now im facing an even bigger problem since i dont have access to
>>>>>> change the underlying data, I just want to apply a schema over something
>>>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>>>
>>>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>>>> it into a dataframe which I pass the schema into
>>>>>>
>>>>>> Whats the best way to do this?
>>>>>>
>>>>>> I doubt removing all the quotes in the JSON is the best solution is
>>>>>> it?
>>>>>>
>>>>>> Regards
>>>>>> Sam
>>>>>>
>>>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>
>>>>>> Hi Sam
>>>>>> Remove the " from the number that it will work
>>>>>>
>>>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>>>> escreveu:
>>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> I would like to specify a schema when reading from a json but when
>>>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>>>> with no joy!
>>>>>>
>>>>>>
>>>>>> When inferring the schema customer id is set to String, and I would
>>>>>> like to cast it as Double
>>>>>>
>>>>>> so df1 is corrupted while df2 shows
>>>>>>
>>>>>>
>>>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>>>> json, I specified the below schema as an example of the issue I am facing
>>>>>>
>>>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>>> df1.show(1)
>>>>>> df2.show(1)
>>>>>>
>>>>>>
>>>>>> Any help would be appreciated, I am sure I am missing something
>>>>>> obvious but for the life of me I cant tell what it is!
>>>>>>
>>>>>>
>>>>>> Kind Regards
>>>>>> Sam
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Ah ok


Thanks for clearing it up Ayan! i will give that a go



Thank you all for your help, this mailing list is awesome!

On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <gu...@gmail.com> wrote:

> If I am not missing anything here, "So I know which columns are numeric
> and which arent because I have a StructType and all the internal
> StructFields will tell me which ones have a DataType which is numeric and
> which arent" will lead to getting to a list of fields which should be
> numeric.
>
> Essentially, You will create a list of numeric fields from your
> "should-be" struct type. Then you will load your raw data using built-in
> json reader. At this point, your data have a wrong schema. Now, you will
> need to correct it. How? You will loop over the list of numeric fields (or,
> you can do it directly on the struct type), and try to match the type. If
> you find a mismatch, you'd add a withColumn clause to cast to the correct
> data type (from your "should-be" struct).
>
> HTH?
>
> Best
> Ayan
>
> On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Yup sorry I should have explained myself better
>>
>> So I know which columns are numeric and which arent because I have a
>> StructType and all the internal StructFields will tell me which ones have a
>> DataType which is numeric and which arent
>>
>> So assuming I have a json string which has double quotes on numbers when
>> it shouldnt, and I have the correct schema in a struct type
>>
>>
>> how can I iterate over them to programatically create the new dataframe
>> in the correct format
>>
>> do i iterate over the columns in the StructType? or iterate over the
>> columns in the dataframe and try to match them with the StructType?
>>
>> I hope I cleared things up, What I wouldnt do for a drawing board right
>> now!
>>
>>
>> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:
>>
>>> Umm....I think the premise is you need to "know" beforehand which
>>> columns are numeric.....Unless you know it, how would you apply the schema?
>>>
>>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>>> Thanks ayan but I meant how to derive the list automatically
>>>>
>>>> In your example you are specifying the numeric columns and I would like
>>>> it to be applied to any schema if that makes sense
>>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> SImple (pyspark) example:
>>>>>
>>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>>> >>> df.printSchema()
>>>>> root
>>>>>  |-- customerid: string (nullable = true)
>>>>>  |-- foo: string (nullable = true)
>>>>>
>>>>> >>> numeric_field_list = ['customerid']
>>>>>
>>>>> >>> for k in numeric_field_list:
>>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>>> ...
>>>>> >>> df.printSchema()
>>>>> root
>>>>>  |-- customerid: long (nullable = true)
>>>>>  |-- foo: string (nullable = true)
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Ok thanks Micheal!
>>>>>
>>>>>
>>>>> Can I get an idea on where to start? Assuming I have the end schema
>>>>> and the current dataframe...
>>>>> How can I loop through it and create a new dataframe using the
>>>>> WithColumn?
>>>>>
>>>>>
>>>>> Am I iterating through the dataframe or the schema?
>>>>>
>>>>> I'm assuming it's easier to iterate through the columns in the old df.
>>>>> For each column cast it correctly and generate a new df?
>>>>>
>>>>>
>>>>> Would you recommend that?
>>>>>
>>>>> Regards
>>>>> Sam
>>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> If you already have the expected schema, and you know that all numbers
>>>>> will always be formatted as strings in the input JSON, you could probably
>>>>> derive this list automatically.
>>>>>
>>>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>>>> quotes?
>>>>>
>>>>>
>>>>> I think this is likely to be a slower and less robust solution.  You
>>>>> would have to make sure that you got all the corner cases right (i.e.
>>>>> escaping and what not).
>>>>>
>>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I see so for the connector I need to pass in an array/list of
>>>>> numerical columns?
>>>>>
>>>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>>>> quotes?
>>>>>
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>> Specifying the schema when parsing JSON will only let you pick between
>>>>> similar datatypes (i.e should this be a short, long float, double etc).  It
>>>>> will not let you perform conversions like string <-> number.  This has to
>>>>> be done with explicit casts after the data has been loaded.
>>>>>
>>>>> I think you can make a solution that uses select or withColumn
>>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>>> as strings.  Then construct a list of columns that should be numbers and
>>>>> apply the necessary conversions.
>>>>>
>>>>> import org.apache.spark.sql.functions.col
>>>>> var df = spark.read.schema(parseSchema).json("...")
>>>>> numericColumns.foreach { columnName =>
>>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks Micheal
>>>>>
>>>>> I've been spending the past few days researching this
>>>>>
>>>>> The problem is the generated json has double quotes on fields that are
>>>>> numbers because the producing datastore doesn't want to lose precision
>>>>>
>>>>> I can change the data type true but that would be on specific to a job
>>>>> rather than a generic streaming job. I'm writing a structured streaming
>>>>> connector and I have the schema the generated dataframe should match.
>>>>>
>>>>> Unfortunately using withColumn won't help me here since the solution
>>>>> needs to be generic
>>>>>
>>>>> To summarise assume I have the following json
>>>>>
>>>>> [{
>>>>> "customerid": "535137",
>>>>> "foo": "bar"
>>>>> }]
>>>>>
>>>>>
>>>>> and I know the schema should be:
>>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>>> uctField("foo",StringType,true)))
>>>>>
>>>>> Whats the best way of solving this?
>>>>>
>>>>> My current approach is to iterate over the JSON and identify which
>>>>> fields are numbers and which arent then recreate the json
>>>>>
>>>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>>>> for advice on this
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> -dev
>>>>>
>>>>> You can use withColumn to change the type after the data has been
>>>>> loaded
>>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>>> .
>>>>>
>>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Direceu
>>>>>
>>>>> Thanks your right! that did work
>>>>>
>>>>>
>>>>> But now im facing an even bigger problem since i dont have access to
>>>>> change the underlying data, I just want to apply a schema over something
>>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>>
>>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>>> it into a dataframe which I pass the schema into
>>>>>
>>>>> Whats the best way to do this?
>>>>>
>>>>> I doubt removing all the quotes in the JSON is the best solution is
>>>>> it?
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>
>>>>> Hi Sam
>>>>> Remove the " from the number that it will work
>>>>>
>>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>>> escreveu:
>>>>>
>>>>> Hi All
>>>>>
>>>>> I would like to specify a schema when reading from a json but when
>>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>>> with no joy!
>>>>>
>>>>>
>>>>> When inferring the schema customer id is set to String, and I would
>>>>> like to cast it as Double
>>>>>
>>>>> so df1 is corrupted while df2 shows
>>>>>
>>>>>
>>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>>> json, I specified the below schema as an example of the issue I am facing
>>>>>
>>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>> df1.show(1)
>>>>> df2.show(1)
>>>>>
>>>>>
>>>>> Any help would be appreciated, I am sure I am missing something
>>>>> obvious but for the life of me I cant tell what it is!
>>>>>
>>>>>
>>>>> Kind Regards
>>>>> Sam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Posted by ayan guha <gu...@gmail.com>.

If I am not missing anything here, "So I know which columns are numeric and
which arent because I have a StructType and all the internal StructFields
will tell me which ones have a DataType which is numeric and which arent"
will lead to getting to a list of fields which should be numeric.

Essentially, You will create a list of numeric fields from your "should-be"
struct type. Then you will load your raw data using built-in json reader.
At this point, your data have a wrong schema. Now, you will need to correct
it. How? You will loop over the list of numeric fields (or, you can do it
directly on the struct type), and try to match the type. If you find a
mismatch, you'd add a withColumn clause to cast to the correct data type
(from your "should-be" struct).

HTH?

Best
Ayan

On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hu...@gmail.com> wrote:

> Yup sorry I should have explained myself better
>
> So I know which columns are numeric and which arent because I have a
> StructType and all the internal StructFields will tell me which ones have a
> DataType which is numeric and which arent
>
> So assuming I have a json string which has double quotes on numbers when
> it shouldnt, and I have the correct schema in a struct type
>
>
> how can I iterate over them to programatically create the new dataframe in
> the correct format
>
> do i iterate over the columns in the StructType? or iterate over the
> columns in the dataframe and try to match them with the StructType?
>
> I hope I cleared things up, What I wouldnt do for a drawing board right
> now!
>
>
> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Umm....I think the premise is you need to "know" beforehand which columns
>> are numeric.....Unless you know it, how would you apply the schema?
>>
>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>>> Thanks ayan but I meant how to derive the list automatically
>>>
>>> In your example you are specifying the numeric columns and I would like
>>> it to be applied to any schema if that makes sense
>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> SImple (pyspark) example:
>>>>
>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>> >>> df.printSchema()
>>>> root
>>>>  |-- customerid: string (nullable = true)
>>>>  |-- foo: string (nullable = true)
>>>>
>>>> >>> numeric_field_list = ['customerid']
>>>>
>>>> >>> for k in numeric_field_list:
>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>> ...
>>>> >>> df.printSchema()
>>>> root
>>>>  |-- customerid: long (nullable = true)
>>>>  |-- foo: string (nullable = true)
>>>>
>>>>
>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>> Ok thanks Micheal!
>>>>
>>>>
>>>> Can I get an idea on where to start? Assuming I have the end schema and
>>>> the current dataframe...
>>>> How can I loop through it and create a new dataframe using the
>>>> WithColumn?
>>>>
>>>>
>>>> Am I iterating through the dataframe or the schema?
>>>>
>>>> I'm assuming it's easier to iterate through the columns in the old df.
>>>> For each column cast it correctly and generate a new df?
>>>>
>>>>
>>>> Would you recommend that?
>>>>
>>>> Regards
>>>> Sam
>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
>>>> wrote:
>>>>
>>>> If you already have the expected schema, and you know that all numbers
>>>> will always be formatted as strings in the input JSON, you could probably
>>>> derive this list automatically.
>>>>
>>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>>> quotes?
>>>>
>>>>
>>>> I think this is likely to be a slower and less robust solution.  You
>>>> would have to make sure that you got all the corner cases right (i.e.
>>>> escaping and what not).
>>>>
>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>> I see so for the connector I need to pass in an array/list of numerical
>>>> columns?
>>>>
>>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>>> quotes?
>>>>
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>> Specifying the schema when parsing JSON will only let you pick between
>>>> similar datatypes (i.e should this be a short, long float, double etc).  It
>>>> will not let you perform conversions like string <-> number.  This has to
>>>> be done with explicit casts after the data has been loaded.
>>>>
>>>> I think you can make a solution that uses select or withColumn
>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>> as strings.  Then construct a list of columns that should be numbers and
>>>> apply the necessary conversions.
>>>>
>>>> import org.apache.spark.sql.functions.col
>>>> var df = spark.read.schema(parseSchema).json("...")
>>>> numericColumns.foreach { columnName =>
>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>> }
>>>>
>>>>
>>>>
>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Micheal
>>>>
>>>> I've been spending the past few days researching this
>>>>
>>>> The problem is the generated json has double quotes on fields that are
>>>> numbers because the producing datastore doesn't want to lose precision
>>>>
>>>> I can change the data type true but that would be on specific to a job
>>>> rather than a generic streaming job. I'm writing a structured streaming
>>>> connector and I have the schema the generated dataframe should match.
>>>>
>>>> Unfortunately using withColumn won't help me here since the solution
>>>> needs to be generic
>>>>
>>>> To summarise assume I have the following json
>>>>
>>>> [{
>>>> "customerid": "535137",
>>>> "foo": "bar"
>>>> }]
>>>>
>>>>
>>>> and I know the schema should be:
>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>> uctField("foo",StringType,true)))
>>>>
>>>> Whats the best way of solving this?
>>>>
>>>> My current approach is to iterate over the JSON and identify which
>>>> fields are numbers and which arent then recreate the json
>>>>
>>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>>> for advice on this
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>>>> wrote:
>>>>
>>>> -dev
>>>>
>>>> You can use withColumn to change the type after the data has been
>>>> loaded
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>> .
>>>>
>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Direceu
>>>>
>>>> Thanks your right! that did work
>>>>
>>>>
>>>> But now im facing an even bigger problem since i dont have access to
>>>> change the underlying data, I just want to apply a schema over something
>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>
>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>> it into a dataframe which I pass the schema into
>>>>
>>>> Whats the best way to do this?
>>>>
>>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>> dirceu.semighini@gmail.com> wrote:
>>>>
>>>> Hi Sam
>>>> Remove the " from the number that it will work
>>>>
>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>> escreveu:
>>>>
>>>> Hi All
>>>>
>>>> I would like to specify a schema when reading from a json but when
>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>> with no joy!
>>>>
>>>>
>>>> When inferring the schema customer id is set to String, and I would
>>>> like to cast it as Double
>>>>
>>>> so df1 is corrupted while df2 shows
>>>>
>>>>
>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>> json, I specified the below schema as an example of the issue I am facing
>>>>
>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> df1.show(1)
>>>> df2.show(1)
>>>>
>>>>
>>>> Any help would be appreciated, I am sure I am missing something obvious
>>>> but for the life of me I cant tell what it is!
>>>>
>>>>
>>>> Kind Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Yup sorry I should have explained myself better

So I know which columns are numeric and which arent because I have a
StructType and all the internal StructFields will tell me which ones have a
DataType which is numeric and which arent

So assuming I have a json string which has double quotes on numbers when it
shouldnt, and I have the correct schema in a struct type


how can I iterate over them to programatically create the new dataframe in
the correct format

do i iterate over the columns in the StructType? or iterate over the
columns in the dataframe and try to match them with the StructType?

I hope I cleared things up, What I wouldnt do for a drawing board right
now!


On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <gu...@gmail.com> wrote:

> Umm....I think the premise is you need to "know" beforehand which columns
> are numeric.....Unless you know it, how would you apply the schema?
>
> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Thanks ayan but I meant how to derive the list automatically
>>
>> In your example you are specifying the numeric columns and I would like
>> it to be applied to any schema if that makes sense
>> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>>
>>> SImple (pyspark) example:
>>>
>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>> >>> df.printSchema()
>>> root
>>>  |-- customerid: string (nullable = true)
>>>  |-- foo: string (nullable = true)
>>>
>>> >>> numeric_field_list = ['customerid']
>>>
>>> >>> for k in numeric_field_list:
>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>> ...
>>> >>> df.printSchema()
>>> root
>>>  |-- customerid: long (nullable = true)
>>>  |-- foo: string (nullable = true)
>>>
>>>
>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>> Ok thanks Micheal!
>>>
>>>
>>> Can I get an idea on where to start? Assuming I have the end schema and
>>> the current dataframe...
>>> How can I loop through it and create a new dataframe using the
>>> WithColumn?
>>>
>>>
>>> Am I iterating through the dataframe or the schema?
>>>
>>> I'm assuming it's easier to iterate through the columns in the old df.
>>> For each column cast it correctly and generate a new df?
>>>
>>>
>>> Would you recommend that?
>>>
>>> Regards
>>> Sam
>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>> If you already have the expected schema, and you know that all numbers
>>> will always be formatted as strings in the input JSON, you could probably
>>> derive this list automatically.
>>>
>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>> quotes?
>>>
>>>
>>> I think this is likely to be a slower and less robust solution.  You
>>> would have to make sure that you got all the corner cases right (i.e.
>>> escaping and what not).
>>>
>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>> I see so for the connector I need to pass in an array/list of numerical
>>> columns?
>>>
>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>> quotes?
>>>
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>> Specifying the schema when parsing JSON will only let you pick between
>>> similar datatypes (i.e should this be a short, long float, double etc).  It
>>> will not let you perform conversions like string <-> number.  This has to
>>> be done with explicit casts after the data has been loaded.
>>>
>>> I think you can make a solution that uses select or withColumn generic.
>>> Just load the dataframe with a "parse schema" that treats numbers as
>>> strings.  Then construct a list of columns that should be numbers and apply
>>> the necessary conversions.
>>>
>>> import org.apache.spark.sql.functions.col
>>> var df = spark.read.schema(parseSchema).json("...")
>>> numericColumns.foreach { columnName =>
>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>> }
>>>
>>>
>>>
>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>> Thanks Micheal
>>>
>>> I've been spending the past few days researching this
>>>
>>> The problem is the generated json has double quotes on fields that are
>>> numbers because the producing datastore doesn't want to lose precision
>>>
>>> I can change the data type true but that would be on specific to a job
>>> rather than a generic streaming job. I'm writing a structured streaming
>>> connector and I have the schema the generated dataframe should match.
>>>
>>> Unfortunately using withColumn won't help me here since the solution
>>> needs to be generic
>>>
>>> To summarise assume I have the following json
>>>
>>> [{
>>> "customerid": "535137",
>>> "foo": "bar"
>>> }]
>>>
>>>
>>> and I know the schema should be:
>>> StructType(Array(StructField("customerid",LongType,true),Str
>>> uctField("foo",StringType,true)))
>>>
>>> Whats the best way of solving this?
>>>
>>> My current approach is to iterate over the JSON and identify which
>>> fields are numbers and which arent then recreate the json
>>>
>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>> for advice on this
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>> -dev
>>>
>>> You can use withColumn to change the type after the data has been loaded
>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>> .
>>>
>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>> Hi Direceu
>>>
>>> Thanks your right! that did work
>>>
>>>
>>> But now im facing an even bigger problem since i dont have access to
>>> change the underlying data, I just want to apply a schema over something
>>> that was written via the sparkContext.newAPIHadoopRDD
>>>
>>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>>> into a dataframe which I pass the schema into
>>>
>>> Whats the best way to do this?
>>>
>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>
>>> Regards
>>> Sam
>>>
>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>> dirceu.semighini@gmail.com> wrote:
>>>
>>> Hi Sam
>>> Remove the " from the number that it will work
>>>
>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>> escreveu:
>>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Posted by ayan guha <gu...@gmail.com>.

Umm....I think the premise is you need to "know" beforehand which columns
are numeric.....Unless you know it, how would you apply the schema?

On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hu...@gmail.com> wrote:

> Thanks ayan but I meant how to derive the list automatically
>
> In your example you are specifying the numeric columns and I would like it
> to be applied to any schema if that makes sense
> On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:
>
>> SImple (pyspark) example:
>>
>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>> >>> df.printSchema()
>> root
>>  |-- customerid: string (nullable = true)
>>  |-- foo: string (nullable = true)
>>
>> >>> numeric_field_list = ['customerid']
>>
>> >>> for k in numeric_field_list:
>> ...     df = df.withColumn(k,df[k].cast("long"))
>> ...
>> >>> df.printSchema()
>> root
>>  |-- customerid: long (nullable = true)
>>  |-- foo: string (nullable = true)
>>
>>
>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Ok thanks Micheal!
>>
>>
>> Can I get an idea on where to start? Assuming I have the end schema and
>> the current dataframe...
>> How can I loop through it and create a new dataframe using the WithColumn?
>>
>>
>> Am I iterating through the dataframe or the schema?
>>
>> I'm assuming it's easier to iterate through the columns in the old df.
>> For each column cast it correctly and generate a new df?
>>
>>
>> Would you recommend that?
>>
>> Regards
>> Sam
>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> If you already have the expected schema, and you know that all numbers
>> will always be formatted as strings in the input JSON, you could probably
>> derive this list automatically.
>>
>> Wouldn't it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> I think this is likely to be a slower and less robust solution.  You
>> would have to make sure that you got all the corner cases right (i.e.
>> escaping and what not).
>>
>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> I see so for the connector I need to pass in an array/list of numerical
>> columns?
>>
>> Wouldnt it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> Regards
>> Sam
>>
>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>> Specifying the schema when parsing JSON will only let you pick between
>> similar datatypes (i.e should this be a short, long float, double etc).  It
>> will not let you perform conversions like string <-> number.  This has to
>> be done with explicit casts after the data has been loaded.
>>
>> I think you can make a solution that uses select or withColumn generic.
>> Just load the dataframe with a "parse schema" that treats numbers as
>> strings.  Then construct a list of columns that should be numbers and apply
>> the necessary conversions.
>>
>> import org.apache.spark.sql.functions.col
>> var df = spark.read.schema(parseSchema).json("...")
>> numericColumns.foreach { columnName =>
>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>> }
>>
>>
>>
>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Thanks Micheal
>>
>> I've been spending the past few days researching this
>>
>> The problem is the generated json has double quotes on fields that are
>> numbers because the producing datastore doesn't want to lose precision
>>
>> I can change the data type true but that would be on specific to a job
>> rather than a generic streaming job. I'm writing a structured streaming
>> connector and I have the schema the generated dataframe should match.
>>
>> Unfortunately using withColumn won't help me here since the solution
>> needs to be generic
>>
>> To summarise assume I have the following json
>>
>> [{
>> "customerid": "535137",
>> "foo": "bar"
>> }]
>>
>>
>> and I know the schema should be:
>> StructType(Array(StructField("customerid",LongType,true),
>> StructField("foo",StringType,true)))
>>
>> Whats the best way of solving this?
>>
>> My current approach is to iterate over the JSON and identify which fields
>> are numbers and which arent then recreate the json
>>
>> But to be honest that doesnt seem like the cleanest approach, so happy
>> for advice on this
>>
>> Regards
>> Sam
>>
>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> -dev
>>
>> You can use withColumn to change the type after the data has been loaded
>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>> .
>>
>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Hi Direceu
>>
>> Thanks your right! that did work
>>
>>
>> But now im facing an even bigger problem since i dont have access to
>> change the underlying data, I just want to apply a schema over something
>> that was written via the sparkContext.newAPIHadoopRDD
>>
>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>> into a dataframe which I pass the schema into
>>
>> Whats the best way to do this?
>>
>> I doubt removing all the quotes in the JSON is the best solution is it?
>>
>> Regards
>> Sam
>>
>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>> escreveu:
>>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Thanks ayan but I meant how to derive the list automatically

In your example you are specifying the numeric columns and I would like it
to be applied to any schema if that makes sense
On Mon, 6 Feb 2017 at 08:49, ayan guha <gu...@gmail.com> wrote:

> SImple (pyspark) example:
>
> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
> >>> df.printSchema()
> root
>  |-- customerid: string (nullable = true)
>  |-- foo: string (nullable = true)
>
> >>> numeric_field_list = ['customerid']
>
> >>> for k in numeric_field_list:
> ...     df = df.withColumn(k,df[k].cast("long"))
> ...
> >>> df.printSchema()
> root
>  |-- customerid: long (nullable = true)
>  |-- foo: string (nullable = true)
>
>
> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Ok thanks Micheal!
>
>
> Can I get an idea on where to start? Assuming I have the end schema and
> the current dataframe...
> How can I loop through it and create a new dataframe using the WithColumn?
>
>
> Am I iterating through the dataframe or the schema?
>
> I'm assuming it's easier to iterate through the columns in the old df. For
> each column cast it correctly and generate a new df?
>
>
> Would you recommend that?
>
> Regards
> Sam
> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> If you already have the expected schema, and you know that all numbers
> will always be formatted as strings in the input JSON, you could probably
> derive this list automatically.
>
> Wouldn't it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> I think this is likely to be a slower and less robust solution.  You would
> have to make sure that you got all the corner cases right (i.e. escaping
> and what not).
>
> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> I see so for the connector I need to pass in an array/list of numerical
> columns?
>
> Wouldnt it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> Regards
> Sam
>
> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Specifying the schema when parsing JSON will only let you pick between
> similar datatypes (i.e should this be a short, long float, double etc).  It
> will not let you perform conversions like string <-> number.  This has to
> be done with explicit casts after the data has been loaded.
>
> I think you can make a solution that uses select or withColumn generic.
> Just load the dataframe with a "parse schema" that treats numbers as
> strings.  Then construct a list of columns that should be numbers and apply
> the necessary conversions.
>
> import org.apache.spark.sql.functions.col
> var df = spark.read.schema(parseSchema).json("...")
> numericColumns.foreach { columnName =>
>   df = df.withColumn(columnName, col(columnName).cast("long"))
> }
>
>
>
> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Thanks Micheal
>
> I've been spending the past few days researching this
>
> The problem is the generated json has double quotes on fields that are
> numbers because the producing datastore doesn't want to lose precision
>
> I can change the data type true but that would be on specific to a job
> rather than a generic streaming job. I'm writing a structured streaming
> connector and I have the schema the generated dataframe should match.
>
> Unfortunately using withColumn won't help me here since the solution needs
> to be generic
>
> To summarise assume I have the following json
>
> [{
> "customerid": "535137",
> "foo": "bar"
> }]
>
>
> and I know the schema should be:
>
> StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true)))
>
> Whats the best way of solving this?
>
> My current approach is to iterate over the JSON and identify which fields
> are numbers and which arent then recreate the json
>
> But to be honest that doesnt seem like the cleanest approach, so happy for
> advice on this
>
> Regards
> Sam
>
> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> -dev
>
> You can use withColumn to change the type after the data has been loaded
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
> .
>
> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
> escreveu:
>
> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Posted by ayan guha <gu...@gmail.com>.

SImple (pyspark) example:

>>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>> df.printSchema()
root
 |-- customerid: string (nullable = true)
 |-- foo: string (nullable = true)

>>> numeric_field_list = ['customerid']

>>> for k in numeric_field_list:
...     df = df.withColumn(k,df[k].cast("long"))
...
>>> df.printSchema()
root
 |-- customerid: long (nullable = true)
 |-- foo: string (nullable = true)


On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hu...@gmail.com> wrote:

> Ok thanks Micheal!
>
>
> Can I get an idea on where to start? Assuming I have the end schema and
> the current dataframe...
> How can I loop through it and create a new dataframe using the WithColumn?
>
>
> Am I iterating through the dataframe or the schema?
>
> I'm assuming it's easier to iterate through the columns in the old df. For
> each column cast it correctly and generate a new df?
>
>
> Would you recommend that?
>
> Regards
> Sam
> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> If you already have the expected schema, and you know that all numbers
>> will always be formatted as strings in the input JSON, you could probably
>> derive this list automatically.
>>
>> Wouldn't it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> I think this is likely to be a slower and less robust solution.  You
>> would have to make sure that you got all the corner cases right (i.e.
>> escaping and what not).
>>
>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> I see so for the connector I need to pass in an array/list of numerical
>> columns?
>>
>> Wouldnt it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> Regards
>> Sam
>>
>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>> Specifying the schema when parsing JSON will only let you pick between
>> similar datatypes (i.e should this be a short, long float, double etc).  It
>> will not let you perform conversions like string <-> number.  This has to
>> be done with explicit casts after the data has been loaded.
>>
>> I think you can make a solution that uses select or withColumn generic.
>> Just load the dataframe with a "parse schema" that treats numbers as
>> strings.  Then construct a list of columns that should be numbers and apply
>> the necessary conversions.
>>
>> import org.apache.spark.sql.functions.col
>> var df = spark.read.schema(parseSchema).json("...")
>> numericColumns.foreach { columnName =>
>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>> }
>>
>>
>>
>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Thanks Micheal
>>
>> I've been spending the past few days researching this
>>
>> The problem is the generated json has double quotes on fields that are
>> numbers because the producing datastore doesn't want to lose precision
>>
>> I can change the data type true but that would be on specific to a job
>> rather than a generic streaming job. I'm writing a structured streaming
>> connector and I have the schema the generated dataframe should match.
>>
>> Unfortunately using withColumn won't help me here since the solution
>> needs to be generic
>>
>> To summarise assume I have the following json
>>
>> [{
>> "customerid": "535137",
>> "foo": "bar"
>> }]
>>
>>
>> and I know the schema should be:
>> StructType(Array(StructField("customerid",LongType,true),
>> StructField("foo",StringType,true)))
>>
>> Whats the best way of solving this?
>>
>> My current approach is to iterate over the JSON and identify which fields
>> are numbers and which arent then recreate the json
>>
>> But to be honest that doesnt seem like the cleanest approach, so happy
>> for advice on this
>>
>> Regards
>> Sam
>>
>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> -dev
>>
>> You can use withColumn to change the type after the data has been loaded
>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>> .
>>
>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Hi Direceu
>>
>> Thanks your right! that did work
>>
>>
>> But now im facing an even bigger problem since i dont have access to
>> change the underlying data, I just want to apply a schema over something
>> that was written via the sparkContext.newAPIHadoopRDD
>>
>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>> into a dataframe which I pass the schema into
>>
>> Whats the best way to do this?
>>
>> I doubt removing all the quotes in the JSON is the best solution is it?
>>
>> Regards
>> Sam
>>
>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>> escreveu:
>>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>>
>>
>>
>>
>>
>>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Ok thanks Micheal!


Can I get an idea on where to start? Assuming I have the end schema and the
current dataframe...
How can I loop through it and create a new dataframe using the WithColumn?


Am I iterating through the dataframe or the schema?

I'm assuming it's easier to iterate through the columns in the old df. For
each column cast it correctly and generate a new df?


Would you recommend that?

Regards
Sam
On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mi...@databricks.com>
wrote:

> If you already have the expected schema, and you know that all numbers
> will always be formatted as strings in the input JSON, you could probably
> derive this list automatically.
>
> Wouldn't it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> I think this is likely to be a slower and less robust solution.  You would
> have to make sure that you got all the corner cases right (i.e. escaping
> and what not).
>
> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> I see so for the connector I need to pass in an array/list of numerical
> columns?
>
> Wouldnt it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> Regards
> Sam
>
> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Specifying the schema when parsing JSON will only let you pick between
> similar datatypes (i.e should this be a short, long float, double etc).  It
> will not let you perform conversions like string <-> number.  This has to
> be done with explicit casts after the data has been loaded.
>
> I think you can make a solution that uses select or withColumn generic.
> Just load the dataframe with a "parse schema" that treats numbers as
> strings.  Then construct a list of columns that should be numbers and apply
> the necessary conversions.
>
> import org.apache.spark.sql.functions.col
> var df = spark.read.schema(parseSchema).json("...")
> numericColumns.foreach { columnName =>
>   df = df.withColumn(columnName, col(columnName).cast("long"))
> }
>
>
>
> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Thanks Micheal
>
> I've been spending the past few days researching this
>
> The problem is the generated json has double quotes on fields that are
> numbers because the producing datastore doesn't want to lose precision
>
> I can change the data type true but that would be on specific to a job
> rather than a generic streaming job. I'm writing a structured streaming
> connector and I have the schema the generated dataframe should match.
>
> Unfortunately using withColumn won't help me here since the solution needs
> to be generic
>
> To summarise assume I have the following json
>
> [{
> "customerid": "535137",
> "foo": "bar"
> }]
>
>
> and I know the schema should be:
>
> StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true)))
>
> Whats the best way of solving this?
>
> My current approach is to iterate over the JSON and identify which fields
> are numbers and which arent then recreate the json
>
> But to be honest that doesnt seem like the cleanest approach, so happy for
> advice on this
>
> Regards
> Sam
>
> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> -dev
>
> You can use withColumn to change the type after the data has been loaded
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
> .
>
> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
> escreveu:
>
> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>
>
>
>
>
>
>

Re: specifing schema on dataframe

Posted by Michael Armbrust <mi...@databricks.com>.

If you already have the expected schema, and you know that all numbers will
always be formatted as strings in the input JSON, you could probably derive
this list automatically.

Wouldn't it be simpler to just regex replace the numbers to remove the
> quotes?


I think this is likely to be a slower and less robust solution.  You would
have to make sure that you got all the corner cases right (i.e. escaping
and what not).

On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hu...@gmail.com> wrote:

> I see so for the connector I need to pass in an array/list of numerical
> columns?
>
> Wouldnt it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> Regards
> Sam
>
> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Specifying the schema when parsing JSON will only let you pick between
>> similar datatypes (i.e should this be a short, long float, double etc).  It
>> will not let you perform conversions like string <-> number.  This has to
>> be done with explicit casts after the data has been loaded.
>>
>> I think you can make a solution that uses select or withColumn generic.
>> Just load the dataframe with a "parse schema" that treats numbers as
>> strings.  Then construct a list of columns that should be numbers and apply
>> the necessary conversions.
>>
>> import org.apache.spark.sql.functions.col
>> var df = spark.read.schema(parseSchema).json("...")
>> numericColumns.foreach { columnName =>
>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>> }
>>
>>
>>
>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>>> Thanks Micheal
>>>
>>> I've been spending the past few days researching this
>>>
>>> The problem is the generated json has double quotes on fields that are
>>> numbers because the producing datastore doesn't want to lose precision
>>>
>>> I can change the data type true but that would be on specific to a job
>>> rather than a generic streaming job. I'm writing a structured streaming
>>> connector and I have the schema the generated dataframe should match.
>>>
>>> Unfortunately using withColumn won't help me here since the solution
>>> needs to be generic
>>>
>>> To summarise assume I have the following json
>>>
>>> [{
>>> "customerid": "535137",
>>> "foo": "bar"
>>> }]
>>>
>>>
>>> and I know the schema should be:
>>> StructType(Array(StructField("customerid",LongType,true),Str
>>> uctField("foo",StringType,true)))
>>>
>>> Whats the best way of solving this?
>>>
>>> My current approach is to iterate over the JSON and identify which
>>> fields are numbers and which arent then recreate the json
>>>
>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>> for advice on this
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>>> -dev
>>>>
>>>> You can use withColumn to change the type after the data has been
>>>> loaded
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>> .
>>>>
>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Direceu
>>>>
>>>> Thanks your right! that did work
>>>>
>>>>
>>>> But now im facing an even bigger problem since i dont have access to
>>>> change the underlying data, I just want to apply a schema over something
>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>
>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>> it into a dataframe which I pass the schema into
>>>>
>>>> Whats the best way to do this?
>>>>
>>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>> dirceu.semighini@gmail.com> wrote:
>>>>
>>>> Hi Sam
>>>> Remove the " from the number that it will work
>>>>
>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>>> escreveu:
>>>>
>>>> Hi All
>>>>
>>>> I would like to specify a schema when reading from a json but when
>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>> with no joy!
>>>>
>>>>
>>>> When inferring the schema customer id is set to String, and I would
>>>> like to cast it as Double
>>>>
>>>> so df1 is corrupted while df2 shows
>>>>
>>>>
>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>> json, I specified the below schema as an example of the issue I am facing
>>>>
>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> df1.show(1)
>>>> df2.show(1)
>>>>
>>>>
>>>> Any help would be appreciated, I am sure I am missing something obvious
>>>> but for the life of me I cant tell what it is!
>>>>
>>>>
>>>> Kind Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>
>>
>

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

I see so for the connector I need to pass in an array/list of numerical
columns?

Wouldnt it be simpler to just regex replace the numbers to remove the
quotes?


Regards
Sam

On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Specifying the schema when parsing JSON will only let you pick between
> similar datatypes (i.e should this be a short, long float, double etc).  It
> will not let you perform conversions like string <-> number.  This has to
> be done with explicit casts after the data has been loaded.
>
> I think you can make a solution that uses select or withColumn generic.
> Just load the dataframe with a "parse schema" that treats numbers as
> strings.  Then construct a list of columns that should be numbers and apply
> the necessary conversions.
>
> import org.apache.spark.sql.functions.col
> var df = spark.read.schema(parseSchema).json("...")
> numericColumns.foreach { columnName =>
>   df = df.withColumn(columnName, col(columnName).cast("long"))
> }
>
>
>
> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Thanks Micheal
>>
>> I've been spending the past few days researching this
>>
>> The problem is the generated json has double quotes on fields that are
>> numbers because the producing datastore doesn't want to lose precision
>>
>> I can change the data type true but that would be on specific to a job
>> rather than a generic streaming job. I'm writing a structured streaming
>> connector and I have the schema the generated dataframe should match.
>>
>> Unfortunately using withColumn won't help me here since the solution
>> needs to be generic
>>
>> To summarise assume I have the following json
>>
>> [{
>> "customerid": "535137",
>> "foo": "bar"
>> }]
>>
>>
>> and I know the schema should be:
>> StructType(Array(StructField("customerid",LongType,true),Str
>> uctField("foo",StringType,true)))
>>
>> Whats the best way of solving this?
>>
>> My current approach is to iterate over the JSON and identify which fields
>> are numbers and which arent then recreate the json
>>
>> But to be honest that doesnt seem like the cleanest approach, so happy
>> for advice on this
>>
>> Regards
>> Sam
>>
>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> -dev
>>>
>>> You can use withColumn to change the type after the data has been loaded
>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>> .
>>>
>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>>> wrote:
>>>
>>> Hi Direceu
>>>
>>> Thanks your right! that did work
>>>
>>>
>>> But now im facing an even bigger problem since i dont have access to
>>> change the underlying data, I just want to apply a schema over something
>>> that was written via the sparkContext.newAPIHadoopRDD
>>>
>>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>>> into a dataframe which I pass the schema into
>>>
>>> Whats the best way to do this?
>>>
>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>
>>> Regards
>>> Sam
>>>
>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>> dirceu.semighini@gmail.com> wrote:
>>>
>>> Hi Sam
>>> Remove the " from the number that it will work
>>>
>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>>> escreveu:
>>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>>
>>>
>>>
>

Re: specifing schema on dataframe

Posted by Michael Armbrust <mi...@databricks.com>.

Specifying the schema when parsing JSON will only let you pick between
similar datatypes (i.e should this be a short, long float, double etc).  It
will not let you perform conversions like string <-> number.  This has to
be done with explicit casts after the data has been loaded.

I think you can make a solution that uses select or withColumn generic.
Just load the dataframe with a "parse schema" that treats numbers as
strings.  Then construct a list of columns that should be numbers and apply
the necessary conversions.

import org.apache.spark.sql.functions.col
var df = spark.read.schema(parseSchema).json("...")
numericColumns.foreach { columnName =>
  df = df.withColumn(columnName, col(columnName).cast("long"))
}



On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hu...@gmail.com> wrote:

> Thanks Micheal
>
> I've been spending the past few days researching this
>
> The problem is the generated json has double quotes on fields that are
> numbers because the producing datastore doesn't want to lose precision
>
> I can change the data type true but that would be on specific to a job
> rather than a generic streaming job. I'm writing a structured streaming
> connector and I have the schema the generated dataframe should match.
>
> Unfortunately using withColumn won't help me here since the solution needs
> to be generic
>
> To summarise assume I have the following json
>
> [{
> "customerid": "535137",
> "foo": "bar"
> }]
>
>
> and I know the schema should be:
> StructType(Array(StructField("customerid",LongType,true),
> StructField("foo",StringType,true)))
>
> Whats the best way of solving this?
>
> My current approach is to iterate over the JSON and identify which fields
> are numbers and which arent then recreate the json
>
> But to be honest that doesnt seem like the cleanest approach, so happy for
> advice on this
>
> Regards
> Sam
>
> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> -dev
>>
>> You can use withColumn to change the type after the data has been loaded
>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>> .
>>
>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
>> wrote:
>>
>> Hi Direceu
>>
>> Thanks your right! that did work
>>
>>
>> But now im facing an even bigger problem since i dont have access to
>> change the underlying data, I just want to apply a schema over something
>> that was written via the sparkContext.newAPIHadoopRDD
>>
>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>> into a dataframe which I pass the schema into
>>
>> Whats the best way to do this?
>>
>> I doubt removing all the quotes in the JSON is the best solution is it?
>>
>> Regards
>> Sam
>>
>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>> escreveu:
>>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>>
>>
>>

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Thanks Micheal

I've been spending the past few days researching this

The problem is the generated json has double quotes on fields that are
numbers because the producing datastore doesn't want to lose precision

I can change the data type true but that would be on specific to a job
rather than a generic streaming job. I'm writing a structured streaming
connector and I have the schema the generated dataframe should match.

Unfortunately using withColumn won't help me here since the solution needs
to be generic

To summarise assume I have the following json

[{
"customerid": "535137",
"foo": "bar"
}]


and I know the schema should be:
StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true)))

Whats the best way of solving this?

My current approach is to iterate over the JSON and identify which fields
are numbers and which arent then recreate the json

But to be honest that doesnt seem like the cleanest approach, so happy for
advice on this

Regards
Sam

On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mi...@databricks.com>
wrote:

> -dev
>
> You can use withColumn to change the type after the data has been loaded
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
> .
>
> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
> escreveu:
>
> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>
>
>
>

Re: specifing schema on dataframe

Posted by Michael Armbrust <mi...@databricks.com>.

-dev

You can use withColumn to change the type after the data has been loaded
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
.

On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com> wrote:

> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>> escreveu:
>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>
>

Re: specifing schema on dataframe

Posted by Michael Armbrust <mi...@databricks.com>.

-dev

You can use withColumn to change the type after the data has been loaded
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
.

On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hu...@gmail.com> wrote:

> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
>> escreveu:
>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>
>

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Hi Direceu

Thanks your right! that did work


But now im facing an even bigger problem since i dont have access to change
the underlying data, I just want to apply a schema over something that was
written via the sparkContext.newAPIHadoopRDD

Basically I am reading in a RDD[JsonObject] and would like to convert it
into a dataframe which I pass the schema into

Whats the best way to do this?

I doubt removing all the quotes in the JSON is the best solution is it?

Regards
Sam

On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
> escreveu:
>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>

Re: specifing schema on dataframe

Posted by Sam Elamin <hu...@gmail.com>.

Hi Direceu

Thanks your right! that did work


But now im facing an even bigger problem since i dont have access to change
the underlying data, I just want to apply a schema over something that was
written via the sparkContext.newAPIHadoopRDD

Basically I am reading in a RDD[JsonObject] and would like to convert it
into a dataframe which I pass the schema into

Whats the best way to do this?

I doubt removing all the quotes in the JSON is the best solution is it?

Regards
Sam

On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
> escreveu:
>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>

Re: specifing schema on dataframe

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Hi Sam
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
escreveu:

> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>

Re: specifing schema on dataframe

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Hi Sam
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hu...@gmail.com>
escreveu:

> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>