You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Shubham Chaurasia <sh...@gmail.com> on 2019/02/05 12:46:11 UTC

DataSourceV2 producing wrong date value in Custom Data Writer

Hi All,

I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)

Here is how I am trying to pass in *date type *from spark shell.

scala> val df =
> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
> col("datetype").cast("date"))
> scala> df.write.format("com.shubham.MyDataSource").save


Below is the minimal write() method of my DataWriter implementation.

@Override
public void write(InternalRow record) throws IOException {
  ByteArrayOutputStream format = streamingRecordFormatter.format(record);
  System.out.println("MyDataWriter.write: " + record.get(0,
DataTypes.DateType));

}

It prints an integer as output:

MyDataWriter.write: 17039


Is this a bug?  or I am doing something wrong?

Thanks,
Shubham

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks Ryan

On Tue, Feb 5, 2019 at 10:28 PM Ryan Blue <rb...@netflix.com> wrote:

> Shubham,
>
> DataSourceV2 passes Spark's internal representation to your source and
> expects Spark's internal representation back from the source. That's why
> you consume and produce InternalRow: "internal" indicates that Spark
> doesn't need to convert the values.
>
> Spark's internal representation for a date is the ordinal from the unix
> epoch date, 1970-01-01 = 0.
>
> rb
>
> On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <
> shubh.chaurasia@gmail.com> wrote:
>
>> Hi All,
>>
>> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>>
>> Here is how I am trying to pass in *date type *from spark shell.
>>
>> scala> val df =
>>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>>> col("datetype").cast("date"))
>>> scala> df.write.format("com.shubham.MyDataSource").save
>>
>>
>> Below is the minimal write() method of my DataWriter implementation.
>>
>> @Override
>> public void write(InternalRow record) throws IOException {
>>   ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>>   System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>>
>> }
>>
>> It prints an integer as output:
>>
>> MyDataWriter.write: 17039
>>
>>
>> Is this a bug?  or I am doing something wrong?
>>
>> Thanks,
>> Shubham
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks Ryan

On Tue, Feb 5, 2019 at 10:28 PM Ryan Blue <rb...@netflix.com> wrote:

> Shubham,
>
> DataSourceV2 passes Spark's internal representation to your source and
> expects Spark's internal representation back from the source. That's why
> you consume and produce InternalRow: "internal" indicates that Spark
> doesn't need to convert the values.
>
> Spark's internal representation for a date is the ordinal from the unix
> epoch date, 1970-01-01 = 0.
>
> rb
>
> On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <
> shubh.chaurasia@gmail.com> wrote:
>
>> Hi All,
>>
>> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>>
>> Here is how I am trying to pass in *date type *from spark shell.
>>
>> scala> val df =
>>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>>> col("datetype").cast("date"))
>>> scala> df.write.format("com.shubham.MyDataSource").save
>>
>>
>> Below is the minimal write() method of my DataWriter implementation.
>>
>> @Override
>> public void write(InternalRow record) throws IOException {
>>   ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>>   System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>>
>> }
>>
>> It prints an integer as output:
>>
>> MyDataWriter.write: 17039
>>
>>
>> Is this a bug?  or I am doing something wrong?
>>
>> Thanks,
>> Shubham
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Shubham,

DataSourceV2 passes Spark's internal representation to your source and
expects Spark's internal representation back from the source. That's why
you consume and produce InternalRow: "internal" indicates that Spark
doesn't need to convert the values.

Spark's internal representation for a date is the ordinal from the unix
epoch date, 1970-01-01 = 0.

rb

On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <sh...@gmail.com>
wrote:

> Hi All,
>
> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>
> Here is how I am trying to pass in *date type *from spark shell.
>
> scala> val df =
>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>> col("datetype").cast("date"))
>> scala> df.write.format("com.shubham.MyDataSource").save
>
>
> Below is the minimal write() method of my DataWriter implementation.
>
> @Override
> public void write(InternalRow record) throws IOException {
>   ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>   System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>
> }
>
> It prints an integer as output:
>
> MyDataWriter.write: 17039
>
>
> Is this a bug?  or I am doing something wrong?
>
> Thanks,
> Shubham
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Shubham,

DataSourceV2 passes Spark's internal representation to your source and
expects Spark's internal representation back from the source. That's why
you consume and produce InternalRow: "internal" indicates that Spark
doesn't need to convert the values.

Spark's internal representation for a date is the ordinal from the unix
epoch date, 1970-01-01 = 0.

rb

On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <sh...@gmail.com>
wrote:

> Hi All,
>
> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>
> Here is how I am trying to pass in *date type *from spark shell.
>
> scala> val df =
>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>> col("datetype").cast("date"))
>> scala> df.write.format("com.shubham.MyDataSource").save
>
>
> Below is the minimal write() method of my DataWriter implementation.
>
> @Override
> public void write(InternalRow record) throws IOException {
>   ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>   System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>
> }
>
> It prints an integer as output:
>
> MyDataWriter.write: 17039
>
>
> Is this a bug?  or I am doing something wrong?
>
> Thanks,
> Shubham
>


-- 
Ryan Blue
Software Engineer
Netflix