You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shubham Chaurasia <sh...@gmail.com> on 2019/02/05 12:46:11 UTC
DataSourceV2 producing wrong date value in Custom Data Writer
Hi All,
I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
Here is how I am trying to pass in *date type *from spark shell.
scala> val df =
> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
> col("datetype").cast("date"))
> scala> df.write.format("com.shubham.MyDataSource").save
Below is the minimal write() method of my DataWriter implementation.
@Override
public void write(InternalRow record) throws IOException {
ByteArrayOutputStream format = streamingRecordFormatter.format(record);
System.out.println("MyDataWriter.write: " + record.get(0,
DataTypes.DateType));
}
It prints an integer as output:
MyDataWriter.write: 17039
Is this a bug? or I am doing something wrong?
Thanks,
Shubham
Re: DataSourceV2 producing wrong date value in Custom Data Writer
Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks Ryan
On Tue, Feb 5, 2019 at 10:28 PM Ryan Blue <rb...@netflix.com> wrote:
> Shubham,
>
> DataSourceV2 passes Spark's internal representation to your source and
> expects Spark's internal representation back from the source. That's why
> you consume and produce InternalRow: "internal" indicates that Spark
> doesn't need to convert the values.
>
> Spark's internal representation for a date is the ordinal from the unix
> epoch date, 1970-01-01 = 0.
>
> rb
>
> On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <
> shubh.chaurasia@gmail.com> wrote:
>
>> Hi All,
>>
>> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>>
>> Here is how I am trying to pass in *date type *from spark shell.
>>
>> scala> val df =
>>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>>> col("datetype").cast("date"))
>>> scala> df.write.format("com.shubham.MyDataSource").save
>>
>>
>> Below is the minimal write() method of my DataWriter implementation.
>>
>> @Override
>> public void write(InternalRow record) throws IOException {
>> ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>> System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>>
>> }
>>
>> It prints an integer as output:
>>
>> MyDataWriter.write: 17039
>>
>>
>> Is this a bug? or I am doing something wrong?
>>
>> Thanks,
>> Shubham
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
Re: DataSourceV2 producing wrong date value in Custom Data Writer
Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks Ryan
On Tue, Feb 5, 2019 at 10:28 PM Ryan Blue <rb...@netflix.com> wrote:
> Shubham,
>
> DataSourceV2 passes Spark's internal representation to your source and
> expects Spark's internal representation back from the source. That's why
> you consume and produce InternalRow: "internal" indicates that Spark
> doesn't need to convert the values.
>
> Spark's internal representation for a date is the ordinal from the unix
> epoch date, 1970-01-01 = 0.
>
> rb
>
> On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <
> shubh.chaurasia@gmail.com> wrote:
>
>> Hi All,
>>
>> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>>
>> Here is how I am trying to pass in *date type *from spark shell.
>>
>> scala> val df =
>>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>>> col("datetype").cast("date"))
>>> scala> df.write.format("com.shubham.MyDataSource").save
>>
>>
>> Below is the minimal write() method of my DataWriter implementation.
>>
>> @Override
>> public void write(InternalRow record) throws IOException {
>> ByteArrayOutputStream format = streamingRecordFormatter.format(record);
>> System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>>
>> }
>>
>> It prints an integer as output:
>>
>> MyDataWriter.write: 17039
>>
>>
>> Is this a bug? or I am doing something wrong?
>>
>> Thanks,
>> Shubham
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
Re: DataSourceV2 producing wrong date value in Custom Data Writer
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Shubham,
DataSourceV2 passes Spark's internal representation to your source and
expects Spark's internal representation back from the source. That's why
you consume and produce InternalRow: "internal" indicates that Spark
doesn't need to convert the values.
Spark's internal representation for a date is the ordinal from the unix
epoch date, 1970-01-01 = 0.
rb
On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <sh...@gmail.com>
wrote:
> Hi All,
>
> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>
> Here is how I am trying to pass in *date type *from spark shell.
>
> scala> val df =
>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>> col("datetype").cast("date"))
>> scala> df.write.format("com.shubham.MyDataSource").save
>
>
> Below is the minimal write() method of my DataWriter implementation.
>
> @Override
> public void write(InternalRow record) throws IOException {
> ByteArrayOutputStream format = streamingRecordFormatter.format(record);
> System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>
> }
>
> It prints an integer as output:
>
> MyDataWriter.write: 17039
>
>
> Is this a bug? or I am doing something wrong?
>
> Thanks,
> Shubham
>
--
Ryan Blue
Software Engineer
Netflix
Re: DataSourceV2 producing wrong date value in Custom Data Writer
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Shubham,
DataSourceV2 passes Spark's internal representation to your source and
expects Spark's internal representation back from the source. That's why
you consume and produce InternalRow: "internal" indicates that Spark
doesn't need to convert the values.
Spark's internal representation for a date is the ordinal from the unix
epoch date, 1970-01-01 = 0.
rb
On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <sh...@gmail.com>
wrote:
> Hi All,
>
> I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
>
> Here is how I am trying to pass in *date type *from spark shell.
>
> scala> val df =
>> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
>> col("datetype").cast("date"))
>> scala> df.write.format("com.shubham.MyDataSource").save
>
>
> Below is the minimal write() method of my DataWriter implementation.
>
> @Override
> public void write(InternalRow record) throws IOException {
> ByteArrayOutputStream format = streamingRecordFormatter.format(record);
> System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
>
> }
>
> It prints an integer as output:
>
> MyDataWriter.write: 17039
>
>
> Is this a bug? or I am doing something wrong?
>
> Thanks,
> Shubham
>
--
Ryan Blue
Software Engineer
Netflix