You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Minnow Noir <mi...@gmail.com> on 2016/10/01 01:26:11 UTC

Query dates in parquet file created by Spark job?

I'm trying to process data using Spark and then query it using Drill.

When I create a parquet file using a Spark 1.6.1 job, and then try to query
it in Drill 1.8.0, I notice that the dates are in an unknown format. All
string and other types seem fine. I'm using the java.sql.Date class because
I get "unsupported" errors when I use java.util.Date and try to save in
parquet format.  If I create the parquet file using CTAS in Drill, I don't
have this problem; this is strictly a problem exchanging data between the
two products.

For example, if I create an RDD of dates, convert that to a DF, then save
that DF, and read the file back into Spark, it sees the correct values.

...
76  case class foo(dt: java.sql.Date)
 80  val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
 81  val dates = test.map(x => foo( new java.sql.Date(
format.parse(x).getTime ) ) )
 83  val df = dates.toDF
 85  df.write.save("blah/test.parquet")
 86 val df2 = sqlContext.read.parquet("blah/test.parquet")
 87 df2.first
res10: org.apache.spark.sql.Row = [2016-06-08]


However, If I query the file using Drill, I get a different result:

select * from blah limit 1;
+-------------+---------------+
|     dt      |     dir0      |
+-------------+---------------+
| 349-06-19  | test.parquet  |

Any idea what I need to do in order to be able to query dates in
Spark-created parquet files with Drill?

Thanks

Re: Query dates in parquet file created by Spark job?

Posted by Minnow Noir <mi...@gmail.com>.
Interesting. Thanks

On Fri, Sep 30, 2016 at 11:46 PM, Jinfeng Ni <jn...@apache.org> wrote:

> This seems to be same issue as reported in DRILL-4203. You are right
> that it causes the problem of exchanging data between two different
> products.
>
> There has been a pull request under review. Hopefully, once DRILL-4203
> is fixed, the issue you saw will be fixed as well.
>
>
> [1] https://issues.apache.org/jira/browse/DRILL-4203
>
>
> On Fri, Sep 30, 2016 at 6:26 PM, Minnow Noir <mi...@gmail.com> wrote:
> > I'm trying to process data using Spark and then query it using Drill.
> >
> > When I create a parquet file using a Spark 1.6.1 job, and then try to
> query
> > it in Drill 1.8.0, I notice that the dates are in an unknown format. All
> > string and other types seem fine. I'm using the java.sql.Date class
> because
> > I get "unsupported" errors when I use java.util.Date and try to save in
> > parquet format.  If I create the parquet file using CTAS in Drill, I
> don't
> > have this problem; this is strictly a problem exchanging data between the
> > two products.
> >
> > For example, if I create an RDD of dates, convert that to a DF, then save
> > that DF, and read the file back into Spark, it sees the correct values.
> >
> > ...
> > 76  case class foo(dt: java.sql.Date)
> >  80  val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
> >  81  val dates = test.map(x => foo( new java.sql.Date(
> > format.parse(x).getTime ) ) )
> >  83  val df = dates.toDF
> >  85  df.write.save("blah/test.parquet")
> >  86 val df2 = sqlContext.read.parquet("blah/test.parquet")
> >  87 df2.first
> > res10: org.apache.spark.sql.Row = [2016-06-08]
> >
> >
> > However, If I query the file using Drill, I get a different result:
> >
> > select * from blah limit 1;
> > +-------------+---------------+
> > |     dt      |     dir0      |
> > +-------------+---------------+
> > | 349-06-19  | test.parquet  |
> >
> > Any idea what I need to do in order to be able to query dates in
> > Spark-created parquet files with Drill?
> >
> > Thanks
>

Re: Query dates in parquet file created by Spark job?

Posted by Jinfeng Ni <jn...@apache.org>.
This seems to be same issue as reported in DRILL-4203. You are right
that it causes the problem of exchanging data between two different
products.

There has been a pull request under review. Hopefully, once DRILL-4203
is fixed, the issue you saw will be fixed as well.


[1] https://issues.apache.org/jira/browse/DRILL-4203


On Fri, Sep 30, 2016 at 6:26 PM, Minnow Noir <mi...@gmail.com> wrote:
> I'm trying to process data using Spark and then query it using Drill.
>
> When I create a parquet file using a Spark 1.6.1 job, and then try to query
> it in Drill 1.8.0, I notice that the dates are in an unknown format. All
> string and other types seem fine. I'm using the java.sql.Date class because
> I get "unsupported" errors when I use java.util.Date and try to save in
> parquet format.  If I create the parquet file using CTAS in Drill, I don't
> have this problem; this is strictly a problem exchanging data between the
> two products.
>
> For example, if I create an RDD of dates, convert that to a DF, then save
> that DF, and read the file back into Spark, it sees the correct values.
>
> ...
> 76  case class foo(dt: java.sql.Date)
>  80  val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
>  81  val dates = test.map(x => foo( new java.sql.Date(
> format.parse(x).getTime ) ) )
>  83  val df = dates.toDF
>  85  df.write.save("blah/test.parquet")
>  86 val df2 = sqlContext.read.parquet("blah/test.parquet")
>  87 df2.first
> res10: org.apache.spark.sql.Row = [2016-06-08]
>
>
> However, If I query the file using Drill, I get a different result:
>
> select * from blah limit 1;
> +-------------+---------------+
> |     dt      |     dir0      |
> +-------------+---------------+
> | 349-06-19  | test.parquet  |
>
> Any idea what I need to do in order to be able to query dates in
> Spark-created parquet files with Drill?
>
> Thanks