You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2016/05/23 13:25:17 UTC

Reading Parquet Files Created Elsewhere

Hey all, I have some Parquet files that I believe were made in a Map Reduce
job and work well in Impala, however, when I read them in Drill, the fields
that are strings come through as [B@25ddbb etc. The exact string
represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)

Well, I found I  can cast those fields as Varchar... and get the right
data... is this the right approach?  Why is this happening? Performance
wise am I hurting something by doing the cast to Varchar?


Any thoughts would be helpful...

John

Re: Reading Parquet Files Created Elsewhere

Posted by John Omernik <jo...@omernik.com>.

That did work faster, I can now get the 10 rows in 12 seconds as opposed to
25.

So in my 25 sec. query, I CAST all items from the parquet, but do I need to
that? for the 12 seconds query, I only CONVERT_FROM on the string values,
the view seems happier. So that's nice.

Thanks for the point, I am playing around CTAS this into another table, I
will try the CTAS (see the other thread) with the CONVERT_FROM rather than
CAST.

Thanks!

On Mon, May 23, 2016 at 9:45 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> John,
>
> See if convert_from helps in this regard, I believe it is supposed to be
> faster than cast varchar.
>
> This is likely what will work on your data
> CONVERT_FROM(<column>, 'UTF8')
>
> Hopefully someone with more in depth knowledge of the Drill Parquet reader
> can comment.
>
> --Andries
>
>
>
> > On May 23, 2016, at 7:35 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > I am learning more about my data here, the data was created in a CDH
> > version of the apache parquet-mr library. (Not sure version yet, getting
> > that soon).  They used snappy and version 1.0 of the Parquet spec due to
> > Impala needing it.  They are also using setEnableDictionary on the write.
> >
> > Trying to figure things out right now
> >
> > If I make a view and cast all string fields to a VARCHAR drill shows the
> > right result, but it's slow.
> >
> > (10 row select from raw = 1.9 seconds, 10 row select with CAST in a view
> =
> > 25 seconds)
> >
> > I've resigned myself to converting the table once for performance, which
> > isn't an issue however I am getting different issues on that front  (I'll
> > open a new thread for that)
> >
> > Other than the cast(field AS VARCHAR) as field  is there any other
> (perhaps
> > more performant) way to handle this situation?
> >
> >
> >
> >
> >
> > On Mon, May 23, 2016 at 8:31 AM, Todd <bi...@163.com> wrote:
> >
> >>
> >> Looks like Impala encoded string as binary data, I think there is some
> >> configuration in Drill(I know spark has) that helps do the conversion.
> >>
> >>
> >>
> >>
> >>
> >> At 2016-05-23 21:25:17, "John Omernik" <jo...@omernik.com> wrote:
> >>> Hey all, I have some Parquet files that I believe were made in a Map
> >> Reduce
> >>> job and work well in Impala, however, when I read them in Drill, the
> >> fields
> >>> that are strings come through as [B@25ddbb etc. The exact string
> >>> represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
> >>>
> >>> Well, I found I  can cast those fields as Varchar... and get the right
> >>> data... is this the right approach?  Why is this happening? Performance
> >>> wise am I hurting something by doing the cast to Varchar?
> >>>
> >>>
> >>> Any thoughts would be helpful...
> >>>
> >>> John
> >>
>
>

Re: Reading Parquet Files Created Elsewhere

Posted by Andries Engelbrecht <ae...@maprtech.com>.

John,

See if convert_from helps in this regard, I believe it is supposed to be faster than cast varchar.

This is likely what will work on your data 
CONVERT_FROM(<column>, 'UTF8')

Hopefully someone with more in depth knowledge of the Drill Parquet reader can comment.

--Andries



> On May 23, 2016, at 7:35 AM, John Omernik <jo...@omernik.com> wrote:
> 
> I am learning more about my data here, the data was created in a CDH
> version of the apache parquet-mr library. (Not sure version yet, getting
> that soon).  They used snappy and version 1.0 of the Parquet spec due to
> Impala needing it.  They are also using setEnableDictionary on the write.
> 
> Trying to figure things out right now
> 
> If I make a view and cast all string fields to a VARCHAR drill shows the
> right result, but it's slow.
> 
> (10 row select from raw = 1.9 seconds, 10 row select with CAST in a view =
> 25 seconds)
> 
> I've resigned myself to converting the table once for performance, which
> isn't an issue however I am getting different issues on that front  (I'll
> open a new thread for that)
> 
> Other than the cast(field AS VARCHAR) as field  is there any other (perhaps
> more performant) way to handle this situation?
> 
> 
> 
> 
> 
> On Mon, May 23, 2016 at 8:31 AM, Todd <bi...@163.com> wrote:
> 
>> 
>> Looks like Impala encoded string as binary data, I think there is some
>> configuration in Drill(I know spark has) that helps do the conversion.
>> 
>> 
>> 
>> 
>> 
>> At 2016-05-23 21:25:17, "John Omernik" <jo...@omernik.com> wrote:
>>> Hey all, I have some Parquet files that I believe were made in a Map
>> Reduce
>>> job and work well in Impala, however, when I read them in Drill, the
>> fields
>>> that are strings come through as [B@25ddbb etc. The exact string
>>> represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
>>> 
>>> Well, I found I  can cast those fields as Varchar... and get the right
>>> data... is this the right approach?  Why is this happening? Performance
>>> wise am I hurting something by doing the cast to Varchar?
>>> 
>>> 
>>> Any thoughts would be helpful...
>>> 
>>> John
>>

Re: Reading Parquet Files Created Elsewhere

Posted by John Omernik <jo...@omernik.com>.

I am learning more about my data here, the data was created in a CDH
version of the apache parquet-mr library. (Not sure version yet, getting
that soon).  They used snappy and version 1.0 of the Parquet spec due to
Impala needing it.  They are also using setEnableDictionary on the write.

Trying to figure things out right now

If I make a view and cast all string fields to a VARCHAR drill shows the
right result, but it's slow.

(10 row select from raw = 1.9 seconds, 10 row select with CAST in a view =
25 seconds)

I've resigned myself to converting the table once for performance, which
isn't an issue however I am getting different issues on that front  (I'll
open a new thread for that)

Other than the cast(field AS VARCHAR) as field  is there any other (perhaps
more performant) way to handle this situation?

On Mon, May 23, 2016 at 8:31 AM, Todd <bi...@163.com> wrote:

>
> Looks like Impala encoded string as binary data, I think there is some
> configuration in Drill(I know spark has) that helps do the conversion.
>
>
>
>
>
> At 2016-05-23 21:25:17, "John Omernik" <jo...@omernik.com> wrote:
> >Hey all, I have some Parquet files that I believe were made in a Map
> Reduce
> >job and work well in Impala, however, when I read them in Drill, the
> fields
> >that are strings come through as [B@25ddbb etc. The exact string
> >represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
> >
> >Well, I found I  can cast those fields as Varchar... and get the right
> >data... is this the right approach?  Why is this happening? Performance
> >wise am I hurting something by doing the cast to Varchar?
> >
> >
> >Any thoughts would be helpful...
> >
> >John
>

Re:Reading Parquet Files Created Elsewhere

Posted by Todd <bi...@163.com>.

Looks like Impala encoded string as binary data, I think there is some configuration in Drill(I know spark has) that helps do the conversion.





At 2016-05-23 21:25:17, "John Omernik" <jo...@omernik.com> wrote:
>Hey all, I have some Parquet files that I believe were made in a Map Reduce
>job and work well in Impala, however, when I read them in Drill, the fields
>that are strings come through as [B@25ddbb etc. The exact string
>represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
>
>Well, I found I  can cast those fields as Varchar... and get the right
>data... is this the right approach?  Why is this happening? Performance
>wise am I hurting something by doing the cast to Varchar?
>
>
>Any thoughts would be helpful...
>
>John