You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/04/03 21:05:29 UTC

Parquet File Weirdness

I have a table in Hive (no partitions, single level, stored as PARQUET
(hive-0.13).  When I query it in hive, it works fine, when I run a
count(*) on it drill it works (fast) but when I run a query, it seems
to return the same number of results, but it  look likes this...
thoughts?  (These should be strings with emails, domains, etc)





[B@4d8c55fe | [B@3861be78 | [B@191fd533 | [B@78e61427 | [B@49354a73 |
[B@49aae991 |

Re: Parquet File Weirdness

Posted by Steven Phillips <sp...@maprtech.com>.
Parquet has a few primitive types, one of which is Binary array. These
primitive types are used to store different "converted types". For example,
one of the converted types that uses binary array is "UTF8" string. I
believe that the parquet files you are querying do not have the "converted
type" set for the columns, so Drill does not know how to interpret the
columns. So it treats them as "VARBINARY", and not converting them to
VARCHAR. In hive, the fact that they represent strings is stored in the
metastore, so they dont have this problem.

To display the data correctly in drill, you'll need to cast them as
varchar, e.g.:

select cast(column as varchar(255)) ...

On Fri, Apr 3, 2015 at 12:34 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Are you reading the data using the Hive Storage plugin for Drill and using
> the Metastore, or are you directly querying the parquet files on the
> filesystem with Drill?
>
>
> —Andries
>
>
> On Apr 3, 2015, at 12:05 PM, John Omernik <jo...@omernik.com> wrote:
>
> > I have a table in Hive (no partitions, single level, stored as PARQUET
> > (hive-0.13).  When I query it in hive, it works fine, when I run a
> > count(*) on it drill it works (fast) but when I run a query, it seems
> > to return the same number of results, but it  look likes this...
> > thoughts?  (These should be strings with emails, domains, etc)
> >
> >
> >
> >
> >
> > [B@4d8c55fe | [B@3861be78 | [B@191fd533 | [B@78e61427 | [B@49354a73 |
> > [B@49aae991 |
>
>


-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Parquet File Weirdness

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Are you reading the data using the Hive Storage plugin for Drill and using the Metastore, or are you directly querying the parquet files on the filesystem with Drill?


—Andries


On Apr 3, 2015, at 12:05 PM, John Omernik <jo...@omernik.com> wrote:

> I have a table in Hive (no partitions, single level, stored as PARQUET
> (hive-0.13).  When I query it in hive, it works fine, when I run a
> count(*) on it drill it works (fast) but when I run a query, it seems
> to return the same number of results, but it  look likes this...
> thoughts?  (These should be strings with emails, domains, etc)
> 
> 
> 
> 
> 
> [B@4d8c55fe | [B@3861be78 | [B@191fd533 | [B@78e61427 | [B@49354a73 |
> [B@49aae991 |