You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/08/05 17:58:04 UTC

Parquet Partitions

After reading about Parquet Partition Pruning in Drill 1.1, I was wondering
if there is still partitioning based on "hive like" partitions. I.e. I have
a process that is making a hive table with Parquet files.  It's using
Partitions (Directories).  Do I need Drill to read that data using the Hive
Plugin so it's aware of the partitions and can prune, or can I just use the
DFS plugin, point it at the root of the table in Hive, and let it go,
inferring Schema and partitions based on the directories that exist?

John

Re: Parquet Partitions

Posted by John Omernik <jo...@omernik.com>.

Interesting, I wonder if that would/could be an addition, you don't need
the meta store to infer those partitions, you can see that in the directory
listing.  I will play around and let you know what I find.

Thanks!

John

On Wed, Aug 5, 2015 at 3:37 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> John,
>
> Drill has no idea about the names of your partitions since that information
> is part of the hive metastore. You can get partition pruning if you modify
> your query like below
>
> select * from dfs.hive_parq where dir0=val1; (dir0 is equivalent to part1,
> and dir1 would be equivalent to part2)
>
> - Rahul
>
> On Wed, Aug 5, 2015 at 1:21 PM, John Omernik <jo...@omernik.com> wrote:
>
> > So , what I am getting at is say a table was created in Hive with PArquet
> > files
> >
> > CREATE table hive_parq(field1 STRING, field2 STRING) Partitioned by part1
> > STRING, part2 STRING STORED as Parquet.
> >
> > That creates a directory named hive_part, then there will be directories
> in
> > under that part1=val1,  then under that part2=val1, part2=val2 , then the
> > actual parquet files.
> >
> > Without the Hive Metastore, will Drill know that it's partitioned based
> on
> > the directory name, and I if I say, select * from dfs.hive_parq where
> > part1=val1 will it only look in the /hive_parq/part1=val1 one folders or
> > will it look at all subdirectories, because the partitioned fields are
> not
> > part of the parquet files and we don't have metastore information to work
> > with.
> >
> > Thanks!
> >
> >
> >
> > On Wed, Aug 5, 2015 at 3:13 PM, Ramana I N <in...@gmail.com> wrote:
> >
> > > Yes. You can use the dfs plugin in this case.
> > >
> > > Regards
> > > Ramana
> > >
> > >
> > > On Wed, Aug 5, 2015 at 1:02 PM, John Omernik <jo...@omernik.com> wrote:
> > >
> > > > Would Drill know to partition prune based on directories if it didn't
> > > have
> > > > the hive metastore to define the partitions at the directory level?
> > > >
> > > >
> > > > On Wed, Aug 5, 2015 at 11:01 AM, Neeraja Rentachintala <
> > > > nrentachintala@maprtech.com> wrote:
> > > >
> > > > > John
> > > > > Both would work i.e query partitioned directories directly using
> file
> > > > > system storage plug in or via Hive table.
> > > > >
> > > > > On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com>
> > wrote:
> > > > >
> > > > > > After reading about Parquet Partition Pruning in Drill 1.1, I was
> > > > > wondering
> > > > > > if there is still partitioning based on "hive like" partitions.
> > I.e.
> > > I
> > > > > have
> > > > > > a process that is making a hive table with Parquet files.  It's
> > using
> > > > > > Partitions (Directories).  Do I need Drill to read that data
> using
> > > the
> > > > > Hive
> > > > > > Plugin so it's aware of the partitions and can prune, or can I
> just
> > > use
> > > > > the
> > > > > > DFS plugin, point it at the root of the table in Hive, and let it
> > go,
> > > > > > inferring Schema and partitions based on the directories that
> > exist?
> > > > > >
> > > > > > John
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet Partitions

Posted by rahul challapalli <ch...@gmail.com>.

John,

Drill has no idea about the names of your partitions since that information
is part of the hive metastore. You can get partition pruning if you modify
your query like below

select * from dfs.hive_parq where dir0=val1; (dir0 is equivalent to part1,
and dir1 would be equivalent to part2)

- Rahul

On Wed, Aug 5, 2015 at 1:21 PM, John Omernik <jo...@omernik.com> wrote:

> So , what I am getting at is say a table was created in Hive with PArquet
> files
>
> CREATE table hive_parq(field1 STRING, field2 STRING) Partitioned by part1
> STRING, part2 STRING STORED as Parquet.
>
> That creates a directory named hive_part, then there will be directories in
> under that part1=val1,  then under that part2=val1, part2=val2 , then the
> actual parquet files.
>
> Without the Hive Metastore, will Drill know that it's partitioned based on
> the directory name, and I if I say, select * from dfs.hive_parq where
> part1=val1 will it only look in the /hive_parq/part1=val1 one folders or
> will it look at all subdirectories, because the partitioned fields are not
> part of the parquet files and we don't have metastore information to work
> with.
>
> Thanks!
>
>
>
> On Wed, Aug 5, 2015 at 3:13 PM, Ramana I N <in...@gmail.com> wrote:
>
> > Yes. You can use the dfs plugin in this case.
> >
> > Regards
> > Ramana
> >
> >
> > On Wed, Aug 5, 2015 at 1:02 PM, John Omernik <jo...@omernik.com> wrote:
> >
> > > Would Drill know to partition prune based on directories if it didn't
> > have
> > > the hive metastore to define the partitions at the directory level?
> > >
> > >
> > > On Wed, Aug 5, 2015 at 11:01 AM, Neeraja Rentachintala <
> > > nrentachintala@maprtech.com> wrote:
> > >
> > > > John
> > > > Both would work i.e query partitioned directories directly using file
> > > > system storage plug in or via Hive table.
> > > >
> > > > On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com>
> wrote:
> > > >
> > > > > After reading about Parquet Partition Pruning in Drill 1.1, I was
> > > > wondering
> > > > > if there is still partitioning based on "hive like" partitions.
> I.e.
> > I
> > > > have
> > > > > a process that is making a hive table with Parquet files.  It's
> using
> > > > > Partitions (Directories).  Do I need Drill to read that data using
> > the
> > > > Hive
> > > > > Plugin so it's aware of the partitions and can prune, or can I just
> > use
> > > > the
> > > > > DFS plugin, point it at the root of the table in Hive, and let it
> go,
> > > > > inferring Schema and partitions based on the directories that
> exist?
> > > > >
> > > > > John
> > > > >
> > > >
> > >
> >
>

Re: Parquet Partitions

Posted by John Omernik <jo...@omernik.com>.

So , what I am getting at is say a table was created in Hive with PArquet
files

CREATE table hive_parq(field1 STRING, field2 STRING) Partitioned by part1
STRING, part2 STRING STORED as Parquet.

That creates a directory named hive_part, then there will be directories in
under that part1=val1,  then under that part2=val1, part2=val2 , then the
actual parquet files.

Without the Hive Metastore, will Drill know that it's partitioned based on
the directory name, and I if I say, select * from dfs.hive_parq where
part1=val1 will it only look in the /hive_parq/part1=val1 one folders or
will it look at all subdirectories, because the partitioned fields are not
part of the parquet files and we don't have metastore information to work
with.

Thanks!

On Wed, Aug 5, 2015 at 3:13 PM, Ramana I N <in...@gmail.com> wrote:

> Yes. You can use the dfs plugin in this case.
>
> Regards
> Ramana
>
>
> On Wed, Aug 5, 2015 at 1:02 PM, John Omernik <jo...@omernik.com> wrote:
>
> > Would Drill know to partition prune based on directories if it didn't
> have
> > the hive metastore to define the partitions at the directory level?
> >
> >
> > On Wed, Aug 5, 2015 at 11:01 AM, Neeraja Rentachintala <
> > nrentachintala@maprtech.com> wrote:
> >
> > > John
> > > Both would work i.e query partitioned directories directly using file
> > > system storage plug in or via Hive table.
> > >
> > > On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com> wrote:
> > >
> > > > After reading about Parquet Partition Pruning in Drill 1.1, I was
> > > wondering
> > > > if there is still partitioning based on "hive like" partitions. I.e.
> I
> > > have
> > > > a process that is making a hive table with Parquet files.  It's using
> > > > Partitions (Directories).  Do I need Drill to read that data using
> the
> > > Hive
> > > > Plugin so it's aware of the partitions and can prune, or can I just
> use
> > > the
> > > > DFS plugin, point it at the root of the table in Hive, and let it go,
> > > > inferring Schema and partitions based on the directories that exist?
> > > >
> > > > John
> > > >
> > >
> >
>

Re: Parquet Partitions

Posted by Ramana I N <in...@gmail.com>.

Yes. You can use the dfs plugin in this case.

Regards
Ramana


On Wed, Aug 5, 2015 at 1:02 PM, John Omernik <jo...@omernik.com> wrote:

> Would Drill know to partition prune based on directories if it didn't have
> the hive metastore to define the partitions at the directory level?
>
>
> On Wed, Aug 5, 2015 at 11:01 AM, Neeraja Rentachintala <
> nrentachintala@maprtech.com> wrote:
>
> > John
> > Both would work i.e query partitioned directories directly using file
> > system storage plug in or via Hive table.
> >
> > On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > After reading about Parquet Partition Pruning in Drill 1.1, I was
> > wondering
> > > if there is still partitioning based on "hive like" partitions. I.e. I
> > have
> > > a process that is making a hive table with Parquet files.  It's using
> > > Partitions (Directories).  Do I need Drill to read that data using the
> > Hive
> > > Plugin so it's aware of the partitions and can prune, or can I just use
> > the
> > > DFS plugin, point it at the root of the table in Hive, and let it go,
> > > inferring Schema and partitions based on the directories that exist?
> > >
> > > John
> > >
> >
>

Re: Parquet Partitions

Posted by John Omernik <jo...@omernik.com>.

Would Drill know to partition prune based on directories if it didn't have
the hive metastore to define the partitions at the directory level?


On Wed, Aug 5, 2015 at 11:01 AM, Neeraja Rentachintala <
nrentachintala@maprtech.com> wrote:

> John
> Both would work i.e query partitioned directories directly using file
> system storage plug in or via Hive table.
>
> On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com> wrote:
>
> > After reading about Parquet Partition Pruning in Drill 1.1, I was
> wondering
> > if there is still partitioning based on "hive like" partitions. I.e. I
> have
> > a process that is making a hive table with Parquet files.  It's using
> > Partitions (Directories).  Do I need Drill to read that data using the
> Hive
> > Plugin so it's aware of the partitions and can prune, or can I just use
> the
> > DFS plugin, point it at the root of the table in Hive, and let it go,
> > inferring Schema and partitions based on the directories that exist?
> >
> > John
> >
>

Re: Parquet Partitions

Posted by Neeraja Rentachintala <nr...@maprtech.com>.

John
Both would work i.e query partitioned directories directly using file
system storage plug in or via Hive table.

On Wed, Aug 5, 2015 at 8:58 AM, John Omernik <jo...@omernik.com> wrote:

> After reading about Parquet Partition Pruning in Drill 1.1, I was wondering
> if there is still partitioning based on "hive like" partitions. I.e. I have
> a process that is making a hive table with Parquet files.  It's using
> Partitions (Directories).  Do I need Drill to read that data using the Hive
> Plugin so it's aware of the partitions and can prune, or can I just use the
> DFS plugin, point it at the root of the table in Hive, and let it go,
> inferring Schema and partitions based on the directories that exist?
>
> John
>