You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Akshay Bhasin (BLOOMBERG/ 731 LEX)" <ab...@bloomberg.net> on 2021/04/21 18:32:30 UTC

Feature/Question

Hi Drill Community, 

I'm Akshay and I'm using Drill for a project I'm working on.

There is this particular use case I want to implement - I want to know if its possible.

1) Currently, we have a partition of file system and we create a view on top of it. For example, we have below directory structure - 

/home/product/product_name/year/month/day/*parquet
/home/product/product_name_2/year/month/day/*parquet
/home/product/product_name_3/year/month/day/*parquetdev

Now, we create a view over it - 
Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, `dir3` as day, * from dfs.`/home/product`;

Then, we can query all the data dynamically - 
SELECT * from temp LIMIT 5;

2) Now I want to replicate this behavior via s3. I want to ask if its possible - I was able to create a logical directory. But s3 inherently does not support directories only objects. 

Therefore, I was curious to know if it is supported/way to do this. I was unable to find any documentation on your website related to partitioning data on s3.

Thanks for your help.
Best,
Akshay

Re: Feature/Question

Posted by Ted Dunning <te...@gmail.com>.

Akshay,

Yes. It is possible to do what you want from a few different angles.

As you have noted, S3 doesn't have directories. Not really. On the other
hand, people simulate this using naming schemes and S3 has some support for
this.

One of the simplest ways to deal with this is to create a view that
explicitly mentions every S3 object that you have in your table. The
contents of this view can get a bit cumbersome, but that shouldn't be a
problem since users never need to know. You will need to set up a scheduled
action to update this view occasionally, but that is pretty simple.

The other way is to use a naming scheme with a delimiter such as /. This is
described at
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
If you do that and have files named (for instance) foo/a.json, foo/b.json,
foo/c.json and you query

    select * from s3.`foo`

you should see the contents of a.json, b.json and c.json. See here for
commentary
<https://stackoverflow.com/questions/44785065/apache-drill-how-to-query-all-files-in-an-s3-bucket>

I haven't tried this, however, so I am simply going on the reports of
others. If this works for you, please report your success back here.

On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasin16@bloomberg.net> wrote:

> Hi Drill Community,
>
> I'm Akshay and I'm using Drill for a project I'm working on.
>
> There is this particular use case I want to implement - I want to know if
> its possible.
>
> 1) Currently, we have a partition of file system and we create a view on
> top of it. For example, we have below directory structure -
>
> /home/product/product_name/year/month/day/*parquet
> /home/product/product_name_2/year/month/day/*parquet
> /home/product/product_name_3/year/month/day/*parquetdev
>
> Now, we create a view over it -
> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as
> month, `dir3` as day, * from dfs.`/home/product`;
>
> Then, we can query all the data dynamically -
> SELECT * from temp LIMIT 5;
>
> 2) Now I want to replicate this behavior via s3. I want to ask if its
> possible - I was able to create a logical directory. But s3 inherently does
> not support directories only objects.
>
> Therefore, I was curious to know if it is supported/way to do this. I was
> unable to find any documentation on your website related to partitioning
> data on s3.
>
> Thanks for your help.
> Best,
> Akshay