You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Rafael Jaimes III <ra...@gmail.com> on 2021/08/20 00:57:04 UTC

querying nested parquet directory structure

Hi all,

I have a large dataset of parquet files that are nested within several
subdirectories. For example:

study1
|----data1
    |----2020-01-01
        |---0001.parquet
|----data2

study2
|----dataA
|----dataB

Is it possible for Drill to report back the "directories" as "tables"? For
example to perform a query and return something that tells me the directory
structure?

I've read something about creating workspaces, but to do so for each of the
directories seems onerous, and also requires going into the storage plugin
configuration.

The alternative would be to implement some logic and traverse the file
system, outside of Drill, and then use that information to drive the
"tables" for the queries. Although, that seems unintuitive provided Drill's
ability to traverse the file system, infer schema, create cache, and so on.

Thanks,
Rafael

Re: querying nested parquet directory structure

Posted by Charles Givre <cg...@gmail.com>.

Rafael, 
Thanks for the pointers James.  The INFORMATION_SCHEMA provides a lot of info and you can filter by whether a file is a directory or not. 


There's one other things you should be aware of and that is when you are querying nested directories, Drill adds an implicit field called dirN, where n is the level of nesting, starting at zero.  

So ... Using the structure provided below, you could do something like:

SELECT *, dir0, dir1 
FROM dfs.studies
WHERE dir1='study1'

There are a few functions for directory traversals as this can be an expensive operation.  I can point you to the docs for those as well. 
Best,
-- C



> On Aug 20, 2021, at 8:34 AM, James Turton <ja...@somecomputer.xyz.INVALID> wrote:
> 
> Some pointers, in case you're not already aware of them.
> 
> https://drill.apache.org/docs/querying-the-information-schema/
> 
> show files in dfs.foo;
> show files in dfs.`/foo/bar`;
> 
> select * from information_schema.`files`;
> 
> In my experience, be careful of performance when using the last one.  Especially if you've set the option storage.list_files_recursively = true;
> 
> 
> On 2021/08/20 13:57, Rafael Jaimes III wrote:
>> Thanks Charles.
>> 
>> I'm wondering more along the lines if you don't know the name of the directories. In your example, you have to know that data1 and study1 are called that. How do you find this information? Sure you can examine the file system separate from Drill.
>> 
>> Is there information of the file system path names within Drill, such as in INFORMATION_SCHEMA or similar?
>> 
>> In short I'm wondering if it's possible to have a command like LIST TABLES; and have returned study1.data1 , study1.data2
>> 
>> On August 20, 2021 6:49:42 AM EDT, luoc <lu...@apache.org> wrote:
>>> Best practices. Schema-free in Drill.
>>> 
>>>> 在 2021年8月20日，12:04，Charles Givre <cg...@gmail.com> 写道：
>>>> 
>>>> Hi Rafael,
>>>> If you're asking what I think you're asking, it sounds as if you'd like to query multiple files in a nested directory.  If that's the case, I have some good news...
>>>> Drill allows you to query entire directories as if they were one big file.  Effectively Drill performs a UNION on those files, so the end result is that they appear to be one big table.
>>>> Thus, with the structure you provided, you could do the following:
>>>> 
>>>> SELECT ...
>>>> FROM dfs.`<path>/study1/data1`
>>>> 
>>>> That would roll up all the files under that directory path.  Now, there are some tricks that you should be aware of.  The first are implicit columns.  These can help you figure out the directory structure as well as some basic filtering.  There are also some specific functions that are unique to querying directories.  Take a look at the links below for references about the implicit fields as well as the directory functions.
>>>> 
>>>> https://drill.apache.org/docs/querying-a-file-system-introduction/ <https://drill.apache.org/docs/querying-a-file-system-introduction/>
>>>> https://drill.apache.org/docs/querying-directories/ <https://drill.apache.org/docs/querying-directories/>
>>>> 
>>>> Best,
>>>> -- C
>>>> 
>>>> 
>>>> 
>>>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <ra...@gmail.com> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I have a large dataset of parquet files that are nested within several
>>>>> subdirectories. For example:
>>>>> 
>>>>> study1
>>>>> |----data1
>>>>>   |----2020-01-01
>>>>>       |---0001.parquet
>>>>> |----data2
>>>>> 
>>>>> study2
>>>>> |----dataA
>>>>> |----dataB
>>>>> 
>>>>> Is it possible for Drill to report back the "directories" as "tables"? For
>>>>> example to perform a query and return something that tells me the directory
>>>>> structure?
>>>>> 
>>>>> I've read something about creating workspaces, but to do so for each of the
>>>>> directories seems onerous, and also requires going into the storage plugin
>>>>> configuration.
>>>>> 
>>>>> The alternative would be to implement some logic and traverse the file
>>>>> system, outside of Drill, and then use that information to drive the
>>>>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>>>>> ability to traverse the file system, infer schema, create cache, and so on.
>>>>> 
>>>>> Thanks,
>>>>> Rafael
>

Re: querying nested parquet directory structure

Posted by James Turton <ja...@somecomputer.xyz.INVALID>.

Some pointers, in case you're not already aware of them.

https://drill.apache.org/docs/querying-the-information-schema/

show files in dfs.foo;
show files in dfs.`/foo/bar`;

select * from information_schema.`files`;

In my experience, be careful of performance when using the last one.  
Especially if you've set the option storage.list_files_recursively = true;


On 2021/08/20 13:57, Rafael Jaimes III wrote:
> Thanks Charles.
>
> I'm wondering more along the lines if you don't know the name of the directories. In your example, you have to know that data1 and study1 are called that. How do you find this information? Sure you can examine the file system separate from Drill.
>
> Is there information of the file system path names within Drill, such as in INFORMATION_SCHEMA or similar?
>
> In short I'm wondering if it's possible to have a command like LIST TABLES; and have returned study1.data1 , study1.data2
>
> On August 20, 2021 6:49:42 AM EDT, luoc <lu...@apache.org> wrote:
>> Best practices. Schema-free in Drill.
>>
>>> 在 2021年8月20日，12:04，Charles Givre <cg...@gmail.com> 写道：
>>>
>>> Hi Rafael,
>>> If you're asking what I think you're asking, it sounds as if you'd like to query multiple files in a nested directory.  If that's the case, I have some good news...
>>> Drill allows you to query entire directories as if they were one big file.  Effectively Drill performs a UNION on those files, so the end result is that they appear to be one big table.
>>> Thus, with the structure you provided, you could do the following:
>>>
>>> SELECT ...
>>> FROM dfs.`<path>/study1/data1`
>>>
>>> That would roll up all the files under that directory path.  Now, there are some tricks that you should be aware of.  The first are implicit columns.  These can help you figure out the directory structure as well as some basic filtering.  There are also some specific functions that are unique to querying directories.  Take a look at the links below for references about the implicit fields as well as the directory functions.
>>>
>>> https://drill.apache.org/docs/querying-a-file-system-introduction/ <https://drill.apache.org/docs/querying-a-file-system-introduction/>
>>> https://drill.apache.org/docs/querying-directories/ <https://drill.apache.org/docs/querying-directories/>
>>>
>>> Best,
>>> -- C
>>>
>>>
>>>
>>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <ra...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I have a large dataset of parquet files that are nested within several
>>>> subdirectories. For example:
>>>>
>>>> study1
>>>> |----data1
>>>>    |----2020-01-01
>>>>        |---0001.parquet
>>>> |----data2
>>>>
>>>> study2
>>>> |----dataA
>>>> |----dataB
>>>>
>>>> Is it possible for Drill to report back the "directories" as "tables"? For
>>>> example to perform a query and return something that tells me the directory
>>>> structure?
>>>>
>>>> I've read something about creating workspaces, but to do so for each of the
>>>> directories seems onerous, and also requires going into the storage plugin
>>>> configuration.
>>>>
>>>> The alternative would be to implement some logic and traverse the file
>>>> system, outside of Drill, and then use that information to drive the
>>>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>>>> ability to traverse the file system, infer schema, create cache, and so on.
>>>>
>>>> Thanks,
>>>> Rafael

Re: querying nested parquet directory structure

Posted by Rafael Jaimes III <ra...@gmail.com>.

Thanks Charles.

I'm wondering more along the lines if you don't know the name of the directories. In your example, you have to know that data1 and study1 are called that. How do you find this information? Sure you can examine the file system separate from Drill.

Is there information of the file system path names within Drill, such as in INFORMATION_SCHEMA or similar?

In short I'm wondering if it's possible to have a command like LIST TABLES; and have returned study1.data1 , study1.data2

On August 20, 2021 6:49:42 AM EDT, luoc <lu...@apache.org> wrote:
>
>Best practices. Schema-free in Drill.
>
>> 在 2021年8月20日，12:04，Charles Givre <cg...@gmail.com> 写道：
>> 
>> Hi Rafael, 
>> If you're asking what I think you're asking, it sounds as if you'd like to query multiple files in a nested directory.  If that's the case, I have some good news...
>> Drill allows you to query entire directories as if they were one big file.  Effectively Drill performs a UNION on those files, so the end result is that they appear to be one big table. 
>> Thus, with the structure you provided, you could do the following:
>> 
>> SELECT ...
>> FROM dfs.`<path>/study1/data1`
>> 
>> That would roll up all the files under that directory path.  Now, there are some tricks that you should be aware of.  The first are implicit columns.  These can help you figure out the directory structure as well as some basic filtering.  There are also some specific functions that are unique to querying directories.  Take a look at the links below for references about the implicit fields as well as the directory functions.  
>> 
>> https://drill.apache.org/docs/querying-a-file-system-introduction/ <https://drill.apache.org/docs/querying-a-file-system-introduction/>
>> https://drill.apache.org/docs/querying-directories/ <https://drill.apache.org/docs/querying-directories/>
>> 
>> Best,
>> -- C
>> 
>> 
>> 
>>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <ra...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> I have a large dataset of parquet files that are nested within several
>>> subdirectories. For example:
>>> 
>>> study1
>>> |----data1
>>>   |----2020-01-01
>>>       |---0001.parquet
>>> |----data2
>>> 
>>> study2
>>> |----dataA
>>> |----dataB
>>> 
>>> Is it possible for Drill to report back the "directories" as "tables"? For
>>> example to perform a query and return something that tells me the directory
>>> structure?
>>> 
>>> I've read something about creating workspaces, but to do so for each of the
>>> directories seems onerous, and also requires going into the storage plugin
>>> configuration.
>>> 
>>> The alternative would be to implement some logic and traverse the file
>>> system, outside of Drill, and then use that information to drive the
>>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>>> ability to traverse the file system, infer schema, create cache, and so on.
>>> 
>>> Thanks,
>>> Rafael
>> 
>

Re: querying nested parquet directory structure

Posted by luoc <lu...@apache.org>.

Best practices. Schema-free in Drill.

> 在 2021年8月20日，12:04，Charles Givre <cg...@gmail.com> 写道：
> 
> Hi Rafael, 
> If you're asking what I think you're asking, it sounds as if you'd like to query multiple files in a nested directory.  If that's the case, I have some good news...
> Drill allows you to query entire directories as if they were one big file.  Effectively Drill performs a UNION on those files, so the end result is that they appear to be one big table. 
> Thus, with the structure you provided, you could do the following:
> 
> SELECT ...
> FROM dfs.`<path>/study1/data1`
> 
> That would roll up all the files under that directory path.  Now, there are some tricks that you should be aware of.  The first are implicit columns.  These can help you figure out the directory structure as well as some basic filtering.  There are also some specific functions that are unique to querying directories.  Take a look at the links below for references about the implicit fields as well as the directory functions.  
> 
> https://drill.apache.org/docs/querying-a-file-system-introduction/ <https://drill.apache.org/docs/querying-a-file-system-introduction/>
> https://drill.apache.org/docs/querying-directories/ <https://drill.apache.org/docs/querying-directories/>
> 
> Best,
> -- C
> 
> 
> 
>> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <ra...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> I have a large dataset of parquet files that are nested within several
>> subdirectories. For example:
>> 
>> study1
>> |----data1
>>   |----2020-01-01
>>       |---0001.parquet
>> |----data2
>> 
>> study2
>> |----dataA
>> |----dataB
>> 
>> Is it possible for Drill to report back the "directories" as "tables"? For
>> example to perform a query and return something that tells me the directory
>> structure?
>> 
>> I've read something about creating workspaces, but to do so for each of the
>> directories seems onerous, and also requires going into the storage plugin
>> configuration.
>> 
>> The alternative would be to implement some logic and traverse the file
>> system, outside of Drill, and then use that information to drive the
>> "tables" for the queries. Although, that seems unintuitive provided Drill's
>> ability to traverse the file system, infer schema, create cache, and so on.
>> 
>> Thanks,
>> Rafael
>

Re: querying nested parquet directory structure

Posted by Charles Givre <cg...@gmail.com>.

Hi Rafael, 
If you're asking what I think you're asking, it sounds as if you'd like to query multiple files in a nested directory.  If that's the case, I have some good news...
Drill allows you to query entire directories as if they were one big file.  Effectively Drill performs a UNION on those files, so the end result is that they appear to be one big table. 
Thus, with the structure you provided, you could do the following:

SELECT ...
FROM dfs.`<path>/study1/data1`

That would roll up all the files under that directory path.  Now, there are some tricks that you should be aware of.  The first are implicit columns.  These can help you figure out the directory structure as well as some basic filtering.  There are also some specific functions that are unique to querying directories.  Take a look at the links below for references about the implicit fields as well as the directory functions.  

https://drill.apache.org/docs/querying-a-file-system-introduction/ <https://drill.apache.org/docs/querying-a-file-system-introduction/>
https://drill.apache.org/docs/querying-directories/ <https://drill.apache.org/docs/querying-directories/>

Best,
-- C

> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <ra...@gmail.com> wrote:
> 
> Hi all,
> 
> I have a large dataset of parquet files that are nested within several
> subdirectories. For example:
> 
> study1
> |----data1
>    |----2020-01-01
>        |---0001.parquet
> |----data2
> 
> study2
> |----dataA
> |----dataB
> 
> Is it possible for Drill to report back the "directories" as "tables"? For
> example to perform a query and return something that tells me the directory
> structure?
> 
> I've read something about creating workspaces, but to do so for each of the
> directories seems onerous, and also requires going into the storage plugin
> configuration.
> 
> The alternative would be to implement some logic and traverse the file
> system, outside of Drill, and then use that information to drive the
> "tables" for the queries. Although, that seems unintuitive provided Drill's
> ability to traverse the file system, infer schema, create cache, and so on.
> 
> Thanks,
> Rafael