You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/10/17 12:19:06 UTC

directory structure containing multiple file types

Hi,

I have a single directory structure containing both .avro and .json files.
There content is the same and they use the same schema (Avro files
explicitly and JSON files implicitly).

When I query the directory Drill returns an error informing me that the
Avro files can not be read as JSON files.

I assumed that the file ending would dictate the reader but some other
rules seem to apply.

Can someone tell me if I need to do something special to make this work or
if this a known limitation.

Regards,
 -Stefan

Re: directory structure containing multiple file types

Posted by Aman Sinha <am...@apache.org>.
With regard to the last comment on directory based pruning, please watch
DRILL-3759 (https://issues.apache.org/jira/browse/DRILL-3759).   I don't
have a timeline for it yet but hopefully
in the next Drill release.

Aman

On Mon, Oct 19, 2015 at 3:50 AM, Dhruv Gohil <yo...@gmail.com>
wrote:

> "What's needed in Drill to truly eliminate ETL" +1 but in another thread
> ;-)
>         few 'hacks' we want to share there of our 'work rounds' related to
> various drill limitations on multi directory queries (99% of our workload)
> , e.g. avoiding empty directory failures, building queries with directory
> pruning that 'works' etc..
>
>
>
> On Monday 19 October 2015 01:47 PM, Stefán Baxter wrote:
>
>> Hi Ted,
>>
>> Your approach only works for a single directory, not a directory
>> structure.
>>
>> I will create an improvement request later today.
>>
>> I would welcome a session on "What's needed in Drill to truly eliminate
>> ETL" (Just an idea)
>>
>> Regards,
>>   -Stefan
>>
>> On Sun, Oct 18, 2015 at 10:30 PM, Stefán Baxter <
>> stefan@activitystream.com>
>> wrote:
>>
>> than you Jacques, I will.
>>>
>>> On Sun, Oct 18, 2015 at 10:01 PM, Jacques Nadeau <ja...@dremio.com>
>>> wrote:
>>>
>>> Stefan, can you open a JIRA for reading multiple files types in a single
>>>> directory. It isn't the most common case we've run across but is
>>>> definitely
>>>> something that should be addressed.
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>> On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <
>>>> stefan@activitystream.com>
>>>> wrote:
>>>>
>>>> Thanks Abhishek,
>>>>>
>>>>> I think Drill is still quite far from eliminating ETL and the list of
>>>>> obstacles on the way to there seems growing. (yeah, disappointment got
>>>>>
>>>> me
>>>>
>>>>> for a bit)
>>>>>
>>>>> Regards,
>>>>>   -Stefan
>>>>>
>>>>>
>>>
>

Re: directory structure containing multiple file types

Posted by Dhruv Gohil <yo...@gmail.com>.
"What's needed in Drill to truly eliminate ETL" +1 but in another thread ;-)
	few 'hacks' we want to share there of our 'work rounds' related to various drill limitations on multi directory queries (99% of our workload) , e.g. avoiding empty directory failures, building queries with directory pruning that 'works' etc..


On Monday 19 October 2015 01:47 PM, Stefán Baxter wrote:
> Hi Ted,
>
> Your approach only works for a single directory, not a directory structure.
>
> I will create an improvement request later today.
>
> I would welcome a session on "What's needed in Drill to truly eliminate
> ETL" (Just an idea)
>
> Regards,
>   -Stefan
>
> On Sun, Oct 18, 2015 at 10:30 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
>> than you Jacques, I will.
>>
>> On Sun, Oct 18, 2015 at 10:01 PM, Jacques Nadeau <ja...@dremio.com>
>> wrote:
>>
>>> Stefan, can you open a JIRA for reading multiple files types in a single
>>> directory. It isn't the most common case we've run across but is
>>> definitely
>>> something that should be addressed.
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>> On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <
>>> stefan@activitystream.com>
>>> wrote:
>>>
>>>> Thanks Abhishek,
>>>>
>>>> I think Drill is still quite far from eliminating ETL and the list of
>>>> obstacles on the way to there seems growing. (yeah, disappointment got
>>> me
>>>> for a bit)
>>>>
>>>> Regards,
>>>>   -Stefan
>>>>
>>


Re: directory structure containing multiple file types

Posted by Stefán Baxter <st...@activitystream.com>.
Hi Ted,

Your approach only works for a single directory, not a directory structure.

I will create an improvement request later today.

I would welcome a session on "What's needed in Drill to truly eliminate
ETL" (Just an idea)

Regards,
 -Stefan

On Sun, Oct 18, 2015 at 10:30 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> than you Jacques, I will.
>
> On Sun, Oct 18, 2015 at 10:01 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
>> Stefan, can you open a JIRA for reading multiple files types in a single
>> directory. It isn't the most common case we've run across but is
>> definitely
>> something that should be addressed.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <
>> stefan@activitystream.com>
>> wrote:
>>
>> > Thanks Abhishek,
>> >
>> > I think Drill is still quite far from eliminating ETL and the list of
>> > obstacles on the way to there seems growing. (yeah, disappointment got
>> me
>> > for a bit)
>> >
>> > Regards,
>> >  -Stefan
>> >
>>
>
>

Re: directory structure containing multiple file types

Posted by Stefán Baxter <st...@activitystream.com>.
than you Jacques, I will.

On Sun, Oct 18, 2015 at 10:01 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Stefan, can you open a JIRA for reading multiple files types in a single
> directory. It isn't the most common case we've run across but is definitely
> something that should be addressed.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > Thanks Abhishek,
> >
> > I think Drill is still quite far from eliminating ETL and the list of
> > obstacles on the way to there seems growing. (yeah, disappointment got me
> > for a bit)
> >
> > Regards,
> >  -Stefan
> >
>

Re: directory structure containing multiple file types

Posted by Jacques Nadeau <ja...@dremio.com>.
Stefan, can you open a JIRA for reading multiple files types in a single
directory. It isn't the most common case we've run across but is definitely
something that should be addressed.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Thanks Abhishek,
>
> I think Drill is still quite far from eliminating ETL and the list of
> obstacles on the way to there seems growing. (yeah, disappointment got me
> for a bit)
>
> Regards,
>  -Stefan
>

Re: directory structure containing multiple file types

Posted by Ted Dunning <te...@gmail.com>.
Yes. This is a pain in the butt.

One thing that might work for you is to use a union of different
wild-cards.  Here is an example where I have a directory with both csv and
json files.

select * from (
    select columns[0] as a, columns[1] as b
      from dfs.tdunning.`foo1/*.csv`
) union (
    select j.a, j.b
      from dfs.tdunning.`foo1/*.json` j
);

Note that each record in a csv consists of a single value (called columns)
which is an array.  Each record from a json is a structure. I have to
extract these components in order to get data that can be union'ed.

On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Thanks Abhishek,
>
> I think Drill is still quite far from eliminating ETL and the list of
> obstacles on the way to there seems growing. (yeah, disappointment got me
> for a bit)
>
> Regards,
>  -Stefan
>

Re: directory structure containing multiple file types

Posted by Stefán Baxter <st...@activitystream.com>.
Thanks Abhishek,

I think Drill is still quite far from eliminating ETL and the list of
obstacles on the way to there seems growing. (yeah, disappointment got me
for a bit)

Regards,
 -Stefan

Re: directory structure containing multiple file types

Posted by Abhishek Girish <ab...@gmail.com>.
While querying directories on a file system, Drill expects all files within
it to be of the same format/type. Heterogenous types aren't supported
afaik.

I've seen a case where Drill would start off querying but would fail later.
And another case where it would fail right away. I think this is a known
limitation - If the first file read is of type JSON, all remainder of files
are expected to be of type JSON. I don't think the scheme being same itself
matters.

-Abhishek

On Saturday, October 17, 2015, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I have a single directory structure containing both .avro and .json files.
> There content is the same and they use the same schema (Avro files
> explicitly and JSON files implicitly).
>
> When I query the directory Drill returns an error informing me that the
> Avro files can not be read as JSON files.
>
> I assumed that the file ending would dictate the reader but some other
> rules seem to apply.
>
> Can someone tell me if I need to do something special to make this work or
> if this a known limitation.
>
> Regards,
>  -Stefan
>