You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2016/02/25 21:29:30 UTC

Avro support in Drill - Missing support for the IN operator and other frustrating things

Hi,

This query targets Avro files in the latest 1.5 release:

0: jdbc:drill:zk=local> select count(*) from
dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
'Customer/4-2492847';
+---------+
| EXPR$0  |
+---------+
| 5788    |
+---------+

0: jdbc:drill:zk=local> select count(*) from
dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
('Customer/4-2492847');
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+

It shows that the IN operator does not work with Avro (works with Parquet).

This finally tips us over. We have invested hundreds of hours moving all
streaming/fresh data from JSON to Avro but the Avro part of Drill is broken
in too many ways to recommend its use to anyone.

Attempts to report Avro errors and shortcomings, like the missing support
for dirX, has had no results.

I think it would be prudent to warn people on the Drill website that the
Avro support is experimental, at best

- Stefán Baxter

Re: Avro support in Drill - Missing support for the IN operator and other frustrating things

Posted by Jason Altekruse <al...@gmail.com>.
Hey Stefan,

It is possible that this is the case. A quick look at the code seems to
indicate that the Avro reader is not overriding the default behavior of
determining approximate row count of files. I believe there is still a
small issue with the code handling tiny files, are the files you are
dealing with at least a few megabytes?

Can you see how many minor fragments are listed under the scan operation in
the query profile? If there are multiple fragments then the scan is
parallelized.

- Jason

On Mon, Feb 29, 2016 at 1:58 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi Jason,
>
> Is it possible that the Avro plugin does not use any parallelism and that
> all the target files are scanned sequentially by the same process?  (1.5)
>
> - Stefán
>
> On Fri, Feb 26, 2016 at 8:04 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Thank you Jason.
> >
> > I do realize that this is an OS project and that everyone is doing their
> > best.
> >
> > There are just a few things I wish I had realized before switching over
> > from JSON to Avro that  have caused us a lot of problems and taken a long
> > time.
> >
> > Your work is appreciated and I apologize for letting my frustration get
> > the better of me.
> >
> > - Stefán
> >
> > On Fri, Feb 26, 2016 at 8:00 PM, Jason Altekruse <
> altekrusejason@gmail.com
> > > wrote:
> >
> >> Stefan,
> >>
> >> I'm sorry that we have not been better about getting back to the issues
> >> you
> >> have filed against the Avro reader. We do appreciate all of the effort
> you
> >> have put into filing thorough bugs and being active in the discussions
> on
> >> the list. I have responded on the bug you filed on this issue [1] with a
> >> workaround and will be posting a patch shortly with a fix.
> >>
> >> - Jason <https://issues.apache.org/jira/browse/DRILL-4120>
> >>
> >> [1] - https://issues.apache.org/jira/browse/DRILL-4441
> >> <https://issues.apache.org/jira/browse/DRILL-4120>
> >>
> >> On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <
> >> stefan@activitystream.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > This query targets Avro files in the latest 1.5 release:
> >> >
> >> > 0: jdbc:drill:zk=local> select count(*) from
> >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
> >> > 'Customer/4-2492847';
> >> > +---------+
> >> > | EXPR$0  |
> >> > +---------+
> >> > | 5788    |
> >> > +---------+
> >> >
> >> > 0: jdbc:drill:zk=local> select count(*) from
> >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
> >> > ('Customer/4-2492847');
> >> > +---------+
> >> > | EXPR$0  |
> >> > +---------+
> >> > | 0       |
> >> > +---------+
> >> >
> >> > It shows that the IN operator does not work with Avro (works with
> >> Parquet).
> >> >
> >> > This finally tips us over. We have invested hundreds of hours moving
> all
> >> > streaming/fresh data from JSON to Avro but the Avro part of Drill is
> >> broken
> >> > in too many ways to recommend its use to anyone.
> >> >
> >> > Attempts to report Avro errors and shortcomings, like the missing
> >> support
> >> > for dirX, has had no results.
> >> >
> >> > I think it would be prudent to warn people on the Drill website that
> the
> >> > Avro support is experimental, at best
> >> >
> >> > - Stefán Baxter
> >> >
> >>
> >
> >
>

Re: Avro support in Drill - Missing support for the IN operator and other frustrating things

Posted by Stefán Baxter <st...@activitystream.com>.
Hi Jason,

Is it possible that the Avro plugin does not use any parallelism and that
all the target files are scanned sequentially by the same process?  (1.5)

- Stefán

On Fri, Feb 26, 2016 at 8:04 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Thank you Jason.
>
> I do realize that this is an OS project and that everyone is doing their
> best.
>
> There are just a few things I wish I had realized before switching over
> from JSON to Avro that  have caused us a lot of problems and taken a long
> time.
>
> Your work is appreciated and I apologize for letting my frustration get
> the better of me.
>
> - Stefán
>
> On Fri, Feb 26, 2016 at 8:00 PM, Jason Altekruse <altekrusejason@gmail.com
> > wrote:
>
>> Stefan,
>>
>> I'm sorry that we have not been better about getting back to the issues
>> you
>> have filed against the Avro reader. We do appreciate all of the effort you
>> have put into filing thorough bugs and being active in the discussions on
>> the list. I have responded on the bug you filed on this issue [1] with a
>> workaround and will be posting a patch shortly with a fix.
>>
>> - Jason <https://issues.apache.org/jira/browse/DRILL-4120>
>>
>> [1] - https://issues.apache.org/jira/browse/DRILL-4441
>> <https://issues.apache.org/jira/browse/DRILL-4120>
>>
>> On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <
>> stefan@activitystream.com>
>> wrote:
>>
>> > Hi,
>> >
>> > This query targets Avro files in the latest 1.5 release:
>> >
>> > 0: jdbc:drill:zk=local> select count(*) from
>> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
>> > 'Customer/4-2492847';
>> > +---------+
>> > | EXPR$0  |
>> > +---------+
>> > | 5788    |
>> > +---------+
>> >
>> > 0: jdbc:drill:zk=local> select count(*) from
>> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
>> > ('Customer/4-2492847');
>> > +---------+
>> > | EXPR$0  |
>> > +---------+
>> > | 0       |
>> > +---------+
>> >
>> > It shows that the IN operator does not work with Avro (works with
>> Parquet).
>> >
>> > This finally tips us over. We have invested hundreds of hours moving all
>> > streaming/fresh data from JSON to Avro but the Avro part of Drill is
>> broken
>> > in too many ways to recommend its use to anyone.
>> >
>> > Attempts to report Avro errors and shortcomings, like the missing
>> support
>> > for dirX, has had no results.
>> >
>> > I think it would be prudent to warn people on the Drill website that the
>> > Avro support is experimental, at best
>> >
>> > - Stefán Baxter
>> >
>>
>
>

Re: Avro support in Drill - Missing support for the IN operator and other frustrating things

Posted by Stefán Baxter <st...@activitystream.com>.
Thank you Jason.

I do realize that this is an OS project and that everyone is doing their
best.

There are just a few things I wish I had realized before switching over
from JSON to Avro that  have caused us a lot of problems and taken a long
time.

Your work is appreciated and I apologize for letting my frustration get the
better of me.

- Stefán

On Fri, Feb 26, 2016 at 8:00 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Stefan,
>
> I'm sorry that we have not been better about getting back to the issues you
> have filed against the Avro reader. We do appreciate all of the effort you
> have put into filing thorough bugs and being active in the discussions on
> the list. I have responded on the bug you filed on this issue [1] with a
> workaround and will be posting a patch shortly with a fix.
>
> - Jason <https://issues.apache.org/jira/browse/DRILL-4120>
>
> [1] - https://issues.apache.org/jira/browse/DRILL-4441
> <https://issues.apache.org/jira/browse/DRILL-4120>
>
> On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > Hi,
> >
> > This query targets Avro files in the latest 1.5 release:
> >
> > 0: jdbc:drill:zk=local> select count(*) from
> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
> > 'Customer/4-2492847';
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 5788    |
> > +---------+
> >
> > 0: jdbc:drill:zk=local> select count(*) from
> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
> > ('Customer/4-2492847');
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 0       |
> > +---------+
> >
> > It shows that the IN operator does not work with Avro (works with
> Parquet).
> >
> > This finally tips us over. We have invested hundreds of hours moving all
> > streaming/fresh data from JSON to Avro but the Avro part of Drill is
> broken
> > in too many ways to recommend its use to anyone.
> >
> > Attempts to report Avro errors and shortcomings, like the missing support
> > for dirX, has had no results.
> >
> > I think it would be prudent to warn people on the Drill website that the
> > Avro support is experimental, at best
> >
> > - Stefán Baxter
> >
>

Re: Avro support in Drill - Missing support for the IN operator and other frustrating things

Posted by Jason Altekruse <al...@gmail.com>.
Stefan,

I'm sorry that we have not been better about getting back to the issues you
have filed against the Avro reader. We do appreciate all of the effort you
have put into filing thorough bugs and being active in the discussions on
the list. I have responded on the bug you filed on this issue [1] with a
workaround and will be posting a patch shortly with a fix.

- Jason <https://issues.apache.org/jira/browse/DRILL-4120>

[1] - https://issues.apache.org/jira/browse/DRILL-4441
<https://issues.apache.org/jira/browse/DRILL-4120>

On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> This query targets Avro files in the latest 1.5 release:
>
> 0: jdbc:drill:zk=local> select count(*) from
> dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
> 'Customer/4-2492847';
> +---------+
> | EXPR$0  |
> +---------+
> | 5788    |
> +---------+
>
> 0: jdbc:drill:zk=local> select count(*) from
> dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
> ('Customer/4-2492847');
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
>
> It shows that the IN operator does not work with Avro (works with Parquet).
>
> This finally tips us over. We have invested hundreds of hours moving all
> streaming/fresh data from JSON to Avro but the Avro part of Drill is broken
> in too many ways to recommend its use to anyone.
>
> Attempts to report Avro errors and shortcomings, like the missing support
> for dirX, has had no results.
>
> I think it would be prudent to warn people on the Drill website that the
> Avro support is experimental, at best
>
> - Stefán Baxter
>

Re: Avro support in Drill - Missing support for the IN operator and other frustrating things

Posted by Jason Altekruse <al...@gmail.com>.
Stefan,

I'm sorry that we have not been better about getting back to the issues you
have filed against the Avro reader. We do appreciate all of the effort you
have put into filing thorough bugs and being active in the discussions on
the list. I have responded on the bug you filed on this issue [1] with a
workaround and will be posting a patch shortly with a fix.

- Jason <https://issues.apache.org/jira/browse/DRILL-4120>

[1] - https://issues.apache.org/jira/browse/DRILL-4441
<https://issues.apache.org/jira/browse/DRILL-4120>

On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> This query targets Avro files in the latest 1.5 release:
>
> 0: jdbc:drill:zk=local> select count(*) from
> dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
> 'Customer/4-2492847';
> +---------+
> | EXPR$0  |
> +---------+
> | 5788    |
> +---------+
>
> 0: jdbc:drill:zk=local> select count(*) from
> dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
> ('Customer/4-2492847');
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
>
> It shows that the IN operator does not work with Avro (works with Parquet).
>
> This finally tips us over. We have invested hundreds of hours moving all
> streaming/fresh data from JSON to Avro but the Avro part of Drill is broken
> in too many ways to recommend its use to anyone.
>
> Attempts to report Avro errors and shortcomings, like the missing support
> for dirX, has had no results.
>
> I think it would be prudent to warn people on the Drill website that the
> Avro support is experimental, at best
>
> - Stefán Baxter
>