You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Patrick Woody <pa...@gmail.com> on 2016/10/04 15:33:08 UTC

StreamBytesInput with DictionaryFilter

Hey all,

Running a parquet-mr build off of master and I'm seeing some interesting
behavior when using a DictionaryFilter to prune row groups. Basically, if I
have an And or Or filter the DictionaryPage object gets re-used. This seems
to be a problem for StreamBytesInput because the stream gets exhausted
after the first toByteArray call. My current workaround is to synchronize
and just re-use the byte array after the first read, but I'd be curious as
to what people think the best approach to solving this is and if we should
be reusing the BytesInput at all.

Best,
Patrick

Re: StreamBytesInput with DictionaryFilter

Posted by Patrick Woody <pa...@gmail.com>.

Hi Ryan,

Apologies for the delay! I've filed it here
https://issues.apache.org/jira/browse/PARQUET-743 with the information from
the thread.

Thanks
Patrick

On Wed, Oct 5, 2016 at 4:39 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Patrick,
>
> Can you please open an issue for this? I think we should fix this before
> the 1.9.0 release. Thanks!
>
> rb
>
> On Tue, Oct 4, 2016 at 11:58 AM, Patrick Woody <pa...@gmail.com>
> wrote:
>
> > Looking a bit more - it looks like this is because decompression converts
> > to a StreamBytesInput automatically. The current tests run with the
> > uncompressed codec, so it doesn't hit this issue. I've put up a commit
> here
> > that demonstrates the issue and my current workaround:
> > https://github.com/palantir/parquet-mr/pull/10/commits/
> > 70cc00cba5c294d4c860bd4cd2c48c2d083a5809
> > .
> >
> > Thanks,
> > Patrick
> >
> > On Tue, Oct 4, 2016 at 4:33 PM, Patrick Woody <pa...@gmail.com>
> > wrote:
> >
> > > Hey all,
> > >
> > > Running a parquet-mr build off of master and I'm seeing some
> interesting
> > > behavior when using a DictionaryFilter to prune row groups. Basically,
> > if I
> > > have an And or Or filter the DictionaryPage object gets re-used. This
> > seems
> > > to be a problem for StreamBytesInput because the stream gets exhausted
> > > after the first toByteArray call. My current workaround is to
> synchronize
> > > and just re-use the byte array after the first read, but I'd be curious
> > as
> > > to what people think the best approach to solving this is and if we
> > should
> > > be reusing the BytesInput at all.
> > >
> > > Best,
> > > Patrick
> > >
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: StreamBytesInput with DictionaryFilter

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Patrick,

Can you please open an issue for this? I think we should fix this before
the 1.9.0 release. Thanks!

rb

On Tue, Oct 4, 2016 at 11:58 AM, Patrick Woody <pa...@gmail.com>
wrote:

> Looking a bit more - it looks like this is because decompression converts
> to a StreamBytesInput automatically. The current tests run with the
> uncompressed codec, so it doesn't hit this issue. I've put up a commit here
> that demonstrates the issue and my current workaround:
> https://github.com/palantir/parquet-mr/pull/10/commits/
> 70cc00cba5c294d4c860bd4cd2c48c2d083a5809
> .
>
> Thanks,
> Patrick
>
> On Tue, Oct 4, 2016 at 4:33 PM, Patrick Woody <pa...@gmail.com>
> wrote:
>
> > Hey all,
> >
> > Running a parquet-mr build off of master and I'm seeing some interesting
> > behavior when using a DictionaryFilter to prune row groups. Basically,
> if I
> > have an And or Or filter the DictionaryPage object gets re-used. This
> seems
> > to be a problem for StreamBytesInput because the stream gets exhausted
> > after the first toByteArray call. My current workaround is to synchronize
> > and just re-use the byte array after the first read, but I'd be curious
> as
> > to what people think the best approach to solving this is and if we
> should
> > be reusing the BytesInput at all.
> >
> > Best,
> > Patrick
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: StreamBytesInput with DictionaryFilter

Posted by Patrick Woody <pa...@gmail.com>.

Looking a bit more - it looks like this is because decompression converts
to a StreamBytesInput automatically. The current tests run with the
uncompressed codec, so it doesn't hit this issue. I've put up a commit here
that demonstrates the issue and my current workaround:
https://github.com/palantir/parquet-mr/pull/10/commits/70cc00cba5c294d4c860bd4cd2c48c2d083a5809
.

Thanks,
Patrick

On Tue, Oct 4, 2016 at 4:33 PM, Patrick Woody <pa...@gmail.com>
wrote:

> Hey all,
>
> Running a parquet-mr build off of master and I'm seeing some interesting
> behavior when using a DictionaryFilter to prune row groups. Basically, if I
> have an And or Or filter the DictionaryPage object gets re-used. This seems
> to be a problem for StreamBytesInput because the stream gets exhausted
> after the first toByteArray call. My current workaround is to synchronize
> and just re-use the byte array after the first read, but I'd be curious as
> to what people think the best approach to solving this is and if we should
> be reusing the BytesInput at all.
>
> Best,
> Patrick
>