You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Karthikeyan Muthukumar <mk...@gmail.com> on 2015/04/08 03:13:03 UTC

Writing directly to Parquet without Avro/Thrift/ProtoBuf

Hi,
In my mapreduce program, I have my model defined in Avro and have been
using the AvroParquet Input/Output format classes to serialize Parquet
files with Avro model. I have faced no issues with that.
I'm being told that using a Avro model and writing to Parquet is
in-efficient and writing directly to Parquet is a better option.
I have two questions:
1) What are the advantages, if any, of writing directly to Parquet and not
through Avro?
2) Majority of the material on the web about Parquet is about writing to
Parquet using one of the available WriteSupport like Avro. Are there any
examples/pointers to code related to writing/reading direct Parquet files.

PS: My data model is not very complex. It has a bunch of primitives, some
Maps (String -> Number) and Lists (of Strings). No multi-level nested
structures.

Thanks & Regards
MK

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Honestly, though, don't bother unless you actually identified avro layer as
a significant bottleneck. This is going to take some work and may be fun /
educational, but I worry about hearsay-based perf optimizations...

On Thursday, April 16, 2015, Karthikeyan Muthukumar <mk...@gmail.com>
wrote:

> Thanks Jacques and Ryan for the insights!
> Im going to try something based on RecordConsumer model.
> MK
>
> On Thu, Apr 9, 2015 at 12:57 PM, Ryan Blue <blue@cloudera.com
> <javascript:;>> wrote:
>
> > Excellent point about unions at too high of a level, which I never
> thought
> > about. The best practice is definitely to add the new column with a
> default
> > instead of versioning the entire record! I wonder if there is something
> we
> > can do about that.
> >
> > rb
> >
> > On 04/08/2015 06:03 PM, Jacques Nadeau wrote:
> >
> >> I agree with what Ryan said.  In terms of effort of implementation,
> using
> >> the existing object models are great.
> >>
> >> However, as you try to tune your application,  you may find suboptimal
> >> transformation patterns to the physical format.  This is always a
> possible
> >> risk when working through an abstraction.  The example I've seen
> >> previously
> >> is that people might create a union at a level higher than is necessary.
> >> For example, imagine
> >>
> >> old: {
> >>    first:string
> >>    last:string
> >> }
> >>
> >> new: {
> >>    first:string
> >>    last:string
> >>    twitter_handle:string
> >> }
> >>
> >> People are inclined to union (old,new).  Last I checked, the default
> Avro
> >> behavior in this situation would be to create five columns: old_first,
> >> old_last, new_first, and new_last (names are actually nested as
> group0.x,
> >> group1.x or something similar).  Depending on what is being done, this
> can
> >> be suboptimal as a logical query of "select table.first from table" now
> >> has
> >> to read two columns, manage two possibly different encoding schemes,
> etc.
> >> This will be even more impactful as we implement things like indices in
> >> the
> >> physical layer.
> >>
> >> In short, if you are using an abstraction, be aware that the physical
> >> layout may not be as optimal as it would have been if you had hand-tuned
> >> the schema with your particular application in mind.  The flip-side is
> you
> >> save time and aggravation in implementation.
> >>
> >> Make sense?
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Karthikeyan Muthukumar <mk...@gmail.com>.
Thanks Jacques and Ryan for the insights!
Im going to try something based on RecordConsumer model.
MK

On Thu, Apr 9, 2015 at 12:57 PM, Ryan Blue <bl...@cloudera.com> wrote:

> Excellent point about unions at too high of a level, which I never thought
> about. The best practice is definitely to add the new column with a default
> instead of versioning the entire record! I wonder if there is something we
> can do about that.
>
> rb
>
> On 04/08/2015 06:03 PM, Jacques Nadeau wrote:
>
>> I agree with what Ryan said.  In terms of effort of implementation, using
>> the existing object models are great.
>>
>> However, as you try to tune your application,  you may find suboptimal
>> transformation patterns to the physical format.  This is always a possible
>> risk when working through an abstraction.  The example I've seen
>> previously
>> is that people might create a union at a level higher than is necessary.
>> For example, imagine
>>
>> old: {
>>    first:string
>>    last:string
>> }
>>
>> new: {
>>    first:string
>>    last:string
>>    twitter_handle:string
>> }
>>
>> People are inclined to union (old,new).  Last I checked, the default Avro
>> behavior in this situation would be to create five columns: old_first,
>> old_last, new_first, and new_last (names are actually nested as group0.x,
>> group1.x or something similar).  Depending on what is being done, this can
>> be suboptimal as a logical query of "select table.first from table" now
>> has
>> to read two columns, manage two possibly different encoding schemes, etc.
>> This will be even more impactful as we implement things like indices in
>> the
>> physical layer.
>>
>> In short, if you are using an abstraction, be aware that the physical
>> layout may not be as optimal as it would have been if you had hand-tuned
>> the schema with your particular application in mind.  The flip-side is you
>> save time and aggravation in implementation.
>>
>> Make sense?
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Ryan Blue <bl...@cloudera.com>.
Excellent point about unions at too high of a level, which I never 
thought about. The best practice is definitely to add the new column 
with a default instead of versioning the entire record! I wonder if 
there is something we can do about that.

rb

On 04/08/2015 06:03 PM, Jacques Nadeau wrote:
> I agree with what Ryan said.  In terms of effort of implementation, using
> the existing object models are great.
>
> However, as you try to tune your application,  you may find suboptimal
> transformation patterns to the physical format.  This is always a possible
> risk when working through an abstraction.  The example I've seen previously
> is that people might create a union at a level higher than is necessary.
> For example, imagine
>
> old: {
>    first:string
>    last:string
> }
>
> new: {
>    first:string
>    last:string
>    twitter_handle:string
> }
>
> People are inclined to union (old,new).  Last I checked, the default Avro
> behavior in this situation would be to create five columns: old_first,
> old_last, new_first, and new_last (names are actually nested as group0.x,
> group1.x or something similar).  Depending on what is being done, this can
> be suboptimal as a logical query of "select table.first from table" now has
> to read two columns, manage two possibly different encoding schemes, etc.
> This will be even more impactful as we implement things like indices in the
> physical layer.
>
> In short, if you are using an abstraction, be aware that the physical
> layout may not be as optimal as it would have been if you had hand-tuned
> the schema with your particular application in mind.  The flip-side is you
> save time and aggravation in implementation.
>
> Make sense?


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Ryan Blue <bl...@cloudera.com>.
MK,

Here's a link to the Avro reader. It sounds like you're familiar with 
Avro, so that might be the easiest to read.

 
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroIndexedRecordConverter.java

What is your use case? Are you building something with its own tailored 
data layer on top of Parquet? We're always interested in hearing about 
projects that need their own data model. We can at least help you out 
with the tricky parts, like higher-level type representations.

rb

On 04/09/2015 09:47 AM, Karthikeyan Muthukumar wrote:
> Thanks Jacques & Ryan.
> Can any of you please point/provide some code snippets for writing to
> Parquet with own object model(NOT using Avro etc)?
>
> I don't have any complex unions etc, my data model is very simple with a
> bunch of primitives, and a few Arrays and Maps (of Strings -> Numbers).
> Looking through existing code in Drill or Hive, requires getting into
> context of those technologies, which often takes much more time than whats
> actually needed.
> I would greatly appreciate a small java code snippet for writing a simple
> json like this to Parquet (without Avro etc):
> {"Name": "Ram", "Age": 30, "Departments": ["Sales", "Marketing"],
> "Ratings": {"Dec": 100, "Nov": 50, "Oct": 200}}
>
> Thanks a lot!
> MK
>
>
> On Wed, Apr 8, 2015 at 9:03 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
>> I agree with what Ryan said.  In terms of effort of implementation, using
>> the existing object models are great.
>>
>> However, as you try to tune your application,  you may find suboptimal
>> transformation patterns to the physical format.  This is always a possible
>> risk when working through an abstraction.  The example I've seen previously
>> is that people might create a union at a level higher than is necessary.
>> For example, imagine
>>
>> old: {
>>    first:string
>>    last:string
>> }
>>
>> new: {
>>    first:string
>>    last:string
>>    twitter_handle:string
>> }
>>
>> People are inclined to union (old,new).  Last I checked, the default Avro
>> behavior in this situation would be to create five columns: old_first,
>> old_last, new_first, and new_last (names are actually nested as group0.x,
>> group1.x or something similar).  Depending on what is being done, this can
>> be suboptimal as a logical query of "select table.first from table" now has
>> to read two columns, manage two possibly different encoding schemes, etc.
>> This will be even more impactful as we implement things like indices in the
>> physical layer.
>>
>> In short, if you are using an abstraction, be aware that the physical
>> layout may not be as optimal as it would have been if you had hand-tuned
>> the schema with your particular application in mind.  The flip-side is you
>> save time and aggravation in implementation.
>>
>> Make sense?
>>
>>
>> On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>>> On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
>>>
>>>> Thanks Jacques and Alex.
>>>> I have been successfully using Avro model to write to Parquet files and
>>>> found that quite logical, because Avro is quite rich.
>>>> Are there any functional or performance impacts of using Avro model
>> based
>>>> Parquet files, specifically w.r.t accessing the generated Parquet files
>>>> through other tools like Drill, SparkSQL etc?
>>>> Thanks & Regards
>>>> MK
>>>>
>>>
>>> Hi MK,
>>>
>>> If Avro is the data model you're interested in using in your application,
>>> then parquet-avro is a good choice.
>>>
>>> For an application, it is perfectly reasonable to use Avro objects. There
>>> are a few reasons for this:
>>> 1. You have existing code based on the Avro format and object model
>>> 2. You want to use Avro-generated classes (avro-specific)
>>> 3. You want to use your own Java classes via reflection (avro-reflect)
>>> 4. You want compatibility with both storage formats
>>>
>>> Similarly, you could use parquet-thrift if you preferred using Thrift
>>> objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
>>>
>>> The only reason you would want to build your own object model is if you
>>> are doing a translation step later. For example, Hive can translate Avro
>>> objects to the form it expects, but instead we implemented a Hive object
>>> model to go directly from Parquet to Hive's representation. That's faster
>>> and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
>>> and others have their own data models.
>>>
>>> rb
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Karthikeyan Muthukumar <mk...@gmail.com>.
Thanks Jacques & Ryan.
Can any of you please point/provide some code snippets for writing to
Parquet with own object model(NOT using Avro etc)?

I don't have any complex unions etc, my data model is very simple with a
bunch of primitives, and a few Arrays and Maps (of Strings -> Numbers).
Looking through existing code in Drill or Hive, requires getting into
context of those technologies, which often takes much more time than whats
actually needed.
I would greatly appreciate a small java code snippet for writing a simple
json like this to Parquet (without Avro etc):
{"Name": "Ram", "Age": 30, "Departments": ["Sales", "Marketing"],
"Ratings": {"Dec": 100, "Nov": 50, "Oct": 200}}

Thanks a lot!
MK


On Wed, Apr 8, 2015 at 9:03 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I agree with what Ryan said.  In terms of effort of implementation, using
> the existing object models are great.
>
> However, as you try to tune your application,  you may find suboptimal
> transformation patterns to the physical format.  This is always a possible
> risk when working through an abstraction.  The example I've seen previously
> is that people might create a union at a level higher than is necessary.
> For example, imagine
>
> old: {
>   first:string
>   last:string
> }
>
> new: {
>   first:string
>   last:string
>   twitter_handle:string
> }
>
> People are inclined to union (old,new).  Last I checked, the default Avro
> behavior in this situation would be to create five columns: old_first,
> old_last, new_first, and new_last (names are actually nested as group0.x,
> group1.x or something similar).  Depending on what is being done, this can
> be suboptimal as a logical query of "select table.first from table" now has
> to read two columns, manage two possibly different encoding schemes, etc.
> This will be even more impactful as we implement things like indices in the
> physical layer.
>
> In short, if you are using an abstraction, be aware that the physical
> layout may not be as optimal as it would have been if you had hand-tuned
> the schema with your particular application in mind.  The flip-side is you
> save time and aggravation in implementation.
>
> Make sense?
>
>
> On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <bl...@cloudera.com> wrote:
>
> > On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
> >
> >> Thanks Jacques and Alex.
> >> I have been successfully using Avro model to write to Parquet files and
> >> found that quite logical, because Avro is quite rich.
> >> Are there any functional or performance impacts of using Avro model
> based
> >> Parquet files, specifically w.r.t accessing the generated Parquet files
> >> through other tools like Drill, SparkSQL etc?
> >> Thanks & Regards
> >> MK
> >>
> >
> > Hi MK,
> >
> > If Avro is the data model you're interested in using in your application,
> > then parquet-avro is a good choice.
> >
> > For an application, it is perfectly reasonable to use Avro objects. There
> > are a few reasons for this:
> > 1. You have existing code based on the Avro format and object model
> > 2. You want to use Avro-generated classes (avro-specific)
> > 3. You want to use your own Java classes via reflection (avro-reflect)
> > 4. You want compatibility with both storage formats
> >
> > Similarly, you could use parquet-thrift if you preferred using Thrift
> > objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
> >
> > The only reason you would want to build your own object model is if you
> > are doing a translation step later. For example, Hive can translate Avro
> > objects to the form it expects, but instead we implemented a Hive object
> > model to go directly from Parquet to Hive's representation. That's faster
> > and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
> > and others have their own data models.
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Jacques Nadeau <ja...@apache.org>.
I agree with what Ryan said.  In terms of effort of implementation, using
the existing object models are great.

However, as you try to tune your application,  you may find suboptimal
transformation patterns to the physical format.  This is always a possible
risk when working through an abstraction.  The example I've seen previously
is that people might create a union at a level higher than is necessary.
For example, imagine

old: {
  first:string
  last:string
}

new: {
  first:string
  last:string
  twitter_handle:string
}

People are inclined to union (old,new).  Last I checked, the default Avro
behavior in this situation would be to create five columns: old_first,
old_last, new_first, and new_last (names are actually nested as group0.x,
group1.x or something similar).  Depending on what is being done, this can
be suboptimal as a logical query of "select table.first from table" now has
to read two columns, manage two possibly different encoding schemes, etc.
This will be even more impactful as we implement things like indices in the
physical layer.

In short, if you are using an abstraction, be aware that the physical
layout may not be as optimal as it would have been if you had hand-tuned
the schema with your particular application in mind.  The flip-side is you
save time and aggravation in implementation.

Make sense?


On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <bl...@cloudera.com> wrote:

> On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
>
>> Thanks Jacques and Alex.
>> I have been successfully using Avro model to write to Parquet files and
>> found that quite logical, because Avro is quite rich.
>> Are there any functional or performance impacts of using Avro model based
>> Parquet files, specifically w.r.t accessing the generated Parquet files
>> through other tools like Drill, SparkSQL etc?
>> Thanks & Regards
>> MK
>>
>
> Hi MK,
>
> If Avro is the data model you're interested in using in your application,
> then parquet-avro is a good choice.
>
> For an application, it is perfectly reasonable to use Avro objects. There
> are a few reasons for this:
> 1. You have existing code based on the Avro format and object model
> 2. You want to use Avro-generated classes (avro-specific)
> 3. You want to use your own Java classes via reflection (avro-reflect)
> 4. You want compatibility with both storage formats
>
> Similarly, you could use parquet-thrift if you preferred using Thrift
> objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
>
> The only reason you would want to build your own object model is if you
> are doing a translation step later. For example, Hive can translate Avro
> objects to the form it expects, but instead we implemented a Hive object
> model to go directly from Parquet to Hive's representation. That's faster
> and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
> and others have their own data models.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Ryan Blue <bl...@cloudera.com>.
On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
> Thanks Jacques and Alex.
> I have been successfully using Avro model to write to Parquet files and
> found that quite logical, because Avro is quite rich.
> Are there any functional or performance impacts of using Avro model based
> Parquet files, specifically w.r.t accessing the generated Parquet files
> through other tools like Drill, SparkSQL etc?
> Thanks & Regards
> MK

Hi MK,

If Avro is the data model you're interested in using in your 
application, then parquet-avro is a good choice.

For an application, it is perfectly reasonable to use Avro objects. 
There are a few reasons for this:
1. You have existing code based on the Avro format and object model
2. You want to use Avro-generated classes (avro-specific)
3. You want to use your own Java classes via reflection (avro-reflect)
4. You want compatibility with both storage formats

Similarly, you could use parquet-thrift if you preferred using Thrift 
objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)

The only reason you would want to build your own object model is if you 
are doing a translation step later. For example, Hive can translate Avro 
objects to the form it expects, but instead we implemented a Hive object 
model to go directly from Parquet to Hive's representation. That's 
faster and doesn't require copying the data. This is why Drill, 
SparkSQL, Hive, and others have their own data models.

rb

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Karthikeyan Muthukumar <mk...@gmail.com>.
Thanks Jacques and Alex.
I have been successfully using Avro model to write to Parquet files and
found that quite logical, because Avro is quite rich.
Are there any functional or performance impacts of using Avro model based
Parquet files, specifically w.r.t accessing the generated Parquet files
through other tools like Drill, SparkSQL etc?
Thanks & Regards
MK


On Tue, Apr 7, 2015 at 9:30 PM, Jacques Nadeau <ja...@apache.org> wrote:

> You can write to Parquet using the RecordConsumer model.  It's lower level
> so not everyone will have appetite for it but it can be more efficient
> depending on your particular application.
>
> On Tue, Apr 7, 2015 at 6:22 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > You have to write to parquet through *some* object model. Whether it's
> > thrift, avro, or plain java objects, you need some way to represent a
> > schema. While using plain java objects might seem more direct, the plain
> > java object support is done via reflection, so using avro makes more
> sense
> > when you've already got an avro schema.
> >
> > Does that make sense?
> >
> > On Tue, Apr 7, 2015 at 6:13 PM, Karthikeyan Muthukumar <
> > mkarthikswamy@gmail.com> wrote:
> >
> > > Hi,
> > > In my mapreduce program, I have my model defined in Avro and have been
> > > using the AvroParquet Input/Output format classes to serialize Parquet
> > > files with Avro model. I have faced no issues with that.
> > > I'm being told that using a Avro model and writing to Parquet is
> > > in-efficient and writing directly to Parquet is a better option.
> > > I have two questions:
> > > 1) What are the advantages, if any, of writing directly to Parquet and
> > not
> > > through Avro?
> > > 2) Majority of the material on the web about Parquet is about writing
> to
> > > Parquet using one of the available WriteSupport like Avro. Are there
> any
> > > examples/pointers to code related to writing/reading direct Parquet
> > files.
> > >
> > > PS: My data model is not very complex. It has a bunch of primitives,
> some
> > > Maps (String -> Number) and Lists (of Strings). No multi-level nested
> > > structures.
> > >
> > > Thanks & Regards
> > > MK
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Jacques Nadeau <ja...@apache.org>.
You can write to Parquet using the RecordConsumer model.  It's lower level
so not everyone will have appetite for it but it can be more efficient
depending on your particular application.

On Tue, Apr 7, 2015 at 6:22 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:

> You have to write to parquet through *some* object model. Whether it's
> thrift, avro, or plain java objects, you need some way to represent a
> schema. While using plain java objects might seem more direct, the plain
> java object support is done via reflection, so using avro makes more sense
> when you've already got an avro schema.
>
> Does that make sense?
>
> On Tue, Apr 7, 2015 at 6:13 PM, Karthikeyan Muthukumar <
> mkarthikswamy@gmail.com> wrote:
>
> > Hi,
> > In my mapreduce program, I have my model defined in Avro and have been
> > using the AvroParquet Input/Output format classes to serialize Parquet
> > files with Avro model. I have faced no issues with that.
> > I'm being told that using a Avro model and writing to Parquet is
> > in-efficient and writing directly to Parquet is a better option.
> > I have two questions:
> > 1) What are the advantages, if any, of writing directly to Parquet and
> not
> > through Avro?
> > 2) Majority of the material on the web about Parquet is about writing to
> > Parquet using one of the available WriteSupport like Avro. Are there any
> > examples/pointers to code related to writing/reading direct Parquet
> files.
> >
> > PS: My data model is not very complex. It has a bunch of primitives, some
> > Maps (String -> Number) and Lists (of Strings). No multi-level nested
> > structures.
> >
> > Thanks & Regards
> > MK
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Posted by Alex Levenson <al...@twitter.com.INVALID>.
You have to write to parquet through *some* object model. Whether it's
thrift, avro, or plain java objects, you need some way to represent a
schema. While using plain java objects might seem more direct, the plain
java object support is done via reflection, so using avro makes more sense
when you've already got an avro schema.

Does that make sense?

On Tue, Apr 7, 2015 at 6:13 PM, Karthikeyan Muthukumar <
mkarthikswamy@gmail.com> wrote:

> Hi,
> In my mapreduce program, I have my model defined in Avro and have been
> using the AvroParquet Input/Output format classes to serialize Parquet
> files with Avro model. I have faced no issues with that.
> I'm being told that using a Avro model and writing to Parquet is
> in-efficient and writing directly to Parquet is a better option.
> I have two questions:
> 1) What are the advantages, if any, of writing directly to Parquet and not
> through Avro?
> 2) Majority of the material on the web about Parquet is about writing to
> Parquet using one of the available WriteSupport like Avro. Are there any
> examples/pointers to code related to writing/reading direct Parquet files.
>
> PS: My data model is not very complex. It has a bunch of primitives, some
> Maps (String -> Number) and Lists (of Strings). No multi-level nested
> structures.
>
> Thanks & Regards
> MK
>



-- 
Alex Levenson
@THISWILLWORK