You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Ashutosh Chauhan <as...@gmail.com> on 2010/05/21 21:22:37 UTC

column names from Object Inspector in serialize() method of custom serde

Hi,

I am writing my own custom serde to write data to an external table.
In serialize() method of my serde I am handed over an object and an
object Inspector. Since this object represents a row, I make an
assumption that object Inspector is of type StructObjectInspector and
then I get fields out of this struct using struct Object inspector.
When I do field.getFieldName() on it I expect it will give me the real
column name as contained in my table schema in metastore. But, instead
I get names like _col1, _col2, _col3 ..

Now the workaround for it is to store the column names in a list in
initialize() method and then use that list to get names in
serialize(). This is what I am doing now and it works. It seems hbase
serde is also doing similar thing. But, it was counter intuitive to me
not to expect to get the real column names in getFieldName() but
rather some random made up names. If this is not the expected behavior
then potentially I am doing something wrong in my serde.. if so I will
appreciate if some one confirms that.. But if this is how things are
implemented currently.. then I think its a bug and I will open a jira
for it..

Thanks,
Ashutosh

PS: I am posting it on dev-list But if folks think its more
appropriate for user-list, feel free to move it there, while replying
to it.

Re: column names from Object Inspector in serialize() method of custom serde

Posted by Ashutosh Chauhan <as...@gmail.com>.
Thanks John and Arvind. Your explanation make sense. Early days of
getting used to "hive way" of doing things in the world of serdes,
storage-handler, meta-hook, object-inspectors etc. :)

Ashutosh

On Thu, May 27, 2010 at 10:56, Arvind Prabhakar <ar...@cloudera.com> wrote:
> John, Ashutosh,
>
> I agree with John's evaluation on this. Consider the case of writing to a
> partition of a table. Clearly, the columns being written to will not be the
> same as what are defined in the metadata for the entire table. Moreover,
> there are cases where intermediate tables (files) may be produced during a
> particular operation which are not defined by the user. In such cases you
> are dealing with either a subset of columns of a table or columns of an
> intermediate transient table. And since Struct OIs insist on having names
> for fields, it follows that to cover the general case we can use any unique
> names where necessary.
>
> The actual data pipeline underneath the Hive query is already semantically
> verified to fit the appropriate type definitions and hence adding the column
> names would not add any value to the runtime. It will add to the overall
> processing overhead.
>
> Arvind
>
> On Wed, May 26, 2010 at 6:29 PM, John Sichi <js...@facebook.com> wrote:
>
>> Hey Ashutosh,
>>
>> You're right, currently the target table column names come in via
>> initialize in the Properties parameter, e.g.
>> props.getProperty(Constants.LIST_COLUMNS), whereas the object inspector gets
>> _col1, _col2, _col3.  (And of course, if you have a custom mapping string
>> like HBase, then that comes in through the initialize Properties parameter
>> via your own private property name.)
>>
>> I haven't looked into the details of why this is, but probably the object
>> inspector references an internally produced row from whatever was upstream
>> (rather than being derived from the target table itself, although the number
>> of columns has to match).  I'm not sure this is a bug per se, just something
>> to be aware of.  In general, you should try to precompute any data
>> structures needed during initialize so that serialize can be as lean as
>> possible, meaning you probably don't want to be looking at the field names
>> in there anyway.
>>
>> Opinions from other hive devs?
>>
>> JVS
>>
>> On May 21, 2010, at 12:22 PM, Ashutosh Chauhan wrote:
>>
>> > Hi,
>> >
>> > I am writing my own custom serde to write data to an external table.
>> > In serialize() method of my serde I am handed over an object and an
>> > object Inspector. Since this object represents a row, I make an
>> > assumption that object Inspector is of type StructObjectInspector and
>> > then I get fields out of this struct using struct Object inspector.
>> > When I do field.getFieldName() on it I expect it will give me the real
>> > column name as contained in my table schema in metastore. But, instead
>> > I get names like _col1, _col2, _col3 ..
>> >
>> > Now the workaround for it is to store the column names in a list in
>> > initialize() method and then use that list to get names in
>> > serialize(). This is what I am doing now and it works. It seems hbase
>> > serde is also doing similar thing. But, it was counter intuitive to me
>> > not to expect to get the real column names in getFieldName() but
>> > rather some random made up names. If this is not the expected behavior
>> > then potentially I am doing something wrong in my serde.. if so I will
>> > appreciate if some one confirms that.. But if this is how things are
>> > implemented currently.. then I think its a bug and I will open a jira
>> > for it..
>> >
>> > Thanks,
>> > Ashutosh
>> >
>> > PS: I am posting it on dev-list But if folks think its more
>> > appropriate for user-list, feel free to move it there, while replying
>> > to it.
>>
>>
>

Re: column names from Object Inspector in serialize() method of custom serde

Posted by Arvind Prabhakar <ar...@cloudera.com>.
John, Ashutosh,

I agree with John's evaluation on this. Consider the case of writing to a
partition of a table. Clearly, the columns being written to will not be the
same as what are defined in the metadata for the entire table. Moreover,
there are cases where intermediate tables (files) may be produced during a
particular operation which are not defined by the user. In such cases you
are dealing with either a subset of columns of a table or columns of an
intermediate transient table. And since Struct OIs insist on having names
for fields, it follows that to cover the general case we can use any unique
names where necessary.

The actual data pipeline underneath the Hive query is already semantically
verified to fit the appropriate type definitions and hence adding the column
names would not add any value to the runtime. It will add to the overall
processing overhead.

Arvind

On Wed, May 26, 2010 at 6:29 PM, John Sichi <js...@facebook.com> wrote:

> Hey Ashutosh,
>
> You're right, currently the target table column names come in via
> initialize in the Properties parameter, e.g.
> props.getProperty(Constants.LIST_COLUMNS), whereas the object inspector gets
> _col1, _col2, _col3.  (And of course, if you have a custom mapping string
> like HBase, then that comes in through the initialize Properties parameter
> via your own private property name.)
>
> I haven't looked into the details of why this is, but probably the object
> inspector references an internally produced row from whatever was upstream
> (rather than being derived from the target table itself, although the number
> of columns has to match).  I'm not sure this is a bug per se, just something
> to be aware of.  In general, you should try to precompute any data
> structures needed during initialize so that serialize can be as lean as
> possible, meaning you probably don't want to be looking at the field names
> in there anyway.
>
> Opinions from other hive devs?
>
> JVS
>
> On May 21, 2010, at 12:22 PM, Ashutosh Chauhan wrote:
>
> > Hi,
> >
> > I am writing my own custom serde to write data to an external table.
> > In serialize() method of my serde I am handed over an object and an
> > object Inspector. Since this object represents a row, I make an
> > assumption that object Inspector is of type StructObjectInspector and
> > then I get fields out of this struct using struct Object inspector.
> > When I do field.getFieldName() on it I expect it will give me the real
> > column name as contained in my table schema in metastore. But, instead
> > I get names like _col1, _col2, _col3 ..
> >
> > Now the workaround for it is to store the column names in a list in
> > initialize() method and then use that list to get names in
> > serialize(). This is what I am doing now and it works. It seems hbase
> > serde is also doing similar thing. But, it was counter intuitive to me
> > not to expect to get the real column names in getFieldName() but
> > rather some random made up names. If this is not the expected behavior
> > then potentially I am doing something wrong in my serde.. if so I will
> > appreciate if some one confirms that.. But if this is how things are
> > implemented currently.. then I think its a bug and I will open a jira
> > for it..
> >
> > Thanks,
> > Ashutosh
> >
> > PS: I am posting it on dev-list But if folks think its more
> > appropriate for user-list, feel free to move it there, while replying
> > to it.
>
>

Re: column names from Object Inspector in serialize() method of custom serde

Posted by John Sichi <js...@facebook.com>.
Hey Ashutosh,

You're right, currently the target table column names come in via initialize in the Properties parameter, e.g. props.getProperty(Constants.LIST_COLUMNS), whereas the object inspector gets _col1, _col2, _col3.  (And of course, if you have a custom mapping string like HBase, then that comes in through the initialize Properties parameter via your own private property name.)

I haven't looked into the details of why this is, but probably the object inspector references an internally produced row from whatever was upstream (rather than being derived from the target table itself, although the number of columns has to match).  I'm not sure this is a bug per se, just something to be aware of.  In general, you should try to precompute any data structures needed during initialize so that serialize can be as lean as possible, meaning you probably don't want to be looking at the field names in there anyway.

Opinions from other hive devs?

JVS

On May 21, 2010, at 12:22 PM, Ashutosh Chauhan wrote:

> Hi,
> 
> I am writing my own custom serde to write data to an external table.
> In serialize() method of my serde I am handed over an object and an
> object Inspector. Since this object represents a row, I make an
> assumption that object Inspector is of type StructObjectInspector and
> then I get fields out of this struct using struct Object inspector.
> When I do field.getFieldName() on it I expect it will give me the real
> column name as contained in my table schema in metastore. But, instead
> I get names like _col1, _col2, _col3 ..
> 
> Now the workaround for it is to store the column names in a list in
> initialize() method and then use that list to get names in
> serialize(). This is what I am doing now and it works. It seems hbase
> serde is also doing similar thing. But, it was counter intuitive to me
> not to expect to get the real column names in getFieldName() but
> rather some random made up names. If this is not the expected behavior
> then potentially I am doing something wrong in my serde.. if so I will
> appreciate if some one confirms that.. But if this is how things are
> implemented currently.. then I think its a bug and I will open a jira
> for it..
> 
> Thanks,
> Ashutosh
> 
> PS: I am posting it on dev-list But if folks think its more
> appropriate for user-list, feel free to move it there, while replying
> to it.


Re: column names from Object Inspector in serialize() method of custom serde

Posted by Ashutosh Chauhan <as...@gmail.com>.
No one replied on dev list. Reposting on user list. If I am unclear,
let me know and I will provide more details of my use case.

Ashutosh

On Fri, May 21, 2010 at 12:22, Ashutosh Chauhan
<as...@gmail.com> wrote:
> Hi,
>
> I am writing my own custom serde to write data to an external table.
> In serialize() method of my serde I am handed over an object and an
> object Inspector. Since this object represents a row, I make an
> assumption that object Inspector is of type StructObjectInspector and
> then I get fields out of this struct using struct Object inspector.
> When I do field.getFieldName() on it I expect it will give me the real
> column name as contained in my table schema in metastore. But, instead
> I get names like _col1, _col2, _col3 ..
>
> Now the workaround for it is to store the column names in a list in
> initialize() method and then use that list to get names in
> serialize(). This is what I am doing now and it works. It seems hbase
> serde is also doing similar thing. But, it was counter intuitive to me
> not to expect to get the real column names in getFieldName() but
> rather some random made up names. If this is not the expected behavior
> then potentially I am doing something wrong in my serde.. if so I will
> appreciate if some one confirms that.. But if this is how things are
> implemented currently.. then I think its a bug and I will open a jira
> for it..
>
> Thanks,
> Ashutosh
>
> PS: I am posting it on dev-list But if folks think its more
> appropriate for user-list, feel free to move it there, while replying
> to it.
>