You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Josh Ferguson <jo...@besquared.net> on 2009/03/01 10:14:56 UTC

Malformed Rows

So it's currently possible to load data into a table that has fewer  
columns than the table is specified as having. The end result is that  
when you go to SELECT from the table you get array out of bounds  
exceptions during the deserialization of the row. There's currently no  
way to delete rows from tables so I'm pretty sure that if you do this  
your table is corrupt and unreadable.

I want to suggest that there be an option to simply catch these  
deserialization attempts and log them somewhere, but not to destroy  
the entire SELECT query.

How are other people dealing with this?

Josh Ferguson

Re: Malformed Rows

Posted by Josh Ferguson <jo...@besquared.net>.
I'm fine setting them to null values. I've been avoiding this  
situation by just putting in NULL and NULL:NULL where I am missing  
values/arrays and maps. I noticed hive uses something like \\N for  
null values. Either way adding in null values to the end where things  
are missing would be fine with me, I just don't want my SELECT  
statements to crash and make my tables unreadable.

Josh Ferguson

On Mar 1, 2009, at 1:35 AM, Zheng Shao wrote:

> Hi Josh,
>
> Thanks for the quick reply.
>
> I recently added LazySimpleSerde to handle most of the cases for  
> DynamicSerDe/TCTLSeparatedProtocol, to provider a better efficiency  
> and also a much simpler code stack.
>
> Currently LazySimpleSerDe do not support map<> or array<> and that's  
> why you hit DynamicSerDe with your table.
>
> I think the best way to solve the problem will be to extend  
> LazySimpleSerDe to handle map and array, so we can deprecate  
> DynamicSerDe/TCTLSeparatedProtocol.
>
>
> By the way, another reason that we want to set missing columns at  
> the end of the row to NULL instead of reporting it as an error is  
> that it seamlessly supports adding new columns to the metadata of  
> the table (e.g. data from newer partitions of a table will contain  
> more rows than older partitions).
>
> Zheng
>
> On Sun, Mar 1, 2009 at 1:26 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> CREATE TABLE activities
> (occurred_at INT, actor_id STRING, actee_id STRING, properties  
> MAP<STRING, STRING>)
> PARTITIONED BY (account STRING, application STRING, dataset STRING,  
> hour INT)
> CLUSTERED BY (actor_id, actee_id) INTO 32 BUCKETS
> ROW FORMAT DELIMITED
> COLLECTION ITEMS TERMINATED BY '44'
> MAP KEYS TERMINATED BY '58'
> STORED AS TEXTFILE;
>
> On Mar 1, 2009, at 1:24 AM, Zheng Shao wrote:
>
>> Hi Josh,
>>
>> Can you post the stack trace here? I want to know which serde  
>> created this problem.
>>
>> The expected behavior is that missing columns will have a value of  
>> NULL. If it's not like that, then there must be something wrong.
>>
>>
>> Zheng
>>
>> On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net>  
>> wrote:
>> So it's currently possible to load data into a table that has fewer  
>> columns than the table is specified as having. The end result is  
>> that when you go to SELECT from the table you get array out of  
>> bounds exceptions during the deserialization of the row. There's  
>> currently no way to delete rows from tables so I'm pretty sure that  
>> if you do this your table is corrupt and unreadable.
>>
>> I want to suggest that there be an option to simply catch these  
>> deserialization attempts and log them somewhere, but not to destroy  
>> the entire SELECT query.
>>
>> How are other people dealing with this?
>>
>> Josh Ferguson
>>
>>
>>
>> -- 
>> Yours,
>> Zheng
>
>
>
>
> -- 
> Yours,
> Zheng


RE: Malformed Rows

Posted by Joydeep Sen Sarma <js...@facebook.com>.
There's also a jira open to ignore (upto threshold) exceptions from the execution engine. That would be easy to implement and help fix this particular scenario as well.

________________________________
From: Zheng Shao [mailto:zshao9@gmail.com]
Sent: Sunday, March 01, 2009 1:35 AM
To: hive-user@hadoop.apache.org
Subject: Re: Malformed Rows

Hi Josh,

Thanks for the quick reply.

I recently added LazySimpleSerde to handle most of the cases for DynamicSerDe/TCTLSeparatedProtocol, to provider a better efficiency and also a much simpler code stack.

Currently LazySimpleSerDe do not support map<> or array<> and that's why you hit DynamicSerDe with your table.

I think the best way to solve the problem will be to extend LazySimpleSerDe to handle map and array, so we can deprecate DynamicSerDe/TCTLSeparatedProtocol.


By the way, another reason that we want to set missing columns at the end of the row to NULL instead of reporting it as an error is that it seamlessly supports adding new columns to the metadata of the table (e.g. data from newer partitions of a table will contain more rows than older partitions).

Zheng
On Sun, Mar 1, 2009 at 1:26 AM, Josh Ferguson <jo...@besquared.net>> wrote:
CREATE TABLE activities
(occurred_at INT, actor_id STRING, actee_id STRING, properties MAP<STRING, STRING>)
PARTITIONED BY (account STRING, application STRING, dataset STRING, hour INT)
CLUSTERED BY (actor_id, actee_id) INTO 32 BUCKETS
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '44'
MAP KEYS TERMINATED BY '58'
STORED AS TEXTFILE;

On Mar 1, 2009, at 1:24 AM, Zheng Shao wrote:

Hi Josh,

Can you post the stack trace here? I want to know which serde created this problem.

The expected behavior is that missing columns will have a value of NULL. If it's not like that, then there must be something wrong.


Zheng
On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net>> wrote:
So it's currently possible to load data into a table that has fewer columns than the table is specified as having. The end result is that when you go to SELECT from the table you get array out of bounds exceptions during the deserialization of the row. There's currently no way to delete rows from tables so I'm pretty sure that if you do this your table is corrupt and unreadable.

I want to suggest that there be an option to simply catch these deserialization attempts and log them somewhere, but not to destroy the entire SELECT query.

How are other people dealing with this?

Josh Ferguson



--
Yours,
Zheng




--
Yours,
Zheng

Re: Malformed Rows

Posted by Zheng Shao <zs...@gmail.com>.
Hi Josh,

Thanks for the quick reply.

I recently added LazySimpleSerde to handle most of the cases for
DynamicSerDe/TCTLSeparatedProtocol, to provider a better efficiency and also
a much simpler code stack.

Currently LazySimpleSerDe do not support map<> or array<> and that's why you
hit DynamicSerDe with your table.

I think the best way to solve the problem will be to extend LazySimpleSerDe
to handle map and array, so we can deprecate
DynamicSerDe/TCTLSeparatedProtocol.


By the way, another reason that we want to set missing columns at the end of
the row to NULL instead of reporting it as an error is that it seamlessly
supports adding new columns to the metadata of the table (e.g. data from
newer partitions of a table will contain more rows than older partitions).

Zheng

On Sun, Mar 1, 2009 at 1:26 AM, Josh Ferguson <jo...@besquared.net> wrote:

> CREATE TABLE activities
> (occurred_at INT, actor_id STRING, actee_id STRING, properties MAP<STRING,
> STRING>)
> PARTITIONED BY (account STRING, application STRING, dataset STRING, hour
> INT)
> CLUSTERED BY (actor_id, actee_id) INTO 32 BUCKETS
> ROW FORMAT DELIMITED
> COLLECTION ITEMS TERMINATED BY '44'
> MAP KEYS TERMINATED BY '58'
> STORED AS TEXTFILE;
>
> On Mar 1, 2009, at 1:24 AM, Zheng Shao wrote:
>
> Hi Josh,
>
> Can you post the stack trace here? I want to know which serde created this
> problem.
>
> The expected behavior is that missing columns will have a value of NULL. If
> it's not like that, then there must be something wrong.
>
>
> Zheng
>
> On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net> wrote:
>
>> So it's currently possible to load data into a table that has fewer
>> columns than the table is specified as having. The end result is that when
>> you go to SELECT from the table you get array out of bounds exceptions
>> during the deserialization of the row. There's currently no way to delete
>> rows from tables so I'm pretty sure that if you do this your table is
>> corrupt and unreadable.
>>
>> I want to suggest that there be an option to simply catch these
>> deserialization attempts and log them somewhere, but not to destroy the
>> entire SELECT query.
>>
>> How are other people dealing with this?
>>
>> Josh Ferguson
>>
>
>
>
> --
> Yours,
> Zheng
>
>
>


-- 
Yours,
Zheng

Re: Malformed Rows

Posted by Josh Ferguson <jo...@besquared.net>.
CREATE TABLE activities
(occurred_at INT, actor_id STRING, actee_id STRING, properties  
MAP<STRING, STRING>)
PARTITIONED BY (account STRING, application STRING, dataset STRING,  
hour INT)
CLUSTERED BY (actor_id, actee_id) INTO 32 BUCKETS
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '44'
MAP KEYS TERMINATED BY '58'
STORED AS TEXTFILE;

On Mar 1, 2009, at 1:24 AM, Zheng Shao wrote:

> Hi Josh,
>
> Can you post the stack trace here? I want to know which serde  
> created this problem.
>
> The expected behavior is that missing columns will have a value of  
> NULL. If it's not like that, then there must be something wrong.
>
>
> Zheng
>
> On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> So it's currently possible to load data into a table that has fewer  
> columns than the table is specified as having. The end result is  
> that when you go to SELECT from the table you get array out of  
> bounds exceptions during the deserialization of the row. There's  
> currently no way to delete rows from tables so I'm pretty sure that  
> if you do this your table is corrupt and unreadable.
>
> I want to suggest that there be an option to simply catch these  
> deserialization attempts and log them somewhere, but not to destroy  
> the entire SELECT query.
>
> How are other people dealing with this?
>
> Josh Ferguson
>
>
>
> -- 
> Yours,
> Zheng


Re: Malformed Rows

Posted by Josh Ferguson <jo...@besquared.net>.
java.lang.ArrayIndexOutOfBoundsException: 3
	at  
org 
.apache 
.hadoop 
.hive 
.serde2 
.thrift.TCTLSeparatedProtocol.readMapBegin(TCTLSeparatedProtocol.java: 
608)
	at  
org 
.apache 
.hadoop 
.hive 
.serde2 
.dynamic_type.DynamicSerDeTypeMap.deserialize(DynamicSerDeTypeMap.java: 
88)
	at  
org 
.apache 
.hadoop 
.hive 
.serde2 
.dynamic_type.DynamicSerDeTypeMap.deserialize(DynamicSerDeTypeMap.java: 
35)
	at  
org 
.apache 
.hadoop 
.hive 
.serde2 
.dynamic_type 
.DynamicSerDeFieldList.deserialize(DynamicSerDeFieldList.java:194)
	at  
org 
.apache 
.hadoop 
.hive 
.serde2 
.dynamic_type 
.DynamicSerDeStructBase.deserialize(DynamicSerDeStructBase.java:59)
	at  
org 
.apache 
.hadoop 
.hive.serde2.dynamic_type.DynamicSerDe.deserialize(DynamicSerDe.java: 
126)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:300)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:266)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:207)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:305)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Failed with exception java.lang.ArrayIndexOutOfBoundsException: 3

On Mar 1, 2009, at 1:24 AM, Zheng Shao wrote:

> Hi Josh,
>
> Can you post the stack trace here? I want to know which serde  
> created this problem.
>
> The expected behavior is that missing columns will have a value of  
> NULL. If it's not like that, then there must be something wrong.
>
>
> Zheng
>
> On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> So it's currently possible to load data into a table that has fewer  
> columns than the table is specified as having. The end result is  
> that when you go to SELECT from the table you get array out of  
> bounds exceptions during the deserialization of the row. There's  
> currently no way to delete rows from tables so I'm pretty sure that  
> if you do this your table is corrupt and unreadable.
>
> I want to suggest that there be an option to simply catch these  
> deserialization attempts and log them somewhere, but not to destroy  
> the entire SELECT query.
>
> How are other people dealing with this?
>
> Josh Ferguson
>
>
>
> -- 
> Yours,
> Zheng


Re: Malformed Rows

Posted by Zheng Shao <zs...@gmail.com>.
Hi Josh,

Can you post the stack trace here? I want to know which serde created this
problem.

The expected behavior is that missing columns will have a value of NULL. If
it's not like that, then there must be something wrong.


Zheng

On Sun, Mar 1, 2009 at 1:14 AM, Josh Ferguson <jo...@besquared.net> wrote:

> So it's currently possible to load data into a table that has fewer columns
> than the table is specified as having. The end result is that when you go to
> SELECT from the table you get array out of bounds exceptions during the
> deserialization of the row. There's currently no way to delete rows from
> tables so I'm pretty sure that if you do this your table is corrupt and
> unreadable.
>
> I want to suggest that there be an option to simply catch these
> deserialization attempts and log them somewhere, but not to destroy the
> entire SELECT query.
>
> How are other people dealing with this?
>
> Josh Ferguson
>



-- 
Yours,
Zheng