You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Tim Robertson <ti...@gmail.com> on 2011/02/21 05:48:04 UTC

UDFRowSequence called in Map() ?

Hi all,

I am using UDFRowSequence as follows:

CREATE TEMPORARY FUNCTION rowSequence AS
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
mapred.reduce.tasks=1;
CREATE TABLE temp_tc1_test
as
SELECT
  rowSequence() AS id,
  data_resource_id,
  local_id,
  local_parent_id,
  name,
  author
FROM normalized;

I see 2 jobs, the first of which running with 2 map() and 0 reduce()
on my small test data.  I believe the rowSequence() is being called in
the map and not the reduce as the results have duplicate IDs:

select * from temp_tc1_test where id=8915;
8915	167	1148	1113	Cytospora elaeagni	Allesch.
8915	168	7	6	Achromadora inflata	Abebe & Coomans, 1996

Is there any way to enforce the UDF is called in the reduce?

Thanks,
Tim

Re: UDFRowSequence called in Map() ?

Posted by John Sichi <js...@fb.com>.
There's no explicit way to enforce it, but in practice you can get it to work by using the UDF invocation in an outer select, typically with an ORDER or SORT BY on the inner select, as in this example:

http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad#Prepare_Range_Partitioning

Note also this semi-related JIRA issue, which I'm currently working on:

https://issues.apache.org/jira/browse/HIVE-1994

JVS

On Feb 20, 2011, at 8:48 PM, Tim Robertson wrote:

> Hi all,
> 
> I am using UDFRowSequence as follows:
> 
> CREATE TEMPORARY FUNCTION rowSequence AS
> 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
> mapred.reduce.tasks=1;
> CREATE TABLE temp_tc1_test
> as
> SELECT
>  rowSequence() AS id,
>  data_resource_id,
>  local_id,
>  local_parent_id,
>  name,
>  author
> FROM normalized;
> 
> I see 2 jobs, the first of which running with 2 map() and 0 reduce()
> on my small test data.  I believe the rowSequence() is being called in
> the map and not the reduce as the results have duplicate IDs:
> 
> select * from temp_tc1_test where id=8915;
> 8915	167	1148	1113	Cytospora elaeagni	Allesch.
> 8915	168	7	6	Achromadora inflata	Abebe & Coomans, 1996
> 
> Is there any way to enforce the UDF is called in the reduce?
> 
> Thanks,
> Tim