You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ritesh Agrawal <ra...@netflix.com> on 2013/07/30 00:19:49 UTC

UDAF terminatePartial structure

Hi all,

I am writing my first UDAF. In my terminatePartial() function, I need to store different data having different data types. Below is a list of items that I need to store
1. C1 : list of doubles
2. C2: list of doubles
3. C3: double
4. Show: list of strings


I am wondering can I use simple HashMap and store these different objects into it. Will it automatically serialize or will I need to write my own serializiable method. Also is there any example of a UDAF that shows how to use map type structure for storing partial results. 

Thanks

Ritesh

Re: UDAF terminatePartial structure

Posted by Robin Morris <rd...@baynote.com>.

There are limitations as to what can be passed between terminatePartial() and merge() I'm not sure that you can pass java arrays (i.e. your double[] c1;) through all the hive reflection gubbins.  Try using ArrayList<>s instead, but be warned, you need to make explicit deep copies of anything passed in to merge().

Robin

From: Ritesh Agrawal <ra...@netflix.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Monday, July 29, 2013 9:12 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: UDAF terminatePartial structure

Hi Robin,igor

Thanks for the suggestion and links. Based on examples I found, below is my UDF. However, I am getting following error when trying to run it. Not sure what the error means

============= ERROR ====================
FAILED: Hive Internal Error: java.lang.RuntimeException(java.lang.NoSuchMethodException: [D.<init>())
java.lang.RuntimeException: java.lang.NoSuchMethodException: [D.<init>()
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
        at org.apache.hadoop.hive.serde2.objectinspector.ReflectionStructObjectInspector.create(ReflectionStructObjectInspector.java:170)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:225)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:127)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:221)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:127)


============= UDF CODE ==================
package com.netflix.hive.udaf;

import java.io.IOException;
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;

@Description(
name = "MFFoldIn",
value = "_FUNC_(expr, nb) - Computes latent features for a given item/user based user/item vectors",
extended = "Example:\n"
)

public class MFFoldIn extends UDAF {


publicstatic class MFFoldInEvaluator implements UDAFEvaluator{
publicstatic class PartialResult{
double[] c1;
double[][] c2;
double[][] c3;
doublewm;
doublelambda;
intitemCount;
double[][] varco;
Set<Long> observedShows;


publicint getDimensionsCount() throws Exception{
if(c1 != null) returnc1.length;
thrownew Exception("Unknown dimension count");
}
}


private UserVecBuilder builder;


publicvoid init() {
builder = null;
}


publicboolean iterate(DoubleWritable wm, DoubleWritable lambda,
IntWritable itemCount, String itemSquaredFile,
DoubleWritable weight, List<Double> lf,
Long item) throws IOException{


double[] lflist = new double[lf.size()];
for(int i=0; i<lf.size(); i++)
lflist[i] = lf.get(i).doubleValue();


if(builder == null) builder = new UserVecBuilder();


if(!builder.isReady()){
builder.setW_m(wm.get());
builder.setLambda(lambda.get());
builder.setItemRowCount(itemCount.get());
builder.readItemCovarianceMatFiles(itemSquaredFile, lflist.length);
}



builder.add(item, lflist, weight.get());


returntrue;


}


public PartialResult terminatePartial(){
PartialResult partial = new PartialResult();
partial.c1 = builder.getComponent1();
partial.c2 = builder.getComponent2();
partial.c3 = builder.getComponent3();
partial.wm = builder.getW_m();
partial.lambda = builder.getLambda();
partial.observedShows = builder.getObservedShows();
partial.itemCount = builder.getItemRowCount();
partial.varco = builder.getVarCovar();
return partial;
}


publicboolean merge(PartialResult other){
if(other == null) returntrue;
if(builder == null) builder = new UserVecBuilder();


if(!builder.isReady()){
builder.setW_m(other.wm);
builder.setLambda(other.lambda);
builder.setItemRowCount(other.itemCount);
builder.setItemCovarianceMat(other.varco);
builder.setComponent1(other.c1);
builder.setComponent2(other.c2);
builder.setComponent3(other.c3);
builder.setObservedShows(other.observedShows);
}else{
builder.merge(other.c1, other.c2, other.c3, other.observedShows);
}
returntrue;
}


publicdouble[] terminate(){
if(builder == null) returnnull;
  returnbuilder.build();
}


}

}


====================
On Jul 29, 2013, at 4:37 PM, Igor Tatarinov wrote:

I found this Cloudera example helpful:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-contrib/0.7.0-cdh3u0/org/apache/hadoop/hive/contrib/udaf/example/UDAFExampleMaxMinNUtil.java#UDAFExampleMaxMinNUtil.Evaluator

igor
decide.com<http://decide.com/>



On Mon, Jul 29, 2013 at 4:32 PM, Ritesh Agrawal <ra...@netflix.com>> wrote:
Hi Robin,

Thanks for the suggestion. I did find such an example in Hadoop The definitive guide book. However I am not total confused.

The book extends UDAF instead of AbstractGenericUDAFResolver. Which one is recommended ?

Also the example in the book uses DoubleWritable as a return type for the "terminate" function. However, I will be returning an arraylist of double. Do I always need to written objects that are derived from WritableComponents.

Ritesh
On Jul 29, 2013, at 4:15 PM, Robin Morris wrote:

> I believe a map will be passed correctly from the terminatePartial to the
> merge functions.  But it seems a bit of overkill.
>
> Why not define a class within your UDAF which has 4 public data members,
> and return instances of that class from terminatePartial()?
>
> Robin
>
>
> On 7/29/13 3:19 PM, "Ritesh Agrawal" <ra...@netflix.com>> wrote:
>
>> Hi all,
>>
>> I am writing my first UDAF. In my terminatePartial() function, I need to
>> store different data having different data types. Below is a list of
>> items that I need to store
>> 1. C1 : list of doubles
>> 2. C2: list of doubles
>> 3. C3: double
>> 4. Show: list of strings
>>
>>
>> I am wondering can I use simple HashMap and store these different objects
>> into it. Will it automatically serialize or will I need to write my own
>> serializiable method. Also is there any example of a UDAF that shows how
>> to use map type structure for storing partial results.
>>
>> Thanks
>>
>> Ritesh
>

Re: UDAF terminatePartial structure

Posted by Ritesh Agrawal <ra...@netflix.com>.

Hi Robin,igor

Thanks for the suggestion and links. Based on examples I found, below is my UDF. However, I am getting following error when trying to run it. Not sure what the error means

============= ERROR ====================
FAILED: Hive Internal Error: java.lang.RuntimeException(java.lang.NoSuchMethodException: [D.<init>())
java.lang.RuntimeException: java.lang.NoSuchMethodException: [D.<init>()
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
        at org.apache.hadoop.hive.serde2.objectinspector.ReflectionStructObjectInspector.create(ReflectionStructObjectInspector.java:170)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:225)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:127)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:221)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:127)


============= UDF CODE ==================
package com.netflix.hive.udaf;

import java.io.IOException;
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;

@Description(
		name = "MFFoldIn",
		value = "_FUNC_(expr, nb) - Computes latent features for a given item/user based user/item vectors",
		extended = "Example:\n"
)

public class MFFoldIn extends UDAF {
	
	public static class MFFoldInEvaluator implements UDAFEvaluator{
		public static class PartialResult{
			double[] c1;
			double[][] c2;
			double[][] c3;
			double wm;
			double lambda;
			int itemCount;
			double[][] varco;
			Set<Long> observedShows;
			
			public int getDimensionsCount() throws Exception{
				if(c1 != null) return c1.length;
				throw new Exception("Unknown dimension count");
			}
		}
		
		private UserVecBuilder builder;
		
		public void init() {
			builder = null;
		}
		
		public boolean iterate(DoubleWritable wm, DoubleWritable lambda,
				IntWritable itemCount, String itemSquaredFile, 
				DoubleWritable weight, List<Double> lf,
				Long item) throws IOException{
			
			double[] lflist = new double[lf.size()];
			for(int i=0; i<lf.size(); i++)
				lflist[i] = lf.get(i).doubleValue();
			
			if(builder == null) builder = new UserVecBuilder();
			
			if(!builder.isReady()){
				builder.setW_m(wm.get());
				builder.setLambda(lambda.get());
				builder.setItemRowCount(itemCount.get());
				builder.readItemCovarianceMatFiles(itemSquaredFile, lflist.length);				
			}

				
			builder.add(item, lflist, weight.get());
			
			return true;
			
		}
		
		public PartialResult terminatePartial(){
			PartialResult partial = new PartialResult();
			partial.c1 = builder.getComponent1();
			partial.c2 = builder.getComponent2();
			partial.c3 = builder.getComponent3();
			partial.wm = builder.getW_m();
			partial.lambda = builder.getLambda();
			partial.observedShows = builder.getObservedShows();
			partial.itemCount = builder.getItemRowCount();
			partial.varco = builder.getVarCovar();
			return partial;
		}
		
		public boolean merge(PartialResult other){
			if(other == null) return true;
			if(builder == null) builder = new UserVecBuilder();
			
			if(!builder.isReady()){
				builder.setW_m(other.wm);
				builder.setLambda(other.lambda);
				builder.setItemRowCount(other.itemCount);
				builder.setItemCovarianceMat(other.varco);
				builder.setComponent1(other.c1);
				builder.setComponent2(other.c2);
				builder.setComponent3(other.c3);
				builder.setObservedShows(other.observedShows);
			}else{
				builder.merge(other.c1, other.c2, other.c3, other.observedShows);
			}
			return true;
		}
		
		public double[] terminate(){
	 		if(builder == null) return null;
 			return builder.build();
		}
		
	}

}


====================
On Jul 29, 2013, at 4:37 PM, Igor Tatarinov wrote:

> I found this Cloudera example helpful:
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-contrib/0.7.0-cdh3u0/org/apache/hadoop/hive/contrib/udaf/example/UDAFExampleMaxMinNUtil.java#UDAFExampleMaxMinNUtil.Evaluator
> 
> igor
> decide.com
> 
> 
> 
> On Mon, Jul 29, 2013 at 4:32 PM, Ritesh Agrawal <ra...@netflix.com> wrote:
> Hi Robin,
> 
> Thanks for the suggestion. I did find such an example in Hadoop The definitive guide book. However I am not total confused.
> 
> The book extends UDAF instead of AbstractGenericUDAFResolver. Which one is recommended ?
> 
> Also the example in the book uses DoubleWritable as a return type for the "terminate" function. However, I will be returning an arraylist of double. Do I always need to written objects that are derived from WritableComponents.
> 
> Ritesh
> On Jul 29, 2013, at 4:15 PM, Robin Morris wrote:
> 
> > I believe a map will be passed correctly from the terminatePartial to the
> > merge functions.  But it seems a bit of overkill.
> >
> > Why not define a class within your UDAF which has 4 public data members,
> > and return instances of that class from terminatePartial()?
> >
> > Robin
> >
> >
> > On 7/29/13 3:19 PM, "Ritesh Agrawal" <ra...@netflix.com> wrote:
> >
> >> Hi all,
> >>
> >> I am writing my first UDAF. In my terminatePartial() function, I need to
> >> store different data having different data types. Below is a list of
> >> items that I need to store
> >> 1. C1 : list of doubles
> >> 2. C2: list of doubles
> >> 3. C3: double
> >> 4. Show: list of strings
> >>
> >>
> >> I am wondering can I use simple HashMap and store these different objects
> >> into it. Will it automatically serialize or will I need to write my own
> >> serializiable method. Also is there any example of a UDAF that shows how
> >> to use map type structure for storing partial results.
> >>
> >> Thanks
> >>
> >> Ritesh
> >
> 
>

Re: UDAF terminatePartial structure

Posted by Igor Tatarinov <ig...@decide.com>.

I found this Cloudera example helpful:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-contrib/0.7.0-cdh3u0/org/apache/hadoop/hive/contrib/udaf/example/UDAFExampleMaxMinNUtil.java#UDAFExampleMaxMinNUtil.Evaluator

igor
decide.com



On Mon, Jul 29, 2013 at 4:32 PM, Ritesh Agrawal <ra...@netflix.com>wrote:

> Hi Robin,
>
> Thanks for the suggestion. I did find such an example in Hadoop The
> definitive guide book. However I am not total confused.
>
> The book extends UDAF instead of AbstractGenericUDAFResolver. Which one is
> recommended ?
>
> Also the example in the book uses DoubleWritable as a return type for the
> "terminate" function. However, I will be returning an arraylist of double.
> Do I always need to written objects that are derived from
> WritableComponents.
>
> Ritesh
> On Jul 29, 2013, at 4:15 PM, Robin Morris wrote:
>
> > I believe a map will be passed correctly from the terminatePartial to the
> > merge functions.  But it seems a bit of overkill.
> >
> > Why not define a class within your UDAF which has 4 public data members,
> > and return instances of that class from terminatePartial()?
> >
> > Robin
> >
> >
> > On 7/29/13 3:19 PM, "Ritesh Agrawal" <ra...@netflix.com> wrote:
> >
> >> Hi all,
> >>
> >> I am writing my first UDAF. In my terminatePartial() function, I need to
> >> store different data having different data types. Below is a list of
> >> items that I need to store
> >> 1. C1 : list of doubles
> >> 2. C2: list of doubles
> >> 3. C3: double
> >> 4. Show: list of strings
> >>
> >>
> >> I am wondering can I use simple HashMap and store these different
> objects
> >> into it. Will it automatically serialize or will I need to write my own
> >> serializiable method. Also is there any example of a UDAF that shows how
> >> to use map type structure for storing partial results.
> >>
> >> Thanks
> >>
> >> Ritesh
> >
>
>

Re: UDAF terminatePartial structure

Posted by Ritesh Agrawal <ra...@netflix.com>.

Hi Robin,

Thanks for the suggestion. I did find such an example in Hadoop The definitive guide book. However I am not total confused. 

The book extends UDAF instead of AbstractGenericUDAFResolver. Which one is recommended ? 

Also the example in the book uses DoubleWritable as a return type for the "terminate" function. However, I will be returning an arraylist of double. Do I always need to written objects that are derived from WritableComponents.

Ritesh
On Jul 29, 2013, at 4:15 PM, Robin Morris wrote:

> I believe a map will be passed correctly from the terminatePartial to the
> merge functions.  But it seems a bit of overkill.
> 
> Why not define a class within your UDAF which has 4 public data members,
> and return instances of that class from terminatePartial()?
> 
> Robin
> 
> 
> On 7/29/13 3:19 PM, "Ritesh Agrawal" <ra...@netflix.com> wrote:
> 
>> Hi all,
>> 
>> I am writing my first UDAF. In my terminatePartial() function, I need to
>> store different data having different data types. Below is a list of
>> items that I need to store
>> 1. C1 : list of doubles
>> 2. C2: list of doubles
>> 3. C3: double
>> 4. Show: list of strings
>> 
>> 
>> I am wondering can I use simple HashMap and store these different objects
>> into it. Will it automatically serialize or will I need to write my own
>> serializiable method. Also is there any example of a UDAF that shows how
>> to use map type structure for storing partial results.
>> 
>> Thanks
>> 
>> Ritesh
>

Re: UDAF terminatePartial structure

Posted by Robin Morris <rd...@baynote.com>.

I believe a map will be passed correctly from the terminatePartial to the
merge functions.  But it seems a bit of overkill.

Why not define a class within your UDAF which has 4 public data members,
and return instances of that class from terminatePartial()?

Robin

On 7/29/13 3:19 PM, "Ritesh Agrawal" <ra...@netflix.com> wrote:

>Hi all,
>
>I am writing my first UDAF. In my terminatePartial() function, I need to
>store different data having different data types. Below is a list of
>items that I need to store
>1. C1 : list of doubles
>2. C2: list of doubles
>3. C3: double
>4. Show: list of strings
>
>
>I am wondering can I use simple HashMap and store these different objects
>into it. Will it automatically serialize or will I need to write my own
>serializiable method. Also is there any example of a UDAF that shows how
>to use map type structure for storing partial results.
>
>Thanks
>
>Ritesh