You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2009/07/14 16:57:47 UTC
Classpath question with testcase and UDF
Hey all,
I am working on a UDF that can be used with prepared statements ad a
technique to go from Hive->SQL. The usage would be something like
from src SELECT
dboutput("jdbc:mysql://localhost:20","user","pass","INSERT INTO A
(col1,col2) VALUES (?,?)", key,value)
--Sniplet--
public class GenericUDFDBOutput extends GenericUDF {
private static Log LOG =
LogFactory.getLog(GenericUDFDBOutput.class.getName());
ObjectInspector[] argumentOIs;
GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
Connection connection = null;
private String url;
private String user;
private String pass;
private IntWritable result = new IntWritable(-1);
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentTypeException {
this.argumentOIs = arguments;
public Object evaluate(DeferredObject[] arguments) throws HiveException {
System.err.println("we are here");
url = arguments[0].get().toString()+";create=true";
url = "jdbc:derby:;databaseName=test_dboutput_db;create=true";
user = arguments[1].get().toString();
pass = arguments[2].get().toString();
try {
try {
Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
} catch (InstantiationException e) {
System.out.println(e);
} catch (IllegalAccessException e) {
System.out.println(e);
} catch (ClassNotFoundException e) {
System.out.println(e);
}
connection = DriverManager.getConnection(url, user, pass);
PreparedStatement ps = connection.prepareStatement(arguments[3].get()
.toString());
for (int i = 4; i < arguments.length; ++i) {
PrimitiveObjectInspector poi = ((PrimitiveObjectInspector)
arguments[i]);
ps.setObject(i - 3, poi.getPrimitiveJavaObject(arguments[i]));
}
ps.execute();
ps.close();
--end snip--
In my test case I am attempting to use the derby embedded driver since
I figured it would probably already be in the classpath.
org.apache.derby.jdbc.EmbeddedDriver. It is not. I took several stabs
and adding it to the ant files but I get the feeling that the
TestCliDriver is not inheriting those paths. Any suggestions?
Thanks
Re: Classpath question with testcase and UDF
Posted by Frederick Oko <fr...@gmail.com>.
DBOutputFormat is transactional per reducer (bulk at reducer close) so
granted to prevent the case of polluting your external DB w/ incomplete
results you would have to use it in queries that would take a single
reducer.
On Jul 16, 2009 7:12 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
Zheng,
Thanks for the tips. I used your suggestions.
https://issues.apache.org/jira/browse/HIVE-645.
Zheng we are only outputting one query per row. The usage is to dump a
table:
people
id name
1 Ed
2 Zheng
from people
select dboutput('jdbc:mysql', 'user','password', 'INSERT INTO people
(id,name) VALUES (?,?)', id, name);
@Frederick. There is no transaction here since we are making a
connection per row. A huge transactional dump would be tricky. Is
DBOutputFormat transactional.
The UDF returns 0 for success 1 for failure so you can do the status like:
from people
select id, dboutput('jdbc:mysql', 'user','password', 'INSERT INTO
people (id,name) VALUES (?,?)', id, name);
1,0
2,0
Edward
On Thu, Jul 16, 2009 at 8:23 PM, Zheng Shao<zs...@gmail.com> wrote: > Hi
Edward, > > We currently d...
Re: Classpath question with testcase and UDF
Posted by Edward Capriolo <ed...@gmail.com>.
Zheng,
Thanks for the tips. I used your suggestions.
https://issues.apache.org/jira/browse/HIVE-645.
Zheng we are only outputting one query per row. The usage is to dump a table:
people
id name
1 Ed
2 Zheng
from people
select dboutput('jdbc:mysql', 'user','password', 'INSERT INTO people
(id,name) VALUES (?,?)', id, name);
@Frederick. There is no transaction here since we are making a
connection per row. A huge transactional dump would be tricky. Is
DBOutputFormat transactional.
The UDF returns 0 for success 1 for failure so you can do the status like:
from people
select id, dboutput('jdbc:mysql', 'user','password', 'INSERT INTO
people (id,name) VALUES (?,?)', id, name);
1,0
2,0
Edward
On Thu, Jul 16, 2009 at 8:23 PM, Zheng Shao<zs...@gmail.com> wrote:
> Hi Edward,
>
> We currently don't allow UDF/GenericUDF to output multiple rows with a
> single call to evaluate(...).
> Is that feature going to block you?
>
>
> Also, if you expect an argument to be a String, we should do:
> 1. Check the type of that argument in the intialize by doing:
> if (!(arguments[i] instanceof StringObjectInspector)) {
> throw new UDFArgumentTypeException(i, "..."));
> }
> 2. in evaluate function
> Use: ((StringObjectInspector)argumentOI[i]).getJavaObject(arguments[i].get())
> instead of arguments[i].get().toString()
>
> The reason for 2 is that the in-memory objects in Hive can have
> different format (which is encapsulated by ObjectInspector).
> This is for both efficiency and ease-of-integration with other systems.
>
> We have to always use the ObjectInspector to access the value of the
> parameters in GenericUDF.
>
>
> If the ObjectInspector stuff seems too complicated, you can try UDF
> which is much simpler.
>
>
> Zheng
>
> On Tue, Jul 14, 2009 at 7:57 AM, Edward Capriolo<ed...@gmail.com> wrote:
>> Hey all,
>>
>> I am working on a UDF that can be used with prepared statements ad a
>> technique to go from Hive->SQL. The usage would be something like
>> from src SELECT
>> dboutput("jdbc:mysql://localhost:20","user","pass","INSERT INTO A
>> (col1,col2) VALUES (?,?)", key,value)
>>
>> --Sniplet--
>> public class GenericUDFDBOutput extends GenericUDF {
>> private static Log LOG =
>> LogFactory.getLog(GenericUDFDBOutput.class.getName());
>>
>> ObjectInspector[] argumentOIs;
>> GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
>> Connection connection = null;
>> private String url;
>> private String user;
>> private String pass;
>> private IntWritable result = new IntWritable(-1);
>>
>> public ObjectInspector initialize(ObjectInspector[] arguments)
>> throws UDFArgumentTypeException {
>>
>> this.argumentOIs = arguments;
>>
>>
>>
>> public Object evaluate(DeferredObject[] arguments) throws HiveException {
>>
>> System.err.println("we are here");
>> url = arguments[0].get().toString()+";create=true";
>> url = "jdbc:derby:;databaseName=test_dboutput_db;create=true";
>> user = arguments[1].get().toString();
>> pass = arguments[2].get().toString();
>> try {
>> try {
>> Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
>> } catch (InstantiationException e) {
>> System.out.println(e);
>> } catch (IllegalAccessException e) {
>> System.out.println(e);
>> } catch (ClassNotFoundException e) {
>> System.out.println(e);
>> }
>> connection = DriverManager.getConnection(url, user, pass);
>> PreparedStatement ps = connection.prepareStatement(arguments[3].get()
>> .toString());
>> for (int i = 4; i < arguments.length; ++i) {
>> PrimitiveObjectInspector poi = ((PrimitiveObjectInspector)
>> arguments[i]);
>> ps.setObject(i - 3, poi.getPrimitiveJavaObject(arguments[i]));
>> }
>> ps.execute();
>> ps.close();
>>
>> --end snip--
>>
>> In my test case I am attempting to use the derby embedded driver since
>> I figured it would probably already be in the classpath.
>> org.apache.derby.jdbc.EmbeddedDriver. It is not. I took several stabs
>> and adding it to the ant files but I get the feeling that the
>> TestCliDriver is not inheriting those paths. Any suggestions?
>>
>> Thanks
>>
>
>
>
> --
> Yours,
> Zheng
>
Re: Classpath question with testcase and UDF
Posted by Zheng Shao <zs...@gmail.com>.
Hi Edward,
We currently don't allow UDF/GenericUDF to output multiple rows with a
single call to evaluate(...).
Is that feature going to block you?
Also, if you expect an argument to be a String, we should do:
1. Check the type of that argument in the intialize by doing:
if (!(arguments[i] instanceof StringObjectInspector)) {
throw new UDFArgumentTypeException(i, "..."));
}
2. in evaluate function
Use: ((StringObjectInspector)argumentOI[i]).getJavaObject(arguments[i].get())
instead of arguments[i].get().toString()
The reason for 2 is that the in-memory objects in Hive can have
different format (which is encapsulated by ObjectInspector).
This is for both efficiency and ease-of-integration with other systems.
We have to always use the ObjectInspector to access the value of the
parameters in GenericUDF.
If the ObjectInspector stuff seems too complicated, you can try UDF
which is much simpler.
Zheng
On Tue, Jul 14, 2009 at 7:57 AM, Edward Capriolo<ed...@gmail.com> wrote:
> Hey all,
>
> I am working on a UDF that can be used with prepared statements ad a
> technique to go from Hive->SQL. The usage would be something like
> from src SELECT
> dboutput("jdbc:mysql://localhost:20","user","pass","INSERT INTO A
> (col1,col2) VALUES (?,?)", key,value)
>
> --Sniplet--
> public class GenericUDFDBOutput extends GenericUDF {
> private static Log LOG =
> LogFactory.getLog(GenericUDFDBOutput.class.getName());
>
> ObjectInspector[] argumentOIs;
> GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
> Connection connection = null;
> private String url;
> private String user;
> private String pass;
> private IntWritable result = new IntWritable(-1);
>
> public ObjectInspector initialize(ObjectInspector[] arguments)
> throws UDFArgumentTypeException {
>
> this.argumentOIs = arguments;
>
>
>
> public Object evaluate(DeferredObject[] arguments) throws HiveException {
>
> System.err.println("we are here");
> url = arguments[0].get().toString()+";create=true";
> url = "jdbc:derby:;databaseName=test_dboutput_db;create=true";
> user = arguments[1].get().toString();
> pass = arguments[2].get().toString();
> try {
> try {
> Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
> } catch (InstantiationException e) {
> System.out.println(e);
> } catch (IllegalAccessException e) {
> System.out.println(e);
> } catch (ClassNotFoundException e) {
> System.out.println(e);
> }
> connection = DriverManager.getConnection(url, user, pass);
> PreparedStatement ps = connection.prepareStatement(arguments[3].get()
> .toString());
> for (int i = 4; i < arguments.length; ++i) {
> PrimitiveObjectInspector poi = ((PrimitiveObjectInspector)
> arguments[i]);
> ps.setObject(i - 3, poi.getPrimitiveJavaObject(arguments[i]));
> }
> ps.execute();
> ps.close();
>
> --end snip--
>
> In my test case I am attempting to use the derby embedded driver since
> I figured it would probably already be in the classpath.
> org.apache.derby.jdbc.EmbeddedDriver. It is not. I took several stabs
> and adding it to the ant files but I get the feeling that the
> TestCliDriver is not inheriting those paths. Any suggestions?
>
> Thanks
>
--
Yours,
Zheng
Re: Classpath question with testcase and UDF
Posted by Frederick Oko <fr...@gmail.com>.
Clearly this is a flexible way to sidestep OOB-secondpass or custom JDBC but
my inquiry was more on philosophical grounds where conceptually I prefer to
not mix the output specification in the actual query. W/o using INSERT won't
this also have to print something for every row? Doesn't this approach allow
export of records before the query is known to have fully suceeded such that
you could have exported only partial results before subsequent Hive/Hadoop
failure w/o any rollback oversight or control knowledge?
On Jul 14, 2009 3:19 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
On Tue, Jul 14, 2009 at 5:40 PM, Frederick Oko<fr...@gmail.com>
wrote: > Is this just a work...
Frederick,
A bulk operation would perform better. I was going to build a more
complete version and start a Jira on it, I guess I let the cat out of
the bag on this one.
My Notes:
My DataFlow looks like this..
RawWebLogs (5 MInute interval) -> Hive raw_web_log partition by date and
hour
Every hour I want to summarize the previous hour for fast access.
These summaries might be used directly by a web-application. For
example I want a summary of http_status and count. This would have
from 5-10 rows.
200 65300
304 4000
404 1003
A medium summary like a page count may look like this and have <50,000 rows
/page1 9000
/page2 5454
-> Large Summaries > 100,000 rows
raw-> Medium Summaries < 100,000 rows
-> Small Summaries < 1,000 rows
Small Summaries:
Might be too small for a partition, but could be justified.
Loading into a relational database should be fast
Operations like sorting could be handled in a client
browser/AJAX
Medium Summaries:
Might be too small for a partition, but could be justified.
Loading into a relational database should be fast
operations like sorting column pruning should be handled
server side
Large Summaries:
Should be in a partition.
Loading into a relational database should be throttled.
Operations like sorting or column pruning might need to be
done by
another summary or should be pre-sorted
Currently I am dropping these summaries into a partition of another
table, some are very very small.
raw_web_log_http_status partition date, partition hour
The UDF would be targeting small & medium summaries. Performance
should not be an issue since the number of inserts should be small.
Also I think if performance is an issue the jdbc driver could be a
connection pool.
Large inserts would likely have issues that custom grammar or UDF
could not address either, like throttling, replication, and shards.
My goal is to be able to dump my summaries into a hive partition
and/or a mysql table. This way my front end web server does not need
to have exotic hadoop/hive access to show people stats. Do you think I
am on the right track?
Edward
Re: Classpath question with testcase and UDF
Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jul 14, 2009 at 5:40 PM, Frederick Oko<fr...@gmail.com> wrote:
> Is this just a workaround for not being able to as easily extend the grammar
> with something like 'INSERT TABLE JDBC(...' which would process over the
> final result set in bulk instead of a per record evaluate()?
>
> On Jul 14, 2009 8:57 AM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
> Thanks.
> I was using add file. That did the trick.
>
> 2009/7/14 He Yongqiang <he...@software.ict.ac.cn>:
>
>> Try "add jar your-db-connector-jar" first. > > Yongqiang > > On 09-7-14
>> 下午10:57, "Edward Capriolo"...
Frederick,
A bulk operation would perform better. I was going to build a more
complete version and start a Jira on it, I guess I let the cat out of
the bag on this one.
My Notes:
My DataFlow looks like this..
RawWebLogs (5 MInute interval) -> Hive raw_web_log partition by date and hour
Every hour I want to summarize the previous hour for fast access.
These summaries might be used directly by a web-application. For
example I want a summary of http_status and count. This would have
from 5-10 rows.
200 65300
304 4000
404 1003
A medium summary like a page count may look like this and have <50,000 rows
/page1 9000
/page2 5454
-> Large Summaries > 100,000 rows
raw-> Medium Summaries < 100,000 rows
-> Small Summaries < 1,000 rows
Small Summaries:
Might be too small for a partition, but could be justified.
Loading into a relational database should be fast
Operations like sorting could be handled in a client browser/AJAX
Medium Summaries:
Might be too small for a partition, but could be justified.
Loading into a relational database should be fast
operations like sorting column pruning should be handled server side
Large Summaries:
Should be in a partition.
Loading into a relational database should be throttled.
Operations like sorting or column pruning might need to be done by
another summary or should be pre-sorted
Currently I am dropping these summaries into a partition of another
table, some are very very small.
raw_web_log_http_status partition date, partition hour
The UDF would be targeting small & medium summaries. Performance
should not be an issue since the number of inserts should be small.
Also I think if performance is an issue the jdbc driver could be a
connection pool.
Large inserts would likely have issues that custom grammar or UDF
could not address either, like throttling, replication, and shards.
My goal is to be able to dump my summaries into a hive partition
and/or a mysql table. This way my front end web server does not need
to have exotic hadoop/hive access to show people stats. Do you think I
am on the right track?
Edward
Re: Classpath question with testcase and UDF
Posted by Frederick Oko <fr...@gmail.com>.
Is this just a workaround for not being able to as easily extend the grammar
with something like 'INSERT TABLE JDBC(...' which would process over the
final result set in bulk instead of a per record evaluate()?
On Jul 14, 2009 8:57 AM, "Edward Capriolo" <ed...@gmail.com> wrote:
Thanks.
I was using add file. That did the trick.
2009/7/14 He Yongqiang <he...@software.ict.ac.cn>:
> Try "add jar your-db-connector-jar" first. > > Yongqiang > > On 09-7-14
下午10:57, "Edward Capriolo"...
Re: Classpath question with testcase and UDF
Posted by Edward Capriolo <ed...@gmail.com>.
Thanks.
I was using add file. That did the trick.
2009/7/14 He Yongqiang <he...@software.ict.ac.cn>:
> Try "add jar your-db-connector-jar" first.
>
> Yongqiang
>
> On 09-7-14 下午10:57, "Edward Capriolo" <ed...@gmail.com> wrote:
>
>> Hey all,
>>
>> I am working on a UDF that can be used with prepared statements ad a
>> technique to go from Hive->SQL. The usage would be something like
>> from src SELECT
>> dboutput("jdbc:mysql://localhost:20","user","pass","INSERT INTO A
>> (col1,col2) VALUES (?,?)", key,value)
>>
>> --Sniplet--
>> public class GenericUDFDBOutput extends GenericUDF {
>> private static Log LOG =
>> LogFactory.getLog(GenericUDFDBOutput.class.getName());
>>
>> ObjectInspector[] argumentOIs;
>> GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
>> Connection connection = null;
>> private String url;
>> private String user;
>> private String pass;
>> private IntWritable result = new IntWritable(-1);
>>
>> public ObjectInspector initialize(ObjectInspector[] arguments)
>> throws UDFArgumentTypeException {
>>
>> this.argumentOIs = arguments;
>>
>>
>>
>> public Object evaluate(DeferredObject[] arguments) throws HiveException {
>>
>> System.err.println("we are here");
>> url = arguments[0].get().toString()+";create=true";
>> url = "jdbc:derby:;databaseName=test_dboutput_db;create=true";
>> user = arguments[1].get().toString();
>> pass = arguments[2].get().toString();
>> try {
>> try {
>> Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
>> } catch (InstantiationException e) {
>> System.out.println(e);
>> } catch (IllegalAccessException e) {
>> System.out.println(e);
>> } catch (ClassNotFoundException e) {
>> System.out.println(e);
>> }
>> connection = DriverManager.getConnection(url, user, pass);
>> PreparedStatement ps = connection.prepareStatement(arguments[3].get()
>> .toString());
>> for (int i = 4; i < arguments.length; ++i) {
>> PrimitiveObjectInspector poi = ((PrimitiveObjectInspector)
>> arguments[i]);
>> ps.setObject(i - 3, poi.getPrimitiveJavaObject(arguments[i]));
>> }
>> ps.execute();
>> ps.close();
>>
>> --end snip--
>>
>> In my test case I am attempting to use the derby embedded driver since
>> I figured it would probably already be in the classpath.
>> org.apache.derby.jdbc.EmbeddedDriver. It is not. I took several stabs
>> and adding it to the ant files but I get the feeling that the
>> TestCliDriver is not inheriting those paths. Any suggestions?
>>
>> Thanks
>>
>>
>
>
>
Re: Classpath question with testcase and UDF
Posted by He Yongqiang <he...@software.ict.ac.cn>.
Try "add jar your-db-connector-jar" first.
Yongqiang
On 09-7-14 下午10:57, "Edward Capriolo" <ed...@gmail.com> wrote:
> Hey all,
>
> I am working on a UDF that can be used with prepared statements ad a
> technique to go from Hive->SQL. The usage would be something like
> from src SELECT
> dboutput("jdbc:mysql://localhost:20","user","pass","INSERT INTO A
> (col1,col2) VALUES (?,?)", key,value)
>
> --Sniplet--
> public class GenericUDFDBOutput extends GenericUDF {
> private static Log LOG =
> LogFactory.getLog(GenericUDFDBOutput.class.getName());
>
> ObjectInspector[] argumentOIs;
> GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
> Connection connection = null;
> private String url;
> private String user;
> private String pass;
> private IntWritable result = new IntWritable(-1);
>
> public ObjectInspector initialize(ObjectInspector[] arguments)
> throws UDFArgumentTypeException {
>
> this.argumentOIs = arguments;
>
>
>
> public Object evaluate(DeferredObject[] arguments) throws HiveException {
>
> System.err.println("we are here");
> url = arguments[0].get().toString()+";create=true";
> url = "jdbc:derby:;databaseName=test_dboutput_db;create=true";
> user = arguments[1].get().toString();
> pass = arguments[2].get().toString();
> try {
> try {
> Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
> } catch (InstantiationException e) {
> System.out.println(e);
> } catch (IllegalAccessException e) {
> System.out.println(e);
> } catch (ClassNotFoundException e) {
> System.out.println(e);
> }
> connection = DriverManager.getConnection(url, user, pass);
> PreparedStatement ps = connection.prepareStatement(arguments[3].get()
> .toString());
> for (int i = 4; i < arguments.length; ++i) {
> PrimitiveObjectInspector poi = ((PrimitiveObjectInspector)
> arguments[i]);
> ps.setObject(i - 3, poi.getPrimitiveJavaObject(arguments[i]));
> }
> ps.execute();
> ps.close();
>
> --end snip--
>
> In my test case I am attempting to use the derby embedded driver since
> I figured it would probably already be in the classpath.
> org.apache.derby.jdbc.EmbeddedDriver. It is not. I took several stabs
> and adding it to the ant files but I get the feeling that the
> TestCliDriver is not inheriting those paths. Any suggestions?
>
> Thanks
>
>