You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Adam O'Donnell <ad...@immunet.com> on 2010/02/15 04:43:01 UTC

Working UDF for GeoIP lookup?

Anyone have a working UDF jar for GeoIP lookups using MaxMind's data?
I saw one being discussed a few months ago, but haven't seen it in any
contrib branches.

Also, I tried building a simple one using perl, but it looks like a
bug in amazon's hive distribution is dropping the perl modules with an
odd filename transformation.  For example, add file
s3://bucket/file.pm becomes s3_bucket_file.pm, making distributing a
perl module a bit sticky.

Any pointers on either the UDF or the amazon hive issue?

Adam

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 15, 2010 at 5:10 PM, Edward Capriolo <ed...@gmail.com> wrote:
> I was thinking to make UDFs like this.
>
> select geo_lookup('databasefile' , fieldx , 'municipality' ) from table;
> Alternatively we can embedded the data files inside in the jar/udf
> geo_lookup('ip' , 'municipality' );
>
> As to ant. since the licensing prevent geo_ip_java from getting
> bundled with apache I am not looking to build this into hive. Simply
> going to build a netbeans project and use hive-jars as libraries.
>
> Edward
> On Mon, Feb 15, 2010 at 4:58 PM, Adam O'Donnell <ad...@immunet.com> wrote:
>> Edward:
>>
>> How can I help?  I got most of the UDF built myself last night, and
>> today I was sorting out ant build issues.  My main frustration is
>> trying to get it to play nice in amazon's environment.
>>
>> How did you solve the issue of selecting only parts of the geocity
>> data on a single lookup?  Did you just do multiple lookups, one for
>> each piece of data?
>>
>> Also, did your jar run on amazon's elastic mapreduce?
>>
>> On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>>>> Edward:
>>>>
>>>> I don't have access to the individual data nodes, so I can't install the
>>>> pure perl module. I tried distributing it via the add file command, but that
>>>> is mangling the file name, which causes perl to not load the module as the
>>>> file name and package name dont match.  Kinda frustrating, but it is really
>>>> all about trying to work around an issue on amazon's elastic map reduce.  I
>>>> love the service in general, but some issues are frustrating.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>>>
>>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Carl
>>>>>>
>>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>>> hive table that is stored as a sequence file?  The idea would be I
>>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>>> Thoughts?
>>>>>>
>>>>>> Adam
>>>>>>
>>>>>
>>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>>> the apache/gpl issues.
>>>>>
>>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>>
>>>>>  ret = qp.run(
>>>>>   " FROM ( "+
>>>>>   " FROM raw_web_data_hour "+
>>>>>   " SELECT transform( remote_ip ) "+
>>>>>   " USING 'perl geo_state.pl' "+
>>>>>   " AS ip, country_code3, region "+
>>>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>>>   " ) a " +
>>>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>>>   );
>>>>>
>>>>>
>>>>> #!/usr/bin/perl
>>>>> use Geo::IP;
>>>>> use strict;
>>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>>> GEOIP_STANDARD);
>>>>> while (<STDIN>){
>>>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>>>  chomp($_);
>>>>>  my $record = $gi->record_by_name($_);
>>>>>  print STDERR "was sent $_ \n" ;
>>>>>  if (defined $record) {
>>>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>>>  ;
>>>>>   print STDERR "return " . $record->region . "\n" ;
>>>>>  } else {
>>>>>   print "??\n";
>>>>>   print STDERR "return was undefined \n";
>>>>>  }
>>>>>
>>>>> }
>>>>>
>>>>> Good luck.
>>>>
>>>
>>> Sorry to hear that your having problems. It is a fairly simple UDF,
>>> for those familiar writing udf/genudf. You probably could embed the
>>> lookup data file in the jar as well. I meant to build/host this on my
>>> site, but I have not got around to it. If you want to tag team it, I
>>> am interested.
>>>
>>
>>
>>
>> --
>> Adam J. O'Donnell, Ph.D.
>> Immunet Corporation
>> Cell: +1 (267) 251-0070
>>
>

Just an FYI,
I have an implementation of this on my site if anyone wants to kick the tires.

http://www.jointhegrid.com/hive-udf-geo-ip-jtg/

http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/
http://www.jointhegrid.com/svn/geo-ip-java/

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
I was thinking to make UDFs like this.

select geo_lookup('databasefile' , fieldx , 'municipality' ) from table;
Alternatively we can embedded the data files inside in the jar/udf
geo_lookup('ip' , 'municipality' );

As to ant. since the licensing prevent geo_ip_java from getting
bundled with apache I am not looking to build this into hive. Simply
going to build a netbeans project and use hive-jars as libraries.

Edward
On Mon, Feb 15, 2010 at 4:58 PM, Adam O'Donnell <ad...@immunet.com> wrote:
> Edward:
>
> How can I help?  I got most of the UDF built myself last night, and
> today I was sorting out ant build issues.  My main frustration is
> trying to get it to play nice in amazon's environment.
>
> How did you solve the issue of selecting only parts of the geocity
> data on a single lookup?  Did you just do multiple lookups, one for
> each piece of data?
>
> Also, did your jar run on amazon's elastic mapreduce?
>
> On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>>> Edward:
>>>
>>> I don't have access to the individual data nodes, so I can't install the
>>> pure perl module. I tried distributing it via the add file command, but that
>>> is mangling the file name, which causes perl to not load the module as the
>>> file name and package name dont match.  Kinda frustrating, but it is really
>>> all about trying to work around an issue on amazon's elastic map reduce.  I
>>> love the service in general, but some issues are frustrating.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>>
>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Carl
>>>>>
>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>> hive table that is stored as a sequence file?  The idea would be I
>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>> Thoughts?
>>>>>
>>>>> Adam
>>>>>
>>>>
>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>> the apache/gpl issues.
>>>>
>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>
>>>>  ret = qp.run(
>>>>   " FROM ( "+
>>>>   " FROM raw_web_data_hour "+
>>>>   " SELECT transform( remote_ip ) "+
>>>>   " USING 'perl geo_state.pl' "+
>>>>   " AS ip, country_code3, region "+
>>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>>   " ) a " +
>>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>>   );
>>>>
>>>>
>>>> #!/usr/bin/perl
>>>> use Geo::IP;
>>>> use strict;
>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>> GEOIP_STANDARD);
>>>> while (<STDIN>){
>>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>>  chomp($_);
>>>>  my $record = $gi->record_by_name($_);
>>>>  print STDERR "was sent $_ \n" ;
>>>>  if (defined $record) {
>>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>>  ;
>>>>   print STDERR "return " . $record->region . "\n" ;
>>>>  } else {
>>>>   print "??\n";
>>>>   print STDERR "return was undefined \n";
>>>>  }
>>>>
>>>> }
>>>>
>>>> Good luck.
>>>
>>
>> Sorry to hear that your having problems. It is a fairly simple UDF,
>> for those familiar writing udf/genudf. You probably could embed the
>> lookup data file in the jar as well. I meant to build/host this on my
>> site, but I have not got around to it. If you want to tag team it, I
>> am interested.
>>
>
>
>
> --
> Adam J. O'Donnell, Ph.D.
> Immunet Corporation
> Cell: +1 (267) 251-0070
>

Re: Working UDF for GeoIP lookup?

Posted by Adam O'Donnell <ad...@immunet.com>.
Edward:

How can I help?  I got most of the UDF built myself last night, and
today I was sorting out ant build issues.  My main frustration is
trying to get it to play nice in amazon's environment.

How did you solve the issue of selecting only parts of the geocity
data on a single lookup?  Did you just do multiple lookups, one for
each piece of data?

Also, did your jar run on amazon's elastic mapreduce?

On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>> Edward:
>>
>> I don't have access to the individual data nodes, so I can't install the
>> pure perl module. I tried distributing it via the add file command, but that
>> is mangling the file name, which causes perl to not load the module as the
>> file name and package name dont match.  Kinda frustrating, but it is really
>> all about trying to work around an issue on amazon's elastic map reduce.  I
>> love the service in general, but some issues are frustrating.
>>
>> Sent from my iPhone
>>
>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>
>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Carl
>>>>
>>>> How about this... .can I run a standard hadoop streaming job against a
>>>> hive table that is stored as a sequence file?  The idea would be I
>>>> would break my hive query into two separate tasks and do a hadoop
>>>> streaming job in between, then pick up the hive job afterwards.
>>>> Thoughts?
>>>>
>>>> Adam
>>>>
>>>
>>> I actually did do this with a streaming job. The UDF was tied up with
>>> the apache/gpl issues.
>>>
>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>
>>>  ret = qp.run(
>>>   " FROM ( "+
>>>   " FROM raw_web_data_hour "+
>>>   " SELECT transform( remote_ip ) "+
>>>   " USING 'perl geo_state.pl' "+
>>>   " AS ip, country_code3, region "+
>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>   " ) a " +
>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>   );
>>>
>>>
>>> #!/usr/bin/perl
>>> use Geo::IP;
>>> use strict;
>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>> GEOIP_STANDARD);
>>> while (<STDIN>){
>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>  chomp($_);
>>>  my $record = $gi->record_by_name($_);
>>>  print STDERR "was sent $_ \n" ;
>>>  if (defined $record) {
>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>  ;
>>>   print STDERR "return " . $record->region . "\n" ;
>>>  } else {
>>>   print "??\n";
>>>   print STDERR "return was undefined \n";
>>>  }
>>>
>>> }
>>>
>>> Good luck.
>>
>
> Sorry to hear that your having problems. It is a fairly simple UDF,
> for those familiar writing udf/genudf. You probably could embed the
> lookup data file in the jar as well. I meant to build/host this on my
> site, but I have not got around to it. If you want to tag team it, I
> am interested.
>



-- 
Adam J. O'Donnell, Ph.D.
Immunet Corporation
Cell: +1 (267) 251-0070

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Feb 16, 2010 at 3:23 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Tue, Feb 16, 2010 at 2:54 PM, Eric Arenas <ea...@rocketmail.com> wrote:
>> Hi Ed,
>>
>> I created a similar UDF some time ago, and if I am not mistaken you have to assume that your file is going to be in the same directory, as in:
>>
>> path_of_dat_file = "./name_of_file";
>>
>> And it worked for me,
>>
>> let me know if this solves your issue, and if not, I will look into my old code and see how I did it.
>>
>> regards
>> Eric Arenas
>>
>>
>>
>> ----- Original Message ----
>> From: Edward Capriolo <ed...@gmail.com>
>> To: hive-user@hadoop.apache.org
>> Sent: Tue, February 16, 2010 7:47:30 AM
>> Subject: Re: Working UDF for GeoIP lookup?
>>
>> On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>>>> Edward:
>>>>
>>>> I don't have access to the individual data nodes, so I can't install the
>>>> pure perl module. I tried distributing it via the add file command, but that
>>>> is mangling the file name, which causes perl to not load the module as the
>>>> file name and package name dont match.  Kinda frustrating, but it is really
>>>> all about trying to work around an issue on amazon's elastic map reduce.  I
>>>> love the service in general, but some issues are frustrating.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>>>
>>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Carl
>>>>>>
>>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>>> hive table that is stored as a sequence file?  The idea would be I
>>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>>> Thoughts?
>>>>>>
>>>>>> Adam
>>>>>>
>>>>>
>>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>>> the apache/gpl issues.
>>>>>
>>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>>
>>>>>  ret = qp.run(
>>>>>   " FROM ( "+
>>>>>   " FROM raw_web_data_hour "+
>>>>>   " SELECT transform( remote_ip ) "+
>>>>>   " USING 'perl geo_state.pl' "+
>>>>>   " AS ip, country_code3, region "+
>>>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>>>   " ) a " +
>>>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>>>   );
>>>>>
>>>>>
>>>>> #!/usr/bin/perl
>>>>> use Geo::IP;
>>>>> use strict;
>>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>>> GEOIP_STANDARD);
>>>>> while (<STDIN>){
>>>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>>>  chomp($_);
>>>>>  my $record = $gi->record_by_name($_);
>>>>>  print STDERR "was sent $_ \n" ;
>>>>>  if (defined $record) {
>>>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>>>  ;
>>>>>   print STDERR "return " . $record->region . "\n" ;
>>>>>  } else {
>>>>>   print "??\n";
>>>>>   print STDERR "return was undefined \n";
>>>>>  }
>>>>>
>>>>> }
>>>>>
>>>>> Good luck.
>>>>
>>>
>>> Sorry to hear that your having problems. It is a fairly simple UDF,
>>> for those familiar writing udf/genudf. You probably could embed the
>>> lookup data file in the jar as well. I meant to build/host this on my
>>> site, but I have not got around to it. If you want to tag team it, I
>>> am interested.
>>>
>> So I started working on this:
>> I packaged geo-ip into a jar:
>> http://www.jointhegrid.com/svn/geo-ip-java/
>> And I am building a Hive UDF
>> http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/
>>
>> I am running into a problem, I am trying to have the UDF work with two
>> signatures
>>
>> geoip('209.191.139.200', 'STATE_NAME');
>> geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );
>>
>> For the first invocation I have bundled the data into the JAR file. I
>> have verified that I can access it:
>> http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java
>>
>> I am trying to do the same thing inside by UDF but I get FileNotFound
>> exceptions. I have also tried adding the file to the distributed
>> cache.
>>
>> add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
>> add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
>> add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
>> create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
>> select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;
>>
>>
>> Any hints ? I did notice a Jira about UDF reading distributed cache,
>> so that may be an issue. I still wonder though why I can not pull the
>> file out of the jar. Any hints?
>>
>> -ed
>>
>>
>
> './file' is not working either.
>
> My UDF works when I specify the entire local path however, but that is
> not actually using the file in 'add file'.
>
> This works:
> add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
> select geoip(first, 'COUNTRY_NAME',
> '/home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat'
> ) from a;
>

My mistake! My cluster 18.3 + hive-4.0rc2 is doing this fine. I am
working off some 5.0 trunk locally, might be a bug or I might need to
move to the latest trunk.

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Feb 16, 2010 at 2:54 PM, Eric Arenas <ea...@rocketmail.com> wrote:
> Hi Ed,
>
> I created a similar UDF some time ago, and if I am not mistaken you have to assume that your file is going to be in the same directory, as in:
>
> path_of_dat_file = "./name_of_file";
>
> And it worked for me,
>
> let me know if this solves your issue, and if not, I will look into my old code and see how I did it.
>
> regards
> Eric Arenas
>
>
>
> ----- Original Message ----
> From: Edward Capriolo <ed...@gmail.com>
> To: hive-user@hadoop.apache.org
> Sent: Tue, February 16, 2010 7:47:30 AM
> Subject: Re: Working UDF for GeoIP lookup?
>
> On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>>> Edward:
>>>
>>> I don't have access to the individual data nodes, so I can't install the
>>> pure perl module. I tried distributing it via the add file command, but that
>>> is mangling the file name, which causes perl to not load the module as the
>>> file name and package name dont match.  Kinda frustrating, but it is really
>>> all about trying to work around an issue on amazon's elastic map reduce.  I
>>> love the service in general, but some issues are frustrating.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>>
>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Carl
>>>>>
>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>> hive table that is stored as a sequence file?  The idea would be I
>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>> Thoughts?
>>>>>
>>>>> Adam
>>>>>
>>>>
>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>> the apache/gpl issues.
>>>>
>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>
>>>>  ret = qp.run(
>>>>   " FROM ( "+
>>>>   " FROM raw_web_data_hour "+
>>>>   " SELECT transform( remote_ip ) "+
>>>>   " USING 'perl geo_state.pl' "+
>>>>   " AS ip, country_code3, region "+
>>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>>   " ) a " +
>>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>>   );
>>>>
>>>>
>>>> #!/usr/bin/perl
>>>> use Geo::IP;
>>>> use strict;
>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>> GEOIP_STANDARD);
>>>> while (<STDIN>){
>>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>>  chomp($_);
>>>>  my $record = $gi->record_by_name($_);
>>>>  print STDERR "was sent $_ \n" ;
>>>>  if (defined $record) {
>>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>>  ;
>>>>   print STDERR "return " . $record->region . "\n" ;
>>>>  } else {
>>>>   print "??\n";
>>>>   print STDERR "return was undefined \n";
>>>>  }
>>>>
>>>> }
>>>>
>>>> Good luck.
>>>
>>
>> Sorry to hear that your having problems. It is a fairly simple UDF,
>> for those familiar writing udf/genudf. You probably could embed the
>> lookup data file in the jar as well. I meant to build/host this on my
>> site, but I have not got around to it. If you want to tag team it, I
>> am interested.
>>
> So I started working on this:
> I packaged geo-ip into a jar:
> http://www.jointhegrid.com/svn/geo-ip-java/
> And I am building a Hive UDF
> http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/
>
> I am running into a problem, I am trying to have the UDF work with two
> signatures
>
> geoip('209.191.139.200', 'STATE_NAME');
> geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );
>
> For the first invocation I have bundled the data into the JAR file. I
> have verified that I can access it:
> http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java
>
> I am trying to do the same thing inside by UDF but I get FileNotFound
> exceptions. I have also tried adding the file to the distributed
> cache.
>
> add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
> add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
> add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
> create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
> select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;
>
>
> Any hints ? I did notice a Jira about UDF reading distributed cache,
> so that may be an issue. I still wonder though why I can not pull the
> file out of the jar. Any hints?
>
> -ed
>
>

'./file' is not working either.

My UDF works when I specify the entire local path however, but that is
not actually using the file in 'add file'.

This works:
add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
select geoip(first, 'COUNTRY_NAME',
'/home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat'
) from a;

Re: Working UDF for GeoIP lookup?

Posted by Eric Arenas <ea...@rocketmail.com>.
Hi Ed,

I created a similar UDF some time ago, and if I am not mistaken you have to assume that your file is going to be in the same directory, as in:

path_of_dat_file = "./name_of_file";

And it worked for me,

let me know if this solves your issue, and if not, I will look into my old code and see how I did it.

regards
Eric Arenas



----- Original Message ----
From: Edward Capriolo <ed...@gmail.com>
To: hive-user@hadoop.apache.org
Sent: Tue, February 16, 2010 7:47:30 AM
Subject: Re: Working UDF for GeoIP lookup?

On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>> Edward:
>>
>> I don't have access to the individual data nodes, so I can't install the
>> pure perl module. I tried distributing it via the add file command, but that
>> is mangling the file name, which causes perl to not load the module as the
>> file name and package name dont match.  Kinda frustrating, but it is really
>> all about trying to work around an issue on amazon's elastic map reduce.  I
>> love the service in general, but some issues are frustrating.
>>
>> Sent from my iPhone
>>
>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>
>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Carl
>>>>
>>>> How about this... .can I run a standard hadoop streaming job against a
>>>> hive table that is stored as a sequence file?  The idea would be I
>>>> would break my hive query into two separate tasks and do a hadoop
>>>> streaming job in between, then pick up the hive job afterwards.
>>>> Thoughts?
>>>>
>>>> Adam
>>>>
>>>
>>> I actually did do this with a streaming job. The UDF was tied up with
>>> the apache/gpl issues.
>>>
>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>
>>>  ret = qp.run(
>>>   " FROM ( "+
>>>   " FROM raw_web_data_hour "+
>>>   " SELECT transform( remote_ip ) "+
>>>   " USING 'perl geo_state.pl' "+
>>>   " AS ip, country_code3, region "+
>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>   " ) a " +
>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>   );
>>>
>>>
>>> #!/usr/bin/perl
>>> use Geo::IP;
>>> use strict;
>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>> GEOIP_STANDARD);
>>> while (<STDIN>){
>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>  chomp($_);
>>>  my $record = $gi->record_by_name($_);
>>>  print STDERR "was sent $_ \n" ;
>>>  if (defined $record) {
>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>  ;
>>>   print STDERR "return " . $record->region . "\n" ;
>>>  } else {
>>>   print "??\n";
>>>   print STDERR "return was undefined \n";
>>>  }
>>>
>>> }
>>>
>>> Good luck.
>>
>
> Sorry to hear that your having problems. It is a fairly simple UDF,
> for those familiar writing udf/genudf. You probably could embed the
> lookup data file in the jar as well. I meant to build/host this on my
> site, but I have not got around to it. If you want to tag team it, I
> am interested.
>
So I started working on this:
I packaged geo-ip into a jar:
http://www.jointhegrid.com/svn/geo-ip-java/
And I am building a Hive UDF
http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/

I am running into a problem, I am trying to have the UDF work with two
signatures

geoip('209.191.139.200', 'STATE_NAME');
geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );

For the first invocation I have bundled the data into the JAR file. I
have verified that I can access it:
http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java

I am trying to do the same thing inside by UDF but I get FileNotFound
exceptions. I have also tried adding the file to the distributed
cache.

add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;


Any hints ? I did notice a Jira about UDF reading distributed cache,
so that may be an issue. I still wonder though why I can not pull the
file out of the jar. Any hints?

-ed


Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>> Edward:
>>
>> I don't have access to the individual data nodes, so I can't install the
>> pure perl module. I tried distributing it via the add file command, but that
>> is mangling the file name, which causes perl to not load the module as the
>> file name and package name dont match.  Kinda frustrating, but it is really
>> all about trying to work around an issue on amazon's elastic map reduce.  I
>> love the service in general, but some issues are frustrating.
>>
>> Sent from my iPhone
>>
>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>
>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Carl
>>>>
>>>> How about this... .can I run a standard hadoop streaming job against a
>>>> hive table that is stored as a sequence file?  The idea would be I
>>>> would break my hive query into two separate tasks and do a hadoop
>>>> streaming job in between, then pick up the hive job afterwards.
>>>> Thoughts?
>>>>
>>>> Adam
>>>>
>>>
>>> I actually did do this with a streaming job. The UDF was tied up with
>>> the apache/gpl issues.
>>>
>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>
>>>  ret = qp.run(
>>>   " FROM ( "+
>>>   " FROM raw_web_data_hour "+
>>>   " SELECT transform( remote_ip ) "+
>>>   " USING 'perl geo_state.pl' "+
>>>   " AS ip, country_code3, region "+
>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>   " ) a " +
>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>   );
>>>
>>>
>>> #!/usr/bin/perl
>>> use Geo::IP;
>>> use strict;
>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>> GEOIP_STANDARD);
>>> while (<STDIN>){
>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>  chomp($_);
>>>  my $record = $gi->record_by_name($_);
>>>  print STDERR "was sent $_ \n" ;
>>>  if (defined $record) {
>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>  ;
>>>   print STDERR "return " . $record->region . "\n" ;
>>>  } else {
>>>   print "??\n";
>>>   print STDERR "return was undefined \n";
>>>  }
>>>
>>> }
>>>
>>> Good luck.
>>
>
> Sorry to hear that your having problems. It is a fairly simple UDF,
> for those familiar writing udf/genudf. You probably could embed the
> lookup data file in the jar as well. I meant to build/host this on my
> site, but I have not got around to it. If you want to tag team it, I
> am interested.
>
So I started working on this:
I packaged geo-ip into a jar:
http://www.jointhegrid.com/svn/geo-ip-java/
And I am building a Hive UDF
http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/

I am running into a problem, I am trying to have the UDF work with two
signatures

geoip('209.191.139.200', 'STATE_NAME');
geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );

For the first invocation I have bundled the data into the JAR file. I
have verified that I can access it:
http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java

I am trying to do the same thing inside by UDF but I get FileNotFound
exceptions. I have also tried adding the file to the distributed
cache.

add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;


Any hints ? I did notice a Jira about UDF reading distributed cache,
so that may be an issue. I still wonder though why I can not pull the
file out of the jar. Any hints?

-ed

Re: Working UDF for GeoIP lookup?

Posted by Adam O'Donnell <ad...@immunet.com>.
That would make sense.  Let's talk off list.

On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
>> Edward:
>>
>> I don't have access to the individual data nodes, so I can't install the
>> pure perl module. I tried distributing it via the add file command, but that
>> is mangling the file name, which causes perl to not load the module as the
>> file name and package name dont match.  Kinda frustrating, but it is really
>> all about trying to work around an issue on amazon's elastic map reduce.  I
>> love the service in general, but some issues are frustrating.
>>
>> Sent from my iPhone
>>
>> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>>
>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Carl
>>>>
>>>> How about this... .can I run a standard hadoop streaming job against a
>>>> hive table that is stored as a sequence file?  The idea would be I
>>>> would break my hive query into two separate tasks and do a hadoop
>>>> streaming job in between, then pick up the hive job afterwards.
>>>> Thoughts?
>>>>
>>>> Adam
>>>>
>>>
>>> I actually did do this with a streaming job. The UDF was tied up with
>>> the apache/gpl issues.
>>>
>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>
>>>  ret = qp.run(
>>>   " FROM ( "+
>>>   " FROM raw_web_data_hour "+
>>>   " SELECT transform( remote_ip ) "+
>>>   " USING 'perl geo_state.pl' "+
>>>   " AS ip, country_code3, region "+
>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>   " ) a " +
>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>   );
>>>
>>>
>>> #!/usr/bin/perl
>>> use Geo::IP;
>>> use strict;
>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>> GEOIP_STANDARD);
>>> while (<STDIN>){
>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>  chomp($_);
>>>  my $record = $gi->record_by_name($_);
>>>  print STDERR "was sent $_ \n" ;
>>>  if (defined $record) {
>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>  ;
>>>   print STDERR "return " . $record->region . "\n" ;
>>>  } else {
>>>   print "??\n";
>>>   print STDERR "return was undefined \n";
>>>  }
>>>
>>> }
>>>
>>> Good luck.
>>
>
> Sorry to hear that your having problems. It is a fairly simple UDF,
> for those familiar writing udf/genudf. You probably could embed the
> lookup data file in the jar as well. I meant to build/host this on my
> site, but I have not got around to it. If you want to tag team it, I
> am interested.
>



-- 
Adam J. O'Donnell, Ph.D.
Immunet Corporation
Cell: +1 (267) 251-0070

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <ad...@immunet.com> wrote:
> Edward:
>
> I don't have access to the individual data nodes, so I can't install the
> pure perl module. I tried distributing it via the add file command, but that
> is mangling the file name, which causes perl to not load the module as the
> file name and package name dont match.  Kinda frustrating, but it is really
> all about trying to work around an issue on amazon's elastic map reduce.  I
> love the service in general, but some issues are frustrating.
>
> Sent from my iPhone
>
> On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:
>
>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>>>>
>>>> Hope this helps.
>>>>
>>>> Carl
>>>
>>> How about this... .can I run a standard hadoop streaming job against a
>>> hive table that is stored as a sequence file?  The idea would be I
>>> would break my hive query into two separate tasks and do a hadoop
>>> streaming job in between, then pick up the hive job afterwards.
>>> Thoughts?
>>>
>>> Adam
>>>
>>
>> I actually did do this with a streaming job. The UDF was tied up with
>> the apache/gpl issues.
>>
>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>
>>  ret = qp.run(
>>   " FROM ( "+
>>   " FROM raw_web_data_hour "+
>>   " SELECT transform( remote_ip ) "+
>>   " USING 'perl geo_state.pl' "+
>>   " AS ip, country_code3, region "+
>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>   " ) a " +
>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>   " GROUP BY a.country_code3,a.region,a.ip "
>>   );
>>
>>
>> #!/usr/bin/perl
>> use Geo::IP;
>> use strict;
>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>> GEOIP_STANDARD);
>> while (<STDIN>){
>>  #my $record = $gi->record_by_name("209.191.139.200");
>>  chomp($_);
>>  my $record = $gi->record_by_name($_);
>>  print STDERR "was sent $_ \n" ;
>>  if (defined $record) {
>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>  ;
>>   print STDERR "return " . $record->region . "\n" ;
>>  } else {
>>   print "??\n";
>>   print STDERR "return was undefined \n";
>>  }
>>
>> }
>>
>> Good luck.
>

Sorry to hear that your having problems. It is a fairly simple UDF,
for those familiar writing udf/genudf. You probably could embed the
lookup data file in the jar as well. I meant to build/host this on my
site, but I have not got around to it. If you want to tag team it, I
am interested.

Re: Working UDF for GeoIP lookup?

Posted by "Adam J. O'Donnell" <ad...@immunet.com>.
Edward:

I don't have access to the individual data nodes, so I can't install  
the pure perl module. I tried distributing it via the add file  
command, but that is mangling the file name, which causes perl to not  
load the module as the file name and package name dont match.  Kinda  
frustrating, but it is really all about trying to work around an issue  
on amazon's elastic map reduce.  I love the service in general, but  
some issues are frustrating.

Sent from my iPhone

On Feb 15, 2010, at 6:05, Edward Capriolo <ed...@gmail.com> wrote:

> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com>  
> wrote:
>>> Hope this helps.
>>>
>>> Carl
>>
>> How about this... .can I run a standard hadoop streaming job  
>> against a
>> hive table that is stored as a sequence file?  The idea would be I
>> would break my hive query into two separate tasks and do a hadoop
>> streaming job in between, then pick up the hive job afterwards.
>> Thoughts?
>>
>> Adam
>>
>
> I actually did do this with a streaming job. The UDF was tied up with
> the apache/gpl issues.
>
> Here is how I did this. 1 install geo-ip-perl on all datanodes
>
>  ret = qp.run(
>    " FROM ( "+
>    " FROM raw_web_data_hour "+
>    " SELECT transform( remote_ip ) "+
>    " USING 'perl geo_state.pl' "+
>    " AS ip, country_code3, region "+
>    " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour 
> +"' " +
>    " ) a " +
>    " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>    " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>    " GROUP BY a.country_code3,a.region,a.ip "
>    );
>
>
> #!/usr/bin/perl
> use Geo::IP;
> use strict;
> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",  
> GEOIP_STANDARD);
> while (<STDIN>){
>  #my $record = $gi->record_by_name("209.191.139.200");
>  chomp($_);
>  my $record = $gi->record_by_name($_);
>  print STDERR "was sent $_ \n" ;
>  if (defined $record) {
>    print $_ . "\t" . $record->country_code3 . "\t" . $record- 
> >region . "\n"  ;
>    print STDERR "return " . $record->region . "\n" ;
>  } else {
>    print "??\n";
>    print STDERR "return was undefined \n";
>  }
>
> }
>
> Good luck.

Re: Working UDF for GeoIP lookup?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <ad...@immunet.com> wrote:
>> Hope this helps.
>>
>> Carl
>
> How about this... .can I run a standard hadoop streaming job against a
> hive table that is stored as a sequence file?  The idea would be I
> would break my hive query into two separate tasks and do a hadoop
> streaming job in between, then pick up the hive job afterwards.
> Thoughts?
>
> Adam
>

I actually did do this with a streaming job. The UDF was tied up with
the apache/gpl issues.

Here is how I did this. 1 install geo-ip-perl on all datanodes

  ret = qp.run(
    " FROM ( "+
    " FROM raw_web_data_hour "+
    " SELECT transform( remote_ip ) "+
    " USING 'perl geo_state.pl' "+
    " AS ip, country_code3, region "+
    " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
    " ) a " +
    " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
(log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
    " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
    " GROUP BY a.country_code3,a.region,a.ip "
    );


#!/usr/bin/perl
use Geo::IP;
use strict;
my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", GEOIP_STANDARD);
while (<STDIN>){
  #my $record = $gi->record_by_name("209.191.139.200");
  chomp($_);
  my $record = $gi->record_by_name($_);
  print STDERR "was sent $_ \n" ;
  if (defined $record) {
    print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"  ;
    print STDERR "return " . $record->region . "\n" ;
  } else {
    print "??\n";
    print STDERR "return was undefined \n";
  }

}

Good luck.

Re: Working UDF for GeoIP lookup?

Posted by Adam O'Donnell <ad...@immunet.com>.
> Hope this helps.
>
> Carl

How about this... .can I run a standard hadoop streaming job against a
hive table that is stored as a sequence file?  The idea would be I
would break my hive query into two separate tasks and do a hadoop
streaming job in between, then pick up the hive job afterwards.
Thoughts?

Adam

Re: Working UDF for GeoIP lookup?

Posted by Carl Steinbach <ca...@cloudera.com>.
Hi Adam,

Yeah, I want to do something something along those lines, but the hive
> distribution I am using (amazon's) is mangling file names to the point
> that I can't fetch additional libraries.  This makes grabbing the
> required perl module a bit challenging.
>

Apache Hive's 'add' command does not yet support referencing files in
HDFS or S3. You can only reference files in the local file system.
The ability to reference S3 files in EMR Hive is a feature that the folks
at Amazon added, and since this feature hasn't been open sourced
I can't explain its odd behavior. Probably the best place for questions
about EMR Hive is the Amazon Web Services user forums, where
I noticed you already posted a question :)



> Can you rename a file on the local filesystem after issuing an add
> file command?  Something along the files of: add file
> s3://bucket/file.pm#file.pm?
>

On Apache Hive the command 'add file /foo/bar/baz.txt' causes hive to
check that /foo/bar/baz.txt exists, and to save the path to a list of
resources that should be added to the DistributedCache whenever a
query is executed. Hive expects /foo/bar/baz.txt to exist as long as
the path appears in the output of 'list FILE'. There is not support in
Apache Hive for renaming files when they are placed in the DistributedCache.

Hope this helps.

Carl

Re: Working UDF for GeoIP lookup?

Posted by Adam O'Donnell <ad...@immunet.com>.
Carl:

> There's also a Cloudera Blog post from a while back about analyzing GeoIP
> data using Pig here:
> http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/
>
> While less efficient than a UDF, I think you can probably call this
> Perl script from a Hive TRANSFORM query without making any changes.
> See http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

Yeah, I want to do something something along those lines, but the hive
distribution I am using (amazon's) is mangling file names to the point
that I can't fetch additional libraries.  This makes grabbing the
required perl module a bit challenging.

Can you rename a file on the local filesystem after issuing an add
file command?  Something along the files of: add file
s3://bucket/file.pm#file.pm?

Adam

Re: Working UDF for GeoIP lookup?

Posted by Carl Steinbach <ca...@cloudera.com>.
Hi Adam,

Anyone have a working UDF jar for GeoIP lookups using MaxMind's data?
> I saw one being discussed a few months ago, but haven't seen it in any
> contrib branches.
>

The original discussion about a Hive GeoIP UDF is here:
http://markmail.org/message/acqj3nal4opbpcmw#query:+page:1+mid:4w5ly57x6zcysol6+state:results

Looks like the idea of including it in contrib was nixed because of
licensing
issues, but I bet Ed would be willing to share his work with you.

There's also a Cloudera Blog post from a while back about analyzing GeoIP
data using Pig here:
http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/

While less efficient than a UDF, I think you can probably call this
Perl script from a Hive TRANSFORM query without making any changes.
See http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

Thanks.

Carl