You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Paul Mogren <PM...@commercehub.com> on 2015/06/27 00:12:33 UTC

Connecting to Hive provided by AWS EMR

I have scoured the Drill website and mailing list, and Google, and have
come up with no advice. Can you help?

I started up an EMR cluster with AWS Hive 0.13.1 installed,

started the metastore service: hive/bin/hive ‹service metastore,

created a table:
CREATE TABLE apachelog (
  host STRING,
  IDENTITY STRING,
  USER STRING,
  TIME STRING,
  request STRING,
  STATUS STRING,
  SIZE STRING,
  referrer STRING,
  agent STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") ([0-9]*) ([0-9]*) ([^ \"]*|\"[^\"]*\") ([^
\"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE;

And loaded a small amount of data:
LOAD DATA LOCAL INPATH 'access_log_1' OVERWRITE INTO TABLE apache_log;
 ‹-source: 
http://elasticmapreduce.s3.amazonaws.com/samples/pig-apache/input/access_lo
g_1



I can query this data from the Hive console or from SquirrelSQL using the
AWS Hive JDBC4 driver from
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCD
river.html

I configured a Drill storage plugin:
{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://172.24.7.81:10000",
    "hive.metastore.sasl.enabled": "false"
  }
}


But all I get from Drill is socket timeouts reading from the Hive
metastore, whether I try to query the apache_log table or Drill¹s
INFORMATION_SCHEMA.

I have a guess that I need to swap in some AWS-provided Hive-related jar
files for others that were included with Drill. Looking for suggestions on
that approach, or something else I might be overlooking.

Thanks,
Paul

Re: Connecting to Hive provided by AWS EMR

Posted by Paul Mogren <PM...@commercehub.com>.

Thanks Venki! I should be using port 9083. Port 10000 services the JDBC
connector, and I didn’t realize they’d be different.

I’m not sure what your other comments refer to, or what I would be
specifying. From my recollection of the docs, I didn’t think I would need
any further Hive config in Drill. At first I left the other default config
elements, but whittled it down to what I had before (except the port
number) and it works.


Paul


On 6/26/15, 6:38 PM, "Venki Korukanti" <ve...@gmail.com> wrote:

>Hi,
>
>What port is your Hive metastore listening? The default port is 9083. In
>your case you provided 10000 (as part of hive.metastore.uris). Can you
>double check if that is the correct one.
>
>Also you need provide fs.default.name and other s3 related settings in
>Hive
>storage plugin config.
>
>Thanks
>Venki
>
>On Fri, Jun 26, 2015 at 3:12 PM, Paul Mogren <PM...@commercehub.com>
>wrote:
>
>> I have scoured the Drill website and mailing list, and Google, and have
>> come up with no advice. Can you help?
>>
>> I started up an EMR cluster with AWS Hive 0.13.1 installed,
>>
>> started the metastore service: hive/bin/hive ‹service metastore,
>>
>> created a table:
>> CREATE TABLE apachelog (
>>   host STRING,
>>   IDENTITY STRING,
>>   USER STRING,
>>   TIME STRING,
>>   request STRING,
>>   STATUS STRING,
>>   SIZE STRING,
>>   referrer STRING,
>>   agent STRING
>> )
>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
>> WITH SERDEPROPERTIES (
>>   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
>> \"]*|\"[^\"]*\") ([0-9]*) ([0-9]*) ([^ \"]*|\"[^\"]*\") ([^
>> \"]*|\"[^\"]*\")"
>> )
>> STORED AS TEXTFILE;
>>
>> And loaded a small amount of data:
>> LOAD DATA LOCAL INPATH 'access_log_1' OVERWRITE INTO TABLE apache_log;
>>  ‹-source:
>> 
>>http://elasticmapreduce.s3.amazonaws.com/samples/pig-apache/input/access_
>>lo
>> g_1
>>
>>
>>
>> I can query this data from the Hive console or from SquirrelSQL using
>>the
>> AWS Hive JDBC4 driver from
>> 
>>http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDB
>>CD
>> river.html
>>
>> I configured a Drill storage plugin:
>> {
>>   "type": "hive",
>>   "enabled": true,
>>   "configProps": {
>>     "hive.metastore.uris": "thrift://172.24.7.81:10000",
>>     "hive.metastore.sasl.enabled": "false"
>>   }
>> }
>>
>>
>> But all I get from Drill is socket timeouts reading from the Hive
>> metastore, whether I try to query the apache_log table or Drill¹s
>> INFORMATION_SCHEMA.
>>
>> I have a guess that I need to swap in some AWS-provided Hive-related jar
>> files for others that were included with Drill. Looking for suggestions
>>on
>> that approach, or something else I might be overlooking.
>>
>> Thanks,
>> Paul
>>
>>

Re: Connecting to Hive provided by AWS EMR

Posted by Venki Korukanti <ve...@gmail.com>.

Hi,

What port is your Hive metastore listening? The default port is 9083. In
your case you provided 10000 (as part of hive.metastore.uris). Can you
double check if that is the correct one.

Also you need provide fs.default.name and other s3 related settings in Hive
storage plugin config.

Thanks
Venki

On Fri, Jun 26, 2015 at 3:12 PM, Paul Mogren <PM...@commercehub.com>
wrote:

> I have scoured the Drill website and mailing list, and Google, and have
> come up with no advice. Can you help?
>
> I started up an EMR cluster with AWS Hive 0.13.1 installed,
>
> started the metastore service: hive/bin/hive ‹service metastore,
>
> created a table:
> CREATE TABLE apachelog (
>   host STRING,
>   IDENTITY STRING,
>   USER STRING,
>   TIME STRING,
>   request STRING,
>   STATUS STRING,
>   SIZE STRING,
>   referrer STRING,
>   agent STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
> \"]*|\"[^\"]*\") ([0-9]*) ([0-9]*) ([^ \"]*|\"[^\"]*\") ([^
> \"]*|\"[^\"]*\")"
> )
> STORED AS TEXTFILE;
>
> And loaded a small amount of data:
> LOAD DATA LOCAL INPATH 'access_log_1' OVERWRITE INTO TABLE apache_log;
>  ‹-source:
> http://elasticmapreduce.s3.amazonaws.com/samples/pig-apache/input/access_lo
> g_1
>
>
>
> I can query this data from the Hive console or from SquirrelSQL using the
> AWS Hive JDBC4 driver from
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCD
> river.html
>
> I configured a Drill storage plugin:
> {
>   "type": "hive",
>   "enabled": true,
>   "configProps": {
>     "hive.metastore.uris": "thrift://172.24.7.81:10000",
>     "hive.metastore.sasl.enabled": "false"
>   }
> }
>
>
> But all I get from Drill is socket timeouts reading from the Hive
> metastore, whether I try to query the apache_log table or Drill¹s
> INFORMATION_SCHEMA.
>
> I have a guess that I need to swap in some AWS-provided Hive-related jar
> files for others that were included with Drill. Looking for suggestions on
> that approach, or something else I might be overlooking.
>
> Thanks,
> Paul
>
>