You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by zeekvfu <wz...@gmail.com> on 2013/12/26 09:29:47 UTC

use custom MapReduce script to load data into table with partitions

Hi all,

I want to know that, whether Hive supports using a custom MapReduce script
to load data into a table with static/hybrid/dynamic partitions?

I'm working on a log analysis program, for instance, the Tomcat log.
Firstly, I import the raw/original log file into Hive, then I need to do
some analysis(eg. parse URIs to detect vulnerbility), and store the results
in another Hive table(probably adding several columns). I've achieved what
I've described above using a custom MapReduce script written in Perl.

But what disappoints me is that the query quite slow. So I decide to speed
up the query using Hive (dynamic) partitions. Unfortunately, I didn't
succed.
Then I began to wonder whether Hive supports using a custom MapReduce
script to load data into a table with partitions.

Here is the related HiveQL script:
------------------------------------------------------------------------------------------
------------------------------ original table, tomcat0
CREATE EXTERNAL TABLE if not exists tomcat0
(full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING,
method STRING, uri STRING, protocol STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE
LOCATION '/var/hive_data/logdata_external';

load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE
tomcat0;

------------------------------ target table, tomcat1
CREATE TABLE if not exists tomcat1
(time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING,
server_ip STRING, method STRING, uri STRING, protocol STRING,
uri_attack_type INT)
PARTITIONED BY (partition_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

add file /home/hduser/Hive.pl/Tomcat.pl;

INSERT overwrite TABLE tomcat1 partition(partition_date)
SELECT
TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol)
USING '/usr/bin/perl Tomcat.pl'
AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol,
uri_attack_type, partition_date)
FROM tomcat0
distribute by partition_date limit 2;
------------------------------------------------------------------------------------------

I will give a short explanation:
I'm trying use dynamic partition by date(not Hive timestamp), and the
partition column is generated by the custom MapReduce script. I don't know
whether it's acceptable in Hive.
I also tried to add static partition in advance:
 alter table tomcat1 add partition (partition_time='2013-08-31');
but still no luck. :-(

The typical input/output of the custom MapReduce script, Tomcat.pl:
INPUT (7 columns):
31/Aug/2012:00:14:20+0800    202.104.70.250      -       192.168.202.245 GET
    /wxcs/mains.html?1=1'     HTTP/1.1
OUTPUT (10 columns):
2012-08-31 00:14:20  +0800 202.104.70.250   -       192.168.202.245 GET
/wxcs/mains.html?1=1'   HTTP/1.1    1     2012-08-31

Error messages are typically like this:
------------------------------------------------------------------------------------------
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]:
An error occurred while reading or writing to your custom script. It may
have crashed with an error.
        at
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
        ... 9 more
------------------------------------------------------------------------------------------

I don't think the problem directly exists in my Perl script, though Hive
complains on this.
Sorry for such a long e-mail. :-(
Any input is appreciated.  :-)

Re: use custom MapReduce script to load data into table with partitions

Posted by zeekvfu <wz...@gmail.com>.
On Thu, Dec 26, 2013 at 7:03 PM, Satish Mittal <sa...@inmobi.com>wrote:

> You could use HCatalog for reading/writing data into Hive metastore
> tables. HCatalog provides HCatLoader/HCatStorer interface for Pig and
> HCatInputFormat/HCatFormat interface for MR jobs to perform table
> read/write, including support for static/dynamic partitions.



Thanks for your reply.
I've checked HCatalog. HCatalog is built on top of the Hive metastore and
incorporates Hive's DDL. So the partitioning mechanism of Hive and HCatalog
should be the same.

I should have described my question more clearly and more shortly:
I want to know that, whether Hive supports using a custom MapReduce script
to load data into a table with static/hybrid/dynamic partitions? The
partition column doesn't come from the original table, it's a completely
new column generated automatically by the MapReduce script.

Thanks.

Re: use custom MapReduce script to load data into table with partitions

Posted by Satish Mittal <sa...@inmobi.com>.
Hi,

You could use HCatalog for reading/writing data into Hive metastore tables.
HCatalog provides HCatLoader/HCatStorer interface for Pig and
HCatInputFormat/HCatFormat interface for MR jobs to perform table
read/write, including support for static/dynamic partitions.

Thanks,
Satish


On Thu, Dec 26, 2013 at 1:59 PM, zeekvfu <wz...@gmail.com> wrote:

> Hi all,
>
> I want to know that, whether Hive supports using a custom MapReduce script
> to load data into a table with static/hybrid/dynamic partitions?
>
> I'm working on a log analysis program, for instance, the Tomcat log.
> Firstly, I import the raw/original log file into Hive, then I need to do
> some analysis(eg. parse URIs to detect vulnerbility), and store the results
> in another Hive table(probably adding several columns). I've achieved what
> I've described above using a custom MapReduce script written in Perl.
>
> But what disappoints me is that the query quite slow. So I decide to speed
> up the query using Hive (dynamic) partitions. Unfortunately, I didn't
> succed.
> Then I began to wonder whether Hive supports using a custom MapReduce
> script to load data into a table with partitions.
>
> Here is the related HiveQL script:
>
> ------------------------------------------------------------------------------------------
> ------------------------------ original table, tomcat0
> CREATE EXTERNAL TABLE if not exists tomcat0
> (full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING,
> method STRING, uri STRING, protocol STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
> WITH SERDEPROPERTIES (
>
> 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
> 'quote.delim'='("|\\[|\\])',
> 'field.delim'=' ',
> 'serialization.null.format'='-')
> STORED AS TEXTFILE
> LOCATION '/var/hive_data/logdata_external';
>
> load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE
> tomcat0;
>
> ------------------------------ target table, tomcat1
> CREATE TABLE if not exists tomcat1
> (time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING,
> server_ip STRING, method STRING, uri STRING, protocol STRING,
> uri_attack_type INT)
> PARTITIONED BY (partition_date STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
>
> add file /home/hduser/Hive.pl/Tomcat.pl;
>
> INSERT overwrite TABLE tomcat1 partition(partition_date)
> SELECT
> TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol)
> USING '/usr/bin/perl Tomcat.pl'
> AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol,
> uri_attack_type, partition_date)
> FROM tomcat0
> distribute by partition_date limit 2;
>
> ------------------------------------------------------------------------------------------
>
> I will give a short explanation:
> I'm trying use dynamic partition by date(not Hive timestamp), and the
> partition column is generated by the custom MapReduce script. I don't know
> whether it's acceptable in Hive.
> I also tried to add static partition in advance:
>  alter table tomcat1 add partition (partition_time='2013-08-31');
> but still no luck. :-(
>
> The typical input/output of the custom MapReduce script, Tomcat.pl:
> INPUT (7 columns):
> 31/Aug/2012:00:14:20+0800    202.104.70.250      -       192.168.202.245 GET
>     /wxcs/mains.html?1=1'     HTTP/1.1
> OUTPUT (10 columns):
> 2012-08-31 00:14:20  +0800 202.104.70.250   -       192.168.202.245 GET
>   /wxcs/mains.html?1=1'   HTTP/1.1    1     2012-08-31
>
> Error messages are typically like this:
>
> ------------------------------------------------------------------------------------------
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error
> 20001]: An error occurred while reading or writing to your custom script.
> It may have crashed with an error.
>         at
> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>         ... 9 more
>
> ------------------------------------------------------------------------------------------
>
> I don't think the problem directly exists in my Perl script, though Hive
> complains on this.
> Sorry for such a long e-mail. :-(
> Any input is appreciated.  :-)
>
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.