You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Xiaobin She <xi...@gmail.com> on 2012/02/07 11:03:57 UTC

What's the best practice of loading logs into hdfs while using hive to do log analytic?

Hi all,

Sorry if it is not appropriate to send one thread into two maillist.
**
I'm tring to use hadoop and hive to do some log analytic jobs.

Our system generate lots of logs every day, for example, it produce about
370GB logs(including lots of log files) yesterday, and every day the logs
increases.

And we want to use hadoop and hive to replace our old log analysic system.

We distinguish our logs with logid, we have an log collector which will
collect logs from clients and then generate log files.

for every logid, there will be one log file every hour, for some logid,
this hourly log file can be 1~2GB

I have set up an test cluster with hadoop and hive, and I have run some
test which seems good for us.

For reference, we will create one table in hive for every logid which will
be partitoned by hour.

Now I have a question, what's the best practice for loading logs files into
hdfs or hive warehouse dir ?

My first thought is,  at the begining of every hour,  compress the log file
of the last hour of every logid and then use the hive cmd tool to load
these compressed log files into hdfs.

using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
TABLE $tablename PARTITION (dt='$h') "

I think this can work, and I have run some test on our 3-nodes test
clusters.

But the problem is, there are lots of logid which means there are lots of
log files,  so every hour we will have to load lots of files into hdfs
and there is another problem,  we will run hourly analysis job on these
hourly collected log files,
which inroduces the problem, because there are lots of log files, if we
load these log files at the same time at the begining of every hour, I
think  there will some network flows and there will be data delivery
latency problem.

For data delivery latency problem, I mean it will take some time for the
log files to be copyed into hdfs,  and this will cause our hourly log
analysis job to start later.

So I want to figure out if we can write or append logs into an compressed
file which is already located in hdfs, and I have posted an thread in the
mailist, and from what I have learned, this is not possible.


So, what's the best practice of loading logs into hdfs while using hive to
do log analytic?

Or what's the common methods to handle problem I have describe above?

Can anyone give me some advices?

Thank you very much for your help!

Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by be...@gmail.com.
Hi
    Assume your external table location points to an hdfs dir say /ext_tables/table1/ you can write your custom code in  flume that would ingest data into sub dir within the parent folder like /ext_tables/table1/2012-01-01/12 (currentDate/currentHour) . Configure collector to dump into hdfs on every hour with a max buffer matching your block size. So what would happen is, within  an hour if the buffer gets filled the data would be dumped into hdfs that instant. Now whatever the buffer size is, at the end of every hour flume would dump data into hdfs. At every hour (give a delay of 5 min to be on the safe side ie at n hour 05 min) issue a ddl on hive as
add partition with location as /ext_tables/table1/currentDate/previousHour 

Now the hour partition may contain one or more blocks/files based on your input data.

How would this approach fit your use case?
 
Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: alo alt <wg...@googlemail.com>
Date: Tue, 7 Feb 2012 15:27:18 
To: <us...@hive.apache.org>
Reply-To: common-user@hadoop.apache.org
Cc: <co...@hadoop.apache.org>; 佘晓彬<xi...@gmail.com>
Subject: Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Yes. 
You can use partitioned tables in hive to append in a new table without moving the data. For flume you can define small sinks, but you're right, the file in hdfs is closed and written when flume send the closing. Please note, the gzip codec has no marker inside so you have to wait till flume has closing the file in hdfs before you can process them. Snappy would fit, but I have no longtime tests within an productive environment. 

For blocksizing you're right, but I think that you can move behind. 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 3:09 PM, Xiaobin She wrote:

> hi Bejoy and Alex,
> 
> thank you for your advice.
> 
> Actually I have look at Scribe first, and I have heard of Flume.
> 
> I look at flume's user guide just now, and flume seems promising, as Bejoy said , the flume collector can dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval, this is good and I think it can solve the problem of data delivery latency.
> 
> But what about compress?
> 
> from the user's guide of flume, I see that flum supports compression  of log files, but if flume did not wait until the collector has collect one hour of log and then compress it and send it to hdfs, then it will  send part of the one hour log to hdfs, am I right?
> 
> so if I want to use thest data in hive (assume I have an external table in hive), I have to specify at least two partiton key while creating table, one for day-month-hour, and one for some other time interval like ten miniutes, then I add hive partition to the existed external table with specified partition key.
> 
> Is the above process right ?
> 
> If this right, then there could be some other problem, like the ten miniute logs after compress is not big enough to fit the block size of hdfs which may couse lots of small files ( for some of our log id, this may come true), or if I set the time interval to be half an hour, then at the end of hour, it may still cause the data delivery latency problem.
> 
> this seems not a very good solution, am I making some mistakes or misunderstanding here?
> 
> thank you very much!
> 
> 
> 
> 
> 
> 2012/2/7 alo alt <wg...@googlemail.com>
> Hi,
> 
> a first start with flume:
> http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
> 
> Facebook's scribe could also be work for you.
> 
> - Alex
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
> 
> > Hi all,
> >
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> >
> > Our system generate lots of logs every day, for example, it produce about
> > 370GB logs(including lots of log files) yesterday, and every day the logs
> > increases.
> >
> > And we want to use hadoop and hive to replace our old log analysic system.
> >
> > We distinguish our logs with logid, we have an log collector which will
> > collect logs from clients and then generate log files.
> >
> > for every logid, there will be one log file every hour, for some logid,
> > this hourly log file can be 1~2GB
> >
> > I have set up an test cluster with hadoop and hive, and I have run some
> > test which seems good for us.
> >
> > For reference, we will create one table in hive for every logid which will
> > be partitoned by hour.
> >
> > Now I have a question, what's the best practice for loading logs files into
> > hdfs or hive warehouse dir ?
> >
> > My first thought is,  at the begining of every hour,  compress the log file
> > of the last hour of every logid and then use the hive cmd tool to load
> > these compressed log files into hdfs.
> >
> > using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> > TABLE $tablename PARTITION (dt='$h') "
> >
> > I think this can work, and I have run some test on our 3-nodes test
> > clusters.
> >
> > But the problem is, there are lots of logid which means there are lots of
> > log files,  so every hour we will have to load lots of files into hdfs
> > and there is another problem,  we will run hourly analysis job on these
> > hourly collected log files,
> > which inroduces the problem, because there are lots of log files, if we
> > load these log files at the same time at the begining of every hour, I
> > think  there will some network flows and there will be data delivery
> > latency problem.
> >
> > For data delivery latency problem, I mean it will take some time for the
> > log files to be copyed into hdfs,  and this will cause our hourly log
> > analysis job to start later.
> >
> > So I want to figure out if we can write or append logs into an compressed
> > file which is already located in hdfs, and I have posted an thread in the
> > mailist, and from what I have learned, this is not possible.
> >
> >
> > So, what's the best practice of loading logs into hdfs while using hive to
> > do log analytic?
> >
> > Or what's the common methods to handle problem I have describe above?
> >
> > Can anyone give me some advices?
> >
> > Thank you very much for your help!
> 
> 


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by alo alt <wg...@googlemail.com>.
Yes. 
You can use partitioned tables in hive to append in a new table without moving the data. For flume you can define small sinks, but you're right, the file in hdfs is closed and written when flume send the closing. Please note, the gzip codec has no marker inside so you have to wait till flume has closing the file in hdfs before you can process them. Snappy would fit, but I have no longtime tests within an productive environment. 

For blocksizing you're right, but I think that you can move behind. 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 3:09 PM, Xiaobin She wrote:

> hi Bejoy and Alex,
> 
> thank you for your advice.
> 
> Actually I have look at Scribe first, and I have heard of Flume.
> 
> I look at flume's user guide just now, and flume seems promising, as Bejoy said , the flume collector can dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval, this is good and I think it can solve the problem of data delivery latency.
> 
> But what about compress?
> 
> from the user's guide of flume, I see that flum supports compression  of log files, but if flume did not wait until the collector has collect one hour of log and then compress it and send it to hdfs, then it will  send part of the one hour log to hdfs, am I right?
> 
> so if I want to use thest data in hive (assume I have an external table in hive), I have to specify at least two partiton key while creating table, one for day-month-hour, and one for some other time interval like ten miniutes, then I add hive partition to the existed external table with specified partition key.
> 
> Is the above process right ?
> 
> If this right, then there could be some other problem, like the ten miniute logs after compress is not big enough to fit the block size of hdfs which may couse lots of small files ( for some of our log id, this may come true), or if I set the time interval to be half an hour, then at the end of hour, it may still cause the data delivery latency problem.
> 
> this seems not a very good solution, am I making some mistakes or misunderstanding here?
> 
> thank you very much!
> 
> 
> 
> 
> 
> 2012/2/7 alo alt <wg...@googlemail.com>
> Hi,
> 
> a first start with flume:
> http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
> 
> Facebook's scribe could also be work for you.
> 
> - Alex
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
> 
> > Hi all,
> >
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> >
> > Our system generate lots of logs every day, for example, it produce about
> > 370GB logs(including lots of log files) yesterday, and every day the logs
> > increases.
> >
> > And we want to use hadoop and hive to replace our old log analysic system.
> >
> > We distinguish our logs with logid, we have an log collector which will
> > collect logs from clients and then generate log files.
> >
> > for every logid, there will be one log file every hour, for some logid,
> > this hourly log file can be 1~2GB
> >
> > I have set up an test cluster with hadoop and hive, and I have run some
> > test which seems good for us.
> >
> > For reference, we will create one table in hive for every logid which will
> > be partitoned by hour.
> >
> > Now I have a question, what's the best practice for loading logs files into
> > hdfs or hive warehouse dir ?
> >
> > My first thought is,  at the begining of every hour,  compress the log file
> > of the last hour of every logid and then use the hive cmd tool to load
> > these compressed log files into hdfs.
> >
> > using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> > TABLE $tablename PARTITION (dt='$h') "
> >
> > I think this can work, and I have run some test on our 3-nodes test
> > clusters.
> >
> > But the problem is, there are lots of logid which means there are lots of
> > log files,  so every hour we will have to load lots of files into hdfs
> > and there is another problem,  we will run hourly analysis job on these
> > hourly collected log files,
> > which inroduces the problem, because there are lots of log files, if we
> > load these log files at the same time at the begining of every hour, I
> > think  there will some network flows and there will be data delivery
> > latency problem.
> >
> > For data delivery latency problem, I mean it will take some time for the
> > log files to be copyed into hdfs,  and this will cause our hourly log
> > analysis job to start later.
> >
> > So I want to figure out if we can write or append logs into an compressed
> > file which is already located in hdfs, and I have posted an thread in the
> > mailist, and from what I have learned, this is not possible.
> >
> >
> > So, what's the best practice of loading logs into hdfs while using hive to
> > do log analytic?
> >
> > Or what's the common methods to handle problem I have describe above?
> >
> > Can anyone give me some advices?
> >
> > Thank you very much for your help!
> 
> 


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by alo alt <wg...@googlemail.com>.
Yes. 
You can use partitioned tables in hive to append in a new table without moving the data. For flume you can define small sinks, but you're right, the file in hdfs is closed and written when flume send the closing. Please note, the gzip codec has no marker inside so you have to wait till flume has closing the file in hdfs before you can process them. Snappy would fit, but I have no longtime tests within an productive environment. 

For blocksizing you're right, but I think that you can move behind. 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 3:09 PM, Xiaobin She wrote:

> hi Bejoy and Alex,
> 
> thank you for your advice.
> 
> Actually I have look at Scribe first, and I have heard of Flume.
> 
> I look at flume's user guide just now, and flume seems promising, as Bejoy said , the flume collector can dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval, this is good and I think it can solve the problem of data delivery latency.
> 
> But what about compress?
> 
> from the user's guide of flume, I see that flum supports compression  of log files, but if flume did not wait until the collector has collect one hour of log and then compress it and send it to hdfs, then it will  send part of the one hour log to hdfs, am I right?
> 
> so if I want to use thest data in hive (assume I have an external table in hive), I have to specify at least two partiton key while creating table, one for day-month-hour, and one for some other time interval like ten miniutes, then I add hive partition to the existed external table with specified partition key.
> 
> Is the above process right ?
> 
> If this right, then there could be some other problem, like the ten miniute logs after compress is not big enough to fit the block size of hdfs which may couse lots of small files ( for some of our log id, this may come true), or if I set the time interval to be half an hour, then at the end of hour, it may still cause the data delivery latency problem.
> 
> this seems not a very good solution, am I making some mistakes or misunderstanding here?
> 
> thank you very much!
> 
> 
> 
> 
> 
> 2012/2/7 alo alt <wg...@googlemail.com>
> Hi,
> 
> a first start with flume:
> http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
> 
> Facebook's scribe could also be work for you.
> 
> - Alex
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
> 
> > Hi all,
> >
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> >
> > Our system generate lots of logs every day, for example, it produce about
> > 370GB logs(including lots of log files) yesterday, and every day the logs
> > increases.
> >
> > And we want to use hadoop and hive to replace our old log analysic system.
> >
> > We distinguish our logs with logid, we have an log collector which will
> > collect logs from clients and then generate log files.
> >
> > for every logid, there will be one log file every hour, for some logid,
> > this hourly log file can be 1~2GB
> >
> > I have set up an test cluster with hadoop and hive, and I have run some
> > test which seems good for us.
> >
> > For reference, we will create one table in hive for every logid which will
> > be partitoned by hour.
> >
> > Now I have a question, what's the best practice for loading logs files into
> > hdfs or hive warehouse dir ?
> >
> > My first thought is,  at the begining of every hour,  compress the log file
> > of the last hour of every logid and then use the hive cmd tool to load
> > these compressed log files into hdfs.
> >
> > using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> > TABLE $tablename PARTITION (dt='$h') "
> >
> > I think this can work, and I have run some test on our 3-nodes test
> > clusters.
> >
> > But the problem is, there are lots of logid which means there are lots of
> > log files,  so every hour we will have to load lots of files into hdfs
> > and there is another problem,  we will run hourly analysis job on these
> > hourly collected log files,
> > which inroduces the problem, because there are lots of log files, if we
> > load these log files at the same time at the begining of every hour, I
> > think  there will some network flows and there will be data delivery
> > latency problem.
> >
> > For data delivery latency problem, I mean it will take some time for the
> > log files to be copyed into hdfs,  and this will cause our hourly log
> > analysis job to start later.
> >
> > So I want to figure out if we can write or append logs into an compressed
> > file which is already located in hdfs, and I have posted an thread in the
> > mailist, and from what I have learned, this is not possible.
> >
> >
> > So, what's the best practice of loading logs into hdfs while using hive to
> > do log analytic?
> >
> > Or what's the common methods to handle problem I have describe above?
> >
> > Can anyone give me some advices?
> >
> > Thank you very much for your help!
> 
> 


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by Xiaobin She <xi...@gmail.com>.
hi Bejoy and Alex,

thank you for your advice.

Actually I have look at Scribe first, and I have heard of Flume.

I look at flume's user guide just now, and flume seems promising, as Bejoy
said , the flume collector can dump data into hdfs when the collector
buffer reaches a particular size of after a particular time interval, this
is good and I think it can solve the problem of data delivery latency.

But what about compress?

from the user's guide of flume, I see that flum supports compression  of
log files, but if flume did not wait until the collector has collect one
hour of log and then compress it and send it to hdfs, then it will  send
part of the one hour log to hdfs, am I right?

so if I want to use thest data in hive (assume I have an external table in
hive), I have to specify at least two partiton key while creating table,
one for day-month-hour, and one for some other time interval like ten
miniutes, then I add hive partition to the existed external table with
specified partition key.

Is the above process right ?

If this right, then there could be some other problem, like the ten miniute
logs after compress is not big enough to fit the block size of hdfs which
may couse lots of small files ( for some of our log id, this may come
true), or if I set the time interval to be half an hour, then at the end of
hour, it may still cause the data delivery latency problem.

this seems not a very good solution, am I making some mistakes or
misunderstanding here?

thank you very much!





2012/2/7 alo alt <wg...@googlemail.com>

> Hi,
>
> a first start with flume:
>
> http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
>
> Facebook's scribe could also be work for you.
>
> - Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
>
> > Hi all,
> >
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> >
> > Our system generate lots of logs every day, for example, it produce about
> > 370GB logs(including lots of log files) yesterday, and every day the logs
> > increases.
> >
> > And we want to use hadoop and hive to replace our old log analysic
> system.
> >
> > We distinguish our logs with logid, we have an log collector which will
> > collect logs from clients and then generate log files.
> >
> > for every logid, there will be one log file every hour, for some logid,
> > this hourly log file can be 1~2GB
> >
> > I have set up an test cluster with hadoop and hive, and I have run some
> > test which seems good for us.
> >
> > For reference, we will create one table in hive for every logid which
> will
> > be partitoned by hour.
> >
> > Now I have a question, what's the best practice for loading logs files
> into
> > hdfs or hive warehouse dir ?
> >
> > My first thought is,  at the begining of every hour,  compress the log
> file
> > of the last hour of every logid and then use the hive cmd tool to load
> > these compressed log files into hdfs.
> >
> > using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> > TABLE $tablename PARTITION (dt='$h') "
> >
> > I think this can work, and I have run some test on our 3-nodes test
> > clusters.
> >
> > But the problem is, there are lots of logid which means there are lots of
> > log files,  so every hour we will have to load lots of files into hdfs
> > and there is another problem,  we will run hourly analysis job on these
> > hourly collected log files,
> > which inroduces the problem, because there are lots of log files, if we
> > load these log files at the same time at the begining of every hour, I
> > think  there will some network flows and there will be data delivery
> > latency problem.
> >
> > For data delivery latency problem, I mean it will take some time for the
> > log files to be copyed into hdfs,  and this will cause our hourly log
> > analysis job to start later.
> >
> > So I want to figure out if we can write or append logs into an compressed
> > file which is already located in hdfs, and I have posted an thread in the
> > mailist, and from what I have learned, this is not possible.
> >
> >
> > So, what's the best practice of loading logs into hdfs while using hive
> to
> > do log analytic?
> >
> > Or what's the common methods to handle problem I have describe above?
> >
> > Can anyone give me some advices?
> >
> > Thank you very much for your help!
>
>

Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by Xiaobin She <xi...@gmail.com>.
hi Bejoy and Alex,

thank you for your advice.

Actually I have look at Scribe first, and I have heard of Flume.

I look at flume's user guide just now, and flume seems promising, as Bejoy
said , the flume collector can dump data into hdfs when the collector
buffer reaches a particular size of after a particular time interval, this
is good and I think it can solve the problem of data delivery latency.

But what about compress?

from the user's guide of flume, I see that flum supports compression  of
log files, but if flume did not wait until the collector has collect one
hour of log and then compress it and send it to hdfs, then it will  send
part of the one hour log to hdfs, am I right?

so if I want to use thest data in hive (assume I have an external table in
hive), I have to specify at least two partiton key while creating table,
one for day-month-hour, and one for some other time interval like ten
miniutes, then I add hive partition to the existed external table with
specified partition key.

Is the above process right ?

If this right, then there could be some other problem, like the ten miniute
logs after compress is not big enough to fit the block size of hdfs which
may couse lots of small files ( for some of our log id, this may come
true), or if I set the time interval to be half an hour, then at the end of
hour, it may still cause the data delivery latency problem.

this seems not a very good solution, am I making some mistakes or
misunderstanding here?

thank you very much!





2012/2/7 alo alt <wg...@googlemail.com>

> Hi,
>
> a first start with flume:
>
> http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
>
> Facebook's scribe could also be work for you.
>
> - Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
>
> > Hi all,
> >
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> >
> > Our system generate lots of logs every day, for example, it produce about
> > 370GB logs(including lots of log files) yesterday, and every day the logs
> > increases.
> >
> > And we want to use hadoop and hive to replace our old log analysic
> system.
> >
> > We distinguish our logs with logid, we have an log collector which will
> > collect logs from clients and then generate log files.
> >
> > for every logid, there will be one log file every hour, for some logid,
> > this hourly log file can be 1~2GB
> >
> > I have set up an test cluster with hadoop and hive, and I have run some
> > test which seems good for us.
> >
> > For reference, we will create one table in hive for every logid which
> will
> > be partitoned by hour.
> >
> > Now I have a question, what's the best practice for loading logs files
> into
> > hdfs or hive warehouse dir ?
> >
> > My first thought is,  at the begining of every hour,  compress the log
> file
> > of the last hour of every logid and then use the hive cmd tool to load
> > these compressed log files into hdfs.
> >
> > using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> > TABLE $tablename PARTITION (dt='$h') "
> >
> > I think this can work, and I have run some test on our 3-nodes test
> > clusters.
> >
> > But the problem is, there are lots of logid which means there are lots of
> > log files,  so every hour we will have to load lots of files into hdfs
> > and there is another problem,  we will run hourly analysis job on these
> > hourly collected log files,
> > which inroduces the problem, because there are lots of log files, if we
> > load these log files at the same time at the begining of every hour, I
> > think  there will some network flows and there will be data delivery
> > latency problem.
> >
> > For data delivery latency problem, I mean it will take some time for the
> > log files to be copyed into hdfs,  and this will cause our hourly log
> > analysis job to start later.
> >
> > So I want to figure out if we can write or append logs into an compressed
> > file which is already located in hdfs, and I have posted an thread in the
> > mailist, and from what I have learned, this is not possible.
> >
> >
> > So, what's the best practice of loading logs into hdfs while using hive
> to
> > do log analytic?
> >
> > Or what's the common methods to handle problem I have describe above?
> >
> > Can anyone give me some advices?
> >
> > Thank you very much for your help!
>
>

Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by alo alt <wg...@googlemail.com>.
Hi,

a first start with flume:
http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html

Facebook's scribe could also be work for you.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

> Hi all,
> 
> Sorry if it is not appropriate to send one thread into two maillist.
> **
> I'm tring to use hadoop and hive to do some log analytic jobs.
> 
> Our system generate lots of logs every day, for example, it produce about
> 370GB logs(including lots of log files) yesterday, and every day the logs
> increases.
> 
> And we want to use hadoop and hive to replace our old log analysic system.
> 
> We distinguish our logs with logid, we have an log collector which will
> collect logs from clients and then generate log files.
> 
> for every logid, there will be one log file every hour, for some logid,
> this hourly log file can be 1~2GB
> 
> I have set up an test cluster with hadoop and hive, and I have run some
> test which seems good for us.
> 
> For reference, we will create one table in hive for every logid which will
> be partitoned by hour.
> 
> Now I have a question, what's the best practice for loading logs files into
> hdfs or hive warehouse dir ?
> 
> My first thought is,  at the begining of every hour,  compress the log file
> of the last hour of every logid and then use the hive cmd tool to load
> these compressed log files into hdfs.
> 
> using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> TABLE $tablename PARTITION (dt='$h') "
> 
> I think this can work, and I have run some test on our 3-nodes test
> clusters.
> 
> But the problem is, there are lots of logid which means there are lots of
> log files,  so every hour we will have to load lots of files into hdfs
> and there is another problem,  we will run hourly analysis job on these
> hourly collected log files,
> which inroduces the problem, because there are lots of log files, if we
> load these log files at the same time at the begining of every hour, I
> think  there will some network flows and there will be data delivery
> latency problem.
> 
> For data delivery latency problem, I mean it will take some time for the
> log files to be copyed into hdfs,  and this will cause our hourly log
> analysis job to start later.
> 
> So I want to figure out if we can write or append logs into an compressed
> file which is already located in hdfs, and I have posted an thread in the
> mailist, and from what I have learned, this is not possible.
> 
> 
> So, what's the best practice of loading logs into hdfs while using hive to
> do log analytic?
> 
> Or what's the common methods to handle problem I have describe above?
> 
> Can anyone give me some advices?
> 
> Thank you very much for your help!


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by be...@gmail.com.
Hi
      If you are looking at a solution to ingest data into hdfs in real time, ie as soon as it is generated, you need to look into Cloudera Flume. It makes realtime data ingestion on to hdfs possible. You can configure the flume collector to dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval.

Currently your log collector would give you a file for each log Id in an hour. You may need to think of a design like  replace the log collector with flume. You can make flume ingest data into hour sub dir in hdfs . Once it is done what left would be
- do ddls in hive to add partition for those tables
- trigger your hive jobs for previous hour.

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Xiaobin She <xi...@gmail.com>
Date: Tue, 7 Feb 2012 18:03:57 
To: <co...@hadoop.apache.org>; <us...@hive.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: What's the best practice of loading logs into hdfs while using hive
 to do log analytic?

Hi all,

Sorry if it is not appropriate to send one thread into two maillist.
**
I'm tring to use hadoop and hive to do some log analytic jobs.

Our system generate lots of logs every day, for example, it produce about
370GB logs(including lots of log files) yesterday, and every day the logs
increases.

And we want to use hadoop and hive to replace our old log analysic system.

We distinguish our logs with logid, we have an log collector which will
collect logs from clients and then generate log files.

for every logid, there will be one log file every hour, for some logid,
this hourly log file can be 1~2GB

I have set up an test cluster with hadoop and hive, and I have run some
test which seems good for us.

For reference, we will create one table in hive for every logid which will
be partitoned by hour.

Now I have a question, what's the best practice for loading logs files into
hdfs or hive warehouse dir ?

My first thought is,  at the begining of every hour,  compress the log file
of the last hour of every logid and then use the hive cmd tool to load
these compressed log files into hdfs.

using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
TABLE $tablename PARTITION (dt='$h') "

I think this can work, and I have run some test on our 3-nodes test
clusters.

But the problem is, there are lots of logid which means there are lots of
log files,  so every hour we will have to load lots of files into hdfs
and there is another problem,  we will run hourly analysis job on these
hourly collected log files,
which inroduces the problem, because there are lots of log files, if we
load these log files at the same time at the begining of every hour, I
think  there will some network flows and there will be data delivery
latency problem.

For data delivery latency problem, I mean it will take some time for the
log files to be copyed into hdfs,  and this will cause our hourly log
analysis job to start later.

So I want to figure out if we can write or append logs into an compressed
file which is already located in hdfs, and I have posted an thread in the
mailist, and from what I have learned, this is not possible.


So, what's the best practice of loading logs into hdfs while using hive to
do log analytic?

Or what's the common methods to handle problem I have describe above?

Can anyone give me some advices?

Thank you very much for your help!


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

Posted by alo alt <wg...@googlemail.com>.
Hi,

a first start with flume:
http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html

Facebook's scribe could also be work for you.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

> Hi all,
> 
> Sorry if it is not appropriate to send one thread into two maillist.
> **
> I'm tring to use hadoop and hive to do some log analytic jobs.
> 
> Our system generate lots of logs every day, for example, it produce about
> 370GB logs(including lots of log files) yesterday, and every day the logs
> increases.
> 
> And we want to use hadoop and hive to replace our old log analysic system.
> 
> We distinguish our logs with logid, we have an log collector which will
> collect logs from clients and then generate log files.
> 
> for every logid, there will be one log file every hour, for some logid,
> this hourly log file can be 1~2GB
> 
> I have set up an test cluster with hadoop and hive, and I have run some
> test which seems good for us.
> 
> For reference, we will create one table in hive for every logid which will
> be partitoned by hour.
> 
> Now I have a question, what's the best practice for loading logs files into
> hdfs or hive warehouse dir ?
> 
> My first thought is,  at the begining of every hour,  compress the log file
> of the last hour of every logid and then use the hive cmd tool to load
> these compressed log files into hdfs.
> 
> using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> TABLE $tablename PARTITION (dt='$h') "
> 
> I think this can work, and I have run some test on our 3-nodes test
> clusters.
> 
> But the problem is, there are lots of logid which means there are lots of
> log files,  so every hour we will have to load lots of files into hdfs
> and there is another problem,  we will run hourly analysis job on these
> hourly collected log files,
> which inroduces the problem, because there are lots of log files, if we
> load these log files at the same time at the begining of every hour, I
> think  there will some network flows and there will be data delivery
> latency problem.
> 
> For data delivery latency problem, I mean it will take some time for the
> log files to be copyed into hdfs,  and this will cause our hourly log
> analysis job to start later.
> 
> So I want to figure out if we can write or append logs into an compressed
> file which is already located in hdfs, and I have posted an thread in the
> mailist, and from what I have learned, this is not possible.
> 
> 
> So, what's the best practice of loading logs into hdfs while using hive to
> do log analytic?
> 
> Or what's the common methods to handle problem I have describe above?
> 
> Can anyone give me some advices?
> 
> Thank you very much for your help!