You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Blargy <zm...@hotmail.com> on 2010/06/29 02:10:21 UTC

Importing log files from various machines

I am currently looking into importing all of our application log files (~100+
host machines) into HDFS. Can someone point me in the right direction or
walk me through the process of how I can accomplish this? Any good reading
material on this subject? Videos?

I hope I don't need to physically copy all of the log files to one target
machine before importing. 

Thanks
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Importing-log-files-from-various-machines-tp929423p929423.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Importing log files from various machines

Posted by Steve Loughran <st...@apache.org>.

S. Venkatesh wrote:
> You could write a simple map-only job with each map pulling a bunch of
> files from each of the servers. You could use a NLineInputFormat and
> tweak N based on the # of maps, # of files, etc.
> 

Problem is you can't request which physical host work runs on, a problem 
you hit when you look at other work scheduling issues

  * rebalancing data on HDDs on a single node
  * checksumming blocks (which is treated as a special case in the 
datanodes, not as jobscheduler work
  * machine health checks

It would be nice for me to be able to  push out work to specific nodes, 
more for management than MR work. I can do that, there are ways (cron is 
always handy), but such work doesn't co-operate with the jobscheduler, 
whereas I would like idle task trackers to be picking up the management 
tasks for that node.

For now, I'd put the log upload in as cron jobs, run on the machines, 
copy the data to dfs: filestore, then analyse with MR; it's good to 
clean your log dirs up anyway to prevent outages.

-steve

Re: Importing log files from various machines

Posted by "S. Venkatesh" <ve...@innerzeal.com>.

You could write a simple map-only job with each map pulling a bunch of
files from each of the servers. You could use a NLineInputFormat and
tweak N based on the # of maps, # of files, etc.

Venkatesh

On Tue, Jun 29, 2010 at 5:40 AM, Blargy <zm...@hotmail.com> wrote:
>
> I am currently looking into importing all of our application log files (~100+
> host machines) into HDFS. Can someone point me in the right direction or
> walk me through the process of how I can accomplish this? Any good reading
> material on this subject? Videos?
>
> I hope I don't need to physically copy all of the log files to one target
> machine before importing.
>
> Thanks
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Importing-log-files-from-various-machines-tp929423p929423.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>



-- 
Regards,
Venkatesh

“Perfection (in design) is achieved not when there is nothing more to
add, but rather when there is nothing more to take away.”
- Antoine de Saint-Exupéry

Re: Importing log files from various machines

Posted by Marc Limotte <ms...@gmail.com>.

Just use the hadoop client tools.  That is the hadoop package and configure
it to point to your running cluster.  You don't need to start any hadoop
processes on the node with your logs.  Just use the command line (hadoop dfs
-put) or (hadoop distcp) to move the files from each application server
directly into your HDFS cluster.

Marc

On Mon, Jun 28, 2010 at 5:10 PM, Blargy <zm...@hotmail.com> wrote:

>
> I am currently looking into importing all of our application log files
> (~100+
> host machines) into HDFS. Can someone point me in the right direction or
> walk me through the process of how I can accomplish this? Any good reading
> material on this subject? Videos?
>
> I hope I don't need to physically copy all of the log files to one target
> machine before importing.
>
> Thanks
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Importing-log-files-from-various-machines-tp929423p929423.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>