You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sunil Kulkarni <su...@persistent.co.in> on 2010/01/21 07:53:38 UTC

Anyone has sample program for Hadoop Streaming using shell scripting for Map/Reduce

Hi,

I am new to hadoop. Presently, I am reading Hadoop Streaming related documents.

Anyone has sample program Hadoop Streaming using shell script used for Map/Reduce.

Please help me on this.


----
Thanks,
Sunil

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Anyone has sample program for Hadoop Streaming using shell scripting for Map/Reduce

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

This is a sample work I am trying to write a distributed s3-fetch pig scrip
which uses python script.

s3fetch.pig:
define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
grp_r1 = GROUP r1 BY filename PARALLEL 5;
r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
r3 = STREAM r2 through CMD;
store r3 INTO '/op/s3fetch_debug_log';

And here is my s3fetch.py :
for word in sys.stdin:
  word=word.rstrip()
  str='/usr/local/hadoop-0.20.0/
bin/hadoop fs -cp s3n://<s3-credentials>@bucket/dir-name/'+word+'
/ip/data/.';
  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
  (input_str,out_err) = os.popen4(str);
  for line in out_err.readlines():
    sys.stdout.write('\t'+word+'::\t'+line)



On Thu, Jan 21, 2010 at 10:48 PM, Alexey Tigarev
<al...@gmail.com>wrote:

> On Thu, Jan 21, 2010 at 8:53 AM, Sunil Kulkarni
> <su...@persistent.co.in> wrote:
> > I am new to hadoop. Presently, I am reading Hadoop Streaming related
> documents.
> > Anyone has sample program Hadoop Streaming using shell script used for
> Map/Reduce.
> > Please help me on this.
>
> Here's an article with a simple example,
> "Finding Similar Items with Amazon Elastic MapReduce, Python, and
> Hadoop Streaming", Pete Skomoroch:
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2294
>
> Hope this helps.
>
> --
> С уважением, Алексей Тигарев
> <ti...@nlp.od.ua> Jabber: tigra@jabber.od.ua Skype: t__gra
>
> Как программисту стать фрилансером и заработать первую $1000 на oDesk:
> http://freelance-start.com/earn-first-1000-on-odesk
>

Re: Anyone has sample program for Hadoop Streaming using shell scripting for Map/Reduce

Posted by Alexey Tigarev <al...@gmail.com>.

On Thu, Jan 21, 2010 at 8:53 AM, Sunil Kulkarni
<su...@persistent.co.in> wrote:
> I am new to hadoop. Presently, I am reading Hadoop Streaming related documents.
> Anyone has sample program Hadoop Streaming using shell script used for Map/Reduce.
> Please help me on this.

Here's an article with a simple example,
"Finding Similar Items with Amazon Elastic MapReduce, Python, and
Hadoop Streaming", Pete Skomoroch:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2294

Hope this helps.

-- 
С уважением, Алексей Тигарев
<ti...@nlp.od.ua> Jabber: tigra@jabber.od.ua Skype: t__gra

Как программисту стать фрилансером и заработать первую $1000 на oDesk:
http://freelance-start.com/earn-first-1000-on-odesk

RE: HDFS data storage across multiple disks on slave

Posted by Arv Mistry <ar...@kindsight.net>.

Sorry that was a typo, it is actually "/opt1/dfs/data"

Cheers Arv

-----Original Message-----
From: Wang Xu [mailto:gnawux@gmail.com] 
Sent: January 21, 2010 11:36 AM
To: common-user@hadoop.apache.org
Subject: Re: HDFS data storage across multiple disks on slave

On Fri, Jan 22, 2010 at 12:25 AM, Arv Mistry <ar...@kindsight.net> wrote:
> The setup I have is a single slave, with two disks, 500G each. In the
> hdfs-site.xml file I specify for the dfs.data.dir the two disks i.e.
> /opt/dfs/data,opt1/dfs/data.

It looks you should configure "/opt1/dfs/data" rather than "opt1/dfs/data"

-- 
Wang Xu
Samuel Goldwyn  - "I'm willing to admit that I may not always be
right, but I am never wrong." -
http://www.brainyquote.com/quotes/authors/s/samuel_goldwyn.html

Re: HDFS data storage across multiple disks on slave

Posted by Wang Xu <gn...@gmail.com>.

On Fri, Jan 22, 2010 at 12:25 AM, Arv Mistry <ar...@kindsight.net> wrote:
> The setup I have is a single slave, with two disks, 500G each. In the
> hdfs-site.xml file I specify for the dfs.data.dir the two disks i.e.
> /opt/dfs/data,opt1/dfs/data.

It looks you should configure "/opt1/dfs/data" rather than "opt1/dfs/data"

-- 
Wang Xu
Samuel Goldwyn  - "I'm willing to admit that I may not always be
right, but I am never wrong." -
http://www.brainyquote.com/quotes/authors/s/samuel_goldwyn.html

Re: HDFS data storage across multiple disks on slave

Posted by Allen Wittenauer <aw...@linkedin.com>.


You should be able to modify the dfs.data.dir property and bounce the
datanode process.

On 1/21/10 9:30 AM, "Arv Mistry" <ar...@kindsight.net> wrote:

> I had a typo in my original email, which I've corrected below.
> But assuming I wanted to add another disk to my slave, should I be able
> to do that without losing my current data? Anyone have any documentation
> or link they could send me that describes this.
> 
> Appreciate your help,
> 
> Cheers Arv
> 
> -----Original Message-----
> From: Arv Mistry [mailto:arv@kindsight.net]
> Sent: January 21, 2010 11:25 AM
> To: common-user@hadoop.apache.org; common-dev-info@hadoop.apache.org
> Subject: HDFS data storage across multiple disks on slave
> 
> Hi,
> 
> I'm using hadoop 0.20 and trying to understand how hadoop stores the
> data.
> The setup I have is a single slave, with two disks, 500G each. In the
> hdfs-site.xml file I specify for the dfs.data.dir the two disks i.e.
> /opt/dfs/data,/opt1/dfs/data.
> 
> Now, a couple of things when I do a report i.e. ./hadoop dfsadmin
> -report,
> It only says I have a configured capacity of (500G), should that not be
> twice that, since there are 2 500G disks.
> 
> And when I look at the data being written, its only written to
> /opt/dfs/data. There is no directory /opt1/dfs/data. Should that not
> have been created when I formatted the hdfs?
> 
> Could anyone tell me is there an easy way to add this second disk to the
> HDFS and preserve the existing data. And any ideas what I did wrong that
> it didn't get created/used.
> 
> Any insight would be appreciated.
> 
> Cheers Arv

RE: HDFS data storage across multiple disks on slave

Posted by Arv Mistry <ar...@kindsight.net>.

I had a typo in my original email, which I've corrected below.
But assuming I wanted to add another disk to my slave, should I be able
to do that without losing my current data? Anyone have any documentation
or link they could send me that describes this.

Appreciate your help,

Cheers Arv

-----Original Message-----
From: Arv Mistry [mailto:arv@kindsight.net] 
Sent: January 21, 2010 11:25 AM
To: common-user@hadoop.apache.org; common-dev-info@hadoop.apache.org
Subject: HDFS data storage across multiple disks on slave

Hi,

I'm using hadoop 0.20 and trying to understand how hadoop stores the
data.
The setup I have is a single slave, with two disks, 500G each. In the
hdfs-site.xml file I specify for the dfs.data.dir the two disks i.e.
/opt/dfs/data,/opt1/dfs/data.

Now, a couple of things when I do a report i.e. ./hadoop dfsadmin
-report,
It only says I have a configured capacity of (500G), should that not be
twice that, since there are 2 500G disks.

And when I look at the data being written, its only written to
/opt/dfs/data. There is no directory /opt1/dfs/data. Should that not
have been created when I formatted the hdfs?

Could anyone tell me is there an easy way to add this second disk to the
HDFS and preserve the existing data. And any ideas what I did wrong that
it didn't get created/used.

Any insight would be appreciated.

Cheers Arv

HDFS data storage across multiple disks on slave

Posted by Arv Mistry <ar...@kindsight.net>.

Hi,

I'm using hadoop 0.20 and trying to understand how hadoop stores the
data.
The setup I have is a single slave, with two disks, 500G each. In the
hdfs-site.xml file I specify for the dfs.data.dir the two disks i.e.
/opt/dfs/data,opt1/dfs/data.

Now, a couple of things when I do a report i.e. ./hadoop dfsadmin
-report,
It only says I have a configured capacity of (500G), should that not be
twice that, since there are 2 500G disks.

And when I look at the data being written, its only written to
/opt/dfs/data. There is no directory /opt1/dfs/data. Should that not
have been created when I formatted the hdfs?

Could anyone tell me is there an easy way to add this second disk to the
HDFS and preserve the existing data. And any ideas what I did wrong that
it didn't get created/used.

Any insight would be appreciated.

Cheers Arv