You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2012/11/28 06:17:13 UTC

advice

Hi,
  Lately, I have been writing alot of algorithms in map reduce abstraction
in python (hadoop streaming).
I have got a hang of it (I think)...
I have couple of questions:
1) By not using java libraries, what power of hadoop am I missing?
2) I know that this is just the tip of the iceberg, can someone point out
from practical usage, what are some of the concepts I should focus on next
( like maybe practising combiners or hdfs??) which will improve on my
current practical knowledge and then offcourse the not so practical part as
well?
Sorry for being so vague.
Thanks
Jamal

Re: advice

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

              Please follow the inline answers,

On Wed, Nov 28, 2012 at 10:47 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   Lately, I have been writing alot of algorithms in map reduce abstraction
> in python (hadoop streaming).
> I have got a hang of it (I think)...
> I have couple of questions:
> 1) By not using java libraries, what power of hadoop am I missing?
>
Though I am NOT very sure,
-> I believe there is NO better control over the job while using streaming
API.
-> Using java, in reducer phase the values get automatically aggregated
(Iterator) for a given key. But in Streaming jobs user has to take care of
aggregating/processing the values based on key
-> In normal case the framework will call map function once per each line,
but in streaming you have the better control over processing multiple lines

> 2) I know that this is just the tip of the iceberg, can someone point out
> from practical usage, what are some of the concepts I should focus on next
> ( like maybe practising combiners or hdfs??) which will improve on my
> current practical knowledge and then offcourse the not so practical part as
> well?
> Sorry for being so vague.
>
-> Its better start learning basics of HDFS, MapReduce architectures, and
then concepts like combiners, partitioner, recordreader, inputformats,
outputformats etc

Best,
Mahesh Balija,
Calsoft Labs.

Re: advice

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

              Please follow the inline answers,

On Wed, Nov 28, 2012 at 10:47 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   Lately, I have been writing alot of algorithms in map reduce abstraction
> in python (hadoop streaming).
> I have got a hang of it (I think)...
> I have couple of questions:
> 1) By not using java libraries, what power of hadoop am I missing?
>
Though I am NOT very sure,
-> I believe there is NO better control over the job while using streaming
API.
-> Using java, in reducer phase the values get automatically aggregated
(Iterator) for a given key. But in Streaming jobs user has to take care of
aggregating/processing the values based on key
-> In normal case the framework will call map function once per each line,
but in streaming you have the better control over processing multiple lines

> 2) I know that this is just the tip of the iceberg, can someone point out
> from practical usage, what are some of the concepts I should focus on next
> ( like maybe practising combiners or hdfs??) which will improve on my
> current practical knowledge and then offcourse the not so practical part as
> well?
> Sorry for being so vague.
>
-> Its better start learning basics of HDFS, MapReduce architectures, and
then concepts like combiners, partitioner, recordreader, inputformats,
outputformats etc

Best,
Mahesh Balija,
Calsoft Labs.

Re: advice

Posted by Simone Leo <si...@crs4.it>.

On 11/28/2012 06:17 AM, jamal sasha wrote:
>    Lately, I have been writing alot of algorithms in map reduce
> abstraction in python (hadoop streaming).
> By not using java libraries, what power of hadoop am I missing?

In the Pydoop docs we have a section where several approaches to Hadoop 
programming are discussed (with a focus on Python, ofc):

http://pydoop.sourceforge.net/docs/for_dumbo_users.html

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: advice

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

              Please follow the inline answers,

On Wed, Nov 28, 2012 at 10:47 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   Lately, I have been writing alot of algorithms in map reduce abstraction
> in python (hadoop streaming).
> I have got a hang of it (I think)...
> I have couple of questions:
> 1) By not using java libraries, what power of hadoop am I missing?
>
Though I am NOT very sure,
-> I believe there is NO better control over the job while using streaming
API.
-> Using java, in reducer phase the values get automatically aggregated
(Iterator) for a given key. But in Streaming jobs user has to take care of
aggregating/processing the values based on key
-> In normal case the framework will call map function once per each line,
but in streaming you have the better control over processing multiple lines

> 2) I know that this is just the tip of the iceberg, can someone point out
> from practical usage, what are some of the concepts I should focus on next
> ( like maybe practising combiners or hdfs??) which will improve on my
> current practical knowledge and then offcourse the not so practical part as
> well?
> Sorry for being so vague.
>
-> Its better start learning basics of HDFS, MapReduce architectures, and
then concepts like combiners, partitioner, recordreader, inputformats,
outputformats etc

Best,
Mahesh Balija,
Calsoft Labs.

Re: advice

Posted by Simone Leo <si...@crs4.it>.

On 11/28/2012 06:17 AM, jamal sasha wrote:
>    Lately, I have been writing alot of algorithms in map reduce
> abstraction in python (hadoop streaming).
> By not using java libraries, what power of hadoop am I missing?

In the Pydoop docs we have a section where several approaches to Hadoop 
programming are discussed (with a focus on Python, ofc):

http://pydoop.sourceforge.net/docs/for_dumbo_users.html

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: advice

Posted by Simone Leo <si...@crs4.it>.

On 11/28/2012 06:17 AM, jamal sasha wrote:
>    Lately, I have been writing alot of algorithms in map reduce
> abstraction in python (hadoop streaming).
> By not using java libraries, what power of hadoop am I missing?

In the Pydoop docs we have a section where several approaches to Hadoop 
programming are discussed (with a focus on Python, ofc):

http://pydoop.sourceforge.net/docs/for_dumbo_users.html

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: advice

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

              Please follow the inline answers,

On Wed, Nov 28, 2012 at 10:47 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   Lately, I have been writing alot of algorithms in map reduce abstraction
> in python (hadoop streaming).
> I have got a hang of it (I think)...
> I have couple of questions:
> 1) By not using java libraries, what power of hadoop am I missing?
>
Though I am NOT very sure,
-> I believe there is NO better control over the job while using streaming
API.
-> Using java, in reducer phase the values get automatically aggregated
(Iterator) for a given key. But in Streaming jobs user has to take care of
aggregating/processing the values based on key
-> In normal case the framework will call map function once per each line,
but in streaming you have the better control over processing multiple lines

> 2) I know that this is just the tip of the iceberg, can someone point out
> from practical usage, what are some of the concepts I should focus on next
> ( like maybe practising combiners or hdfs??) which will improve on my
> current practical knowledge and then offcourse the not so practical part as
> well?
> Sorry for being so vague.
>
-> Its better start learning basics of HDFS, MapReduce architectures, and
then concepts like combiners, partitioner, recordreader, inputformats,
outputformats etc

Best,
Mahesh Balija,
Calsoft Labs.

Re: advice

Posted by Simone Leo <si...@crs4.it>.

On 11/28/2012 06:17 AM, jamal sasha wrote:
>    Lately, I have been writing alot of algorithms in map reduce
> abstraction in python (hadoop streaming).
> By not using java libraries, what power of hadoop am I missing?

In the Pydoop docs we have a section where several approaches to Hadoop 
programming are discussed (with a focus on Python, ofc):

http://pydoop.sourceforge.net/docs/for_dumbo_users.html

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it