You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by s d <s....@gmail.com> on 2009/05/19 17:36:39 UTC

Hadoop & Python

Hi,
How robust is using hadoop with python over the streaming protocol? Any
disadvantages (performance? flexibility?) ?  It just strikes me that python
is so much more convenient when it comes to deploying and crunching text
files.
Thanks,

Re: Hadoop & Python

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, May 21, 2009 at 5:19 AM, Dan Milstein <dm...@hubspot.com> wrote:

> One thing about the | sort | sh combiner.sh approach: you do have to be
> careful about memory if you're doing that -- if a mapper instance sees a
> large number of rows, you'll be asking sort to sort *all* of those before
> passing them to the combiner.  Hadoop itself only hands off some bounded
> number of output keys at a time to the combiner, which is much safer for
> large data sets.
>

The unix "sort" utility already does some smartness here. It has a
configurable memory buffer it uses for sorting, and spills to /tmp by
default. The manpage doesn't say what algorithm it's actually using, but I
presume it's a mergesort. I think the default memory usage is something
pretty small - you may get better performance using "sort -S 512M" or so.

-Todd

>
>
> On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:
>
>  Whoops, should have googled it first.  Looks like this is now fixed in
>> trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to
>> be
>> adding something like "| sort | sh combiner.sh" to the call of the mapper
>> script (via Klaas Bosteels)
>>
>> Would be great to get this patched into distributions like EMR and
>> Cloudera
>>
>> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
>> <pe...@gmail.com>wrote:
>>
>>  One area I'm curious about is the requirement that any combiners in
>>> Streaming jobs be java classes.  Are there any plans to change this in
>>> the
>>> future?  Prototyping streaming jobs in Python is great, and the ability
>>> to
>>> use a Python combiner would help performance a lot without needing to
>>> move
>>> to Java.
>>>
>>>
>>>
>>>
>>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aa...@cloudera.com> wrote:
>>>
>>>  S d,
>>>>
>>>> It is totally fine to use Python streaming if it does the job you are
>>>> after, there will be a slight performance hit, but that is noise
>>>> assuming
>>>> your cluster is a small one. If you are operating a large cluster
>>>> continuously, then once your logic is stabilized using Python it might
>>>> make
>>>> sense to convert/operationalize some jobs to Java (or C pipes) to
>>>> improve
>>>> performance for purpose of finishing quicker or reducing number of
>>>> servers
>>>> needed.
>>>>
>>>> You should also take a look at PIG and Hive, they are both higher level
>>>> languages and very easy to learn:
>>>>
>>>> http://www.cloudera.com/hadoop-training-pig-introduction
>>>>
>>>> http://www.cloudera.com/hadoop-training-hive-introduction
>>>>
>>>> -- amr
>>>>
>>>>
>>>> s d wrote:
>>>>
>>>>  Thanks.
>>>>> So in the overall scheme of things, what is the general feeling about
>>>>> using
>>>>> python for this? I like the ease of deploying and reading python
>>>>> compared
>>>>> with Java but want to make sure using python over hadoop is scalable &
>>>>> is
>>>>> standard practice and not something done only for prototyping and small
>>>>> scale tests.
>>>>>
>>>>>
>>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>  Streaming is slightly slower than native Java jobs.  Otherwise Python
>>>>>> works
>>>>>> great in streaming.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Hi,
>>>>>>> How robust is using hadoop with python over the streaming protocol?
>>>>>>> Any
>>>>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>>>>
>>>>>>>
>>>>>>>  python
>>>>>>
>>>>>>
>>>>>>  is so much more convenient when it comes to deploying and crunching
>>>>>>> text
>>>>>>> files.
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Peter N. Skomoroch
>>> 617.285.8348
>>> http://www.datawrangling.com
>>> http://delicious.com/pskomoroch
>>> http://twitter.com/peteskomoroch
>>>
>>>
>>
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>
>

Re: Hadoop & Python

Posted by Dan Milstein <dm...@hubspot.com>.
One thing about the | sort | sh combiner.sh approach: you do have to  
be careful about memory if you're doing that -- if a mapper instance  
sees a large number of rows, you'll be asking sort to sort *all* of  
those before passing them to the combiner.  Hadoop itself only hands  
off some bounded number of output keys at a time to the combiner,  
which is much safer for large data sets.

In dumbo itself, Klaas added "combine a chunk at a time", to address  
this problem.

(and, yes, overall, getting combines fully supported in streaming is  
awesome)

-D

On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:

> Whoops, should have googled it first.  Looks like this is now fixed in
> trunk, HADOOP-4842.  For people stuck using 18.3, a workaround  
> appears to be
> adding something like "| sort | sh combiner.sh" to the call of the  
> mapper
> script (via Klaas Bosteels)
>
> Would be great to get this patched into distributions like EMR and  
> Cloudera
>
> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
> <pe...@gmail.com>wrote:
>
>> One area I'm curious about is the requirement that any combiners in
>> Streaming jobs be java classes.  Are there any plans to change this  
>> in the
>> future?  Prototyping streaming jobs in Python is great, and the  
>> ability to
>> use a Python combiner would help performance a lot without needing  
>> to move
>> to Java.
>>
>>
>>
>>
>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aa...@cloudera.com>  
>> wrote:
>>
>>> S d,
>>>
>>> It is totally fine to use Python streaming if it does the job you  
>>> are
>>> after, there will be a slight performance hit, but that is noise  
>>> assuming
>>> your cluster is a small one. If you are operating a large cluster
>>> continuously, then once your logic is stabilized using Python it  
>>> might make
>>> sense to convert/operationalize some jobs to Java (or C pipes) to  
>>> improve
>>> performance for purpose of finishing quicker or reducing number of  
>>> servers
>>> needed.
>>>
>>> You should also take a look at PIG and Hive, they are both higher  
>>> level
>>> languages and very easy to learn:
>>>
>>> http://www.cloudera.com/hadoop-training-pig-introduction
>>>
>>> http://www.cloudera.com/hadoop-training-hive-introduction
>>>
>>> -- amr
>>>
>>>
>>> s d wrote:
>>>
>>>> Thanks.
>>>> So in the overall scheme of things, what is the general feeling  
>>>> about
>>>> using
>>>> python for this? I like the ease of deploying and reading python  
>>>> compared
>>>> with Java but want to make sure using python over hadoop is  
>>>> scalable & is
>>>> standard practice and not something done only for prototyping and  
>>>> small
>>>> scale tests.
>>>>
>>>>
>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <alex@cloudera.com 
>>>> >
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> Streaming is slightly slower than native Java jobs.  Otherwise  
>>>>> Python
>>>>> works
>>>>> great in streaming.
>>>>>
>>>>> Alex
>>>>>
>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>> How robust is using hadoop with python over the streaming  
>>>>>> protocol? Any
>>>>>> disadvantages (performance? flexibility?) ?  It just strikes me  
>>>>>> that
>>>>>>
>>>>>>
>>>>> python
>>>>>
>>>>>
>>>>>> is so much more convenient when it comes to deploying and  
>>>>>> crunching
>>>>>> text
>>>>>> files.
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>
>
>
> -- 
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch


Re: Hadoop & Python

Posted by Peter Skomoroch <pe...@gmail.com>.
Direct link to HADOOP-4842:

https://issues.apache.org/jira/browse/HADOOP-4842

On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch
<pe...@gmail.com>wrote:

> Whoops, should have googled it first.  Looks like this is now fixed in
> trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to be
> adding something like "| sort | sh combiner.sh" to the call of the mapper
> script (via Klaas Bosteels)
>
> Would be great to get this patched into distributions like EMR and Cloudera
>
>
> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch <
> peter.skomoroch@gmail.com> wrote:
>
>> One area I'm curious about is the requirement that any combiners in
>> Streaming jobs be java classes.  Are there any plans to change this in the
>> future?  Prototyping streaming jobs in Python is great, and the ability to
>> use a Python combiner would help performance a lot without needing to move
>> to Java.
>>
>>
>>
>>
>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aa...@cloudera.com> wrote:
>>
>>> S d,
>>>
>>>  It is totally fine to use Python streaming if it does the job you are
>>> after, there will be a slight performance hit, but that is noise assuming
>>> your cluster is a small one. If you are operating a large cluster
>>> continuously, then once your logic is stabilized using Python it might make
>>> sense to convert/operationalize some jobs to Java (or C pipes) to improve
>>> performance for purpose of finishing quicker or reducing number of servers
>>> needed.
>>>
>>>  You should also take a look at PIG and Hive, they are both higher level
>>> languages and very easy to learn:
>>>
>>> http://www.cloudera.com/hadoop-training-pig-introduction
>>>
>>> http://www.cloudera.com/hadoop-training-hive-introduction
>>>
>>> -- amr
>>>
>>>
>>> s d wrote:
>>>
>>>> Thanks.
>>>> So in the overall scheme of things, what is the general feeling about
>>>> using
>>>> python for this? I like the ease of deploying and reading python
>>>> compared
>>>> with Java but want to make sure using python over hadoop is scalable &
>>>> is
>>>> standard practice and not something done only for prototyping and small
>>>> scale tests.
>>>>
>>>>
>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> Streaming is slightly slower than native Java jobs.  Otherwise Python
>>>>> works
>>>>> great in streaming.
>>>>>
>>>>> Alex
>>>>>
>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>> How robust is using hadoop with python over the streaming protocol?
>>>>>> Any
>>>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>>>
>>>>>>
>>>>> python
>>>>>
>>>>>
>>>>>> is so much more convenient when it comes to deploying and crunching
>>>>>> text
>>>>>> files.
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop & Python

Posted by Peter Skomoroch <pe...@gmail.com>.
Whoops, should have googled it first.  Looks like this is now fixed in
trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to be
adding something like "| sort | sh combiner.sh" to the call of the mapper
script (via Klaas Bosteels)

Would be great to get this patched into distributions like EMR and Cloudera

On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
<pe...@gmail.com>wrote:

> One area I'm curious about is the requirement that any combiners in
> Streaming jobs be java classes.  Are there any plans to change this in the
> future?  Prototyping streaming jobs in Python is great, and the ability to
> use a Python combiner would help performance a lot without needing to move
> to Java.
>
>
>
>
> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aa...@cloudera.com> wrote:
>
>> S d,
>>
>>  It is totally fine to use Python streaming if it does the job you are
>> after, there will be a slight performance hit, but that is noise assuming
>> your cluster is a small one. If you are operating a large cluster
>> continuously, then once your logic is stabilized using Python it might make
>> sense to convert/operationalize some jobs to Java (or C pipes) to improve
>> performance for purpose of finishing quicker or reducing number of servers
>> needed.
>>
>>  You should also take a look at PIG and Hive, they are both higher level
>> languages and very easy to learn:
>>
>> http://www.cloudera.com/hadoop-training-pig-introduction
>>
>> http://www.cloudera.com/hadoop-training-hive-introduction
>>
>> -- amr
>>
>>
>> s d wrote:
>>
>>> Thanks.
>>> So in the overall scheme of things, what is the general feeling about
>>> using
>>> python for this? I like the ease of deploying and reading python compared
>>> with Java but want to make sure using python over hadoop is scalable & is
>>> standard practice and not something done only for prototyping and small
>>> scale tests.
>>>
>>>
>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>>> wrote:
>>>
>>>
>>>
>>>> Streaming is slightly slower than native Java jobs.  Otherwise Python
>>>> works
>>>> great in streaming.
>>>>
>>>> Alex
>>>>
>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>> How robust is using hadoop with python over the streaming protocol? Any
>>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>>
>>>>>
>>>> python
>>>>
>>>>
>>>>> is so much more convenient when it comes to deploying and crunching
>>>>> text
>>>>> files.
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop & Python

Posted by Peter Skomoroch <pe...@gmail.com>.
One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes.  Are there any plans to change this in the
future?  Prototyping streaming jobs in Python is great, and the ability to
use a Python combiner would help performance a lot without needing to move
to Java.



On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aa...@cloudera.com> wrote:

> S d,
>
>  It is totally fine to use Python streaming if it does the job you are
> after, there will be a slight performance hit, but that is noise assuming
> your cluster is a small one. If you are operating a large cluster
> continuously, then once your logic is stabilized using Python it might make
> sense to convert/operationalize some jobs to Java (or C pipes) to improve
> performance for purpose of finishing quicker or reducing number of servers
> needed.
>
>  You should also take a look at PIG and Hive, they are both higher level
> languages and very easy to learn:
>
> http://www.cloudera.com/hadoop-training-pig-introduction
>
> http://www.cloudera.com/hadoop-training-hive-introduction
>
> -- amr
>
>
> s d wrote:
>
>> Thanks.
>> So in the overall scheme of things, what is the general feeling about
>> using
>> python for this? I like the ease of deploying and reading python compared
>> with Java but want to make sure using python over hadoop is scalable & is
>> standard practice and not something done only for prototyping and small
>> scale tests.
>>
>>
>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>> wrote:
>>
>>
>>
>>> Streaming is slightly slower than native Java jobs.  Otherwise Python
>>> works
>>> great in streaming.
>>>
>>> Alex
>>>
>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>> How robust is using hadoop with python over the streaming protocol? Any
>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>
>>>>
>>> python
>>>
>>>
>>>> is so much more convenient when it comes to deploying and crunching text
>>>> files.
>>>> Thanks,
>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop & Python

Posted by s d <s....@gmail.com>.
Thanks, What would be the # of severs , file sizes that in their range the
performance hit will be minor? I am concerned about implementing it all only
to rewrite it later to scale economically.
Thanks for all the information.

On Tue, May 19, 2009 at 1:30 PM, Amr Awadallah <aa...@cloudera.com> wrote:

> S d,
>
>  It is totally fine to use Python streaming if it does the job you are
> after, there will be a slight performance hit, but that is noise assuming
> your cluster is a small one. If you are operating a large cluster
> continuously, then once your logic is stabilized using Python it might make
> sense to convert/operationalize some jobs to Java (or C pipes) to improve
> performance for purpose of finishing quicker or reducing number of servers
> needed.
>
>  You should also take a look at PIG and Hive, they are both higher level
> languages and very easy to learn:
>
> http://www.cloudera.com/hadoop-training-pig-introduction
>
> http://www.cloudera.com/hadoop-training-hive-introduction
>
> -- amr
>
>
> s d wrote:
>
>> Thanks.
>> So in the overall scheme of things, what is the general feeling about
>> using
>> python for this? I like the ease of deploying and reading python compared
>> with Java but want to make sure using python over hadoop is scalable & is
>> standard practice and not something done only for prototyping and small
>> scale tests.
>>
>>
>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>> wrote:
>>
>>
>>
>>> Streaming is slightly slower than native Java jobs.  Otherwise Python
>>> works
>>> great in streaming.
>>>
>>> Alex
>>>
>>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>> How robust is using hadoop with python over the streaming protocol? Any
>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>
>>>>
>>> python
>>>
>>>
>>>> is so much more convenient when it comes to deploying and crunching text
>>>> files.
>>>> Thanks,
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Hadoop & Python

Posted by Amr Awadallah <aa...@cloudera.com>.
S d,

  It is totally fine to use Python streaming if it does the job you are 
after, there will be a slight performance hit, but that is noise 
assuming your cluster is a small one. If you are operating a large 
cluster continuously, then once your logic is stabilized using Python it 
might make sense to convert/operationalize some jobs to Java (or C 
pipes) to improve performance for purpose of finishing quicker or 
reducing number of servers needed.

  You should also take a look at PIG and Hive, they are both higher 
level languages and very easy to learn:

http://www.cloudera.com/hadoop-training-pig-introduction

http://www.cloudera.com/hadoop-training-hive-introduction

-- amr

s d wrote:
> Thanks.
> So in the overall scheme of things, what is the general feeling about using
> python for this? I like the ease of deploying and reading python compared
> with Java but want to make sure using python over hadoop is scalable & is
> standard practice and not something done only for prototyping and small
> scale tests.
>
>
> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com> wrote:
>
>   
>> Streaming is slightly slower than native Java jobs.  Otherwise Python works
>> great in streaming.
>>
>> Alex
>>
>> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>>
>>     
>>> Hi,
>>> How robust is using hadoop with python over the streaming protocol? Any
>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>       
>> python
>>     
>>> is so much more convenient when it comes to deploying and crunching text
>>> files.
>>> Thanks,
>>>
>>>       
>
>   

Re: Hadoop & Python

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
I used streaming and php before to work with processing data with a data set 
of about 1TB with out any problems at all.

Billy


"s d" <s....@gmail.com> wrote in message 
news:24b53fa00905191035w41b115c1q94502ee82be4393b@mail.gmail.com...
> Thanks.
> So in the overall scheme of things, what is the general feeling about 
> using
> python for this? I like the ease of deploying and reading python compared
> with Java but want to make sure using python over hadoop is scalable & is
> standard practice and not something done only for prototyping and small
> scale tests.
>
>
> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard 
> <al...@cloudera.com> wrote:
>
>> Streaming is slightly slower than native Java jobs.  Otherwise Python 
>> works
>> great in streaming.
>>
>> Alex
>>
>> On Tue, May 19, 2009 at 8:36 AM, s d 
>> <s....@gmail.com> wrote:
>>
>> > Hi,
>> > How robust is using hadoop with python over the streaming protocol? Any
>> > disadvantages (performance? flexibility?) ?  It just strikes me that
>> python
>> > is so much more convenient when it comes to deploying and crunching 
>> > text
>> > files.
>> > Thanks,
>> >
>>
> 



Re: Hadoop & Python

Posted by Zak Stone <zs...@gmail.com>.
Dumbo certainly makes Python Streaming much nicer; there's more info here:

http://wiki.github.com/klbostee/dumbo
http://dumbotics.com/

For example, Dumbo makes it easy to implement combiners in Python.

Zak


On Tue, May 19, 2009 at 8:17 PM, Alex Loddengaard <al...@cloudera.com> wrote:
> You might also check out Dumbo, which is a Hadoop Python module.
>
> <http://www.audioscrobbler.net/development/dumbo/>
>
> Alex
>
> On Tue, May 19, 2009 at 10:35 AM, s d <s....@gmail.com> wrote:
>
>> Thanks.
>> So in the overall scheme of things, what is the general feeling about using
>> python for this? I like the ease of deploying and reading python compared
>> with Java but want to make sure using python over hadoop is scalable & is
>> standard practice and not something done only for prototyping and small
>> scale tests.
>>
>>
>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
>> wrote:
>>
>> > Streaming is slightly slower than native Java jobs.  Otherwise Python
>> works
>> > great in streaming.
>> >
>> > Alex
>> >
>> > On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>> >
>> > > Hi,
>> > > How robust is using hadoop with python over the streaming protocol? Any
>> > > disadvantages (performance? flexibility?) ?  It just strikes me that
>> > python
>> > > is so much more convenient when it comes to deploying and crunching
>> text
>> > > files.
>> > > Thanks,
>> > >
>> >
>>
>

Re: Hadoop & Python

Posted by Alex Loddengaard <al...@cloudera.com>.
You might also check out Dumbo, which is a Hadoop Python module.

<http://www.audioscrobbler.net/development/dumbo/>

Alex

On Tue, May 19, 2009 at 10:35 AM, s d <s....@gmail.com> wrote:

> Thanks.
> So in the overall scheme of things, what is the general feeling about using
> python for this? I like the ease of deploying and reading python compared
> with Java but want to make sure using python over hadoop is scalable & is
> standard practice and not something done only for prototyping and small
> scale tests.
>
>
> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com>
> wrote:
>
> > Streaming is slightly slower than native Java jobs.  Otherwise Python
> works
> > great in streaming.
> >
> > Alex
> >
> > On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
> >
> > > Hi,
> > > How robust is using hadoop with python over the streaming protocol? Any
> > > disadvantages (performance? flexibility?) ?  It just strikes me that
> > python
> > > is so much more convenient when it comes to deploying and crunching
> text
> > > files.
> > > Thanks,
> > >
> >
>

Re: Hadoop & Python

Posted by s d <s....@gmail.com>.
Thanks.
So in the overall scheme of things, what is the general feeling about using
python for this? I like the ease of deploying and reading python compared
with Java but want to make sure using python over hadoop is scalable & is
standard practice and not something done only for prototyping and small
scale tests.


On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <al...@cloudera.com> wrote:

> Streaming is slightly slower than native Java jobs.  Otherwise Python works
> great in streaming.
>
> Alex
>
> On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:
>
> > Hi,
> > How robust is using hadoop with python over the streaming protocol? Any
> > disadvantages (performance? flexibility?) ?  It just strikes me that
> python
> > is so much more convenient when it comes to deploying and crunching text
> > files.
> > Thanks,
> >
>

Re: Hadoop & Python

Posted by Alex Loddengaard <al...@cloudera.com>.
Streaming is slightly slower than native Java jobs.  Otherwise Python works
great in streaming.

Alex

On Tue, May 19, 2009 at 8:36 AM, s d <s....@gmail.com> wrote:

> Hi,
> How robust is using hadoop with python over the streaming protocol? Any
> disadvantages (performance? flexibility?) ?  It just strikes me that python
> is so much more convenient when it comes to deploying and crunching text
> files.
> Thanks,
>