You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Hassen Riahi <ha...@cern.ch> on 2011/06/20 22:13:26 UTC

mapreduce and python

Dear all,

Is it possible to have a binary input to a map code written in python?

Thank you
Hassen

Re: mapreduce and python

Posted by Hassen Riahi <ha...@cern.ch>.
I'm trying these solutions...Thanks for suggestions.

> I'd like to +1 to using Dumbo for all things Python and Hadoop
> MapReduce. Its one of the better ways to do things.
>
> Do look at the initial conversation here:
> http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
> as well.
>
> The feature/bug fixes specified in the post are present in Apache
> Hadoop 0.21 (which isn't deemed to be suited for production use yet)
> and is also available in other (in-production-use) Hadoop
> distributions such as Cloudera's, which is based off on 0.20.2:
> https://ccp.cloudera.com/display/SUPPORT/Downloads
>
> On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <je...@lewi.us> wrote:
>> Hassen,
>>
>> I've been very succesful using Hadoop Streaming, Dumbo, and  
>> TypedBytes
>> as a solution for using python to implement mappers and reducers.
>>
>> TypedBytes is a hadoop encoding format that allows binary data
>> (including lists and maps) to be encoded in a format that permits the
>> serialized data to safely be passed to mappers/reducers via the  
>> command
>> line through hadoop streaming.
>>
>> Dumbo is a python library which makes it easy to implement your  
>> mappers
>> and reducers in python. In particular, it handles decoding the data
>> encoded as typedbytes to native python types.
>>
>> J
>> On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
>>> Hassen,
>>>
>>>
>>> I have lots of binary data that I parse using Python streaming.
>>>
>>>
>>> The way I do this is stream the binary data into sequence files (the
>>> binary data object I save in the key and (null) as the value).
>>>
>>>
>>> Each key then gets written back to me line by line, key by key for  
>>> an
>>> entire block when streaming.
>>>
>>>
>>> To have this work in streaming on the command line you need to
>>> use -inputformat SequenceFileAsTextInputFormat
>>>
>>>
>>> To create the sequence files I have a jar file that goes from
>>> BufferedReader and writes to  
>>> org.apache.hadoop.io.SequenceFile.Writer
>>>
>>>
>>> I am not sure if you can do this for your data but if not then make
>>> your own InputFormat.
>>>
>>>
>>> good luck!
>>>
>>>
>>> /*
>>> Joe Stein
>>> http://www.linkedin.com/in/charmalloc
>>> Twitter: @allthingshadoop
>>> */
>>>
>>> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <ha...@cern.ch>
>>> wrote:
>>>         Dear all,
>>>
>>>         Is it possible to have a binary input to a map code  
>>> written in
>>>         python?
>>>
>>>         Thank you
>>>         Hassen
>>>
>>>
>>>
>>
>>
>
>
>
> -- 
> Harsh J


Re: mapreduce and python

Posted by Harsh J <ha...@cloudera.com>.
I'd like to +1 to using Dumbo for all things Python and Hadoop
MapReduce. Its one of the better ways to do things.

Do look at the initial conversation here:
http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
as well.

The feature/bug fixes specified in the post are present in Apache
Hadoop 0.21 (which isn't deemed to be suited for production use yet)
and is also available in other (in-production-use) Hadoop
distributions such as Cloudera's, which is based off on 0.20.2:
https://ccp.cloudera.com/display/SUPPORT/Downloads

On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <je...@lewi.us> wrote:
> Hassen,
>
> I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
> as a solution for using python to implement mappers and reducers.
>
> TypedBytes is a hadoop encoding format that allows binary data
> (including lists and maps) to be encoded in a format that permits the
> serialized data to safely be passed to mappers/reducers via the command
> line through hadoop streaming.
>
> Dumbo is a python library which makes it easy to implement your mappers
> and reducers in python. In particular, it handles decoding the data
> encoded as typedbytes to native python types.
>
> J
> On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
>> Hassen,
>>
>>
>> I have lots of binary data that I parse using Python streaming.
>>
>>
>> The way I do this is stream the binary data into sequence files (the
>> binary data object I save in the key and (null) as the value).
>>
>>
>> Each key then gets written back to me line by line, key by key for an
>> entire block when streaming.
>>
>>
>> To have this work in streaming on the command line you need to
>> use -inputformat SequenceFileAsTextInputFormat
>>
>>
>> To create the sequence files I have a jar file that goes from
>> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
>>
>>
>> I am not sure if you can do this for your data but if not then make
>> your own InputFormat.
>>
>>
>> good luck!
>>
>>
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> Twitter: @allthingshadoop
>> */
>>
>> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <ha...@cern.ch>
>> wrote:
>>         Dear all,
>>
>>         Is it possible to have a binary input to a map code written in
>>         python?
>>
>>         Thank you
>>         Hassen
>>
>>
>>
>
>



-- 
Harsh J

Re: mapreduce and python

Posted by Jeremy Lewi <je...@lewi.us>.
Hassen,

I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
as a solution for using python to implement mappers and reducers.

TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to safely be passed to mappers/reducers via the command
line through hadoop streaming.

Dumbo is a python library which makes it easy to implement your mappers
and reducers in python. In particular, it handles decoding the data
encoded as typedbytes to native python types.

J
On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
> Hassen, 
> 
> 
> I have lots of binary data that I parse using Python streaming.
> 
> 
> The way I do this is stream the binary data into sequence files (the
> binary data object I save in the key and (null) as the value).
> 
> 
> Each key then gets written back to me line by line, key by key for an
> entire block when streaming.
> 
> 
> To have this work in streaming on the command line you need to
> use -inputformat SequenceFileAsTextInputFormat
> 
> 
> To create the sequence files I have a jar file that goes from
> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
> 
> 
> I am not sure if you can do this for your data but if not then make
> your own InputFormat.
> 
> 
> good luck!
> 
> 
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> Twitter: @allthingshadoop
> */
> 
> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <ha...@cern.ch>
> wrote:
>         Dear all,
>         
>         Is it possible to have a binary input to a map code written in
>         python?
>         
>         Thank you
>         Hassen
> 
> 
> 


Re: mapreduce and python

Posted by Joe Stein <ch...@allthingshadoop.com>.
Hassen,

I have lots of binary data that I parse using Python streaming.

The way I do this is stream the binary data into sequence files (the binary
data object I save in the key and (null) as the value).

Each key then gets written back to me line by line, key by key for an entire
block when streaming.

To have this work in streaming on the command line you need to use
*-inputformat
SequenceFileAsTextInputFormat*

To create the sequence files I have a jar file that goes from BufferedReader
and writes to org.apache.hadoop.io.SequenceFile.Writer

I am not sure if you can do this for your data but if not then make your own
InputFormat.

good luck!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <ha...@cern.ch> wrote:

> Dear all,
>
> Is it possible to have a binary input to a map code written in python?
>
> Thank you
> Hassen
>

RE: mapreduce and python

Posted by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com>.
You might want to chase down leads around https://issues.apache.org/jira/browse/MAPREDUCE-606. It looks like there is a patch for it on Jira but I am not quite sure if it is working. If it is worth it to you to keep it in python then it might be worth it to tinker with the patch...

HTH,
Matt

-----Original Message-----
From: Hassen Riahi [mailto:hassen.riahi@cern.ch] 
Sent: Monday, June 20, 2011 3:50 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: mapreduce and python

Thanks Matt for your reply...

I would like to use a binary file as input to a map code written in  
python.
All examples that I found take a text file as input to a map code (in  
python). Any ideas? is it feasible?

Hassen

> Hassen,
>
> If you would like to use python as input I would suggest looking  
> into the streaming api examples around Hadoop.
>
> Matt
>
> -----Original Message-----
> From: Hassen Riahi [mailto:hassen.riahi@cern.ch]
> Sent: Monday, June 20, 2011 3:13 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: mapreduce and python
>
> Dear all,
>
> Is it possible to have a binary input to a map code written in python?
>
> Thank you
> Hassen
> This e-mail message may contain privileged and/or confidential  
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in  
> error, please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media.  
> Other use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to  
> monitoring, reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for  
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any  
> damage caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export  
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations  
> (EAR) and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient  
> of this information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>


Re: mapreduce and python

Posted by Hassen Riahi <ha...@cern.ch>.
Thanks Matt for your reply...

I would like to use a binary file as input to a map code written in  
python.
All examples that I found take a text file as input to a map code (in  
python). Any ideas? is it feasible?

Hassen

> Hassen,
>
> If you would like to use python as input I would suggest looking  
> into the streaming api examples around Hadoop.
>
> Matt
>
> -----Original Message-----
> From: Hassen Riahi [mailto:hassen.riahi@cern.ch]
> Sent: Monday, June 20, 2011 3:13 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: mapreduce and python
>
> Dear all,
>
> Is it possible to have a binary input to a map code written in python?
>
> Thank you
> Hassen
> This e-mail message may contain privileged and/or confidential  
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in  
> error, please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media.  
> Other use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to  
> monitoring, reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for  
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any  
> damage caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export  
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations  
> (EAR) and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient  
> of this information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>


RE: mapreduce and python

Posted by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com>.
Hassen,

If you would like to use python as input I would suggest looking into the streaming api examples around Hadoop.

Matt

-----Original Message-----
From: Hassen Riahi [mailto:hassen.riahi@cern.ch] 
Sent: Monday, June 20, 2011 3:13 PM
To: mapreduce-user@hadoop.apache.org
Subject: mapreduce and python

Dear all,

Is it possible to have a binary input to a map code written in python?

Thank you
Hassen
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.