You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Sundeep Kambhampati <ka...@cse.ohio-state.edu> on 2013/01/26 08:10:13 UTC

Executing a Python program inside Map Function

Is it possible to run a python script inside a Map function which is in 
java?

I what to to run a python script which is on my local disk and I want to 
use the output of that script for further processing in Map Function to 
produce <key/Value> Pairs.
Can some give me some idea how to do it.


Regards
Sundeep

Re: Executing a Python program inside Map Function

Posted by Preethi Vinayak Ponangi <vi...@gmail.com>.
It is possible to run a python script from your map function, just make
sure the script is available in your DistributedCache.

I think you are missing something while designing such a job. You are
assuming that your file size is small enough where you can run this script
on your local file system and use the processed output in your Hadoop job.
But what if the local file size increases significantly in a few days? In
that case, you might actually be better off using this python script as a
part of Hadoop Streaming. Stream the python script through the Streaming
API to get the benefit of distributed processing.

Hope this helps.

Vinayak.

On Sat, Jan 26, 2013 at 1:10 AM, Sundeep Kambhampati <
kambhamp@cse.ohio-state.edu> wrote:

> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to
> use the output of that script for further processing in Map Function to
> produce <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep
>

Re: Executing a Python program inside Map Function

Posted by Harsh J <ha...@cloudera.com>.
Java provides the Process class to help you launch and read/write
from/to processes:
http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html. You
can use this to spawn your program from your code, to write input into
the process's stdin, and to read its output via its stdout/etc.. The
hadoop-streaming parts of Apache Hadoop is very similar in its
operations - but allows little control back on the launched java map
class which you seem to require.

The tasks (both M and R types) provide entry and exit API points
(configure()/setup(), and cleanup()) - allowing you to spawn a process
before map-reads start, and end it after, letting you manage your
spawned process more cleanly.

On Sat, Jan 26, 2013 at 12:40 PM, Sundeep Kambhampati
<ka...@cse.ohio-state.edu> wrote:
> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to use
> the output of that script for further processing in Map Function to produce
> <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep



-- 
Harsh J

Re: Executing a Python program inside Map Function

Posted by Harsh J <ha...@cloudera.com>.
Java provides the Process class to help you launch and read/write
from/to processes:
http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html. You
can use this to spawn your program from your code, to write input into
the process's stdin, and to read its output via its stdout/etc.. The
hadoop-streaming parts of Apache Hadoop is very similar in its
operations - but allows little control back on the launched java map
class which you seem to require.

The tasks (both M and R types) provide entry and exit API points
(configure()/setup(), and cleanup()) - allowing you to spawn a process
before map-reads start, and end it after, letting you manage your
spawned process more cleanly.

On Sat, Jan 26, 2013 at 12:40 PM, Sundeep Kambhampati
<ka...@cse.ohio-state.edu> wrote:
> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to use
> the output of that script for further processing in Map Function to produce
> <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep



-- 
Harsh J

Re: Executing a Python program inside Map Function

Posted by Preethi Vinayak Ponangi <vi...@gmail.com>.
It is possible to run a python script from your map function, just make
sure the script is available in your DistributedCache.

I think you are missing something while designing such a job. You are
assuming that your file size is small enough where you can run this script
on your local file system and use the processed output in your Hadoop job.
But what if the local file size increases significantly in a few days? In
that case, you might actually be better off using this python script as a
part of Hadoop Streaming. Stream the python script through the Streaming
API to get the benefit of distributed processing.

Hope this helps.

Vinayak.

On Sat, Jan 26, 2013 at 1:10 AM, Sundeep Kambhampati <
kambhamp@cse.ohio-state.edu> wrote:

> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to
> use the output of that script for further processing in Map Function to
> produce <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep
>

Re: Executing a Python program inside Map Function

Posted by Preethi Vinayak Ponangi <vi...@gmail.com>.
It is possible to run a python script from your map function, just make
sure the script is available in your DistributedCache.

I think you are missing something while designing such a job. You are
assuming that your file size is small enough where you can run this script
on your local file system and use the processed output in your Hadoop job.
But what if the local file size increases significantly in a few days? In
that case, you might actually be better off using this python script as a
part of Hadoop Streaming. Stream the python script through the Streaming
API to get the benefit of distributed processing.

Hope this helps.

Vinayak.

On Sat, Jan 26, 2013 at 1:10 AM, Sundeep Kambhampati <
kambhamp@cse.ohio-state.edu> wrote:

> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to
> use the output of that script for further processing in Map Function to
> produce <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep
>

Re: Executing a Python program inside Map Function

Posted by Harsh J <ha...@cloudera.com>.
Java provides the Process class to help you launch and read/write
from/to processes:
http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html. You
can use this to spawn your program from your code, to write input into
the process's stdin, and to read its output via its stdout/etc.. The
hadoop-streaming parts of Apache Hadoop is very similar in its
operations - but allows little control back on the launched java map
class which you seem to require.

The tasks (both M and R types) provide entry and exit API points
(configure()/setup(), and cleanup()) - allowing you to spawn a process
before map-reads start, and end it after, letting you manage your
spawned process more cleanly.

On Sat, Jan 26, 2013 at 12:40 PM, Sundeep Kambhampati
<ka...@cse.ohio-state.edu> wrote:
> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to use
> the output of that script for further processing in Map Function to produce
> <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep



-- 
Harsh J

Re: Executing a Python program inside Map Function

Posted by Preethi Vinayak Ponangi <vi...@gmail.com>.
It is possible to run a python script from your map function, just make
sure the script is available in your DistributedCache.

I think you are missing something while designing such a job. You are
assuming that your file size is small enough where you can run this script
on your local file system and use the processed output in your Hadoop job.
But what if the local file size increases significantly in a few days? In
that case, you might actually be better off using this python script as a
part of Hadoop Streaming. Stream the python script through the Streaming
API to get the benefit of distributed processing.

Hope this helps.

Vinayak.

On Sat, Jan 26, 2013 at 1:10 AM, Sundeep Kambhampati <
kambhamp@cse.ohio-state.edu> wrote:

> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to
> use the output of that script for further processing in Map Function to
> produce <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep
>

Re: Executing a Python program inside Map Function

Posted by Harsh J <ha...@cloudera.com>.
Java provides the Process class to help you launch and read/write
from/to processes:
http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html. You
can use this to spawn your program from your code, to write input into
the process's stdin, and to read its output via its stdout/etc.. The
hadoop-streaming parts of Apache Hadoop is very similar in its
operations - but allows little control back on the launched java map
class which you seem to require.

The tasks (both M and R types) provide entry and exit API points
(configure()/setup(), and cleanup()) - allowing you to spawn a process
before map-reads start, and end it after, letting you manage your
spawned process more cleanly.

On Sat, Jan 26, 2013 at 12:40 PM, Sundeep Kambhampati
<ka...@cse.ohio-state.edu> wrote:
> Is it possible to run a python script inside a Map function which is in
> java?
>
> I what to to run a python script which is on my local disk and I want to use
> the output of that script for further processing in Map Function to produce
> <key/Value> Pairs.
> Can some give me some idea how to do it.
>
>
> Regards
> Sundeep



-- 
Harsh J