You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/02/16 23:47:44 UTC

executing hadoop commands from python?

Hi,

  This might be more of a python centric question but was wondering if
anyone has tried it out...

I am trying to run few hadoop commands from python program...

For example if from command line, you do:

      bin/hadoop dfs -ls /hdfs/query/path

it returns all the files in the hdfs query path..
So very similar to unix


Now I am trying to basically do this from python.. and do some manipulation
from it.

     exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
     os.system(exec_str)

Now, I am trying to grab this output to do some manipulation in it.
For example.. count number of files?
I looked into subprocess module but then... these are not native shell
commands. hence not sure whether i can apply those concepts
How to solve this?

Thanks

Re: executing hadoop commands from python?

Posted by anuj maurice <an...@gmail.com>.
i was stuck with similar issue before and couldn't come up with a more
viable alternative than this so if the output of the hadoop command is not
that big then you can take it into your py script and process it .

i use the following code snippet to clean the output of ls and store it
into a py list for process.
In your case you can do a len on the list to get file count

fscommand  = "hadoop dfs -ls /path/in/%s/*/ 2> /dev/null"%("hdfs")
hadoop_cmd=commands.getoutput(fscommand)
lines = hadoop_cmd.split("\n")[1:]
strlines =[map(lambda a:a.strip(),line.split(' ')[-3:]) for line in lines]




On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some
> manipulation from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>
>


-- 
regards ,
Anuj Maurice

Re: executing hadoop commands from python?

Posted by Harsh J <ha...@cloudera.com>.
Instead of 'scraping' this way, consider using a library such as
Pydoop (http://pydoop.sourceforge.net) which provides pythonic ways
and APIs to interact with Hadoop components. There are also other
libraries covered at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
for example.

On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some manipulation
> from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>



--
Harsh J

Re: executing hadoop commands from python?

Posted by Harsh J <ha...@cloudera.com>.
Instead of 'scraping' this way, consider using a library such as
Pydoop (http://pydoop.sourceforge.net) which provides pythonic ways
and APIs to interact with Hadoop components. There are also other
libraries covered at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
for example.

On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some manipulation
> from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>



--
Harsh J

Re: executing hadoop commands from python?

Posted by Harsh J <ha...@cloudera.com>.
Instead of 'scraping' this way, consider using a library such as
Pydoop (http://pydoop.sourceforge.net) which provides pythonic ways
and APIs to interact with Hadoop components. There are also other
libraries covered at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
for example.

On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some manipulation
> from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>



--
Harsh J

Re: executing hadoop commands from python?

Posted by anuj maurice <an...@gmail.com>.
i was stuck with similar issue before and couldn't come up with a more
viable alternative than this so if the output of the hadoop command is not
that big then you can take it into your py script and process it .

i use the following code snippet to clean the output of ls and store it
into a py list for process.
In your case you can do a len on the list to get file count

fscommand  = "hadoop dfs -ls /path/in/%s/*/ 2> /dev/null"%("hdfs")
hadoop_cmd=commands.getoutput(fscommand)
lines = hadoop_cmd.split("\n")[1:]
strlines =[map(lambda a:a.strip(),line.split(' ')[-3:]) for line in lines]




On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some
> manipulation from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>
>


-- 
regards ,
Anuj Maurice

Re: executing hadoop commands from python?

Posted by anuj maurice <an...@gmail.com>.
i was stuck with similar issue before and couldn't come up with a more
viable alternative than this so if the output of the hadoop command is not
that big then you can take it into your py script and process it .

i use the following code snippet to clean the output of ls and store it
into a py list for process.
In your case you can do a len on the list to get file count

fscommand  = "hadoop dfs -ls /path/in/%s/*/ 2> /dev/null"%("hdfs")
hadoop_cmd=commands.getoutput(fscommand)
lines = hadoop_cmd.split("\n")[1:]
strlines =[map(lambda a:a.strip(),line.split(' ')[-3:]) for line in lines]




On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some
> manipulation from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>
>


-- 
regards ,
Anuj Maurice

Re: executing hadoop commands from python?

Posted by anuj maurice <an...@gmail.com>.
i was stuck with similar issue before and couldn't come up with a more
viable alternative than this so if the output of the hadoop command is not
that big then you can take it into your py script and process it .

i use the following code snippet to clean the output of ls and store it
into a py list for process.
In your case you can do a len on the list to get file count

fscommand  = "hadoop dfs -ls /path/in/%s/*/ 2> /dev/null"%("hdfs")
hadoop_cmd=commands.getoutput(fscommand)
lines = hadoop_cmd.split("\n")[1:]
strlines =[map(lambda a:a.strip(),line.split(' ')[-3:]) for line in lines]




On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some
> manipulation from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>
>


-- 
regards ,
Anuj Maurice

Re: executing hadoop commands from python?

Posted by Harsh J <ha...@cloudera.com>.
Instead of 'scraping' this way, consider using a library such as
Pydoop (http://pydoop.sourceforge.net) which provides pythonic ways
and APIs to interact with Hadoop components. There are also other
libraries covered at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
for example.

On Sun, Feb 17, 2013 at 4:17 AM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>
>   This might be more of a python centric question but was wondering if
> anyone has tried it out...
>
> I am trying to run few hadoop commands from python program...
>
> For example if from command line, you do:
>
>       bin/hadoop dfs -ls /hdfs/query/path
>
> it returns all the files in the hdfs query path..
> So very similar to unix
>
>
> Now I am trying to basically do this from python.. and do some manipulation
> from it.
>
>      exec_str = "path/to/hadoop/bin/hadoop dfs -ls " + query_path
>      os.system(exec_str)
>
> Now, I am trying to grab this output to do some manipulation in it.
> For example.. count number of files?
> I looked into subprocess module but then... these are not native shell
> commands. hence not sure whether i can apply those concepts
> How to solve this?
>
> Thanks
>
>



--
Harsh J