You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by openresearch <Qi...@openresearchinc.com> on 2009/05/14 18:39:55 UTC

hadoop streaming binary input / image processing

All,

I have read some recommendation regarding image (binary input) processing
using Hadoop-streaming which only accept text out-of-box for now.
http://hadoop.apache.org/core/docs/current/streaming.html
https://issues.apache.org/jira/browse/HADOOP-1722
http://markmail.org/message/24woaqie2a6mrboc

However, I have not got any straight answer.

One recommendation is to put image data on HDFS, but we have to do "hdf
-get" for each file/dir and process it locally which is every expensive.

Another recommendation is to "...put them in a centralized place where all
the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
will becomes bottleneck and it defeat the purpose of distributed processing. 

I also notice some enhancement ticket is open for hadoop-core. Is it
committed to any svn (0.21) branch? can somebody show me an example how to
take *.jpg files (from HDFS), and process files in a distributed fashion
using streaming?

Many thanks

-Qiming
-- 
View this message in context: http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: hadoop streaming binary input / image processing

Posted by Zak Stone <zs...@gmail.com>.

Hi Qiming,

You might consider using Dumbo, which is a Python wrapper for Hadoop
Streaming. The associated typedbytes module makes it easy for
streaming programs to work with binary data:

http://wiki.github.com/klbostee/dumbo
http://wiki.github.com/klbostee/typedbytes
http://dumbotics.com/2009/03/03/indexing-typed-bytes/

If you are using an older version of Hadoop (such as 18.3), you will
need to apply the following patches to Hadoop to make typedbytes work:

https://issues.apache.org/jira/browse/HADOOP-1722
https://issues.apache.org/jira/browse/HADOOP-5450

The commands you use to apply the patches might look something like this:

cd <HADOOP_HOME>
patch -p0 < HADOOP-1722-branch-0.18.patch
patch -p0 < HADOOP-5450.patch
ant package

The guy who put Dumbo together, Klaas Bosteels, is incredibly helpful,
and he continues to improve this useful project.

Zak


On Thu, May 14, 2009 at 12:39 PM, openresearch
<Qi...@openresearchinc.com> wrote:
>
> All,
>
> I have read some recommendation regarding image (binary input) processing
> using Hadoop-streaming which only accept text out-of-box for now.
> http://hadoop.apache.org/core/docs/current/streaming.html
> https://issues.apache.org/jira/browse/HADOOP-1722
> http://markmail.org/message/24woaqie2a6mrboc
>
> However, I have not got any straight answer.
>
> One recommendation is to put image data on HDFS, but we have to do "hdf
> -get" for each file/dir and process it locally which is every expensive.
>
> Another recommendation is to "...put them in a centralized place where all
> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> will becomes bottleneck and it defeat the purpose of distributed processing.
>
> I also notice some enhancement ticket is open for hadoop-core. Is it
> committed to any svn (0.21) branch? can somebody show me an example how to
> take *.jpg files (from HDFS), and process files in a distributed fashion
> using streaming?
>
> Many thanks
>
> -Qiming
> --
> View this message in context: http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: hadoop streaming binary input / image processing

Posted by jason hadoop <ja...@gmail.com>.

My apologies Piotr, I was referring to the streaming case and then pulling
the file out of a shared file systems, not using an input split that
contains the image data as you suggest.

On Thu, May 14, 2009 at 11:50 PM, Piotr Praczyk <pi...@gmail.com>wrote:

> Depends what API do you use. When writing an InputSplit implementation, it
> is possible to specify on which nodes does the data reside. I am new to
> Hadoop, but as far as I know, doing this
> should enable the support for data locality. Moreover, implementing a
> subclass of TextInputFormat and adding some encoding on the fly should not
> impact any locality properties.
>
>
> Piotr
>
>
> 2009/5/15 jason hadoop <ja...@gmail.com>
>
> > A  downside of this approach is that you will not likely have data
> locality
> > for the data on shared file systems, compared with data coming from an
> > input
> > split.
> > That being said,
> > from your script, *hadoop dfs -get FILE -* will write the file to
> standard
> > out.
> >
> > On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk <piotr.praczyk@gmail.com
> > >wrote:
> >
> > > just in addition to my previous post...
> > >
> > > You don't have to store the enceded files in a file system of course
> > since
> > > you can write your own InoutFormat which wil do this on the fly... the
> > > overhead should not be that big.
> > >
> > > Piotr
> > >
> > > 2009/5/14 Piotr Praczyk <pi...@gmail.com>
> > >
> > > > Hi
> > > >
> > > > If you want to read the files form HDFS and can not pass the binary
> > data,
> > > > you can do some encoding of it (base 64 for example, but you can
> think
> > > about
> > > > sth more efficient since the range of characters accprable in the
> input
> > > > string is wider than that used by BASE64). It should solve the
> problem
> > > until
> > > > some king of binary input is supported ( is it going to happen? ).
> > > >
> > > > Piotr
> > > >
> > > > 2009/5/14 openresearch <Qi...@openresearchinc.com>
> > > >
> > > >
> > > >> All,
> > > >>
> > > >> I have read some recommendation regarding image (binary input)
> > > processing
> > > >> using Hadoop-streaming which only accept text out-of-box for now.
> > > >> http://hadoop.apache.org/core/docs/current/streaming.html
> > > >> https://issues.apache.org/jira/browse/HADOOP-1722
> > > >> http://markmail.org/message/24woaqie2a6mrboc
> > > >>
> > > >> However, I have not got any straight answer.
> > > >>
> > > >> One recommendation is to put image data on HDFS, but we have to do
> > "hdf
> > > >> -get" for each file/dir and process it locally which is every
> > expensive.
> > > >>
> > > >> Another recommendation is to "...put them in a centralized place
> where
> > > all
> > > >> the hadoop nodes can access them (via .e.g, NFS mount)..."
> Obviously,
> > IO
> > > >> will becomes bottleneck and it defeat the purpose of distributed
> > > >> processing.
> > > >>
> > > >> I also notice some enhancement ticket is open for hadoop-core. Is it
> > > >> committed to any svn (0.21) branch? can somebody show me an example
> > how
> > > to
> > > >> take *.jpg files (from HDFS), and process files in a distributed
> > fashion
> > > >> using streaming?
> > > >>
> > > >> Many thanks
> > > >>
> > > >> -Qiming
> > > >> --
> > > >> View this message in context:
> > > >>
> > >
> >
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> > > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > > >>
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: hadoop streaming binary input / image processing

Posted by Piotr Praczyk <pi...@gmail.com>.

Depends what API do you use. When writing an InputSplit implementation, it
is possible to specify on which nodes does the data reside. I am new to
Hadoop, but as far as I know, doing this
should enable the support for data locality. Moreover, implementing a
subclass of TextInputFormat and adding some encoding on the fly should not
impact any locality properties.


Piotr


2009/5/15 jason hadoop <ja...@gmail.com>

> A  downside of this approach is that you will not likely have data locality
> for the data on shared file systems, compared with data coming from an
> input
> split.
> That being said,
> from your script, *hadoop dfs -get FILE -* will write the file to standard
> out.
>
> On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk <piotr.praczyk@gmail.com
> >wrote:
>
> > just in addition to my previous post...
> >
> > You don't have to store the enceded files in a file system of course
> since
> > you can write your own InoutFormat which wil do this on the fly... the
> > overhead should not be that big.
> >
> > Piotr
> >
> > 2009/5/14 Piotr Praczyk <pi...@gmail.com>
> >
> > > Hi
> > >
> > > If you want to read the files form HDFS and can not pass the binary
> data,
> > > you can do some encoding of it (base 64 for example, but you can think
> > about
> > > sth more efficient since the range of characters accprable in the input
> > > string is wider than that used by BASE64). It should solve the problem
> > until
> > > some king of binary input is supported ( is it going to happen? ).
> > >
> > > Piotr
> > >
> > > 2009/5/14 openresearch <Qi...@openresearchinc.com>
> > >
> > >
> > >> All,
> > >>
> > >> I have read some recommendation regarding image (binary input)
> > processing
> > >> using Hadoop-streaming which only accept text out-of-box for now.
> > >> http://hadoop.apache.org/core/docs/current/streaming.html
> > >> https://issues.apache.org/jira/browse/HADOOP-1722
> > >> http://markmail.org/message/24woaqie2a6mrboc
> > >>
> > >> However, I have not got any straight answer.
> > >>
> > >> One recommendation is to put image data on HDFS, but we have to do
> "hdf
> > >> -get" for each file/dir and process it locally which is every
> expensive.
> > >>
> > >> Another recommendation is to "...put them in a centralized place where
> > all
> > >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously,
> IO
> > >> will becomes bottleneck and it defeat the purpose of distributed
> > >> processing.
> > >>
> > >> I also notice some enhancement ticket is open for hadoop-core. Is it
> > >> committed to any svn (0.21) branch? can somebody show me an example
> how
> > to
> > >> take *.jpg files (from HDFS), and process files in a distributed
> fashion
> > >> using streaming?
> > >>
> > >> Many thanks
> > >>
> > >> -Qiming
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: hadoop streaming binary input / image processing

Posted by jason hadoop <ja...@gmail.com>.

A  downside of this approach is that you will not likely have data locality
for the data on shared file systems, compared with data coming from an input
split.
That being said,
from your script, *hadoop dfs -get FILE -* will write the file to standard
out.

On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk <pi...@gmail.com>wrote:

> just in addition to my previous post...
>
> You don't have to store the enceded files in a file system of course since
> you can write your own InoutFormat which wil do this on the fly... the
> overhead should not be that big.
>
> Piotr
>
> 2009/5/14 Piotr Praczyk <pi...@gmail.com>
>
> > Hi
> >
> > If you want to read the files form HDFS and can not pass the binary data,
> > you can do some encoding of it (base 64 for example, but you can think
> about
> > sth more efficient since the range of characters accprable in the input
> > string is wider than that used by BASE64). It should solve the problem
> until
> > some king of binary input is supported ( is it going to happen? ).
> >
> > Piotr
> >
> > 2009/5/14 openresearch <Qi...@openresearchinc.com>
> >
> >
> >> All,
> >>
> >> I have read some recommendation regarding image (binary input)
> processing
> >> using Hadoop-streaming which only accept text out-of-box for now.
> >> http://hadoop.apache.org/core/docs/current/streaming.html
> >> https://issues.apache.org/jira/browse/HADOOP-1722
> >> http://markmail.org/message/24woaqie2a6mrboc
> >>
> >> However, I have not got any straight answer.
> >>
> >> One recommendation is to put image data on HDFS, but we have to do "hdf
> >> -get" for each file/dir and process it locally which is every expensive.
> >>
> >> Another recommendation is to "...put them in a centralized place where
> all
> >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> >> will becomes bottleneck and it defeat the purpose of distributed
> >> processing.
> >>
> >> I also notice some enhancement ticket is open for hadoop-core. Is it
> >> committed to any svn (0.21) branch? can somebody show me an example how
> to
> >> take *.jpg files (from HDFS), and process files in a distributed fashion
> >> using streaming?
> >>
> >> Many thanks
> >>
> >> -Qiming
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: hadoop streaming binary input / image processing

Posted by Piotr Praczyk <pi...@gmail.com>.

just in addition to my previous post...

You don't have to store the enceded files in a file system of course since
you can write your own InoutFormat which wil do this on the fly... the
overhead should not be that big.

Piotr

2009/5/14 Piotr Praczyk <pi...@gmail.com>

> Hi
>
> If you want to read the files form HDFS and can not pass the binary data,
> you can do some encoding of it (base 64 for example, but you can think about
> sth more efficient since the range of characters accprable in the input
> string is wider than that used by BASE64). It should solve the problem until
> some king of binary input is supported ( is it going to happen? ).
>
> Piotr
>
> 2009/5/14 openresearch <Qi...@openresearchinc.com>
>
>
>> All,
>>
>> I have read some recommendation regarding image (binary input) processing
>> using Hadoop-streaming which only accept text out-of-box for now.
>> http://hadoop.apache.org/core/docs/current/streaming.html
>> https://issues.apache.org/jira/browse/HADOOP-1722
>> http://markmail.org/message/24woaqie2a6mrboc
>>
>> However, I have not got any straight answer.
>>
>> One recommendation is to put image data on HDFS, but we have to do "hdf
>> -get" for each file/dir and process it locally which is every expensive.
>>
>> Another recommendation is to "...put them in a centralized place where all
>> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
>> will becomes bottleneck and it defeat the purpose of distributed
>> processing.
>>
>> I also notice some enhancement ticket is open for hadoop-core. Is it
>> committed to any svn (0.21) branch? can somebody show me an example how to
>> take *.jpg files (from HDFS), and process files in a distributed fashion
>> using streaming?
>>
>> Many thanks
>>
>> -Qiming
>> --
>> View this message in context:
>> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>

Re: hadoop streaming binary input / image processing

Posted by Piotr Praczyk <pi...@gmail.com>.

Hi

If you want to read the files form HDFS and can not pass the binary data,
you can do some encoding of it (base 64 for example, but you can think about
sth more efficient since the range of characters accprable in the input
string is wider than that used by BASE64). It should solve the problem until
some king of binary input is supported ( is it going to happen? ).

Piotr

2009/5/14 openresearch <Qi...@openresearchinc.com>

>
> All,
>
> I have read some recommendation regarding image (binary input) processing
> using Hadoop-streaming which only accept text out-of-box for now.
> http://hadoop.apache.org/core/docs/current/streaming.html
> https://issues.apache.org/jira/browse/HADOOP-1722
> http://markmail.org/message/24woaqie2a6mrboc
>
> However, I have not got any straight answer.
>
> One recommendation is to put image data on HDFS, but we have to do "hdf
> -get" for each file/dir and process it locally which is every expensive.
>
> Another recommendation is to "...put them in a centralized place where all
> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> will becomes bottleneck and it defeat the purpose of distributed
> processing.
>
> I also notice some enhancement ticket is open for hadoop-core. Is it
> committed to any svn (0.21) branch? can somebody show me an example how to
> take *.jpg files (from HDFS), and process files in a distributed fashion
> using streaming?
>
> Many thanks
>
> -Qiming
> --
> View this message in context:
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>