You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Dan Tamowski <ta...@gmail.com> on 2008/03/21 16:29:47 UTC

Hadoop For Image Analysis/Vectorization

Hello,

Forgive me if I am missing something in the documentation, but nothing is
jumping out at me.

I am exploring the use of Hadoop for image analysis and/or image
vectorization and have a few questions. I anticipate that there will be a
large collection of image files as input with an equal number of output
files. All files will be in raw binary format and are independent of each
other. What I am trying to figure it is:

-Does Hadoop/MR offer a clean abstraction for both consuming and producing a
large number of files? (I know it can handily consume a large number of
fies, but all examples of output seem to form a single file)
-Does Hadoop provide the input/output formats relevant to this or would I
have to create my own? (e.g non-splittable binary input, and binary output)
-Is this issue even well-suited to Hadoop in the first place? This type of
job may only need the map phase, and not the reduce phase, so maybe I'm
looking in the wrong place.

Thank you for your time. Also, I only subscribe to the digest, if you have
questions for me regarding this, please cc me at tamowski.d@gmail.com.

Dan

Re: Hadoop For Image Analysis/Vectorization

Posted by Ted Dunning <td...@veoh.com>.

On 3/21/08 8:29 AM, "Dan Tamowski" <ta...@gmail.com> wrote:

> -Does Hadoop/MR offer a clean abstraction for both consuming and producing a
> large number of files? (I know it can handily consume a large number of
> fies, but all examples of output seem to form a single file)

Yes.

IT works very well if your definition of large is less than hundreds of
thousands and your files are reasonably large (>> 1MB).  If this is not
true, then pasting your files together with a synchronization string between
them that you can scan for quickly works pretty well.

> -Does Hadoop provide the input/output formats relevant to this or would I
> have to create my own? (e.g non-splittable binary input, and binary output)

It has input formats for multiple input files (with an obvious name that I
am spacing at the moment).  Building a glue factory to paste otherwise
unsplittable files together and pull them apart at map time would be pretty
easy.

> -Is this issue even well-suited to Hadoop in the first place? This type of
> job may only need the map phase, and not the reduce phase, so maybe I'm
> looking in the wrong place.

Hadoop is surprisingly beneficial in these cases and you are likely to be
surprised at how useful a reduce phase can be, if only to concatenate and/or
summarize your results.

The benefit in map-only jobs has to do with moving jobs to be near the data
without explicit management by you.  It makes it so that you can come close
to having full disk bandwidth without having to know where your data is, how
to spawn jobs, how many nodes there are, how failures are handled or many
other things.