You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by tog <gu...@gmail.com> on 2009/06/03 10:32:03 UTC

Image indexing/searching with Hadoop and MPI

Hi there,

This is a kind of newbie question (at least as far as Hadoop is concerned).
I was wondering if they were any Hadoop based project around dealing with
Image indexing and searching ? We are working is this area and might be
interesting to have a look in such a project.
Second question is dealing with scientific computing with Haddop. Does
anyone has try to use Hadoop to parallelize a scientific application ? I
know there is Hama but it does not seem very active these days (I might be
wrong ;) )
Some time ago, I heard of an attempt of implementing some MPI implementation
on top of Hadoop , was it really the plan, is there any update ?
Anyway, I would be interested in any paper/fedeback on the performance of
scientific application running on large clusters using Hadoop.

Best Regards
Guillaume

Re: Image indexing/searching with Hadoop and MPI

Posted by Owen O'Malley <ow...@gmail.com>.

> Ok I can understand your point - but I am sure that some people have been
> trying to use map-reduce programming model to do CFD, or any other
> scientific computing.
> Any experience in this area from the list ?

I know of one project that assumes it has an entire Hadoop cluster,
and generates the hostnames in the Mapper and uses those host lists in
the Reducer to launch an MPI job. They do it because it provides a
higher efficiency for doing very small data transfers. The alternative
was doing a long chain of map/reduce jobs that have very small outputs
from each phase. I wouldn't recommend using MPI under map/reduce in
general, since it involves making a lot of assumptions about your
application. In particular, to avoid from killing your cluster your
shouldn't use checkpoints in your application and just rerun the
application from the beginning on failures. That implies that the
application can't run very long (upper bound of probably 30 minutes on
2000 nodes).

That said, if you want to run other styles of applications, you really
want a two level scheduler. Where the first level scheduler allocates
nodes (or partial nodes) to jobs (or frameworks). Effectively, that is
what Hadoop On Demand (HOD) was doing with Torque, but I suspect there
will be a more performant solution than HOD with in the next year.

-- Owen

Re: Image indexing/searching with Hadoop and MPI

Posted by tog <gu...@gmail.com>.

On Wed, Jun 3, 2009 at 5:17 PM, Edward J. Yoon <ed...@apache.org>wrote:

> > This is a kind of newbie question (at least as far as Hadoop is
> concerned).
> > I was wondering if they were any Hadoop based project around dealing with
> > Image indexing and searching ? We are working is this area and might be
> > interesting to have a look in such a project.
>
> There is a text-search engine library, called lucene. See also the
> nutch project. Otherwise, Did you mean something like content-based
> image indexing and searching usig image attributes, such as, color,
> texture, and etc., not the text of image tag?

Yes this is exactly what I mean, I am looking at a project doing
content-based image indexing using for example GIST, BOF, ...
Does such a project exist ?

>
>
> I think the MPI programming isn't suitable for the concept of
> distributed hdfs and map/reduce programming system, since MPI requires
> the heavy communication among the nodes.

Ok I can understand your point - but I am sure that some people have been
trying to use map-reduce programming model to do CFD, or any other
scientific computing.
Any experience in this area from the list ?

Cheers
Guillaume

Re: Image indexing/searching with Hadoop and MPI

Posted by "Edward J. Yoon" <ed...@apache.org>.

> This is a kind of newbie question (at least as far as Hadoop is concerned).
> I was wondering if they were any Hadoop based project around dealing with
> Image indexing and searching ? We are working is this area and might be
> interesting to have a look in such a project.

There is a text-search engine library, called lucene. See also the
nutch project. Otherwise, Did you mean something like content-based
image indexing and searching usig image attributes, such as, color,
texture, and etc., not the text of image tag?

> Second question is dealing with scientific computing with Haddop. Does
> anyone has try to use Hadoop to parallelize a scientific application ? I
> know there is Hama but it does not seem very active these days (I might be
> wrong ;) )
> Some time ago, I heard of an attempt of implementing some MPI implementation
> on top of Hadoop , was it really the plan, is there any update ?
> Anyway, I would be interested in any paper/fedeback on the performance of
> scientific application running on large clusters using Hadoop.

I think the MPI programming isn't suitable for the concept of
distributed hdfs and map/reduce programming system, since MPI requires
the heavy communication among the nodes.

FYI, In hama, currently the basic matrix operations are implemented
based on the map/reduce programming model. For example, the matrix
get/set methods, the matrix norms, matrix-matrix
multiplication/addition, matrix transpose. In near future, SVD,
Eigenvalue decomposition and some graph algorithms will be
implemented. All the operations are sequentially executed.

Thanks.

On Wed, Jun 3, 2009 at 5:32 PM, tog <gu...@gmail.com> wrote:
> Hi there,
>
> This is a kind of newbie question (at least as far as Hadoop is concerned).
> I was wondering if they were any Hadoop based project around dealing with
> Image indexing and searching ? We are working is this area and might be
> interesting to have a look in such a project.
> Second question is dealing with scientific computing with Haddop. Does
> anyone has try to use Hadoop to parallelize a scientific application ? I
> know there is Hama but it does not seem very active these days (I might be
> wrong ;) )
> Some time ago, I heard of an attempt of implementing some MPI implementation
> on top of Hadoop , was it really the plan, is there any update ?
> Anyway, I would be interested in any paper/fedeback on the performance of
> scientific application running on large clusters using Hadoop.
>
> Best Regards
> Guillaume
>

-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org