You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Loughran <st...@apache.org> on 2009/03/02 12:19:15 UTC

Re: How does NVidia GPU compare to Hadoop/MapReduce

Dan Zinngrabe wrote:
> On Fri, Feb 27, 2009 at 11:21 AM, Doug Cutting <cu...@apache.org> wrote:
>> I think they're complementary.
>>
>> Hadoop's MapReduce lets you run computations on up to thousands of computers
>> potentially processing petabytes of data.  It gets data from the grid to
>> your computation, reliably stores output back to the grid, and supports
>> grid-global computations (e.g., sorting).
>>
>> CUDA can make computations on a single computer run faster by using its GPU.
>>  It does not handle co-ordination of multiple computers, e.g., the flow of
>> data in and out of a distributed filesystem, distributed reliability, global
>> computations, etc.
>>
>> So you might use CUDA within mapreduce to more efficiently run
>> compute-intensive tasks over petabytes of data.
>>
>> Doug
> 
> I actually did some work with this several months ago, using a
> consumer-level NVIDIA card. I found a couple of interesting things:
> - I used JOGL and OpenGL shaders rather than CUDA, as at least at the
> time there was no reasonable way to talk to CUDA through java. That
> made a number of things more complicated, CUDA certainly makes things
> simpler. For the particular problem I was working with, GLSL was fine,
> though CUDA would have simplified things.
> - The problem set I was working with involved creating and searching
> large amounts of hashes - 3-4 TB of them at a time.
> - Only 2 of my nodes in an 8 node cluster had accelerators, but they
> had a dramatic effect on performance. I do not have any of my test
> results handy, but for this particular problem the accelerators cut
> the job time in half or more.

that's interesting, as it means the power budget of the overall workload 
ought to be less

> 
> I would agree with Doug that the two are complimentary, though there
> are some similarities. Working with the GPU means you are limited by
> how much texture memory is available for storage (compared to HDFS,
> not much!), and the cost of getting data on and off the card can be
> high. Like many hadoop jobs, the overhead of getting data in and
> starting a task can easily be greater than the length of the task
> itself. For what I was doing, it was a good fit - but for many, many
> problems it would not be the right solution.

Yes, and you will need more disk IO capacity per node if each node is 
capable of more computation, unless you have very CPU-intensive workloads