You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bill Q <bi...@gmail.com> on 2013/11/18 17:29:45 UTC

Writing applications in YARN

Hi,
I just started to write applications on top of YARN. And I have a few
questions I would love to get some opinions.

What needs to be done: I have a bunch of files and some intensive
computations need to be done on each of them separately and individually.
The computation intensity is decided by the file size.

What do I have: 20 machines, 5 of them have more memory (128G) then the
others (32G).

The approach:
1. Identify the machines with 128G memory and create container that has 64G
memory on them; identify machines with 32G memory and create 28G memory
container on them.

Then two options:
2a). Copy the larger files from local to the 128G machine local. And
smaller files to 32G machine. Create container that points to those local
files accordingly.

2b).Copy the files to HDFS. And somehow tell HDFS which data node I would
prefer to move which files to. This sounds like a lot of work with block
location tracking. But there might be better alternatives.

Would step 1 and 2a) applicable?

What about step 2b?

In general, in YARN applications, would the system be able to create
container on the nodes that host the data so it can leverage the data
locality, automatically?

Many thanks for your opinions and insights.



-- 
Many thanks.


Bill