You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Ananth Gundabattula <ag...@gmail.com> on 2014/12/14 08:08:01 UTC
Controlling the block placement and the file placement in HDFS writes

Hello All,


I was wondering if the following issues can be solved by extending hdfs
classes with custom implementations if possible.


Here are my requirements :

1. Is there a way to control that all file blocks belonging to a particular
hdfs directory & file go to the same physical datanode ( and their
corresponding replicas as well ? )

2. Is there a way to control the volume that is being used to write a file
block ?


Here are the finer details to give some background for the above two
queries :

We are using Impala engine to analyze our data and to do so we are
generating the files in parquet format using our custom data processing
pipelines. It is ideal that all of the files belonging to the same
partition be processed by the same node ( as we see our queries can be
partitioned by say a key that is common across all queries). In this
regard, we would like to control a given file block in a particular path
always land on the same physical node so that the impala workers that need
to process data for a given query send less data across nodes to manage the
result. Hence the first question.


The second aspect is that our queries are time based and this time window
follows a familiar pattern of old data not being queried much. Hence we
would like to preserve the most recent data in the HDFS cache ( impala is
helping us manage this aspect via their command set ) but we would like the
next recent amount of data chunks to land on an SSD that is present on
every datanode. The remaining set of blocks which are "very old but in
large quantities" would land on spinning disks. The decision to choose a
given volume is based on the file name as we can control the filename that
is being used to generate the file.

Please note that the criteria for both of the above is being based on the
name of the file and not the size of the file.

For the first query, I have looked into BlockPlacementPolicy interface. The
method chooseTarget() in the above interface is giving me a list of
DataNodes to choose from and return one of them as the target ? My dilemma
is whether given a path for a directory , will the input
DataNodeDescriptors to the function chooseTarget remain the same for each
invocation to the same directory ? Or is it not something not controlled ?

For the second query, I have looked at the VolumeChoosingPolicy Interface.
On looking at the VolumeChoosingPolicy interface, it looks like the only
handle I get is the list of current volumes and no information about the
incoming file.

Any pointers regarding the above two aspects would be immensely helpful.

Thanks a lot for your time.

Regards,
Ananth