You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Ittai Zeidman <it...@fashiontraffic.com> on 2012/08/21 05:42:19 UTC

HDFS directory locking and directory polling

Hi,
I'm a noob to hdfs so I might have not understood its purpose but I've read
around and I think it can solve a problem I'm having.
I'm not sure and so I'd appreciate feedback whether this is the right
course.
My current use case involves two machines, one hosting an ftp server and
another hosting my application.
External entities, to this issue, constantly upload files to the ftp server.
My application listens recursively on a specific directory in the ftp, via
apache camel, downloads the file locally whenever a new one appears and
processes it.
I now need to scale my application so it resides on two different machines
and the immediate problem arises from the following the following
constraint:
Given the following directory structure
FTP
 Root-interest-directory
Dir1
Dir2
At any given time only one file from each directory can be processed by any
machine and the same file can also be run only by a single machine
(precluding failure and retry).
This problem is currently solved by utilizing an in memory map from dir
name to a queue of files and so I only work on one file of the dir at a
time and since there is only one machine then by definition it is the only
one holding the files.
The previous promise of course does not exist, or scale well at least, if
multiple machines poll the same ftp server.
What I'm interested in doing with hdfs is to have my applications poll a
recursive directory on the hdfs and once a new file appears they should try
to lock its directory, whoever wins gets to copy the file.
I saw some old hdfs isssues about directory locking and file locking but I
wasn't sure whether this functionality is available in the format I
described above.
I think my questions are:
1. Can I, easily, recursively poll an hdfs directory? (I'm looking into
hadoop-camel for this)
2. Can I, easily, lock an hdfs directory?
3. If the answer to 2 is no, Will creating a hostname.lock file on an hdfs
directory by nodes work as a manual locking mechanism?
4. Should I try to find a different tool for the job?

I can of course try to find different tools like db locks and so on but I
fear the other solutions I've thought of don't scale well and are very
synthetic.

Would appreciate any feedback,
Ittai

-- 
Ittai Zeidman
Server team leader, Fashion Traffic <http://fashiontraffic.com/>
Follow us on: Twitter  <http://twitter.com/fashiontraffic>| Facebook
<http://www.facebook.com/newfashiontraffic>
| Tumblr <http://fashiontrafficblog.tumblr.com/>