You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rasit OZDAS <ra...@gmail.com> on 2009/02/10 14:05:32 UTC

Copying a file to specified nodes

Hi,

We have thousands of files, each dedicated to a user.  (Each user has
access to other users' files, but they do this not very often.)
Each user runs map-reduce jobs on the cluster.
So we should seperate his/her files equally across the cluster,
so that every machine can take part in the process (assuming he/she is
the only user running jobs).
For this we should initially copy files to specified nodes:
User A :   first file : Node 1, second file: Node 2, .. etc.
User B :   first file : Node 1, second file: Node 2, .. etc.

I know, hadoop create also replicas, but in our solution at least one
file will be in the right place
(or we're willing to control other replicas too).

Rebalancing is also not a problem, assuming it uses the information
about how much a computer is in use.
It even helps for a better organization of files.

How can we copy files to specified nodes?
Or do you have a better solution for us?

I couldn't find a solution to this, probably such an option doesn't exist.
But I wanted to take an expert's opinion about this.

Thanks in advance..
Rasit

Re: Copying a file to specified nodes

Posted by Rasit OZDAS <ra...@gmail.com>.
Yes, I've tried the long solution;
when I execute   ./hadoop dfs -put ... from a datanode,
in any case 1 copy gets written to that datanode.

But I think I should use SSH for this,
Anybody knows a better way?

Thanks,
Rasit

2009/2/16 Rasit OZDAS <ra...@gmail.com>:
> Thanks, Jeff.
> After considering JIRA link you've given and making some investigation:
>
> It seems that this JIRA ticket didn't draw much attention, so will
> take much time to be considered.
> After some more investigation I found out that when I copy the file to
> HDFS from a specific DataNode, first copy will be written to that
> DataNode itself. This solution will take long to implement, I think.
> But we definitely need this feature, so if we have no other choice,
> we'll go though it.
>
> Any further info (or comments on my solution) is appreciated.
>
> Cheers,
> Rasit
>
> 2009/2/10 Jeff Hammerbacher <ha...@cloudera.com>:
>> Hey Rasit,
>>
>> I'm not sure I fully understand your description of the problem, but
>> you might want to check out the JIRA ticket for making the replica
>> placement algorithms in HDFS pluggable
>> (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
>> case there.
>>
>> Regards,
>> Jeff
>>
>> On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS <ra...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We have thousands of files, each dedicated to a user.  (Each user has
>>> access to other users' files, but they do this not very often.)
>>> Each user runs map-reduce jobs on the cluster.
>>> So we should seperate his/her files equally across the cluster,
>>> so that every machine can take part in the process (assuming he/she is
>>> the only user running jobs).
>>> For this we should initially copy files to specified nodes:
>>> User A :   first file : Node 1, second file: Node 2, .. etc.
>>> User B :   first file : Node 1, second file: Node 2, .. etc.
>>>
>>> I know, hadoop create also replicas, but in our solution at least one
>>> file will be in the right place
>>> (or we're willing to control other replicas too).
>>>
>>> Rebalancing is also not a problem, assuming it uses the information
>>> about how much a computer is in use.
>>> It even helps for a better organization of files.
>>>
>>> How can we copy files to specified nodes?
>>> Or do you have a better solution for us?
>>>
>>> I couldn't find a solution to this, probably such an option doesn't exist.
>>> But I wanted to take an expert's opinion about this.
>>>
>>> Thanks in advance..
>>> Rasit
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ

Re: Copying a file to specified nodes

Posted by Rasit OZDAS <ra...@gmail.com>.
Thanks, Jeff.
After considering JIRA link you've given and making some investigation:

It seems that this JIRA ticket didn't draw much attention, so will
take much time to be considered.
After some more investigation I found out that when I copy the file to
HDFS from a specific DataNode, first copy will be written to that
DataNode itself. This solution will take long to implement, I think.
But we definitely need this feature, so if we have no other choice,
we'll go though it.

Any further info (or comments on my solution) is appreciated.

Cheers,
Rasit

2009/2/10 Jeff Hammerbacher <ha...@cloudera.com>:
> Hey Rasit,
>
> I'm not sure I fully understand your description of the problem, but
> you might want to check out the JIRA ticket for making the replica
> placement algorithms in HDFS pluggable
> (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
> case there.
>
> Regards,
> Jeff
>
> On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS <ra...@gmail.com> wrote:
>>
>> Hi,
>>
>> We have thousands of files, each dedicated to a user.  (Each user has
>> access to other users' files, but they do this not very often.)
>> Each user runs map-reduce jobs on the cluster.
>> So we should seperate his/her files equally across the cluster,
>> so that every machine can take part in the process (assuming he/she is
>> the only user running jobs).
>> For this we should initially copy files to specified nodes:
>> User A :   first file : Node 1, second file: Node 2, .. etc.
>> User B :   first file : Node 1, second file: Node 2, .. etc.
>>
>> I know, hadoop create also replicas, but in our solution at least one
>> file will be in the right place
>> (or we're willing to control other replicas too).
>>
>> Rebalancing is also not a problem, assuming it uses the information
>> about how much a computer is in use.
>> It even helps for a better organization of files.
>>
>> How can we copy files to specified nodes?
>> Or do you have a better solution for us?
>>
>> I couldn't find a solution to this, probably such an option doesn't exist.
>> But I wanted to take an expert's opinion about this.
>>
>> Thanks in advance..
>> Rasit
>



-- 
M. Raşit ÖZDAŞ

Re: Copying a file to specified nodes

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey Rasit,

I'm not sure I fully understand your description of the problem, but
you might want to check out the JIRA ticket for making the replica
placement algorithms in HDFS pluggable
(https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
case there.

Regards,
Jeff

On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS <ra...@gmail.com> wrote:
>
> Hi,
>
> We have thousands of files, each dedicated to a user.  (Each user has
> access to other users' files, but they do this not very often.)
> Each user runs map-reduce jobs on the cluster.
> So we should seperate his/her files equally across the cluster,
> so that every machine can take part in the process (assuming he/she is
> the only user running jobs).
> For this we should initially copy files to specified nodes:
> User A :   first file : Node 1, second file: Node 2, .. etc.
> User B :   first file : Node 1, second file: Node 2, .. etc.
>
> I know, hadoop create also replicas, but in our solution at least one
> file will be in the right place
> (or we're willing to control other replicas too).
>
> Rebalancing is also not a problem, assuming it uses the information
> about how much a computer is in use.
> It even helps for a better organization of files.
>
> How can we copy files to specified nodes?
> Or do you have a better solution for us?
>
> I couldn't find a solution to this, probably such an option doesn't exist.
> But I wanted to take an expert's opinion about this.
>
> Thanks in advance..
> Rasit