You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Arjun Bakshi <ba...@mail.uc.edu> on 2014/07/24 19:41:24 UTC

Building custom block placement policy. What is srcPath?

Hi,

I want to write a block placement policy that takes the size of the file 
being placed into account. Something like what is done in CoHadoop or 
BEEMR paper. I have the following questions:

1- What is srcPath in chooseTarget? Is it the path to the original 
un-chunked file, or it is a path to a single block, or something else? I 
added some code to blockplacementpolicydefault to print out the value of 
srcPath but the results look odd.

2- Will a simple new File(srcPath) will do?

3- I've spent time looking at hadoop source code. I can't find a way to 
go from srcPath in chooseTarget to a file size. Every function I think 
can do it, in FSNamesystem, FSDirectory, etc., is either non-public, or 
cannot be called from inside the blockmanagement package or 
blockplacement class.

How do I go from srcPath in blockplacement class to size of the file 
being placed?

Thank you,

AB

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Jul 25, 2014 at 2:55 AM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> Thanks for the reply. It cleared up a few things.
>
> I hadn't thought of situations of under-replication, but I'll give it some
> thought now. It should be easier since, as you've mentioned, by that time
> the namenode knows all the blocks that came from the same file as the
> under-replicated block.
>
> For the most part, I was thinking of when a new file is being placed on the
> cluster. I think this is what you called in-progress files. Say a new 1GB
> file needs to be placed on to the cluster. I want to make the system take
> information of the file being 1GB in size into account while placing all its
> blocks on to nodes in a cluster.

You are assuming that all files are "loaded" into the cluster from an
existing file on another FS, such as a local FS and the command
'hadoop fs -put', and thereby you can "know" what the file length is
entirely going to be. This is incorrect as an assumption.

Programs can write streams of arbitrary data based on their need,
also. A HDFS writer can simply create a new file, and write to its
output stream any number of bytes it wants to. To HDFS this is no
different than a "load". It treats both such writes in the equal way -
its merely the client goal differing.

> I'm not clear on where the file is broken down into blocks/chunks; in terms
> of which class, which file system(local or hdfs), or where in the process
> flow. Knowing that will help me come up with a solution. Where is the last
> place, in terms of a function or point in process that I can find the name
> of the original file that is being placed on the system?

The client program chunks the file as it writes. You can look at the
DFSOutputStream class for the client implementation.

> I'm reading the namenode and fsnamesystem code just to see if I can do what
> I want from there. Any suggestions will be appreciated.
>
> Thank you,
>
> AB
>
>
>
>
> On 07/24/2014 02:12 PM, Harsh J wrote:
>>
>> Hello,
>>
>> (Inline)
>>
>> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu>
>> wrote:
>>>
>>> Hi,
>>>
>>> I want to write a block placement policy that takes the size of the file
>>> being placed into account. Something like what is done in CoHadoop or
>>> BEEMR
>>> paper. I have the following questions:
>>>
>>> 1- What is srcPath in chooseTarget? Is it the path to the original
>>> un-chunked file, or it is a path to a single block, or something else? I
>>> added some code to blockplacementpolicydefault to print out the value of
>>> srcPath but the results look odd.
>>
>> The arguments are documented in the interface javadoc:
>>
>> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>>
>> The srcPath is the file path of the file on HDFS for which the block
>> placement targets are being requested.
>>
>>> 2- Will a simple new File(srcPath) will do?
>>
>> Please rephrase? The srcPath is not a local file if thats what you meant.
>>
>>> 3- I've spent time looking at hadoop source code. I can't find a way to
>>> go
>>> from srcPath in chooseTarget to a file size. Every function I think can
>>> do
>>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot
>>> be
>>> called from inside the blockmanagement package or blockplacement class.
>>
>> The block placement is something that, within a context of a new file
>> creation, is called when requesting a new block. At this point the
>> file is not complete, so there is no way to determine its actual
>> length, but only the requested block size. I'm not certain if
>> BlockPlacementPolicy is what will solve your goal.
>>
>>> How do I go from srcPath in blockplacement class to size of the file
>>> being
>>> placed?
>>
>> Are you targeting in-progress files or completed files? The latter
>> form of files would result in placement policy calls iff there's an
>> under-replication/losses/etc. to block replicas of the original set.
>> Only for such operations would you have a possibility to determine the
>> actual full length of file (as explained above).
>>
>>> Thank you,
>>>
>>> AB
>>
>>
>>
>



-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Jul 25, 2014 at 2:55 AM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> Thanks for the reply. It cleared up a few things.
>
> I hadn't thought of situations of under-replication, but I'll give it some
> thought now. It should be easier since, as you've mentioned, by that time
> the namenode knows all the blocks that came from the same file as the
> under-replicated block.
>
> For the most part, I was thinking of when a new file is being placed on the
> cluster. I think this is what you called in-progress files. Say a new 1GB
> file needs to be placed on to the cluster. I want to make the system take
> information of the file being 1GB in size into account while placing all its
> blocks on to nodes in a cluster.

You are assuming that all files are "loaded" into the cluster from an
existing file on another FS, such as a local FS and the command
'hadoop fs -put', and thereby you can "know" what the file length is
entirely going to be. This is incorrect as an assumption.

Programs can write streams of arbitrary data based on their need,
also. A HDFS writer can simply create a new file, and write to its
output stream any number of bytes it wants to. To HDFS this is no
different than a "load". It treats both such writes in the equal way -
its merely the client goal differing.

> I'm not clear on where the file is broken down into blocks/chunks; in terms
> of which class, which file system(local or hdfs), or where in the process
> flow. Knowing that will help me come up with a solution. Where is the last
> place, in terms of a function or point in process that I can find the name
> of the original file that is being placed on the system?

The client program chunks the file as it writes. You can look at the
DFSOutputStream class for the client implementation.

> I'm reading the namenode and fsnamesystem code just to see if I can do what
> I want from there. Any suggestions will be appreciated.
>
> Thank you,
>
> AB
>
>
>
>
> On 07/24/2014 02:12 PM, Harsh J wrote:
>>
>> Hello,
>>
>> (Inline)
>>
>> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu>
>> wrote:
>>>
>>> Hi,
>>>
>>> I want to write a block placement policy that takes the size of the file
>>> being placed into account. Something like what is done in CoHadoop or
>>> BEEMR
>>> paper. I have the following questions:
>>>
>>> 1- What is srcPath in chooseTarget? Is it the path to the original
>>> un-chunked file, or it is a path to a single block, or something else? I
>>> added some code to blockplacementpolicydefault to print out the value of
>>> srcPath but the results look odd.
>>
>> The arguments are documented in the interface javadoc:
>>
>> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>>
>> The srcPath is the file path of the file on HDFS for which the block
>> placement targets are being requested.
>>
>>> 2- Will a simple new File(srcPath) will do?
>>
>> Please rephrase? The srcPath is not a local file if thats what you meant.
>>
>>> 3- I've spent time looking at hadoop source code. I can't find a way to
>>> go
>>> from srcPath in chooseTarget to a file size. Every function I think can
>>> do
>>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot
>>> be
>>> called from inside the blockmanagement package or blockplacement class.
>>
>> The block placement is something that, within a context of a new file
>> creation, is called when requesting a new block. At this point the
>> file is not complete, so there is no way to determine its actual
>> length, but only the requested block size. I'm not certain if
>> BlockPlacementPolicy is what will solve your goal.
>>
>>> How do I go from srcPath in blockplacement class to size of the file
>>> being
>>> placed?
>>
>> Are you targeting in-progress files or completed files? The latter
>> form of files would result in placement policy calls iff there's an
>> under-replication/losses/etc. to block replicas of the original set.
>> Only for such operations would you have a possibility to determine the
>> actual full length of file (as explained above).
>>
>>> Thank you,
>>>
>>> AB
>>
>>
>>
>



-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Jul 25, 2014 at 2:55 AM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> Thanks for the reply. It cleared up a few things.
>
> I hadn't thought of situations of under-replication, but I'll give it some
> thought now. It should be easier since, as you've mentioned, by that time
> the namenode knows all the blocks that came from the same file as the
> under-replicated block.
>
> For the most part, I was thinking of when a new file is being placed on the
> cluster. I think this is what you called in-progress files. Say a new 1GB
> file needs to be placed on to the cluster. I want to make the system take
> information of the file being 1GB in size into account while placing all its
> blocks on to nodes in a cluster.

You are assuming that all files are "loaded" into the cluster from an
existing file on another FS, such as a local FS and the command
'hadoop fs -put', and thereby you can "know" what the file length is
entirely going to be. This is incorrect as an assumption.

Programs can write streams of arbitrary data based on their need,
also. A HDFS writer can simply create a new file, and write to its
output stream any number of bytes it wants to. To HDFS this is no
different than a "load". It treats both such writes in the equal way -
its merely the client goal differing.

> I'm not clear on where the file is broken down into blocks/chunks; in terms
> of which class, which file system(local or hdfs), or where in the process
> flow. Knowing that will help me come up with a solution. Where is the last
> place, in terms of a function or point in process that I can find the name
> of the original file that is being placed on the system?

The client program chunks the file as it writes. You can look at the
DFSOutputStream class for the client implementation.

> I'm reading the namenode and fsnamesystem code just to see if I can do what
> I want from there. Any suggestions will be appreciated.
>
> Thank you,
>
> AB
>
>
>
>
> On 07/24/2014 02:12 PM, Harsh J wrote:
>>
>> Hello,
>>
>> (Inline)
>>
>> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu>
>> wrote:
>>>
>>> Hi,
>>>
>>> I want to write a block placement policy that takes the size of the file
>>> being placed into account. Something like what is done in CoHadoop or
>>> BEEMR
>>> paper. I have the following questions:
>>>
>>> 1- What is srcPath in chooseTarget? Is it the path to the original
>>> un-chunked file, or it is a path to a single block, or something else? I
>>> added some code to blockplacementpolicydefault to print out the value of
>>> srcPath but the results look odd.
>>
>> The arguments are documented in the interface javadoc:
>>
>> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>>
>> The srcPath is the file path of the file on HDFS for which the block
>> placement targets are being requested.
>>
>>> 2- Will a simple new File(srcPath) will do?
>>
>> Please rephrase? The srcPath is not a local file if thats what you meant.
>>
>>> 3- I've spent time looking at hadoop source code. I can't find a way to
>>> go
>>> from srcPath in chooseTarget to a file size. Every function I think can
>>> do
>>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot
>>> be
>>> called from inside the blockmanagement package or blockplacement class.
>>
>> The block placement is something that, within a context of a new file
>> creation, is called when requesting a new block. At this point the
>> file is not complete, so there is no way to determine its actual
>> length, but only the requested block size. I'm not certain if
>> BlockPlacementPolicy is what will solve your goal.
>>
>>> How do I go from srcPath in blockplacement class to size of the file
>>> being
>>> placed?
>>
>> Are you targeting in-progress files or completed files? The latter
>> form of files would result in placement policy calls iff there's an
>> under-replication/losses/etc. to block replicas of the original set.
>> Only for such operations would you have a possibility to determine the
>> actual full length of file (as explained above).
>>
>>> Thank you,
>>>
>>> AB
>>
>>
>>
>



-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Fri, Jul 25, 2014 at 2:55 AM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> Thanks for the reply. It cleared up a few things.
>
> I hadn't thought of situations of under-replication, but I'll give it some
> thought now. It should be easier since, as you've mentioned, by that time
> the namenode knows all the blocks that came from the same file as the
> under-replicated block.
>
> For the most part, I was thinking of when a new file is being placed on the
> cluster. I think this is what you called in-progress files. Say a new 1GB
> file needs to be placed on to the cluster. I want to make the system take
> information of the file being 1GB in size into account while placing all its
> blocks on to nodes in a cluster.

You are assuming that all files are "loaded" into the cluster from an
existing file on another FS, such as a local FS and the command
'hadoop fs -put', and thereby you can "know" what the file length is
entirely going to be. This is incorrect as an assumption.

Programs can write streams of arbitrary data based on their need,
also. A HDFS writer can simply create a new file, and write to its
output stream any number of bytes it wants to. To HDFS this is no
different than a "load". It treats both such writes in the equal way -
its merely the client goal differing.

> I'm not clear on where the file is broken down into blocks/chunks; in terms
> of which class, which file system(local or hdfs), or where in the process
> flow. Knowing that will help me come up with a solution. Where is the last
> place, in terms of a function or point in process that I can find the name
> of the original file that is being placed on the system?

The client program chunks the file as it writes. You can look at the
DFSOutputStream class for the client implementation.

> I'm reading the namenode and fsnamesystem code just to see if I can do what
> I want from there. Any suggestions will be appreciated.
>
> Thank you,
>
> AB
>
>
>
>
> On 07/24/2014 02:12 PM, Harsh J wrote:
>>
>> Hello,
>>
>> (Inline)
>>
>> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu>
>> wrote:
>>>
>>> Hi,
>>>
>>> I want to write a block placement policy that takes the size of the file
>>> being placed into account. Something like what is done in CoHadoop or
>>> BEEMR
>>> paper. I have the following questions:
>>>
>>> 1- What is srcPath in chooseTarget? Is it the path to the original
>>> un-chunked file, or it is a path to a single block, or something else? I
>>> added some code to blockplacementpolicydefault to print out the value of
>>> srcPath but the results look odd.
>>
>> The arguments are documented in the interface javadoc:
>>
>> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>>
>> The srcPath is the file path of the file on HDFS for which the block
>> placement targets are being requested.
>>
>>> 2- Will a simple new File(srcPath) will do?
>>
>> Please rephrase? The srcPath is not a local file if thats what you meant.
>>
>>> 3- I've spent time looking at hadoop source code. I can't find a way to
>>> go
>>> from srcPath in chooseTarget to a file size. Every function I think can
>>> do
>>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot
>>> be
>>> called from inside the blockmanagement package or blockplacement class.
>>
>> The block placement is something that, within a context of a new file
>> creation, is called when requesting a new block. At this point the
>> file is not complete, so there is no way to determine its actual
>> length, but only the requested block size. I'm not certain if
>> BlockPlacementPolicy is what will solve your goal.
>>
>>> How do I go from srcPath in blockplacement class to size of the file
>>> being
>>> placed?
>>
>> Are you targeting in-progress files or completed files? The latter
>> form of files would result in placement policy calls iff there's an
>> under-replication/losses/etc. to block replicas of the original set.
>> Only for such operations would you have a possibility to determine the
>> actual full length of file (as explained above).
>>
>>> Thank you,
>>>
>>> AB
>>
>>
>>
>



-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Arjun Bakshi <ba...@mail.uc.edu>.

Hi,

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.

For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.

I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?

I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.

Thank you,

AB

On 07/24/2014 02:12 PM, Harsh J wrote:
> Hello,
>
> (Inline)
>
> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
>> Hi,
>>
>> I want to write a block placement policy that takes the size of the file
>> being placed into account. Something like what is done in CoHadoop or BEEMR
>> paper. I have the following questions:
>>
>> 1- What is srcPath in chooseTarget? Is it the path to the original
>> un-chunked file, or it is a path to a single block, or something else? I
>> added some code to blockplacementpolicydefault to print out the value of
>> srcPath but the results look odd.
> The arguments are documented in the interface javadoc:
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>
> The srcPath is the file path of the file on HDFS for which the block
> placement targets are being requested.
>
>> 2- Will a simple new File(srcPath) will do?
> Please rephrase? The srcPath is not a local file if thats what you meant.
>
>> 3- I've spent time looking at hadoop source code. I can't find a way to go
>> from srcPath in chooseTarget to a file size. Every function I think can do
>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
>> called from inside the blockmanagement package or blockplacement class.
> The block placement is something that, within a context of a new file
> creation, is called when requesting a new block. At this point the
> file is not complete, so there is no way to determine its actual
> length, but only the requested block size. I'm not certain if
> BlockPlacementPolicy is what will solve your goal.
>
>> How do I go from srcPath in blockplacement class to size of the file being
>> placed?
> Are you targeting in-progress files or completed files? The latter
> form of files would result in placement policy calls iff there's an
> under-replication/losses/etc. to block replicas of the original set.
> Only for such operations would you have a possibility to determine the
> actual full length of file (as explained above).
>
>> Thank you,
>>
>> AB
>
>

Re: Building custom block placement policy. What is srcPath?

Posted by Arjun Bakshi <ba...@mail.uc.edu>.

Hi,

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.

For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.

I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?

I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.

Thank you,

AB

On 07/24/2014 02:12 PM, Harsh J wrote:
> Hello,
>
> (Inline)
>
> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
>> Hi,
>>
>> I want to write a block placement policy that takes the size of the file
>> being placed into account. Something like what is done in CoHadoop or BEEMR
>> paper. I have the following questions:
>>
>> 1- What is srcPath in chooseTarget? Is it the path to the original
>> un-chunked file, or it is a path to a single block, or something else? I
>> added some code to blockplacementpolicydefault to print out the value of
>> srcPath but the results look odd.
> The arguments are documented in the interface javadoc:
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>
> The srcPath is the file path of the file on HDFS for which the block
> placement targets are being requested.
>
>> 2- Will a simple new File(srcPath) will do?
> Please rephrase? The srcPath is not a local file if thats what you meant.
>
>> 3- I've spent time looking at hadoop source code. I can't find a way to go
>> from srcPath in chooseTarget to a file size. Every function I think can do
>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
>> called from inside the blockmanagement package or blockplacement class.
> The block placement is something that, within a context of a new file
> creation, is called when requesting a new block. At this point the
> file is not complete, so there is no way to determine its actual
> length, but only the requested block size. I'm not certain if
> BlockPlacementPolicy is what will solve your goal.
>
>> How do I go from srcPath in blockplacement class to size of the file being
>> placed?
> Are you targeting in-progress files or completed files? The latter
> form of files would result in placement policy calls iff there's an
> under-replication/losses/etc. to block replicas of the original set.
> Only for such operations would you have a possibility to determine the
> actual full length of file (as explained above).
>
>> Thank you,
>>
>> AB
>
>

Re: Building custom block placement policy. What is srcPath?

Posted by Arjun Bakshi <ba...@mail.uc.edu>.

Hi,

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.

For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.

I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?

I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.

Thank you,

AB

On 07/24/2014 02:12 PM, Harsh J wrote:
> Hello,
>
> (Inline)
>
> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
>> Hi,
>>
>> I want to write a block placement policy that takes the size of the file
>> being placed into account. Something like what is done in CoHadoop or BEEMR
>> paper. I have the following questions:
>>
>> 1- What is srcPath in chooseTarget? Is it the path to the original
>> un-chunked file, or it is a path to a single block, or something else? I
>> added some code to blockplacementpolicydefault to print out the value of
>> srcPath but the results look odd.
> The arguments are documented in the interface javadoc:
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>
> The srcPath is the file path of the file on HDFS for which the block
> placement targets are being requested.
>
>> 2- Will a simple new File(srcPath) will do?
> Please rephrase? The srcPath is not a local file if thats what you meant.
>
>> 3- I've spent time looking at hadoop source code. I can't find a way to go
>> from srcPath in chooseTarget to a file size. Every function I think can do
>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
>> called from inside the blockmanagement package or blockplacement class.
> The block placement is something that, within a context of a new file
> creation, is called when requesting a new block. At this point the
> file is not complete, so there is no way to determine its actual
> length, but only the requested block size. I'm not certain if
> BlockPlacementPolicy is what will solve your goal.
>
>> How do I go from srcPath in blockplacement class to size of the file being
>> placed?
> Are you targeting in-progress files or completed files? The latter
> form of files would result in placement policy calls iff there's an
> under-replication/losses/etc. to block replicas of the original set.
> Only for such operations would you have a possibility to determine the
> actual full length of file (as explained above).
>
>> Thank you,
>>
>> AB
>
>

Re: Building custom block placement policy. What is srcPath?

Posted by Arjun Bakshi <ba...@mail.uc.edu>.

Hi,

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.

For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.

I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?

I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.

Thank you,

AB

On 07/24/2014 02:12 PM, Harsh J wrote:
> Hello,
>
> (Inline)
>
> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
>> Hi,
>>
>> I want to write a block placement policy that takes the size of the file
>> being placed into account. Something like what is done in CoHadoop or BEEMR
>> paper. I have the following questions:
>>
>> 1- What is srcPath in chooseTarget? Is it the path to the original
>> un-chunked file, or it is a path to a single block, or something else? I
>> added some code to blockplacementpolicydefault to print out the value of
>> srcPath but the results look odd.
> The arguments are documented in the interface javadoc:
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
>
> The srcPath is the file path of the file on HDFS for which the block
> placement targets are being requested.
>
>> 2- Will a simple new File(srcPath) will do?
> Please rephrase? The srcPath is not a local file if thats what you meant.
>
>> 3- I've spent time looking at hadoop source code. I can't find a way to go
>> from srcPath in chooseTarget to a file size. Every function I think can do
>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
>> called from inside the blockmanagement package or blockplacement class.
> The block placement is something that, within a context of a new file
> creation, is called when requesting a new block. At this point the
> file is not complete, so there is no way to determine its actual
> length, but only the requested block size. I'm not certain if
> BlockPlacementPolicy is what will solve your goal.
>
>> How do I go from srcPath in blockplacement class to size of the file being
>> placed?
> Are you targeting in-progress files or completed files? The latter
> form of files would result in placement policy calls iff there's an
> under-replication/losses/etc. to block replicas of the original set.
> Only for such operations would you have a possibility to determine the
> actual full length of file (as explained above).
>
>> Thank you,
>>
>> AB
>
>

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> I want to write a block placement policy that takes the size of the file
> being placed into account. Something like what is done in CoHadoop or BEEMR
> paper. I have the following questions:
>
> 1- What is srcPath in chooseTarget? Is it the path to the original
> un-chunked file, or it is a path to a single block, or something else? I
> added some code to blockplacementpolicydefault to print out the value of
> srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.

> 2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.

> 3- I've spent time looking at hadoop source code. I can't find a way to go
> from srcPath in chooseTarget to a file size. Every function I think can do
> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
> called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.

> How do I go from srcPath in blockplacement class to size of the file being
> placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).

> Thank you,
>
> AB

-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> I want to write a block placement policy that takes the size of the file
> being placed into account. Something like what is done in CoHadoop or BEEMR
> paper. I have the following questions:
>
> 1- What is srcPath in chooseTarget? Is it the path to the original
> un-chunked file, or it is a path to a single block, or something else? I
> added some code to blockplacementpolicydefault to print out the value of
> srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.

> 2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.

> 3- I've spent time looking at hadoop source code. I can't find a way to go
> from srcPath in chooseTarget to a file size. Every function I think can do
> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
> called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.

> How do I go from srcPath in blockplacement class to size of the file being
> placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).

> Thank you,
>
> AB

-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> I want to write a block placement policy that takes the size of the file
> being placed into account. Something like what is done in CoHadoop or BEEMR
> paper. I have the following questions:
>
> 1- What is srcPath in chooseTarget? Is it the path to the original
> un-chunked file, or it is a path to a single block, or something else? I
> added some code to blockplacementpolicydefault to print out the value of
> srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.

> 2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.

> 3- I've spent time looking at hadoop source code. I can't find a way to go
> from srcPath in chooseTarget to a file size. Every function I think can do
> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
> called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.

> How do I go from srcPath in blockplacement class to size of the file being
> placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).

> Thank you,
>
> AB

-- 
Harsh J

Re: Building custom block placement policy. What is srcPath?

Posted by Harsh J <ha...@cloudera.com>.

Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <ba...@mail.uc.edu> wrote:
> Hi,
>
> I want to write a block placement policy that takes the size of the file
> being placed into account. Something like what is done in CoHadoop or BEEMR
> paper. I have the following questions:
>
> 1- What is srcPath in chooseTarget? Is it the path to the original
> un-chunked file, or it is a path to a single block, or something else? I
> added some code to blockplacementpolicydefault to print out the value of
> srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.

> 2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.

> 3- I've spent time looking at hadoop source code. I can't find a way to go
> from srcPath in chooseTarget to a file size. Every function I think can do
> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
> called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.

> How do I go from srcPath in blockplacement class to size of the file being
> placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).

> Thank you,
>
> AB

-- 
Harsh J