You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Thodoris Zois <zo...@ics.forth.gr> on 2018/04/23 16:05:25 UTC

Read or Save specific blocks of a file

Hello list,

I have a file on HDFS that is divided into 10 blocks (partitions). 

Is there any way to retrieve data from a specific block? (e.g: using
the blockID). 

Except that, is there any option to write the contents of each block
(or of one block) into separate files?

Thank you very much,
Thodoris 


 

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Re: Read or Save specific blocks of a file

Posted by Thodoris Zois <zo...@ics.forth.gr>.
Thank you very much for your answers. I will probably go through
searching for the blockID and then read the block from the local file
system directly. I need it for a specific purpose! 

Thank you very much for your answers!

- Thodoris


On Tue, 2018-04-24 at 05:54 +0000, Takanobu Asanuma wrote:
> In addition to others' comments, I think fsck command like below is
> the easiest way to find the block locations of the file.
> 
> $ hdfs fsck /path/to/the/data -blocks -files -locations
> 
> Thanks,
> - Takanobu
> 
> -----Original Message-----
> From: Jim Clampffer [mailto:james.clampffer@gmail.com] 
> Sent: Tuesday, April 24, 2018 10:42 AM
> To: Arpit Agarwal <aa...@hortonworks.com>
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: Read or Save specific blocks of a file
> 
> If you want to read replicas from a specific DN after determining the
> block bounds via getFileBlockLocations you could abuse the rack
> locality infrastructure by generating a dummy topology script to get
> the NN to order replicas such that the client tries to read from the
> DNs you prefer first.
> It's not going to guarantee a read from a specific DN and is a
> terrible idea to do in a multi-tenant/production cluster but if you
> have a very specific goal in mind or want to learn more about the
> storage layer it may be an interesting exercise.
> 
> On Mon, Apr 23, 2018 at 9:14 PM, Arpit Agarwal <aagarwal@hortonworks.
> com>
> wrote:
> 
> > Hi,
> > 
> > Perhaps I missed something in the question. 
> > FileSystem#getFileBlockLocations followed by open, seek to start
> > of 
> > target block, read. This will let you read the contents of a
> > specific block using public APIs.
> > 
> > 
> > 
> > On 4/23/18, 5:26 PM, "Daniel Templeton" <da...@cloudera.com>
> > wrote:
> > 
> >     I'm not aware of a way to work with blocks using the public
> > APIs. The
> >     easiest way to do it is probably to retrieve the block IDs and
> > then go
> >     grab those blocks from the data nodes' local file systems
> > directly.
> > 
> >     Daniel
> > 
> >     On 4/23/18 9:05 AM, Thodoris Zois wrote:
> >     > Hello list,
> >     >
> >     > I have a file on HDFS that is divided into 10 blocks
> > (partitions).
> >     >
> >     > Is there any way to retrieve data from a specific block?
> > (e.g: using
> >     > the blockID).
> >     >
> >     > Except that, is there any option to write the contents of
> > each block
> >     > (or of one block) into separate files?
> >     >
> >     > Thank you very much,
> >     > Thodoris
> >     >
> >     >
> >     >
> >     >
> >     > ------------------------------------------------------------
> > ---------
> >     > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.or
> > g
> >     > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.
> > org
> >     >
> > 
> > 
> >     ---------------------------------------------------------------
> > ------
> >     To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> >     For additional commands, e-mail: hdfs-dev-help@hadoop.apache.or
> > g
> > 
> > 
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


RE: Read or Save specific blocks of a file

Posted by Takanobu Asanuma <ta...@yahoo-corp.jp>.
In addition to others' comments, I think fsck command like below is the easiest way to find the block locations of the file.

$ hdfs fsck /path/to/the/data -blocks -files -locations

Thanks,
- Takanobu

-----Original Message-----
From: Jim Clampffer [mailto:james.clampffer@gmail.com] 
Sent: Tuesday, April 24, 2018 10:42 AM
To: Arpit Agarwal <aa...@hortonworks.com>
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Read or Save specific blocks of a file

If you want to read replicas from a specific DN after determining the block bounds via getFileBlockLocations you could abuse the rack locality infrastructure by generating a dummy topology script to get the NN to order replicas such that the client tries to read from the DNs you prefer first.
It's not going to guarantee a read from a specific DN and is a terrible idea to do in a multi-tenant/production cluster but if you have a very specific goal in mind or want to learn more about the storage layer it may be an interesting exercise.

On Mon, Apr 23, 2018 at 9:14 PM, Arpit Agarwal <aa...@hortonworks.com>
wrote:

> Hi,
>
> Perhaps I missed something in the question. 
> FileSystem#getFileBlockLocations followed by open, seek to start of 
> target block, read. This will let you read the contents of a specific block using public APIs.
>
>
>
> On 4/23/18, 5:26 PM, "Daniel Templeton" <da...@cloudera.com> wrote:
>
>     I'm not aware of a way to work with blocks using the public APIs. The
>     easiest way to do it is probably to retrieve the block IDs and then go
>     grab those blocks from the data nodes' local file systems directly.
>
>     Daniel
>
>     On 4/23/18 9:05 AM, Thodoris Zois wrote:
>     > Hello list,
>     >
>     > I have a file on HDFS that is divided into 10 blocks (partitions).
>     >
>     > Is there any way to retrieve data from a specific block? (e.g: using
>     > the blockID).
>     >
>     > Except that, is there any option to write the contents of each block
>     > (or of one block) into separate files?
>     >
>     > Thank you very much,
>     > Thodoris
>     >
>     >
>     >
>     >
>     > ------------------------------------------------------------
> ---------
>     > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>     > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>     >
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>     For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>
>
>
>

Re: Read or Save specific blocks of a file

Posted by Jim Clampffer <ja...@gmail.com>.
If you want to read replicas from a specific DN after determining the block
bounds via getFileBlockLocations you could abuse the rack locality
infrastructure by generating a dummy topology script to get the NN to order
replicas such that the client tries to read from the DNs you prefer first.
It's not going to guarantee a read from a specific DN and is a terrible
idea to do in a multi-tenant/production cluster but if you have a very
specific goal in mind or want to learn more about the storage layer it may
be an interesting exercise.

On Mon, Apr 23, 2018 at 9:14 PM, Arpit Agarwal <aa...@hortonworks.com>
wrote:

> Hi,
>
> Perhaps I missed something in the question. FileSystem#getFileBlockLocations
> followed by open, seek to start of target block, read. This will let you
> read the contents of a specific block using public APIs.
>
>
>
> On 4/23/18, 5:26 PM, "Daniel Templeton" <da...@cloudera.com> wrote:
>
>     I'm not aware of a way to work with blocks using the public APIs. The
>     easiest way to do it is probably to retrieve the block IDs and then go
>     grab those blocks from the data nodes' local file systems directly.
>
>     Daniel
>
>     On 4/23/18 9:05 AM, Thodoris Zois wrote:
>     > Hello list,
>     >
>     > I have a file on HDFS that is divided into 10 blocks (partitions).
>     >
>     > Is there any way to retrieve data from a specific block? (e.g: using
>     > the blockID).
>     >
>     > Except that, is there any option to write the contents of each block
>     > (or of one block) into separate files?
>     >
>     > Thank you very much,
>     > Thodoris
>     >
>     >
>     >
>     >
>     > ------------------------------------------------------------
> ---------
>     > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>     > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>     >
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>     For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>
>
>
>

Re: Read or Save specific blocks of a file

Posted by Arpit Agarwal <aa...@hortonworks.com>.
Hi,

Perhaps I missed something in the question. FileSystem#getFileBlockLocations followed by open, seek to start of target block, read. This will let you read the contents of a specific block using public APIs.



On 4/23/18, 5:26 PM, "Daniel Templeton" <da...@cloudera.com> wrote:

    I'm not aware of a way to work with blocks using the public APIs. The 
    easiest way to do it is probably to retrieve the block IDs and then go 
    grab those blocks from the data nodes' local file systems directly.
    
    Daniel
    
    On 4/23/18 9:05 AM, Thodoris Zois wrote:
    > Hello list,
    >
    > I have a file on HDFS that is divided into 10 blocks (partitions).
    >
    > Is there any way to retrieve data from a specific block? (e.g: using
    > the blockID).
    >
    > Except that, is there any option to write the contents of each block
    > (or of one block) into separate files?
    >
    > Thank you very much,
    > Thodoris
    >
    >
    >   
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
    > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
    >
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
    For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
    
    


Re: Read or Save specific blocks of a file

Posted by Daniel Templeton <da...@cloudera.com>.
I'm not aware of a way to work with blocks using the public APIs. The 
easiest way to do it is probably to retrieve the block IDs and then go 
grab those blocks from the data nodes' local file systems directly.

Daniel

On 4/23/18 9:05 AM, Thodoris Zois wrote:
> Hello list,
>
> I have a file on HDFS that is divided into 10 blocks (partitions).
>
> Is there any way to retrieve data from a specific block? (e.g: using
> the blockID).
>
> Except that, is there any option to write the contents of each block
> (or of one block) into separate files?
>
> Thank you very much,
> Thodoris
>
>
>   
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org