You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Thodoris Zois <zo...@ics.forth.gr> on 2018/04/23 16:05:48 UTC

Read or save specific blocks of a file

Hello list,

I have a file on HDFS that is divided into 10 blocks (partitions). 

Is there any way to retrieve data from a specific block? (e.g: using
the blockID). 

Except that, is there any option to write the contents of each block
(or of one block) into separate files?

Thank you very much,
Thodoris 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Read or save specific blocks of a file

Posted by ayan guha <gu...@gmail.com>.

Is this a recommended way of reading data in the long run? I think it might
be better to write or look for an InputFormat which supports the need

Btw Block is designed to be hdfs internal representation to enable certain
features. It would be interesting to understand the usecase where client
app really needs to know about it. It sounds like a questionable design
without that context

Best
Ayan

On Fri, 4 May 2018 at 1:46 am, Thodoris Zois <zo...@ics.forth.gr> wrote:

> Hello Madhav,
>
> What I did is pretty straight-forward. Let's say that your HDFS block is
> 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
>
> First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It
> will return you some very useful information including the list of blocks.
> So let's say that you want to read the first block (block 0). On the right
> side of the line that corresponds to block 0 you can find the IP of the
> machine that holds this specific block in the local file system as well as
> the blockName (BP-1737920335-xxx.xxx.x.x-1510660262864) and blockID (e.g:
> blk_1073760915_20091) that will help you later recognize it. So what you
> need from fsck is the blockName, blockID and the IP of the machine that has
> the specific block that you are interested in.
>
> After you get these you got everything you need. All you have to do is to
> connect to the specific IP and execute: `find
> /data/hdfs-data/datanode/current/blockName/current/finalized/subdir0/ -name
> blockID`. That command will return you the full path where you can find the
> contents of your file Test.csv that correspond to one block in HDFS.
>
> What I do after I get the full path is to copy the file, remove the last
> line (because there is a big chance that the last line will be included in
> the next block) and store it again to HDFS with the desired name. Then I
> can access one block of file Test.csv from HDFS. That's all, if you need
> any further information do no hesitate to contact me.
>
> - Thodoris
>
>
> On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
>
> Thodoris,
>
> I certainly would be interested in knowing how you were able to identify
> individual blocks and read from them. I was understanding that HDFS
> protocol abstracts this from the consumers to prevent potential data
> corruption issues. Appreciate if you please share some details of your
> approach.
>
> Thanks!
> madhav
>
> On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <zo...@ics.forth.gr> wrote:
>
> That’s what I did :) If you need further information I can post my
> solution..
>
> - Thodoris
>
> On 30 Apr 2018, at 22:23, David Quiroga <qu...@gmail.com> wrote:
>
> There might be a better way... but I wonder if it might be possible to
> access the node where the block is store and read it from the local file
> system rather than from HDFS.
>
> On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <zo...@ics.forth.gr> wrote:
>
> Hello list,
>
> I have a file on HDFS that is divided into 10 blocks (partitions).
>
> Is there any way to retrieve data from a specific block? (e.g: using
> the blockID).
>
> Except that, is there any option to write the contents of each block
> (or of one block) into separate files?
>
> Thank you very much,
> Thodoris
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>
>
> --
Best Regards,
Ayan Guha

Re: Read or save specific blocks of a file

Posted by Thodoris Zois <zo...@ics.forth.gr>.

Hello Madhav,
What I did is pretty straight-forward. Let's say that your HDFS block
is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
First use the command: `hdfs fsck Test.csv -locations -blocks -files`.
It will return you some very useful information including the list of
blocks. So let's say that you want to read the first block (block 0).
On the right side of the line that corresponds to block 0 you can find
the IP of the machine that holds this specific block in the local file
system as well as the blockName (BP-1737920335-xxx.xxx.x.x-
1510660262864) and blockID (e.g: blk_1073760915_20091) that will help
you later recognize it. So what you need from fsck is the blockName,
blockID and the IP of the machine that has the specific block that you
are interested in.
After you get these you got everything you need. All you have to do is
to connect to the specific IP and execute: `find /data/hdfs-
data/datanode/current/blockName/current/finalized/subdir0/ -name
blockID`. That command will return you the full path where you can find
the contents of your file Test.csv that correspond to one block in
HDFS.
What I do after I get the full path is to copy the file, remove the
last line (because there is a big chance that the last line will be
included in the next block) and store it again to HDFS with the desired
name. Then I can access one block of file Test.csv from HDFS. That's
all, if you need any further information do no hesitate to contact me.
- Thodoris

On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
> Thodoris,
> 
> 
> I certainly would be interested in knowing how you were able to
> identify individual blocks and read from them. I was understanding
> that HDFS protocol abstracts this from the consumers to prevent
> potential data corruption issues. Appreciate if you please share some
> details of your approach.
> 
> 
> Thanks!
> madhav
> On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <zo...@ics.forth.gr>
> wrote:
> > That’s what I did :) If you need further information I can post my
> > solution.. 
> > 
> > - Thodoris
> > On 30 Apr 2018, at 22:23, David Quiroga <qu...@gmail.com>
> > wrote:
> > 
> > > There might be a better way... but I wonder if it might be
> > > possible to access the node where the block is store and read it
> > > from the local file system rather than from HDFS.  
> > > On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <zois@ics.forth.g
> > > r> wrote:
> > > > Hello list,
> > > > 
> > > > 
> > > > 
> > > > I have a file on HDFS that is divided into 10 blocks
> > > > (partitions). 
> > > > 
> > > > 
> > > > 
> > > > Is there any way to retrieve data from a specific block? (e.g:
> > > > using
> > > > 
> > > > the blockID). 
> > > > 
> > > > 
> > > > 
> > > > Except that, is there any option to write the contents of each
> > > > block
> > > > 
> > > > (or of one block) into separate files?
> > > > 
> > > > 
> > > > 
> > > > Thank you very much,
> > > > 
> > > > Thodoris 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > -------------------------------------------------------------
> > > > --------
> > > > 
> > > > To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> > > > 
> > > > For additional commands, e-mail: user-help@hadoop.apache.org
> > > > 
> > > > 
> > > >

Re: Read or save specific blocks of a file

Posted by Thodoris Zois <zo...@ics.forth.gr>.

That’s what I did :) If you need further information I can post my solution.. 

- Thodoris

> On 30 Apr 2018, at 22:23, David Quiroga <qu...@gmail.com> wrote:
> 
> There might be a better way... but I wonder if it might be possible to access the node where the block is store and read it from the local file system rather than from HDFS.  
> 
>> On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <zo...@ics.forth.gr> wrote:
>> Hello list,
>> 
>> I have a file on HDFS that is divided into 10 blocks (partitions). 
>> 
>> Is there any way to retrieve data from a specific block? (e.g: using
>> the blockID). 
>> 
>> Except that, is there any option to write the contents of each block
>> (or of one block) into separate files?
>> 
>> Thank you very much,
>> Thodoris 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>> 
>

Re: Read or save specific blocks of a file

Posted by David Quiroga <qu...@gmail.com>.

There might be a better way... but I wonder if it might be possible to
access the node where the block is store and read it from the local file
system rather than from HDFS.

On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <zo...@ics.forth.gr> wrote:

> Hello list,
>
> I have a file on HDFS that is divided into 10 blocks (partitions).
>
> Is there any way to retrieve data from a specific block? (e.g: using
> the blockID).
>
> Except that, is there any option to write the contents of each block
> (or of one block) into separate files?
>
> Thank you very much,
> Thodoris
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>