You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Sesha Kumar <se...@gmail.com> on 2012/01/16 15:20:08 UTC

Data processing in DFSClient

Hey guys,

Sorry for the typo in my last message.I have corrected it.

I would like to perform some additional processing on the data which is
streamed to DFSClient. To my knowledge the class DFSInputStream manages the
stream operations on the client side whenever a file is being read, but i
don't know which class should be modified to add this additional processing
capability to data node. Please clarify.


Thanks in advance

Re: Data processing in DFSClient

Posted by Joey Echeverria <jo...@cloudera.com>.
Personally I would just use Har :) It sounds like an interesting
project. You might find this document helpful:

http://kazman.shidler.hawaii.edu/ArchDoc.html

It was designed to help contributors navigate the HDFS source tree.

-Joey

On Thu, Jan 19, 2012 at 11:52 AM, Sesha Kumar <se...@gmail.com> wrote:
>  I'm currently working on this paper where we try to improve the performance
> of HDFS by combining small files into a single file (like HAR), but this
> merged file contains at the beginning of each block an index file which is
> similar to HAR index file. Datanode uses this index file to obtain the small
> file from the block. We use some metadata to find out the block containing
> the desired file in the merged file.



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Data processing in DFSClient

Posted by Sesha Kumar <se...@gmail.com>.
 I'm currently working on this paper where we try to improve the
performance of HDFS by combining small files into a single file (like HAR),
but this merged file contains at the beginning of each block an index file
which is similar to HAR index file. Datanode uses this index file to obtain
the small file from the block. We use some metadata to find out the block
containing the desired file in the merged file.

RE: Data processing in DFSClient

Posted by Uma Maheswara Rao G <ma...@huawei.com>.
I did not get your complete idea here but before that, file will be splitted into blocks. Then how will you extract ?
Why con't you create tar file and one index file which will maintain index of each file..something like this?

________________________________
From: Sesha Kumar [sesha911@gmail.com]
Sent: Wednesday, January 18, 2012 8:24 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Data processing in DFSClient


Sorry for the delay. I'm trying to implement an IEEE paper which combines a bunch of files into a single file and when the file is requested the datanode extracts the desired file from the block and sends the file to DFSClient.

Re: Data processing in DFSClient

Posted by Sesha Kumar <se...@gmail.com>.
Sorry for the delay. I'm trying to implement an IEEE paper which combines a
bunch of files into a single file and when the file is requested the
datanode extracts the desired file from the block and sends the file to
DFSClient.

Re: Data processing in DFSClient

Posted by Joey Echeverria <jo...@cloudera.com>.
Sesha,

What kind of processing are you attempting to do? Maybe it makes more sense
to just implement a MapReduce job rather than modifying the datanodes?

-Joey

On Mon, Jan 16, 2012 at 9:20 AM, Sesha Kumar <se...@gmail.com> wrote:

> Hey guys,
>
> Sorry for the typo in my last message.I have corrected it.
>
> I would like to perform some additional processing on the data which is
> streamed to DFSClient. To my knowledge the class DFSInputStream manages the
> stream operations on the client side whenever a file is being read, but i
> don't know which class should be modified to add this additional processing
> capability to data node. Please clarify.
>
>
> Thanks in advance
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

RE: Data processing in DFSClient

Posted by Uma Maheswara Rao G <ma...@huawei.com>.
Hi Shesha,

 Take a look at  org.apache.hadoop.hdfs.server.datanode.BlockSender.java

Regards,
Uma
________________________________
From: Sesha Kumar [sesha911@gmail.com]
Sent: Monday, January 16, 2012 7:50 PM
To: hdfs-user@hadoop.apache.org
Subject: Data processing in DFSClient

Hey guys,

Sorry for the typo in my last message.I have corrected it.

I would like to perform some additional processing on the data which is streamed to DFSClient. To my knowledge the class DFSInputStream manages the stream operations on the client side whenever a file is being read, but i don't know which class should be modified to add this additional processing capability to data node. Please clarify.


Thanks in advance