You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Thoihen Maibam <th...@gmail.com> on 2013/05/11 12:49:14 UTC

Hadoop noob question

Hi All,

Can anyone help me know how does companies like Facebook ,Yahoo etc upload
bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for
processing
and after processing how they download those files from HDFS to local file
system.

I don't think they might be using the command line hadoop fs put to upload
files as it would take too long or do they divide say 10 parts each 10
petabytes and  compress and use the command line hadoop fs put

Or if they use any tool to upload huge files.

Please help me .

Thanks
thoihen

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Just wanted to bring one thing up.

Using distcp to upload local file to hdfs might not work if launched from a
gateway host.Gateway hosts typically configured to only submit jobs and are
only aware of NN and JT, so mappers running in various data nodes might not
have access to the local fs of data node.

distcp is possible when data is loaded into the local fs of any of the
datanodes and then distcp  is run from there.

Thanks,
Rahul


On Sun, May 12, 2013 at 7:51 PM, Chris Mawata <ch...@gmail.com>wrote:

>  It is being read sequentially but is it not potentially being written on
> multiple drives and since reading is typically faster than writing don't
> you still get a little benefit of parallelism?
>
>
> On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
>
> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
>  Please comment if you think I am wring somewhere.
>
>  Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>>  Yes , it's a MR job under the hood . my question was that you wrote
>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>> I think the MR job of distcp divides files into individual map tasks based
>> on the total size of the transfer , so multiple mappers would still be
>> spawned if the size of transfer is huge and they would work in parallel.
>>
>>  Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>>  On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>>  Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>>  Thanks to both of you!
>>>>
>>>>   Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>>  example:
>>>>>
>>>>>  hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>>  @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>>  isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>   Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>>  Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>>  Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>>  And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>>  Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>  IMHO,I think the statement about NN with regard to block
>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>  Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>>  Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>>  @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>>  Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For data management products, you can look at falcon
>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>  and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't think they might be using the command line
>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>>>>>>  thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Just wanted to bring one thing up.

Using distcp to upload local file to hdfs might not work if launched from a
gateway host.Gateway hosts typically configured to only submit jobs and are
only aware of NN and JT, so mappers running in various data nodes might not
have access to the local fs of data node.

distcp is possible when data is loaded into the local fs of any of the
datanodes and then distcp  is run from there.

Thanks,
Rahul


On Sun, May 12, 2013 at 7:51 PM, Chris Mawata <ch...@gmail.com>wrote:

>  It is being read sequentially but is it not potentially being written on
> multiple drives and since reading is typically faster than writing don't
> you still get a little benefit of parallelism?
>
>
> On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
>
> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
>  Please comment if you think I am wring somewhere.
>
>  Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>>  Yes , it's a MR job under the hood . my question was that you wrote
>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>> I think the MR job of distcp divides files into individual map tasks based
>> on the total size of the transfer , so multiple mappers would still be
>> spawned if the size of transfer is huge and they would work in parallel.
>>
>>  Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>>  On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>>  Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>>  Thanks to both of you!
>>>>
>>>>   Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>>  example:
>>>>>
>>>>>  hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>>  @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>>  isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>   Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>>  Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>>  Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>>  And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>>  Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>  IMHO,I think the statement about NN with regard to block
>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>  Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>>  Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>>  @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>>  Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For data management products, you can look at falcon
>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>  and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't think they might be using the command line
>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>>>>>>  thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Just wanted to bring one thing up.

Using distcp to upload local file to hdfs might not work if launched from a
gateway host.Gateway hosts typically configured to only submit jobs and are
only aware of NN and JT, so mappers running in various data nodes might not
have access to the local fs of data node.

distcp is possible when data is loaded into the local fs of any of the
datanodes and then distcp  is run from there.

Thanks,
Rahul


On Sun, May 12, 2013 at 7:51 PM, Chris Mawata <ch...@gmail.com>wrote:

>  It is being read sequentially but is it not potentially being written on
> multiple drives and since reading is typically faster than writing don't
> you still get a little benefit of parallelism?
>
>
> On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
>
> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
>  Please comment if you think I am wring somewhere.
>
>  Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>>  Yes , it's a MR job under the hood . my question was that you wrote
>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>> I think the MR job of distcp divides files into individual map tasks based
>> on the total size of the transfer , so multiple mappers would still be
>> spawned if the size of transfer is huge and they would work in parallel.
>>
>>  Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>>  On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>>  Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>>  Thanks to both of you!
>>>>
>>>>   Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>>  example:
>>>>>
>>>>>  hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>>  @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>>  isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>   Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>>  Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>>  Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>>  And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>>  Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>  IMHO,I think the statement about NN with regard to block
>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>  Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>>  Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>>  @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>>  Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For data management products, you can look at falcon
>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>  and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't think they might be using the command line
>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>>>>>>  thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Just wanted to bring one thing up.

Using distcp to upload local file to hdfs might not work if launched from a
gateway host.Gateway hosts typically configured to only submit jobs and are
only aware of NN and JT, so mappers running in various data nodes might not
have access to the local fs of data node.

distcp is possible when data is loaded into the local fs of any of the
datanodes and then distcp  is run from there.

Thanks,
Rahul


On Sun, May 12, 2013 at 7:51 PM, Chris Mawata <ch...@gmail.com>wrote:

>  It is being read sequentially but is it not potentially being written on
> multiple drives and since reading is typically faster than writing don't
> you still get a little benefit of parallelism?
>
>
> On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
>
> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
>  Please comment if you think I am wring somewhere.
>
>  Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>>  Yes , it's a MR job under the hood . my question was that you wrote
>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>> I think the MR job of distcp divides files into individual map tasks based
>> on the total size of the transfer , so multiple mappers would still be
>> spawned if the size of transfer is huge and they would work in parallel.
>>
>>  Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>>  On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>>  Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>>  Thanks to both of you!
>>>>
>>>>   Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>>  example:
>>>>>
>>>>>  hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>>  @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>>  isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>   Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>>  Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>>  Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>>  And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>>  Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>  IMHO,I think the statement about NN with regard to block
>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>  Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>>  Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>>  @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>>  Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For data management products, you can look at falcon
>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>  and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't think they might be using the command line
>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>>>>>>  thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Hadoop noob question

Posted by Chris Mawata <ch...@gmail.com>.

It is being read sequentially but is it not potentially being written on 
multiple drives and since reading is typically faster than writing don't 
you still get a little benefit of parallelism?

On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
> I had said that if you use distcp to copy data *from localFS to HDFS* 
> then you won't be able to exploit parallelism as entire file is 
> present on a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
> <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>     Yes , it's a MR job under the hood . my question was that you
>     wrote that using distcp you loose the benefits  of parallel
>     processing of Hadoop. I think the MR job of distcp divides files
>     into individual map tasks based on the total size of the transfer
>     , so multiple mappers would still be spawned if the size of
>     transfer is huge and they would work in parallel.
>
>     Correct me if there is anything wrong!
>
>     Thanks,
>     Rahul
>
>
>     On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq
>     <dontariq@gmail.com <ma...@gmail.com>> wrote:
>
>         No. distcp is actually a mapreduce job under the hood.
>
>         Warm Regards,
>         Tariq
>         cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
>         On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee
>         <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>             Thanks to both of you!
>
>             Rahul
>
>
>             On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar
>             <nitinpawar432@gmail.com <ma...@gmail.com>>
>             wrote:
>
>                 you can do that using file:///
>
>                 example:
>
>                 |hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
>
>
>
>
>
>
>
>
>                 |
>
>
>
>                 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee
>                 <rahul.rec.dgp@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                     @Tariq can you point me to some resource which
>                     shows how distcp is used to upload files from
>                     local to hdfs.
>
>                     isn't distcp a MR job ? wouldn't it need the data
>                     to be already present in the hadoop's fs?
>
>                     Rahul
>
>
>                     On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq
>                     <dontariq@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         You'r welcome :)
>
>                         Warm Regards,
>                         Tariq
>                         cloudfront.blogspot.com
>                         <http://cloudfront.blogspot.com>
>
>
>                         On Sat, May 11, 2013 at 10:46 PM, Rahul
>                         Bhattacharjee <rahul.rec.dgp@gmail.com
>                         <ma...@gmail.com>> wrote:
>
>                             Thanks Tariq!
>
>
>                             On Sat, May 11, 2013 at 10:34 PM, Mohammad
>                             Tariq <dontariq@gmail.com
>                             <ma...@gmail.com>> wrote:
>
>                                 @Rahul : Yes. distcp can do that.
>
>                                 And, bigger the files lesser the
>                                 metadata hence lesser memory consumption.
>
>                                 Warm Regards,
>                                 Tariq
>                                 cloudfront.blogspot.com
>                                 <http://cloudfront.blogspot.com>
>
>
>                                 On Sat, May 11, 2013 at 9:40 PM, Rahul
>                                 Bhattacharjee <rahul.rec.dgp@gmail.com
>                                 <ma...@gmail.com>> wrote:
>
>                                     IMHO,I think the statement about
>                                     NN with regard to block metadata
>                                     is more like a general statement.
>                                     Even if you put lots of small
>                                     files of combined size 10 TB , you
>                                     need to have a capable NN.
>
>                                     can disct cp be used to copy local
>                                     - to - hdfs ?
>
>                                     Thanks,
>                                     Rahul
>
>
>                                     On Sat, May 11, 2013 at 9:35 PM,
>                                     Nitin Pawar
>                                     <nitinpawar432@gmail.com
>                                     <ma...@gmail.com>>
>                                     wrote:
>
>                                         absolutely rite Mohammad
>
>
>                                         On Sat, May 11, 2013 at 9:33
>                                         PM, Mohammad Tariq
>                                         <dontariq@gmail.com
>                                         <ma...@gmail.com>>
>                                         wrote:
>
>                                             Sorry for barging in guys.
>                                             I think Nitin is talking
>                                             about this :
>
>                                             Every file and block in
>                                             HDFS is treated as an
>                                             object and for each object
>                                             around 200B of metadata
>                                             get created. So the NN
>                                             should be powerful enough
>                                             to handle that much
>                                             metadata, since it is
>                                             going to be in-memory.
>                                             Actually memory is the
>                                             most important metric when
>                                             it comes to NN.
>
>                                             Am I correct @Nitin?
>
>                                             @Thoihen : As Nitin has
>                                             said, when you talk about
>                                             that much data you don't
>                                             actually just do a "put".
>                                             You could use something
>                                             like "distcp" for parallel
>                                             copying. A better approach
>                                             would be to use a data
>                                             aggregation tool like
>                                             Flume or Chukwa, as Nitin
>                                             has already pointed.
>                                             Facebook uses their own
>                                             data aggregation tool,
>                                             called Scribe for this
>                                             purpose.
>
>                                             Warm Regards,
>                                             Tariq
>                                             cloudfront.blogspot.com
>                                             <http://cloudfront.blogspot.com>
>
>
>                                             On Sat, May 11, 2013 at
>                                             9:20 PM, Nitin Pawar
>                                             <nitinpawar432@gmail.com
>                                             <ma...@gmail.com>>
>                                             wrote:
>
>                                                 NN would still be in
>                                                 picture because it
>                                                 will be writing a lot
>                                                 of meta data for each
>                                                 individual file. so
>                                                 you will need a NN
>                                                 capable enough which
>                                                 can store the metadata
>                                                 for your entire
>                                                 dataset. Data will
>                                                 never go to NN but lot
>                                                 of metadata about data
>                                                 will be on NN so its
>                                                 always good idea to
>                                                 have a strong NN.
>
>
>                                                 On Sat, May 11, 2013
>                                                 at 9:11 PM, Rahul
>                                                 Bhattacharjee
>                                                 <rahul.rec.dgp@gmail.com
>                                                 <ma...@gmail.com>>
>                                                 wrote:
>
>                                                     @Nitin , parallel
>                                                     dfs to write to
>                                                     hdfs is great ,
>                                                     but could not
>                                                     understand the
>                                                     meaning of capable
>                                                     NN. As I know ,
>                                                     the NN would not
>                                                     be a part of the
>                                                     actual data write
>                                                     pipeline , means
>                                                     that the data
>                                                     would not travel
>                                                     through the NN ,
>                                                     the dfs would
>                                                     contact the NN
>                                                     from time to time
>                                                     to get locations
>                                                     of DN as where to
>                                                     store the data blocks.
>
>                                                     Thanks,
>                                                     Rahul
>
>
>
>                                                     On Sat, May 11,
>                                                     2013 at 4:54 PM,
>                                                     Nitin Pawar
>                                                     <nitinpawar432@gmail.com
>                                                     <ma...@gmail.com>>
>                                                     wrote:
>
>                                                         is it safe? ..
>                                                         there is no
>                                                         direct answer
>                                                         yes or no
>
>                                                         when you say ,
>                                                         you have files
>                                                         worth 10TB
>                                                         files and you
>                                                         want to upload
>                                                          to HDFS,
>                                                         several
>                                                         factors come
>                                                         into picture
>
>                                                         1) Is the
>                                                         machine in the
>                                                         same network
>                                                         as your hadoop
>                                                         cluster?
>                                                         2) If there
>                                                         guarantee that
>                                                         network will
>                                                         not go down?
>
>                                                         and Most
>                                                         importantly I
>                                                         assume that
>                                                         you have a
>                                                         capable hadoop
>                                                         cluster. By
>                                                         that I mean
>                                                         you have a
>                                                         capable namenode.
>
>                                                         I would
>                                                         definitely not
>                                                         write
>                                                         files sequentially in
>                                                         HDFS. I would
>                                                         prefer to
>                                                         write files in
>                                                         parallel to
>                                                         hdfs to
>                                                         utilize the
>                                                         DFS write
>                                                         features to
>                                                         speed up the
>                                                         process.
>                                                         you can hdfs
>                                                         put command in
>                                                         parallel
>                                                         manner and in
>                                                         my experience
>                                                         it has not
>                                                         failed when we
>                                                         write a lot of
>                                                         data.
>
>
>                                                         On Sat, May
>                                                         11, 2013 at
>                                                         4:38 PM,
>                                                         maisnam ns
>                                                         <maisnam.ns@gmail.com
>                                                         <ma...@gmail.com>>
>                                                         wrote:
>
>                                                             @Nitin
>                                                             Pawar ,
>                                                             thanks for
>                                                             clearing
>                                                             my doubts .
>
>                                                             But I have
>                                                             one more
>                                                             question ,
>                                                             say I have
>                                                             10 TB data
>                                                             in the
>                                                             pipeline .
>
>                                                             Is it
>                                                             perfectly
>                                                             OK to use
>                                                             hadopo fs
>                                                             put
>                                                             command to
>                                                             upload
>                                                             these
>                                                             files of
>                                                             size 10 TB
>                                                             and is
>                                                             there any
>                                                             limit to
>                                                             the file
>                                                             size using
>                                                             hadoop
>                                                             command
>                                                             line . Can
>                                                             hadoop put
>                                                             command
>                                                             line work
>                                                             with huge
>                                                             data.
>
>                                                             Thanks in
>                                                             advance
>
>
>                                                             On Sat,
>                                                             May 11,
>                                                             2013 at
>                                                             4:24 PM,
>                                                             Nitin
>                                                             Pawar
>                                                             <nitinpawar432@gmail.com
>                                                             <ma...@gmail.com>>
>                                                             wrote:
>
>                                                                 first
>                                                                 of all
>                                                                 ..
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do not
>                                                                 get
>                                                                 100 PB
>                                                                 of
>                                                                 data
>                                                                 in one
>                                                                 go.
>                                                                 Its an
>                                                                 accumulating
>                                                                 process and
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do
>                                                                 have a
>                                                                 data
>                                                                 pipeline
>                                                                 in
>                                                                 place
>                                                                 where
>                                                                 the
>                                                                 data
>                                                                 is
>                                                                 written to
>                                                                 hdfs
>                                                                 on a
>                                                                 frequency
>                                                                 basis
>                                                                 and
>                                                                  then
>                                                                 its
>                                                                 retained
>                                                                 on
>                                                                 hdfs
>                                                                 for
>                                                                 some
>                                                                 duration
>                                                                 as per
>                                                                 needed
>                                                                 and
>                                                                 from
>                                                                 there
>                                                                 its
>                                                                 sent
>                                                                 to
>                                                                 archivers
>                                                                 or
>                                                                 deleted.
>
>                                                                 For
>                                                                 data
>                                                                 management
>                                                                 products,
>                                                                 you
>                                                                 can
>                                                                 look
>                                                                 at
>                                                                 falcon
>                                                                 which
>                                                                 is
>                                                                 open
>                                                                 sourced by
>                                                                 inmobi
>                                                                 along
>                                                                 with
>                                                                 hortonworks.
>
>
>                                                                 In any
>                                                                 case,
>                                                                 if you
>                                                                 want
>                                                                 to
>                                                                 write
>                                                                 files
>                                                                 to
>                                                                 hdfs
>                                                                 there
>                                                                 are
>                                                                 few
>                                                                 options available
>                                                                 to you
>                                                                 1)
>                                                                 Write
>                                                                 your
>                                                                 dfs
>                                                                 client
>                                                                 which
>                                                                 writes
>                                                                 to dfs
>                                                                 2) use
>                                                                 hdfs proxy
>                                                                 3)
>                                                                 there
>                                                                 is webhdfs
>                                                                 4)
>                                                                 command line
>                                                                 hdfs
>                                                                 5)
>                                                                 data
>                                                                 collection
>                                                                 tools
>                                                                 come
>                                                                 with
>                                                                 support to
>                                                                 write
>                                                                 to
>                                                                 hdfs
>                                                                 like
>                                                                 flume etc
>
>
>                                                                 On
>                                                                 Sat,
>                                                                 May
>                                                                 11,
>                                                                 2013
>                                                                 at
>                                                                 4:19
>                                                                 PM,
>                                                                 Thoihen Maibam
>                                                                 <thoihen123@gmail.com
>                                                                 <ma...@gmail.com>>
>                                                                 wrote:
>
>                                                                     Hi
>                                                                     All,
>
>                                                                     Can anyone
>                                                                     help
>                                                                     me
>                                                                     know
>                                                                     how does
>                                                                     companies
>                                                                     like
>                                                                     Facebook
>                                                                     ,Yahoo
>                                                                     etc upload
>                                                                     bulk
>                                                                     files
>                                                                     say to
>                                                                     the tune
>                                                                     of
>                                                                     100 petabytes
>                                                                     to
>                                                                     Hadoop
>                                                                     HDFS
>                                                                     cluster
>                                                                     for processing
>                                                                     and after
>                                                                     processing
>                                                                     how they
>                                                                     download
>                                                                     those
>                                                                     files
>                                                                     from
>                                                                     HDFS
>                                                                     to
>                                                                     local
>                                                                     file
>                                                                     system.
>
>                                                                     I
>                                                                     don't
>                                                                     think
>                                                                     they
>                                                                     might
>                                                                     be
>                                                                     using
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs
>                                                                     put to
>                                                                     upload
>                                                                     files
>                                                                     as
>                                                                     it
>                                                                     would
>                                                                     take
>                                                                     too long
>                                                                     or
>                                                                     do
>                                                                     they
>                                                                     divide
>                                                                     say 10
>                                                                     parts
>                                                                     each
>                                                                     10
>                                                                     petabytes
>                                                                     and compress
>                                                                     and use
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs put
>
>                                                                     Or
>                                                                     if
>                                                                     they
>                                                                     use any
>                                                                     tool
>                                                                     to
>                                                                     upload
>                                                                     huge
>                                                                     files.
>
>                                                                     Please
>                                                                     help
>                                                                     me .
>
>                                                                     Thanks
>                                                                     thoihen
>
>
>
>
>                                                                 -- 
>                                                                 Nitin
>                                                                 Pawar
>
>
>
>
>
>                                                         -- 
>                                                         Nitin Pawar
>
>
>
>
>
>                                                 -- 
>                                                 Nitin Pawar
>
>
>
>
>
>                                         -- 
>                                         Nitin Pawar
>
>
>
>
>
>
>
>
>
>                 -- 
>                 Nitin Pawar
>
>
>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Soon after replying I realized something else related to this.
>
> Say we have a single file in HDFS (hdfs configured for default block size
> 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
> from the current hdfs to another one , then
> whether there would be any parallelism or just a single map task would be
> fired?
>
> As per what I have read , a mapper is launcher for a complete file or a
> set of files. It doesn't operate at block level.So no parallelism even if
> the file resides in HDFS.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> yeah you are right I mis read your earlier post.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>>> a single machine. So no multiple TTs.
>>>
>>> Please comment if you think I am wring somewhere.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Yes , it's a MR job under the hood . my question was that you wrote
>>>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>>>> I think the MR job of distcp divides files into individual map tasks based
>>>> on the total size of the transfer , so multiple mappers would still be
>>>> spawned if the size of transfer is huge and they would work in parallel.
>>>>
>>>> Correct me if there is anything wrong!
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> No. distcp is actually a mapreduce job under the hood.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks to both of you!
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> you can do that using file:///
>>>>>>>
>>>>>>> example:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>>> used to upload files from local to hdfs.
>>>>>>>>
>>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>>> present in the hadoop's fs?
>>>>>>>>
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You'r welcome :)
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Tariq!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>>
>>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>>> consumption.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>>
>>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
>>>>>>>>>>>>>> this :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NN would still be in picture because it will be writing a
>>>>>>>>>>>>>>> lot of meta data for each individual file. so you will need a NN capable
>>>>>>>>>>>>>>> enough which can store the metadata for your entire dataset. Data will
>>>>>>>>>>>>>>> never go to NN but lot of metadata about data will be on NN so its always
>>>>>>>>>>>>>>> good idea to have a strong NN.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like
>>>>>>>>>>>>>>>>>>>> Facebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to
>>>>>>>>>>>>>>>>>>>> Hadoop HDFS cluster for processing
>>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Soon after replying I realized something else related to this.
>
> Say we have a single file in HDFS (hdfs configured for default block size
> 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
> from the current hdfs to another one , then
> whether there would be any parallelism or just a single map task would be
> fired?
>
> As per what I have read , a mapper is launcher for a complete file or a
> set of files. It doesn't operate at block level.So no parallelism even if
> the file resides in HDFS.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> yeah you are right I mis read your earlier post.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>>> a single machine. So no multiple TTs.
>>>
>>> Please comment if you think I am wring somewhere.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Yes , it's a MR job under the hood . my question was that you wrote
>>>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>>>> I think the MR job of distcp divides files into individual map tasks based
>>>> on the total size of the transfer , so multiple mappers would still be
>>>> spawned if the size of transfer is huge and they would work in parallel.
>>>>
>>>> Correct me if there is anything wrong!
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> No. distcp is actually a mapreduce job under the hood.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks to both of you!
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> you can do that using file:///
>>>>>>>
>>>>>>> example:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>>> used to upload files from local to hdfs.
>>>>>>>>
>>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>>> present in the hadoop's fs?
>>>>>>>>
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You'r welcome :)
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Tariq!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>>
>>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>>> consumption.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>>
>>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
>>>>>>>>>>>>>> this :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NN would still be in picture because it will be writing a
>>>>>>>>>>>>>>> lot of meta data for each individual file. so you will need a NN capable
>>>>>>>>>>>>>>> enough which can store the metadata for your entire dataset. Data will
>>>>>>>>>>>>>>> never go to NN but lot of metadata about data will be on NN so its always
>>>>>>>>>>>>>>> good idea to have a strong NN.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like
>>>>>>>>>>>>>>>>>>>> Facebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to
>>>>>>>>>>>>>>>>>>>> Hadoop HDFS cluster for processing
>>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Soon after replying I realized something else related to this.
>
> Say we have a single file in HDFS (hdfs configured for default block size
> 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
> from the current hdfs to another one , then
> whether there would be any parallelism or just a single map task would be
> fired?
>
> As per what I have read , a mapper is launcher for a complete file or a
> set of files. It doesn't operate at block level.So no parallelism even if
> the file resides in HDFS.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> yeah you are right I mis read your earlier post.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>>> a single machine. So no multiple TTs.
>>>
>>> Please comment if you think I am wring somewhere.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Yes , it's a MR job under the hood . my question was that you wrote
>>>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>>>> I think the MR job of distcp divides files into individual map tasks based
>>>> on the total size of the transfer , so multiple mappers would still be
>>>> spawned if the size of transfer is huge and they would work in parallel.
>>>>
>>>> Correct me if there is anything wrong!
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> No. distcp is actually a mapreduce job under the hood.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks to both of you!
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> you can do that using file:///
>>>>>>>
>>>>>>> example:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>>> used to upload files from local to hdfs.
>>>>>>>>
>>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>>> present in the hadoop's fs?
>>>>>>>>
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You'r welcome :)
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Tariq!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>>
>>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>>> consumption.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>>
>>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
>>>>>>>>>>>>>> this :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NN would still be in picture because it will be writing a
>>>>>>>>>>>>>>> lot of meta data for each individual file. so you will need a NN capable
>>>>>>>>>>>>>>> enough which can store the metadata for your entire dataset. Data will
>>>>>>>>>>>>>>> never go to NN but lot of metadata about data will be on NN so its always
>>>>>>>>>>>>>>> good idea to have a strong NN.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like
>>>>>>>>>>>>>>>>>>>> Facebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to
>>>>>>>>>>>>>>>>>>>> Hadoop HDFS cluster for processing
>>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Soon after replying I realized something else related to this.
>
> Say we have a single file in HDFS (hdfs configured for default block size
> 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
> from the current hdfs to another one , then
> whether there would be any parallelism or just a single map task would be
> fired?
>
> As per what I have read , a mapper is launcher for a complete file or a
> set of files. It doesn't operate at block level.So no parallelism even if
> the file resides in HDFS.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> yeah you are right I mis read your earlier post.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>>> a single machine. So no multiple TTs.
>>>
>>> Please comment if you think I am wring somewhere.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Yes , it's a MR job under the hood . my question was that you wrote
>>>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>>>> I think the MR job of distcp divides files into individual map tasks based
>>>> on the total size of the transfer , so multiple mappers would still be
>>>> spawned if the size of transfer is huge and they would work in parallel.
>>>>
>>>> Correct me if there is anything wrong!
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> No. distcp is actually a mapreduce job under the hood.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks to both of you!
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> you can do that using file:///
>>>>>>>
>>>>>>> example:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>>> used to upload files from local to hdfs.
>>>>>>>>
>>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>>> present in the hadoop's fs?
>>>>>>>>
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You'r welcome :)
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Tariq!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>>
>>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>>> consumption.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>>
>>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
>>>>>>>>>>>>>> this :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NN would still be in picture because it will be writing a
>>>>>>>>>>>>>>> lot of meta data for each individual file. so you will need a NN capable
>>>>>>>>>>>>>>> enough which can store the metadata for your entire dataset. Data will
>>>>>>>>>>>>>>> never go to NN but lot of metadata about data will be on NN so its always
>>>>>>>>>>>>>>> good idea to have a strong NN.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you
>>>>>>>>>>>>>>>>> want to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there
>>>>>>>>>>>>>>>>>>> are few options available to you
>>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like
>>>>>>>>>>>>>>>>>>>> Facebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to
>>>>>>>>>>>>>>>>>>>> Hadoop HDFS cluster for processing
>>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Soon after replying I realized something else related to this.

Say we have a single file in HDFS (hdfs configured for default block size
64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
from the current hdfs to another one , then
whether there would be any parallelism or just a single map task would be
fired?

As per what I have read , a mapper is launcher for a complete file or a set
of files. It doesn't operate at block level.So no parallelism even if the
file resides in HDFS.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> yeah you are right I mis read your earlier post.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>> a single machine. So no multiple TTs.
>>
>> Please comment if you think I am wring somewhere.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Yes , it's a MR job under the hood . my question was that you wrote that
>>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>>> think the MR job of distcp divides files into individual map tasks based on
>>> the total size of the transfer , so multiple mappers would still be spawned
>>> if the size of transfer is huge and they would work in parallel.
>>>
>>> Correct me if there is anything wrong!
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> No. distcp is actually a mapreduce job under the hood.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks to both of you!
>>>>>
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> you can do that using file:///
>>>>>>
>>>>>> example:
>>>>>>
>>>>>>
>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>> used to upload files from local to hdfs.
>>>>>>>
>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>> present in the hadoop's fs?
>>>>>>>
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You'r welcome :)
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Tariq!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>
>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>> consumption.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>
>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Soon after replying I realized something else related to this.

Say we have a single file in HDFS (hdfs configured for default block size
64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
from the current hdfs to another one , then
whether there would be any parallelism or just a single map task would be
fired?

As per what I have read , a mapper is launcher for a complete file or a set
of files. It doesn't operate at block level.So no parallelism even if the
file resides in HDFS.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> yeah you are right I mis read your earlier post.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>> a single machine. So no multiple TTs.
>>
>> Please comment if you think I am wring somewhere.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Yes , it's a MR job under the hood . my question was that you wrote that
>>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>>> think the MR job of distcp divides files into individual map tasks based on
>>> the total size of the transfer , so multiple mappers would still be spawned
>>> if the size of transfer is huge and they would work in parallel.
>>>
>>> Correct me if there is anything wrong!
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> No. distcp is actually a mapreduce job under the hood.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks to both of you!
>>>>>
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> you can do that using file:///
>>>>>>
>>>>>> example:
>>>>>>
>>>>>>
>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>> used to upload files from local to hdfs.
>>>>>>>
>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>> present in the hadoop's fs?
>>>>>>>
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You'r welcome :)
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Tariq!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>
>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>> consumption.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>
>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Soon after replying I realized something else related to this.

Say we have a single file in HDFS (hdfs configured for default block size
64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
from the current hdfs to another one , then
whether there would be any parallelism or just a single map task would be
fired?

As per what I have read , a mapper is launcher for a complete file or a set
of files. It doesn't operate at block level.So no parallelism even if the
file resides in HDFS.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> yeah you are right I mis read your earlier post.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>> a single machine. So no multiple TTs.
>>
>> Please comment if you think I am wring somewhere.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Yes , it's a MR job under the hood . my question was that you wrote that
>>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>>> think the MR job of distcp divides files into individual map tasks based on
>>> the total size of the transfer , so multiple mappers would still be spawned
>>> if the size of transfer is huge and they would work in parallel.
>>>
>>> Correct me if there is anything wrong!
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> No. distcp is actually a mapreduce job under the hood.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks to both of you!
>>>>>
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> you can do that using file:///
>>>>>>
>>>>>> example:
>>>>>>
>>>>>>
>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>> used to upload files from local to hdfs.
>>>>>>>
>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>> present in the hadoop's fs?
>>>>>>>
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You'r welcome :)
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Tariq!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>
>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>> consumption.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>
>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Soon after replying I realized something else related to this.

Say we have a single file in HDFS (hdfs configured for default block size
64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
from the current hdfs to another one , then
whether there would be any parallelism or just a single map task would be
fired?

As per what I have read , a mapper is launcher for a complete file or a set
of files. It doesn't operate at block level.So no parallelism even if the
file resides in HDFS.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> yeah you are right I mis read your earlier post.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
>> a single machine. So no multiple TTs.
>>
>> Please comment if you think I am wring somewhere.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Yes , it's a MR job under the hood . my question was that you wrote that
>>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>>> think the MR job of distcp divides files into individual map tasks based on
>>> the total size of the transfer , so multiple mappers would still be spawned
>>> if the size of transfer is huge and they would work in parallel.
>>>
>>> Correct me if there is anything wrong!
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> No. distcp is actually a mapreduce job under the hood.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks to both of you!
>>>>>
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> you can do that using file:///
>>>>>>
>>>>>> example:
>>>>>>
>>>>>>
>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>>> used to upload files from local to hdfs.
>>>>>>>
>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>> present in the hadoop's fs?
>>>>>>>
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You'r welcome :)
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Tariq!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>
>>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>>> consumption.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> IMHO,I think the statement about NN with regard to block
>>>>>>>>>>> metadata is more like a general statement. Even if you put lots of small
>>>>>>>>>>> files of combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>>
>>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop
>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable
>>>>>>>>>>>>>>>> hadoop cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in
>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>> of data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For data management products, you can look at falcon
>>>>>>>>>>>>>>>>>> which is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think they might be using the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>> say 10 parts each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com> wrote:

> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Yes , it's a MR job under the hood . my question was that you wrote that
>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>> think the MR job of distcp divides files into individual map tasks based on
>> the total size of the transfer , so multiple mappers would still be spawned
>> if the size of transfer is huge and they would work in parallel.
>>
>> Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks to both of you!
>>>>
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>> example:
>>>>>
>>>>>
>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>>
>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com> wrote:

> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Yes , it's a MR job under the hood . my question was that you wrote that
>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>> think the MR job of distcp divides files into individual map tasks based on
>> the total size of the transfer , so multiple mappers would still be spawned
>> if the size of transfer is huge and they would work in parallel.
>>
>> Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks to both of you!
>>>>
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>> example:
>>>>>
>>>>>
>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>>
>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Chris Mawata <ch...@gmail.com>.

It is being read sequentially but is it not potentially being written on 
multiple drives and since reading is typically faster than writing don't 
you still get a little benefit of parallelism?

On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
> I had said that if you use distcp to copy data *from localFS to HDFS* 
> then you won't be able to exploit parallelism as entire file is 
> present on a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
> <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>     Yes , it's a MR job under the hood . my question was that you
>     wrote that using distcp you loose the benefits  of parallel
>     processing of Hadoop. I think the MR job of distcp divides files
>     into individual map tasks based on the total size of the transfer
>     , so multiple mappers would still be spawned if the size of
>     transfer is huge and they would work in parallel.
>
>     Correct me if there is anything wrong!
>
>     Thanks,
>     Rahul
>
>
>     On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq
>     <dontariq@gmail.com <ma...@gmail.com>> wrote:
>
>         No. distcp is actually a mapreduce job under the hood.
>
>         Warm Regards,
>         Tariq
>         cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
>         On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee
>         <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>             Thanks to both of you!
>
>             Rahul
>
>
>             On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar
>             <nitinpawar432@gmail.com <ma...@gmail.com>>
>             wrote:
>
>                 you can do that using file:///
>
>                 example:
>
>                 |hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
>
>
>
>
>
>
>
>
>                 |
>
>
>
>                 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee
>                 <rahul.rec.dgp@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                     @Tariq can you point me to some resource which
>                     shows how distcp is used to upload files from
>                     local to hdfs.
>
>                     isn't distcp a MR job ? wouldn't it need the data
>                     to be already present in the hadoop's fs?
>
>                     Rahul
>
>
>                     On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq
>                     <dontariq@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         You'r welcome :)
>
>                         Warm Regards,
>                         Tariq
>                         cloudfront.blogspot.com
>                         <http://cloudfront.blogspot.com>
>
>
>                         On Sat, May 11, 2013 at 10:46 PM, Rahul
>                         Bhattacharjee <rahul.rec.dgp@gmail.com
>                         <ma...@gmail.com>> wrote:
>
>                             Thanks Tariq!
>
>
>                             On Sat, May 11, 2013 at 10:34 PM, Mohammad
>                             Tariq <dontariq@gmail.com
>                             <ma...@gmail.com>> wrote:
>
>                                 @Rahul : Yes. distcp can do that.
>
>                                 And, bigger the files lesser the
>                                 metadata hence lesser memory consumption.
>
>                                 Warm Regards,
>                                 Tariq
>                                 cloudfront.blogspot.com
>                                 <http://cloudfront.blogspot.com>
>
>
>                                 On Sat, May 11, 2013 at 9:40 PM, Rahul
>                                 Bhattacharjee <rahul.rec.dgp@gmail.com
>                                 <ma...@gmail.com>> wrote:
>
>                                     IMHO,I think the statement about
>                                     NN with regard to block metadata
>                                     is more like a general statement.
>                                     Even if you put lots of small
>                                     files of combined size 10 TB , you
>                                     need to have a capable NN.
>
>                                     can disct cp be used to copy local
>                                     - to - hdfs ?
>
>                                     Thanks,
>                                     Rahul
>
>
>                                     On Sat, May 11, 2013 at 9:35 PM,
>                                     Nitin Pawar
>                                     <nitinpawar432@gmail.com
>                                     <ma...@gmail.com>>
>                                     wrote:
>
>                                         absolutely rite Mohammad
>
>
>                                         On Sat, May 11, 2013 at 9:33
>                                         PM, Mohammad Tariq
>                                         <dontariq@gmail.com
>                                         <ma...@gmail.com>>
>                                         wrote:
>
>                                             Sorry for barging in guys.
>                                             I think Nitin is talking
>                                             about this :
>
>                                             Every file and block in
>                                             HDFS is treated as an
>                                             object and for each object
>                                             around 200B of metadata
>                                             get created. So the NN
>                                             should be powerful enough
>                                             to handle that much
>                                             metadata, since it is
>                                             going to be in-memory.
>                                             Actually memory is the
>                                             most important metric when
>                                             it comes to NN.
>
>                                             Am I correct @Nitin?
>
>                                             @Thoihen : As Nitin has
>                                             said, when you talk about
>                                             that much data you don't
>                                             actually just do a "put".
>                                             You could use something
>                                             like "distcp" for parallel
>                                             copying. A better approach
>                                             would be to use a data
>                                             aggregation tool like
>                                             Flume or Chukwa, as Nitin
>                                             has already pointed.
>                                             Facebook uses their own
>                                             data aggregation tool,
>                                             called Scribe for this
>                                             purpose.
>
>                                             Warm Regards,
>                                             Tariq
>                                             cloudfront.blogspot.com
>                                             <http://cloudfront.blogspot.com>
>
>
>                                             On Sat, May 11, 2013 at
>                                             9:20 PM, Nitin Pawar
>                                             <nitinpawar432@gmail.com
>                                             <ma...@gmail.com>>
>                                             wrote:
>
>                                                 NN would still be in
>                                                 picture because it
>                                                 will be writing a lot
>                                                 of meta data for each
>                                                 individual file. so
>                                                 you will need a NN
>                                                 capable enough which
>                                                 can store the metadata
>                                                 for your entire
>                                                 dataset. Data will
>                                                 never go to NN but lot
>                                                 of metadata about data
>                                                 will be on NN so its
>                                                 always good idea to
>                                                 have a strong NN.
>
>
>                                                 On Sat, May 11, 2013
>                                                 at 9:11 PM, Rahul
>                                                 Bhattacharjee
>                                                 <rahul.rec.dgp@gmail.com
>                                                 <ma...@gmail.com>>
>                                                 wrote:
>
>                                                     @Nitin , parallel
>                                                     dfs to write to
>                                                     hdfs is great ,
>                                                     but could not
>                                                     understand the
>                                                     meaning of capable
>                                                     NN. As I know ,
>                                                     the NN would not
>                                                     be a part of the
>                                                     actual data write
>                                                     pipeline , means
>                                                     that the data
>                                                     would not travel
>                                                     through the NN ,
>                                                     the dfs would
>                                                     contact the NN
>                                                     from time to time
>                                                     to get locations
>                                                     of DN as where to
>                                                     store the data blocks.
>
>                                                     Thanks,
>                                                     Rahul
>
>
>
>                                                     On Sat, May 11,
>                                                     2013 at 4:54 PM,
>                                                     Nitin Pawar
>                                                     <nitinpawar432@gmail.com
>                                                     <ma...@gmail.com>>
>                                                     wrote:
>
>                                                         is it safe? ..
>                                                         there is no
>                                                         direct answer
>                                                         yes or no
>
>                                                         when you say ,
>                                                         you have files
>                                                         worth 10TB
>                                                         files and you
>                                                         want to upload
>                                                          to HDFS,
>                                                         several
>                                                         factors come
>                                                         into picture
>
>                                                         1) Is the
>                                                         machine in the
>                                                         same network
>                                                         as your hadoop
>                                                         cluster?
>                                                         2) If there
>                                                         guarantee that
>                                                         network will
>                                                         not go down?
>
>                                                         and Most
>                                                         importantly I
>                                                         assume that
>                                                         you have a
>                                                         capable hadoop
>                                                         cluster. By
>                                                         that I mean
>                                                         you have a
>                                                         capable namenode.
>
>                                                         I would
>                                                         definitely not
>                                                         write
>                                                         files sequentially in
>                                                         HDFS. I would
>                                                         prefer to
>                                                         write files in
>                                                         parallel to
>                                                         hdfs to
>                                                         utilize the
>                                                         DFS write
>                                                         features to
>                                                         speed up the
>                                                         process.
>                                                         you can hdfs
>                                                         put command in
>                                                         parallel
>                                                         manner and in
>                                                         my experience
>                                                         it has not
>                                                         failed when we
>                                                         write a lot of
>                                                         data.
>
>
>                                                         On Sat, May
>                                                         11, 2013 at
>                                                         4:38 PM,
>                                                         maisnam ns
>                                                         <maisnam.ns@gmail.com
>                                                         <ma...@gmail.com>>
>                                                         wrote:
>
>                                                             @Nitin
>                                                             Pawar ,
>                                                             thanks for
>                                                             clearing
>                                                             my doubts .
>
>                                                             But I have
>                                                             one more
>                                                             question ,
>                                                             say I have
>                                                             10 TB data
>                                                             in the
>                                                             pipeline .
>
>                                                             Is it
>                                                             perfectly
>                                                             OK to use
>                                                             hadopo fs
>                                                             put
>                                                             command to
>                                                             upload
>                                                             these
>                                                             files of
>                                                             size 10 TB
>                                                             and is
>                                                             there any
>                                                             limit to
>                                                             the file
>                                                             size using
>                                                             hadoop
>                                                             command
>                                                             line . Can
>                                                             hadoop put
>                                                             command
>                                                             line work
>                                                             with huge
>                                                             data.
>
>                                                             Thanks in
>                                                             advance
>
>
>                                                             On Sat,
>                                                             May 11,
>                                                             2013 at
>                                                             4:24 PM,
>                                                             Nitin
>                                                             Pawar
>                                                             <nitinpawar432@gmail.com
>                                                             <ma...@gmail.com>>
>                                                             wrote:
>
>                                                                 first
>                                                                 of all
>                                                                 ..
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do not
>                                                                 get
>                                                                 100 PB
>                                                                 of
>                                                                 data
>                                                                 in one
>                                                                 go.
>                                                                 Its an
>                                                                 accumulating
>                                                                 process and
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do
>                                                                 have a
>                                                                 data
>                                                                 pipeline
>                                                                 in
>                                                                 place
>                                                                 where
>                                                                 the
>                                                                 data
>                                                                 is
>                                                                 written to
>                                                                 hdfs
>                                                                 on a
>                                                                 frequency
>                                                                 basis
>                                                                 and
>                                                                  then
>                                                                 its
>                                                                 retained
>                                                                 on
>                                                                 hdfs
>                                                                 for
>                                                                 some
>                                                                 duration
>                                                                 as per
>                                                                 needed
>                                                                 and
>                                                                 from
>                                                                 there
>                                                                 its
>                                                                 sent
>                                                                 to
>                                                                 archivers
>                                                                 or
>                                                                 deleted.
>
>                                                                 For
>                                                                 data
>                                                                 management
>                                                                 products,
>                                                                 you
>                                                                 can
>                                                                 look
>                                                                 at
>                                                                 falcon
>                                                                 which
>                                                                 is
>                                                                 open
>                                                                 sourced by
>                                                                 inmobi
>                                                                 along
>                                                                 with
>                                                                 hortonworks.
>
>
>                                                                 In any
>                                                                 case,
>                                                                 if you
>                                                                 want
>                                                                 to
>                                                                 write
>                                                                 files
>                                                                 to
>                                                                 hdfs
>                                                                 there
>                                                                 are
>                                                                 few
>                                                                 options available
>                                                                 to you
>                                                                 1)
>                                                                 Write
>                                                                 your
>                                                                 dfs
>                                                                 client
>                                                                 which
>                                                                 writes
>                                                                 to dfs
>                                                                 2) use
>                                                                 hdfs proxy
>                                                                 3)
>                                                                 there
>                                                                 is webhdfs
>                                                                 4)
>                                                                 command line
>                                                                 hdfs
>                                                                 5)
>                                                                 data
>                                                                 collection
>                                                                 tools
>                                                                 come
>                                                                 with
>                                                                 support to
>                                                                 write
>                                                                 to
>                                                                 hdfs
>                                                                 like
>                                                                 flume etc
>
>
>                                                                 On
>                                                                 Sat,
>                                                                 May
>                                                                 11,
>                                                                 2013
>                                                                 at
>                                                                 4:19
>                                                                 PM,
>                                                                 Thoihen Maibam
>                                                                 <thoihen123@gmail.com
>                                                                 <ma...@gmail.com>>
>                                                                 wrote:
>
>                                                                     Hi
>                                                                     All,
>
>                                                                     Can anyone
>                                                                     help
>                                                                     me
>                                                                     know
>                                                                     how does
>                                                                     companies
>                                                                     like
>                                                                     Facebook
>                                                                     ,Yahoo
>                                                                     etc upload
>                                                                     bulk
>                                                                     files
>                                                                     say to
>                                                                     the tune
>                                                                     of
>                                                                     100 petabytes
>                                                                     to
>                                                                     Hadoop
>                                                                     HDFS
>                                                                     cluster
>                                                                     for processing
>                                                                     and after
>                                                                     processing
>                                                                     how they
>                                                                     download
>                                                                     those
>                                                                     files
>                                                                     from
>                                                                     HDFS
>                                                                     to
>                                                                     local
>                                                                     file
>                                                                     system.
>
>                                                                     I
>                                                                     don't
>                                                                     think
>                                                                     they
>                                                                     might
>                                                                     be
>                                                                     using
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs
>                                                                     put to
>                                                                     upload
>                                                                     files
>                                                                     as
>                                                                     it
>                                                                     would
>                                                                     take
>                                                                     too long
>                                                                     or
>                                                                     do
>                                                                     they
>                                                                     divide
>                                                                     say 10
>                                                                     parts
>                                                                     each
>                                                                     10
>                                                                     petabytes
>                                                                     and compress
>                                                                     and use
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs put
>
>                                                                     Or
>                                                                     if
>                                                                     they
>                                                                     use any
>                                                                     tool
>                                                                     to
>                                                                     upload
>                                                                     huge
>                                                                     files.
>
>                                                                     Please
>                                                                     help
>                                                                     me .
>
>                                                                     Thanks
>                                                                     thoihen
>
>
>
>
>                                                                 -- 
>                                                                 Nitin
>                                                                 Pawar
>
>
>
>
>
>                                                         -- 
>                                                         Nitin Pawar
>
>
>
>
>
>                                                 -- 
>                                                 Nitin Pawar
>
>
>
>
>
>                                         -- 
>                                         Nitin Pawar
>
>
>
>
>
>
>
>
>
>                 -- 
>                 Nitin Pawar
>
>
>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com> wrote:

> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Yes , it's a MR job under the hood . my question was that you wrote that
>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>> think the MR job of distcp divides files into individual map tasks based on
>> the total size of the transfer , so multiple mappers would still be spawned
>> if the size of transfer is huge and they would work in parallel.
>>
>> Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks to both of you!
>>>>
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>> example:
>>>>>
>>>>>
>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>>
>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Chris Mawata <ch...@gmail.com>.

It is being read sequentially but is it not potentially being written on 
multiple drives and since reading is typically faster than writing don't 
you still get a little benefit of parallelism?

On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
> I had said that if you use distcp to copy data *from localFS to HDFS* 
> then you won't be able to exploit parallelism as entire file is 
> present on a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
> <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>     Yes , it's a MR job under the hood . my question was that you
>     wrote that using distcp you loose the benefits  of parallel
>     processing of Hadoop. I think the MR job of distcp divides files
>     into individual map tasks based on the total size of the transfer
>     , so multiple mappers would still be spawned if the size of
>     transfer is huge and they would work in parallel.
>
>     Correct me if there is anything wrong!
>
>     Thanks,
>     Rahul
>
>
>     On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq
>     <dontariq@gmail.com <ma...@gmail.com>> wrote:
>
>         No. distcp is actually a mapreduce job under the hood.
>
>         Warm Regards,
>         Tariq
>         cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
>         On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee
>         <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>             Thanks to both of you!
>
>             Rahul
>
>
>             On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar
>             <nitinpawar432@gmail.com <ma...@gmail.com>>
>             wrote:
>
>                 you can do that using file:///
>
>                 example:
>
>                 |hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
>
>
>
>
>
>
>
>
>                 |
>
>
>
>                 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee
>                 <rahul.rec.dgp@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                     @Tariq can you point me to some resource which
>                     shows how distcp is used to upload files from
>                     local to hdfs.
>
>                     isn't distcp a MR job ? wouldn't it need the data
>                     to be already present in the hadoop's fs?
>
>                     Rahul
>
>
>                     On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq
>                     <dontariq@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         You'r welcome :)
>
>                         Warm Regards,
>                         Tariq
>                         cloudfront.blogspot.com
>                         <http://cloudfront.blogspot.com>
>
>
>                         On Sat, May 11, 2013 at 10:46 PM, Rahul
>                         Bhattacharjee <rahul.rec.dgp@gmail.com
>                         <ma...@gmail.com>> wrote:
>
>                             Thanks Tariq!
>
>
>                             On Sat, May 11, 2013 at 10:34 PM, Mohammad
>                             Tariq <dontariq@gmail.com
>                             <ma...@gmail.com>> wrote:
>
>                                 @Rahul : Yes. distcp can do that.
>
>                                 And, bigger the files lesser the
>                                 metadata hence lesser memory consumption.
>
>                                 Warm Regards,
>                                 Tariq
>                                 cloudfront.blogspot.com
>                                 <http://cloudfront.blogspot.com>
>
>
>                                 On Sat, May 11, 2013 at 9:40 PM, Rahul
>                                 Bhattacharjee <rahul.rec.dgp@gmail.com
>                                 <ma...@gmail.com>> wrote:
>
>                                     IMHO,I think the statement about
>                                     NN with regard to block metadata
>                                     is more like a general statement.
>                                     Even if you put lots of small
>                                     files of combined size 10 TB , you
>                                     need to have a capable NN.
>
>                                     can disct cp be used to copy local
>                                     - to - hdfs ?
>
>                                     Thanks,
>                                     Rahul
>
>
>                                     On Sat, May 11, 2013 at 9:35 PM,
>                                     Nitin Pawar
>                                     <nitinpawar432@gmail.com
>                                     <ma...@gmail.com>>
>                                     wrote:
>
>                                         absolutely rite Mohammad
>
>
>                                         On Sat, May 11, 2013 at 9:33
>                                         PM, Mohammad Tariq
>                                         <dontariq@gmail.com
>                                         <ma...@gmail.com>>
>                                         wrote:
>
>                                             Sorry for barging in guys.
>                                             I think Nitin is talking
>                                             about this :
>
>                                             Every file and block in
>                                             HDFS is treated as an
>                                             object and for each object
>                                             around 200B of metadata
>                                             get created. So the NN
>                                             should be powerful enough
>                                             to handle that much
>                                             metadata, since it is
>                                             going to be in-memory.
>                                             Actually memory is the
>                                             most important metric when
>                                             it comes to NN.
>
>                                             Am I correct @Nitin?
>
>                                             @Thoihen : As Nitin has
>                                             said, when you talk about
>                                             that much data you don't
>                                             actually just do a "put".
>                                             You could use something
>                                             like "distcp" for parallel
>                                             copying. A better approach
>                                             would be to use a data
>                                             aggregation tool like
>                                             Flume or Chukwa, as Nitin
>                                             has already pointed.
>                                             Facebook uses their own
>                                             data aggregation tool,
>                                             called Scribe for this
>                                             purpose.
>
>                                             Warm Regards,
>                                             Tariq
>                                             cloudfront.blogspot.com
>                                             <http://cloudfront.blogspot.com>
>
>
>                                             On Sat, May 11, 2013 at
>                                             9:20 PM, Nitin Pawar
>                                             <nitinpawar432@gmail.com
>                                             <ma...@gmail.com>>
>                                             wrote:
>
>                                                 NN would still be in
>                                                 picture because it
>                                                 will be writing a lot
>                                                 of meta data for each
>                                                 individual file. so
>                                                 you will need a NN
>                                                 capable enough which
>                                                 can store the metadata
>                                                 for your entire
>                                                 dataset. Data will
>                                                 never go to NN but lot
>                                                 of metadata about data
>                                                 will be on NN so its
>                                                 always good idea to
>                                                 have a strong NN.
>
>
>                                                 On Sat, May 11, 2013
>                                                 at 9:11 PM, Rahul
>                                                 Bhattacharjee
>                                                 <rahul.rec.dgp@gmail.com
>                                                 <ma...@gmail.com>>
>                                                 wrote:
>
>                                                     @Nitin , parallel
>                                                     dfs to write to
>                                                     hdfs is great ,
>                                                     but could not
>                                                     understand the
>                                                     meaning of capable
>                                                     NN. As I know ,
>                                                     the NN would not
>                                                     be a part of the
>                                                     actual data write
>                                                     pipeline , means
>                                                     that the data
>                                                     would not travel
>                                                     through the NN ,
>                                                     the dfs would
>                                                     contact the NN
>                                                     from time to time
>                                                     to get locations
>                                                     of DN as where to
>                                                     store the data blocks.
>
>                                                     Thanks,
>                                                     Rahul
>
>
>
>                                                     On Sat, May 11,
>                                                     2013 at 4:54 PM,
>                                                     Nitin Pawar
>                                                     <nitinpawar432@gmail.com
>                                                     <ma...@gmail.com>>
>                                                     wrote:
>
>                                                         is it safe? ..
>                                                         there is no
>                                                         direct answer
>                                                         yes or no
>
>                                                         when you say ,
>                                                         you have files
>                                                         worth 10TB
>                                                         files and you
>                                                         want to upload
>                                                          to HDFS,
>                                                         several
>                                                         factors come
>                                                         into picture
>
>                                                         1) Is the
>                                                         machine in the
>                                                         same network
>                                                         as your hadoop
>                                                         cluster?
>                                                         2) If there
>                                                         guarantee that
>                                                         network will
>                                                         not go down?
>
>                                                         and Most
>                                                         importantly I
>                                                         assume that
>                                                         you have a
>                                                         capable hadoop
>                                                         cluster. By
>                                                         that I mean
>                                                         you have a
>                                                         capable namenode.
>
>                                                         I would
>                                                         definitely not
>                                                         write
>                                                         files sequentially in
>                                                         HDFS. I would
>                                                         prefer to
>                                                         write files in
>                                                         parallel to
>                                                         hdfs to
>                                                         utilize the
>                                                         DFS write
>                                                         features to
>                                                         speed up the
>                                                         process.
>                                                         you can hdfs
>                                                         put command in
>                                                         parallel
>                                                         manner and in
>                                                         my experience
>                                                         it has not
>                                                         failed when we
>                                                         write a lot of
>                                                         data.
>
>
>                                                         On Sat, May
>                                                         11, 2013 at
>                                                         4:38 PM,
>                                                         maisnam ns
>                                                         <maisnam.ns@gmail.com
>                                                         <ma...@gmail.com>>
>                                                         wrote:
>
>                                                             @Nitin
>                                                             Pawar ,
>                                                             thanks for
>                                                             clearing
>                                                             my doubts .
>
>                                                             But I have
>                                                             one more
>                                                             question ,
>                                                             say I have
>                                                             10 TB data
>                                                             in the
>                                                             pipeline .
>
>                                                             Is it
>                                                             perfectly
>                                                             OK to use
>                                                             hadopo fs
>                                                             put
>                                                             command to
>                                                             upload
>                                                             these
>                                                             files of
>                                                             size 10 TB
>                                                             and is
>                                                             there any
>                                                             limit to
>                                                             the file
>                                                             size using
>                                                             hadoop
>                                                             command
>                                                             line . Can
>                                                             hadoop put
>                                                             command
>                                                             line work
>                                                             with huge
>                                                             data.
>
>                                                             Thanks in
>                                                             advance
>
>
>                                                             On Sat,
>                                                             May 11,
>                                                             2013 at
>                                                             4:24 PM,
>                                                             Nitin
>                                                             Pawar
>                                                             <nitinpawar432@gmail.com
>                                                             <ma...@gmail.com>>
>                                                             wrote:
>
>                                                                 first
>                                                                 of all
>                                                                 ..
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do not
>                                                                 get
>                                                                 100 PB
>                                                                 of
>                                                                 data
>                                                                 in one
>                                                                 go.
>                                                                 Its an
>                                                                 accumulating
>                                                                 process and
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do
>                                                                 have a
>                                                                 data
>                                                                 pipeline
>                                                                 in
>                                                                 place
>                                                                 where
>                                                                 the
>                                                                 data
>                                                                 is
>                                                                 written to
>                                                                 hdfs
>                                                                 on a
>                                                                 frequency
>                                                                 basis
>                                                                 and
>                                                                  then
>                                                                 its
>                                                                 retained
>                                                                 on
>                                                                 hdfs
>                                                                 for
>                                                                 some
>                                                                 duration
>                                                                 as per
>                                                                 needed
>                                                                 and
>                                                                 from
>                                                                 there
>                                                                 its
>                                                                 sent
>                                                                 to
>                                                                 archivers
>                                                                 or
>                                                                 deleted.
>
>                                                                 For
>                                                                 data
>                                                                 management
>                                                                 products,
>                                                                 you
>                                                                 can
>                                                                 look
>                                                                 at
>                                                                 falcon
>                                                                 which
>                                                                 is
>                                                                 open
>                                                                 sourced by
>                                                                 inmobi
>                                                                 along
>                                                                 with
>                                                                 hortonworks.
>
>
>                                                                 In any
>                                                                 case,
>                                                                 if you
>                                                                 want
>                                                                 to
>                                                                 write
>                                                                 files
>                                                                 to
>                                                                 hdfs
>                                                                 there
>                                                                 are
>                                                                 few
>                                                                 options available
>                                                                 to you
>                                                                 1)
>                                                                 Write
>                                                                 your
>                                                                 dfs
>                                                                 client
>                                                                 which
>                                                                 writes
>                                                                 to dfs
>                                                                 2) use
>                                                                 hdfs proxy
>                                                                 3)
>                                                                 there
>                                                                 is webhdfs
>                                                                 4)
>                                                                 command line
>                                                                 hdfs
>                                                                 5)
>                                                                 data
>                                                                 collection
>                                                                 tools
>                                                                 come
>                                                                 with
>                                                                 support to
>                                                                 write
>                                                                 to
>                                                                 hdfs
>                                                                 like
>                                                                 flume etc
>
>
>                                                                 On
>                                                                 Sat,
>                                                                 May
>                                                                 11,
>                                                                 2013
>                                                                 at
>                                                                 4:19
>                                                                 PM,
>                                                                 Thoihen Maibam
>                                                                 <thoihen123@gmail.com
>                                                                 <ma...@gmail.com>>
>                                                                 wrote:
>
>                                                                     Hi
>                                                                     All,
>
>                                                                     Can anyone
>                                                                     help
>                                                                     me
>                                                                     know
>                                                                     how does
>                                                                     companies
>                                                                     like
>                                                                     Facebook
>                                                                     ,Yahoo
>                                                                     etc upload
>                                                                     bulk
>                                                                     files
>                                                                     say to
>                                                                     the tune
>                                                                     of
>                                                                     100 petabytes
>                                                                     to
>                                                                     Hadoop
>                                                                     HDFS
>                                                                     cluster
>                                                                     for processing
>                                                                     and after
>                                                                     processing
>                                                                     how they
>                                                                     download
>                                                                     those
>                                                                     files
>                                                                     from
>                                                                     HDFS
>                                                                     to
>                                                                     local
>                                                                     file
>                                                                     system.
>
>                                                                     I
>                                                                     don't
>                                                                     think
>                                                                     they
>                                                                     might
>                                                                     be
>                                                                     using
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs
>                                                                     put to
>                                                                     upload
>                                                                     files
>                                                                     as
>                                                                     it
>                                                                     would
>                                                                     take
>                                                                     too long
>                                                                     or
>                                                                     do
>                                                                     they
>                                                                     divide
>                                                                     say 10
>                                                                     parts
>                                                                     each
>                                                                     10
>                                                                     petabytes
>                                                                     and compress
>                                                                     and use
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs put
>
>                                                                     Or
>                                                                     if
>                                                                     they
>                                                                     use any
>                                                                     tool
>                                                                     to
>                                                                     upload
>                                                                     huge
>                                                                     files.
>
>                                                                     Please
>                                                                     help
>                                                                     me .
>
>                                                                     Thanks
>                                                                     thoihen
>
>
>
>
>                                                                 -- 
>                                                                 Nitin
>                                                                 Pawar
>
>
>
>
>
>                                                         -- 
>                                                         Nitin Pawar
>
>
>
>
>
>                                                 -- 
>                                                 Nitin Pawar
>
>
>
>
>
>                                         -- 
>                                         Nitin Pawar
>
>
>
>
>
>
>
>
>
>                 -- 
>                 Nitin Pawar
>
>
>
>
>

Re: Hadoop noob question

Posted by Chris Mawata <ch...@gmail.com>.

It is being read sequentially but is it not potentially being written on 
multiple drives and since reading is typically faster than writing don't 
you still get a little benefit of parallelism?

On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
> I had said that if you use distcp to copy data *from localFS to HDFS* 
> then you won't be able to exploit parallelism as entire file is 
> present on a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
> <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>     Yes , it's a MR job under the hood . my question was that you
>     wrote that using distcp you loose the benefits  of parallel
>     processing of Hadoop. I think the MR job of distcp divides files
>     into individual map tasks based on the total size of the transfer
>     , so multiple mappers would still be spawned if the size of
>     transfer is huge and they would work in parallel.
>
>     Correct me if there is anything wrong!
>
>     Thanks,
>     Rahul
>
>
>     On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq
>     <dontariq@gmail.com <ma...@gmail.com>> wrote:
>
>         No. distcp is actually a mapreduce job under the hood.
>
>         Warm Regards,
>         Tariq
>         cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
>         On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee
>         <rahul.rec.dgp@gmail.com <ma...@gmail.com>> wrote:
>
>             Thanks to both of you!
>
>             Rahul
>
>
>             On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar
>             <nitinpawar432@gmail.com <ma...@gmail.com>>
>             wrote:
>
>                 you can do that using file:///
>
>                 example:
>
>                 |hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
>
>
>
>
>
>
>
>
>                 |
>
>
>
>                 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee
>                 <rahul.rec.dgp@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                     @Tariq can you point me to some resource which
>                     shows how distcp is used to upload files from
>                     local to hdfs.
>
>                     isn't distcp a MR job ? wouldn't it need the data
>                     to be already present in the hadoop's fs?
>
>                     Rahul
>
>
>                     On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq
>                     <dontariq@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         You'r welcome :)
>
>                         Warm Regards,
>                         Tariq
>                         cloudfront.blogspot.com
>                         <http://cloudfront.blogspot.com>
>
>
>                         On Sat, May 11, 2013 at 10:46 PM, Rahul
>                         Bhattacharjee <rahul.rec.dgp@gmail.com
>                         <ma...@gmail.com>> wrote:
>
>                             Thanks Tariq!
>
>
>                             On Sat, May 11, 2013 at 10:34 PM, Mohammad
>                             Tariq <dontariq@gmail.com
>                             <ma...@gmail.com>> wrote:
>
>                                 @Rahul : Yes. distcp can do that.
>
>                                 And, bigger the files lesser the
>                                 metadata hence lesser memory consumption.
>
>                                 Warm Regards,
>                                 Tariq
>                                 cloudfront.blogspot.com
>                                 <http://cloudfront.blogspot.com>
>
>
>                                 On Sat, May 11, 2013 at 9:40 PM, Rahul
>                                 Bhattacharjee <rahul.rec.dgp@gmail.com
>                                 <ma...@gmail.com>> wrote:
>
>                                     IMHO,I think the statement about
>                                     NN with regard to block metadata
>                                     is more like a general statement.
>                                     Even if you put lots of small
>                                     files of combined size 10 TB , you
>                                     need to have a capable NN.
>
>                                     can disct cp be used to copy local
>                                     - to - hdfs ?
>
>                                     Thanks,
>                                     Rahul
>
>
>                                     On Sat, May 11, 2013 at 9:35 PM,
>                                     Nitin Pawar
>                                     <nitinpawar432@gmail.com
>                                     <ma...@gmail.com>>
>                                     wrote:
>
>                                         absolutely rite Mohammad
>
>
>                                         On Sat, May 11, 2013 at 9:33
>                                         PM, Mohammad Tariq
>                                         <dontariq@gmail.com
>                                         <ma...@gmail.com>>
>                                         wrote:
>
>                                             Sorry for barging in guys.
>                                             I think Nitin is talking
>                                             about this :
>
>                                             Every file and block in
>                                             HDFS is treated as an
>                                             object and for each object
>                                             around 200B of metadata
>                                             get created. So the NN
>                                             should be powerful enough
>                                             to handle that much
>                                             metadata, since it is
>                                             going to be in-memory.
>                                             Actually memory is the
>                                             most important metric when
>                                             it comes to NN.
>
>                                             Am I correct @Nitin?
>
>                                             @Thoihen : As Nitin has
>                                             said, when you talk about
>                                             that much data you don't
>                                             actually just do a "put".
>                                             You could use something
>                                             like "distcp" for parallel
>                                             copying. A better approach
>                                             would be to use a data
>                                             aggregation tool like
>                                             Flume or Chukwa, as Nitin
>                                             has already pointed.
>                                             Facebook uses their own
>                                             data aggregation tool,
>                                             called Scribe for this
>                                             purpose.
>
>                                             Warm Regards,
>                                             Tariq
>                                             cloudfront.blogspot.com
>                                             <http://cloudfront.blogspot.com>
>
>
>                                             On Sat, May 11, 2013 at
>                                             9:20 PM, Nitin Pawar
>                                             <nitinpawar432@gmail.com
>                                             <ma...@gmail.com>>
>                                             wrote:
>
>                                                 NN would still be in
>                                                 picture because it
>                                                 will be writing a lot
>                                                 of meta data for each
>                                                 individual file. so
>                                                 you will need a NN
>                                                 capable enough which
>                                                 can store the metadata
>                                                 for your entire
>                                                 dataset. Data will
>                                                 never go to NN but lot
>                                                 of metadata about data
>                                                 will be on NN so its
>                                                 always good idea to
>                                                 have a strong NN.
>
>
>                                                 On Sat, May 11, 2013
>                                                 at 9:11 PM, Rahul
>                                                 Bhattacharjee
>                                                 <rahul.rec.dgp@gmail.com
>                                                 <ma...@gmail.com>>
>                                                 wrote:
>
>                                                     @Nitin , parallel
>                                                     dfs to write to
>                                                     hdfs is great ,
>                                                     but could not
>                                                     understand the
>                                                     meaning of capable
>                                                     NN. As I know ,
>                                                     the NN would not
>                                                     be a part of the
>                                                     actual data write
>                                                     pipeline , means
>                                                     that the data
>                                                     would not travel
>                                                     through the NN ,
>                                                     the dfs would
>                                                     contact the NN
>                                                     from time to time
>                                                     to get locations
>                                                     of DN as where to
>                                                     store the data blocks.
>
>                                                     Thanks,
>                                                     Rahul
>
>
>
>                                                     On Sat, May 11,
>                                                     2013 at 4:54 PM,
>                                                     Nitin Pawar
>                                                     <nitinpawar432@gmail.com
>                                                     <ma...@gmail.com>>
>                                                     wrote:
>
>                                                         is it safe? ..
>                                                         there is no
>                                                         direct answer
>                                                         yes or no
>
>                                                         when you say ,
>                                                         you have files
>                                                         worth 10TB
>                                                         files and you
>                                                         want to upload
>                                                          to HDFS,
>                                                         several
>                                                         factors come
>                                                         into picture
>
>                                                         1) Is the
>                                                         machine in the
>                                                         same network
>                                                         as your hadoop
>                                                         cluster?
>                                                         2) If there
>                                                         guarantee that
>                                                         network will
>                                                         not go down?
>
>                                                         and Most
>                                                         importantly I
>                                                         assume that
>                                                         you have a
>                                                         capable hadoop
>                                                         cluster. By
>                                                         that I mean
>                                                         you have a
>                                                         capable namenode.
>
>                                                         I would
>                                                         definitely not
>                                                         write
>                                                         files sequentially in
>                                                         HDFS. I would
>                                                         prefer to
>                                                         write files in
>                                                         parallel to
>                                                         hdfs to
>                                                         utilize the
>                                                         DFS write
>                                                         features to
>                                                         speed up the
>                                                         process.
>                                                         you can hdfs
>                                                         put command in
>                                                         parallel
>                                                         manner and in
>                                                         my experience
>                                                         it has not
>                                                         failed when we
>                                                         write a lot of
>                                                         data.
>
>
>                                                         On Sat, May
>                                                         11, 2013 at
>                                                         4:38 PM,
>                                                         maisnam ns
>                                                         <maisnam.ns@gmail.com
>                                                         <ma...@gmail.com>>
>                                                         wrote:
>
>                                                             @Nitin
>                                                             Pawar ,
>                                                             thanks for
>                                                             clearing
>                                                             my doubts .
>
>                                                             But I have
>                                                             one more
>                                                             question ,
>                                                             say I have
>                                                             10 TB data
>                                                             in the
>                                                             pipeline .
>
>                                                             Is it
>                                                             perfectly
>                                                             OK to use
>                                                             hadopo fs
>                                                             put
>                                                             command to
>                                                             upload
>                                                             these
>                                                             files of
>                                                             size 10 TB
>                                                             and is
>                                                             there any
>                                                             limit to
>                                                             the file
>                                                             size using
>                                                             hadoop
>                                                             command
>                                                             line . Can
>                                                             hadoop put
>                                                             command
>                                                             line work
>                                                             with huge
>                                                             data.
>
>                                                             Thanks in
>                                                             advance
>
>
>                                                             On Sat,
>                                                             May 11,
>                                                             2013 at
>                                                             4:24 PM,
>                                                             Nitin
>                                                             Pawar
>                                                             <nitinpawar432@gmail.com
>                                                             <ma...@gmail.com>>
>                                                             wrote:
>
>                                                                 first
>                                                                 of all
>                                                                 ..
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do not
>                                                                 get
>                                                                 100 PB
>                                                                 of
>                                                                 data
>                                                                 in one
>                                                                 go.
>                                                                 Its an
>                                                                 accumulating
>                                                                 process and
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do
>                                                                 have a
>                                                                 data
>                                                                 pipeline
>                                                                 in
>                                                                 place
>                                                                 where
>                                                                 the
>                                                                 data
>                                                                 is
>                                                                 written to
>                                                                 hdfs
>                                                                 on a
>                                                                 frequency
>                                                                 basis
>                                                                 and
>                                                                  then
>                                                                 its
>                                                                 retained
>                                                                 on
>                                                                 hdfs
>                                                                 for
>                                                                 some
>                                                                 duration
>                                                                 as per
>                                                                 needed
>                                                                 and
>                                                                 from
>                                                                 there
>                                                                 its
>                                                                 sent
>                                                                 to
>                                                                 archivers
>                                                                 or
>                                                                 deleted.
>
>                                                                 For
>                                                                 data
>                                                                 management
>                                                                 products,
>                                                                 you
>                                                                 can
>                                                                 look
>                                                                 at
>                                                                 falcon
>                                                                 which
>                                                                 is
>                                                                 open
>                                                                 sourced by
>                                                                 inmobi
>                                                                 along
>                                                                 with
>                                                                 hortonworks.
>
>
>                                                                 In any
>                                                                 case,
>                                                                 if you
>                                                                 want
>                                                                 to
>                                                                 write
>                                                                 files
>                                                                 to
>                                                                 hdfs
>                                                                 there
>                                                                 are
>                                                                 few
>                                                                 options available
>                                                                 to you
>                                                                 1)
>                                                                 Write
>                                                                 your
>                                                                 dfs
>                                                                 client
>                                                                 which
>                                                                 writes
>                                                                 to dfs
>                                                                 2) use
>                                                                 hdfs proxy
>                                                                 3)
>                                                                 there
>                                                                 is webhdfs
>                                                                 4)
>                                                                 command line
>                                                                 hdfs
>                                                                 5)
>                                                                 data
>                                                                 collection
>                                                                 tools
>                                                                 come
>                                                                 with
>                                                                 support to
>                                                                 write
>                                                                 to
>                                                                 hdfs
>                                                                 like
>                                                                 flume etc
>
>
>                                                                 On
>                                                                 Sat,
>                                                                 May
>                                                                 11,
>                                                                 2013
>                                                                 at
>                                                                 4:19
>                                                                 PM,
>                                                                 Thoihen Maibam
>                                                                 <thoihen123@gmail.com
>                                                                 <ma...@gmail.com>>
>                                                                 wrote:
>
>                                                                     Hi
>                                                                     All,
>
>                                                                     Can anyone
>                                                                     help
>                                                                     me
>                                                                     know
>                                                                     how does
>                                                                     companies
>                                                                     like
>                                                                     Facebook
>                                                                     ,Yahoo
>                                                                     etc upload
>                                                                     bulk
>                                                                     files
>                                                                     say to
>                                                                     the tune
>                                                                     of
>                                                                     100 petabytes
>                                                                     to
>                                                                     Hadoop
>                                                                     HDFS
>                                                                     cluster
>                                                                     for processing
>                                                                     and after
>                                                                     processing
>                                                                     how they
>                                                                     download
>                                                                     those
>                                                                     files
>                                                                     from
>                                                                     HDFS
>                                                                     to
>                                                                     local
>                                                                     file
>                                                                     system.
>
>                                                                     I
>                                                                     don't
>                                                                     think
>                                                                     they
>                                                                     might
>                                                                     be
>                                                                     using
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs
>                                                                     put to
>                                                                     upload
>                                                                     files
>                                                                     as
>                                                                     it
>                                                                     would
>                                                                     take
>                                                                     too long
>                                                                     or
>                                                                     do
>                                                                     they
>                                                                     divide
>                                                                     say 10
>                                                                     parts
>                                                                     each
>                                                                     10
>                                                                     petabytes
>                                                                     and compress
>                                                                     and use
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs put
>
>                                                                     Or
>                                                                     if
>                                                                     they
>                                                                     use any
>                                                                     tool
>                                                                     to
>                                                                     upload
>                                                                     huge
>                                                                     files.
>
>                                                                     Please
>                                                                     help
>                                                                     me .
>
>                                                                     Thanks
>                                                                     thoihen
>
>
>
>
>                                                                 -- 
>                                                                 Nitin
>                                                                 Pawar
>
>
>
>
>
>                                                         -- 
>                                                         Nitin Pawar
>
>
>
>
>
>                                                 -- 
>                                                 Nitin Pawar
>
>
>
>
>
>                                         -- 
>                                         Nitin Pawar
>
>
>
>
>
>
>
>
>
>                 -- 
>                 Nitin Pawar
>
>
>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <do...@gmail.com> wrote:

> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Yes , it's a MR job under the hood . my question was that you wrote that
>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>> think the MR job of distcp divides files into individual map tasks based on
>> the total size of the transfer , so multiple mappers would still be spawned
>> if the size of transfer is huge and they would work in parallel.
>>
>> Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks to both of you!
>>>>
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>> example:
>>>>>
>>>>>
>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>>
>>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
>>>>>>>>>>>> data you don't actually just do a "put". You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> NN would still be in picture because it will be writing a lot
>>>>>>>>>>>> of meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want
>>>>>>>>>>>>>> to upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul


On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>
>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul


On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>
>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul


On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>
>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul


On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>>
>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For data management products, you can look at falcon which
>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are
>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>> and after processing how they download those files from
>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop
>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>
>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>
>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>
>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>
>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>
>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>
>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>
>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>
>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>
>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>>
>>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>>
>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>>
>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I
>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>> you can hdfs put command in parallel manner and in my
>>>>>>>>>>>> experience it has not failed when we write a lot of data.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file size  using
>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Tariq can you point me to some resource which shows how distcp is used to
> upload files from local to hdfs.
>
> isn't distcp a MR job ? wouldn't it need the data to be already present in
> the hadoop's fs?
>
> Rahul
>
>
> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You'r welcome :)
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks Tariq!
>>>
>>>
>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> @Rahul : Yes. distcp can do that.
>>>>
>>>> And, bigger the files lesser the metadata hence lesser memory
>>>> consumption.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>> more like a general statement. Even if you put lots of small files of
>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>
>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> absolutely rite Mohammad
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>
>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>
>>>>>>> Am I correct @Nitin?
>>>>>>>
>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>> have a strong NN.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>
>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>
>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>
>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>
>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>> to speed up the process.
>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>
>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>> pipeline .
>>>>>>>>>>>
>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>>
>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>
>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>> options available to you
>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Tariq can you point me to some resource which shows how distcp is used to
> upload files from local to hdfs.
>
> isn't distcp a MR job ? wouldn't it need the data to be already present in
> the hadoop's fs?
>
> Rahul
>
>
> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You'r welcome :)
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks Tariq!
>>>
>>>
>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> @Rahul : Yes. distcp can do that.
>>>>
>>>> And, bigger the files lesser the metadata hence lesser memory
>>>> consumption.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>> more like a general statement. Even if you put lots of small files of
>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>
>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> absolutely rite Mohammad
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>
>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>
>>>>>>> Am I correct @Nitin?
>>>>>>>
>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>> have a strong NN.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>
>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>
>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>
>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>
>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>> to speed up the process.
>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>
>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>> pipeline .
>>>>>>>>>>>
>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>>
>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>
>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>> options available to you
>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Tariq can you point me to some resource which shows how distcp is used to
> upload files from local to hdfs.
>
> isn't distcp a MR job ? wouldn't it need the data to be already present in
> the hadoop's fs?
>
> Rahul
>
>
> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You'r welcome :)
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks Tariq!
>>>
>>>
>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> @Rahul : Yes. distcp can do that.
>>>>
>>>> And, bigger the files lesser the metadata hence lesser memory
>>>> consumption.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>> more like a general statement. Even if you put lots of small files of
>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>
>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> absolutely rite Mohammad
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>
>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>
>>>>>>> Am I correct @Nitin?
>>>>>>>
>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>> have a strong NN.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>
>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>
>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>
>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>
>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>> to speed up the process.
>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>
>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>> pipeline .
>>>>>>>>>>>
>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>>
>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>
>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>> options available to you
>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Tariq can you point me to some resource which shows how distcp is used to
> upload files from local to hdfs.
>
> isn't distcp a MR job ? wouldn't it need the data to be already present in
> the hadoop's fs?
>
> Rahul
>
>
> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You'r welcome :)
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks Tariq!
>>>
>>>
>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> @Rahul : Yes. distcp can do that.
>>>>
>>>> And, bigger the files lesser the metadata hence lesser memory
>>>> consumption.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>> more like a general statement. Even if you put lots of small files of
>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>
>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> absolutely rite Mohammad
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>
>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>
>>>>>>> Am I correct @Nitin?
>>>>>>>
>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>> have a strong NN.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>
>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>
>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>
>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>
>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>> to speed up the process.
>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>
>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>> pipeline .
>>>>>>>>>>>
>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>>
>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>
>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>> options available to you
>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by shashwat shriparv <dw...@gmail.com>.

In our case we have our own written hdfs client to write the data and
downlod it.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by shashwat shriparv <dw...@gmail.com>.

In our case we have our own written hdfs client to write the data and
downlod it.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by shashwat shriparv <dw...@gmail.com>.

In our case we have our own written hdfs client to write the data and
downlod it.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by shashwat shriparv <dw...@gmail.com>.

In our case we have our own written hdfs client to write the data and
downlod it.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : Yes. distcp can do that.

And, bigger the files lesser the metadata hence lesser memory consumption.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> IMHO,I think the statement about NN with regard to block metadata is more
> like a general statement. Even if you put lots of small files of combined
> size 10 TB , you need to have a capable NN.
>
> can disct cp be used to copy local - to - hdfs ?
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> absolutely rite Mohammad
>>
>>
>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>
>>> Every file and block in HDFS is treated as an object and for each object
>>> around 200B of metadata get created. So the NN should be powerful enough to
>>> handle that much metadata, since it is going to be in-memory. Actually
>>> memory is the most important metric when it comes to NN.
>>>
>>> Am I correct @Nitin?
>>>
>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>> don't actually just do a "put". You could use something like "distcp" for
>>> parallel copying. A better approach would be to use a data aggregation tool
>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>> data aggregation tool, called Scribe for this purpose.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> NN would still be in picture because it will be writing a lot of meta
>>>> data for each individual file. so you will need a NN capable enough which
>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>> but lot of metadata about data will be on NN so its always good idea to
>>>> have a strong NN.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>> part of the actual data write pipeline , means that the data would not
>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>> get locations of DN as where to store the data blocks.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>
>>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>>  to HDFS, several factors come into picture
>>>>>>
>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>> 2) If there guarantee that network will not go down?
>>>>>>
>>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>>> By that I mean you have a capable namenode.
>>>>>>
>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>> to speed up the process.
>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>> has not failed when we write a lot of data.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>
>>>>>>> But I have one more question , say I have 10 TB data in the pipeline
>>>>>>> .
>>>>>>>
>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>
>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>
>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>> options available to you
>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>> 2) use hdfs proxy
>>>>>>>> 3) there is webhdfs
>>>>>>>> 4) command line hdfs
>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>> flume etc
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>> cluster for processing
>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>> local file system.
>>>>>>>>>
>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>
>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>
>>>>>>>>> Please help me .
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> thoihen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : Yes. distcp can do that.

And, bigger the files lesser the metadata hence lesser memory consumption.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> IMHO,I think the statement about NN with regard to block metadata is more
> like a general statement. Even if you put lots of small files of combined
> size 10 TB , you need to have a capable NN.
>
> can disct cp be used to copy local - to - hdfs ?
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> absolutely rite Mohammad
>>
>>
>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>
>>> Every file and block in HDFS is treated as an object and for each object
>>> around 200B of metadata get created. So the NN should be powerful enough to
>>> handle that much metadata, since it is going to be in-memory. Actually
>>> memory is the most important metric when it comes to NN.
>>>
>>> Am I correct @Nitin?
>>>
>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>> don't actually just do a "put". You could use something like "distcp" for
>>> parallel copying. A better approach would be to use a data aggregation tool
>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>> data aggregation tool, called Scribe for this purpose.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> NN would still be in picture because it will be writing a lot of meta
>>>> data for each individual file. so you will need a NN capable enough which
>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>> but lot of metadata about data will be on NN so its always good idea to
>>>> have a strong NN.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>> part of the actual data write pipeline , means that the data would not
>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>> get locations of DN as where to store the data blocks.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>
>>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>>  to HDFS, several factors come into picture
>>>>>>
>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>> 2) If there guarantee that network will not go down?
>>>>>>
>>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>>> By that I mean you have a capable namenode.
>>>>>>
>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>> to speed up the process.
>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>> has not failed when we write a lot of data.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>
>>>>>>> But I have one more question , say I have 10 TB data in the pipeline
>>>>>>> .
>>>>>>>
>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>
>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>
>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>> options available to you
>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>> 2) use hdfs proxy
>>>>>>>> 3) there is webhdfs
>>>>>>>> 4) command line hdfs
>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>> flume etc
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>> cluster for processing
>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>> local file system.
>>>>>>>>>
>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>
>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>
>>>>>>>>> Please help me .
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> thoihen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : Yes. distcp can do that.

And, bigger the files lesser the metadata hence lesser memory consumption.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> IMHO,I think the statement about NN with regard to block metadata is more
> like a general statement. Even if you put lots of small files of combined
> size 10 TB , you need to have a capable NN.
>
> can disct cp be used to copy local - to - hdfs ?
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> absolutely rite Mohammad
>>
>>
>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>
>>> Every file and block in HDFS is treated as an object and for each object
>>> around 200B of metadata get created. So the NN should be powerful enough to
>>> handle that much metadata, since it is going to be in-memory. Actually
>>> memory is the most important metric when it comes to NN.
>>>
>>> Am I correct @Nitin?
>>>
>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>> don't actually just do a "put". You could use something like "distcp" for
>>> parallel copying. A better approach would be to use a data aggregation tool
>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>> data aggregation tool, called Scribe for this purpose.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> NN would still be in picture because it will be writing a lot of meta
>>>> data for each individual file. so you will need a NN capable enough which
>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>> but lot of metadata about data will be on NN so its always good idea to
>>>> have a strong NN.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>> part of the actual data write pipeline , means that the data would not
>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>> get locations of DN as where to store the data blocks.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>
>>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>>  to HDFS, several factors come into picture
>>>>>>
>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>> 2) If there guarantee that network will not go down?
>>>>>>
>>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>>> By that I mean you have a capable namenode.
>>>>>>
>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>> to speed up the process.
>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>> has not failed when we write a lot of data.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>
>>>>>>> But I have one more question , say I have 10 TB data in the pipeline
>>>>>>> .
>>>>>>>
>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>
>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>
>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>> options available to you
>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>> 2) use hdfs proxy
>>>>>>>> 3) there is webhdfs
>>>>>>>> 4) command line hdfs
>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>> flume etc
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>> cluster for processing
>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>> local file system.
>>>>>>>>>
>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>
>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>
>>>>>>>>> Please help me .
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> thoihen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

@Rahul : Yes. distcp can do that.

And, bigger the files lesser the metadata hence lesser memory consumption.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> IMHO,I think the statement about NN with regard to block metadata is more
> like a general statement. Even if you put lots of small files of combined
> size 10 TB , you need to have a capable NN.
>
> can disct cp be used to copy local - to - hdfs ?
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> absolutely rite Mohammad
>>
>>
>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>
>>> Every file and block in HDFS is treated as an object and for each object
>>> around 200B of metadata get created. So the NN should be powerful enough to
>>> handle that much metadata, since it is going to be in-memory. Actually
>>> memory is the most important metric when it comes to NN.
>>>
>>> Am I correct @Nitin?
>>>
>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>> don't actually just do a "put". You could use something like "distcp" for
>>> parallel copying. A better approach would be to use a data aggregation tool
>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>> data aggregation tool, called Scribe for this purpose.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> NN would still be in picture because it will be writing a lot of meta
>>>> data for each individual file. so you will need a NN capable enough which
>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>> but lot of metadata about data will be on NN so its always good idea to
>>>> have a strong NN.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>> part of the actual data write pipeline , means that the data would not
>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>> get locations of DN as where to store the data blocks.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>
>>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>>  to HDFS, several factors come into picture
>>>>>>
>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>> 2) If there guarantee that network will not go down?
>>>>>>
>>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>>> By that I mean you have a capable namenode.
>>>>>>
>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>> to speed up the process.
>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>> has not failed when we write a lot of data.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>
>>>>>>> But I have one more question , say I have 10 TB data in the pipeline
>>>>>>> .
>>>>>>>
>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>
>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>
>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>> options available to you
>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>> 2) use hdfs proxy
>>>>>>>> 3) there is webhdfs
>>>>>>>> 4) command line hdfs
>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>> flume etc
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>> cluster for processing
>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>> local file system.
>>>>>>>>>
>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>
>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>
>>>>>>>>> Please help me .
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> thoihen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

IMHO,I think the statement about NN with regard to block metadata is more
like a general statement. Even if you put lots of small files of combined
size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:

> absolutely rite Mohammad
>
>
> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Sorry for barging in guys. I think Nitin is talking about this :
>>
>> Every file and block in HDFS is treated as an object and for each object
>> around 200B of metadata get created. So the NN should be powerful enough to
>> handle that much metadata, since it is going to be in-memory. Actually
>> memory is the most important metric when it comes to NN.
>>
>> Am I correct @Nitin?
>>
>> @Thoihen : As Nitin has said, when you talk about that much data you
>> don't actually just do a "put". You could use something like "distcp" for
>> parallel copying. A better approach would be to use a data aggregation tool
>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>> data aggregation tool, called Scribe for this purpose.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> NN would still be in picture because it will be writing a lot of meta
>>> data for each individual file. so you will need a NN capable enough which
>>> can store the metadata for your entire dataset. Data will never go to NN
>>> but lot of metadata about data will be on NN so its always good idea to
>>> have a strong NN.
>>>
>>>
>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>> part of the actual data write pipeline , means that the data would not
>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>> get locations of DN as where to store the data blocks.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> is it safe? .. there is no direct answer yes or no
>>>>>
>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>  to HDFS, several factors come into picture
>>>>>
>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>> 2) If there guarantee that network will not go down?
>>>>>
>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>> By that I mean you have a capable namenode.
>>>>>
>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>> to speed up the process.
>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>> has not failed when we write a lot of data.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>
>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>
>>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>>
>>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>>> line . Can hadoop put command line work with huge data.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>> from there its sent to archivers or deleted.
>>>>>>>
>>>>>>> For data management products, you can look at falcon which is open
>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>
>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>> options available to you
>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>> 2) use hdfs proxy
>>>>>>> 3) there is webhdfs
>>>>>>> 4) command line hdfs
>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>> flume etc
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>>> for processing
>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>> local file system.
>>>>>>>>
>>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>
>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>
>>>>>>>> Please help me .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> thoihen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

IMHO,I think the statement about NN with regard to block metadata is more
like a general statement. Even if you put lots of small files of combined
size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:

> absolutely rite Mohammad
>
>
> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Sorry for barging in guys. I think Nitin is talking about this :
>>
>> Every file and block in HDFS is treated as an object and for each object
>> around 200B of metadata get created. So the NN should be powerful enough to
>> handle that much metadata, since it is going to be in-memory. Actually
>> memory is the most important metric when it comes to NN.
>>
>> Am I correct @Nitin?
>>
>> @Thoihen : As Nitin has said, when you talk about that much data you
>> don't actually just do a "put". You could use something like "distcp" for
>> parallel copying. A better approach would be to use a data aggregation tool
>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>> data aggregation tool, called Scribe for this purpose.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> NN would still be in picture because it will be writing a lot of meta
>>> data for each individual file. so you will need a NN capable enough which
>>> can store the metadata for your entire dataset. Data will never go to NN
>>> but lot of metadata about data will be on NN so its always good idea to
>>> have a strong NN.
>>>
>>>
>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>> part of the actual data write pipeline , means that the data would not
>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>> get locations of DN as where to store the data blocks.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> is it safe? .. there is no direct answer yes or no
>>>>>
>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>  to HDFS, several factors come into picture
>>>>>
>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>> 2) If there guarantee that network will not go down?
>>>>>
>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>> By that I mean you have a capable namenode.
>>>>>
>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>> to speed up the process.
>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>> has not failed when we write a lot of data.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>
>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>
>>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>>
>>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>>> line . Can hadoop put command line work with huge data.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>> from there its sent to archivers or deleted.
>>>>>>>
>>>>>>> For data management products, you can look at falcon which is open
>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>
>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>> options available to you
>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>> 2) use hdfs proxy
>>>>>>> 3) there is webhdfs
>>>>>>> 4) command line hdfs
>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>> flume etc
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>>> for processing
>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>> local file system.
>>>>>>>>
>>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>
>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>
>>>>>>>> Please help me .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> thoihen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

IMHO,I think the statement about NN with regard to block metadata is more
like a general statement. Even if you put lots of small files of combined
size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:

> absolutely rite Mohammad
>
>
> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Sorry for barging in guys. I think Nitin is talking about this :
>>
>> Every file and block in HDFS is treated as an object and for each object
>> around 200B of metadata get created. So the NN should be powerful enough to
>> handle that much metadata, since it is going to be in-memory. Actually
>> memory is the most important metric when it comes to NN.
>>
>> Am I correct @Nitin?
>>
>> @Thoihen : As Nitin has said, when you talk about that much data you
>> don't actually just do a "put". You could use something like "distcp" for
>> parallel copying. A better approach would be to use a data aggregation tool
>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>> data aggregation tool, called Scribe for this purpose.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> NN would still be in picture because it will be writing a lot of meta
>>> data for each individual file. so you will need a NN capable enough which
>>> can store the metadata for your entire dataset. Data will never go to NN
>>> but lot of metadata about data will be on NN so its always good idea to
>>> have a strong NN.
>>>
>>>
>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>> part of the actual data write pipeline , means that the data would not
>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>> get locations of DN as where to store the data blocks.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> is it safe? .. there is no direct answer yes or no
>>>>>
>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>  to HDFS, several factors come into picture
>>>>>
>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>> 2) If there guarantee that network will not go down?
>>>>>
>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>> By that I mean you have a capable namenode.
>>>>>
>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>> to speed up the process.
>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>> has not failed when we write a lot of data.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>
>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>
>>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>>
>>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>>> line . Can hadoop put command line work with huge data.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>> from there its sent to archivers or deleted.
>>>>>>>
>>>>>>> For data management products, you can look at falcon which is open
>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>
>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>> options available to you
>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>> 2) use hdfs proxy
>>>>>>> 3) there is webhdfs
>>>>>>> 4) command line hdfs
>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>> flume etc
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>>> for processing
>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>> local file system.
>>>>>>>>
>>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>
>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>
>>>>>>>> Please help me .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> thoihen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

IMHO,I think the statement about NN with regard to block metadata is more
like a general statement. Even if you put lots of small files of combined
size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <ni...@gmail.com>wrote:

> absolutely rite Mohammad
>
>
> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Sorry for barging in guys. I think Nitin is talking about this :
>>
>> Every file and block in HDFS is treated as an object and for each object
>> around 200B of metadata get created. So the NN should be powerful enough to
>> handle that much metadata, since it is going to be in-memory. Actually
>> memory is the most important metric when it comes to NN.
>>
>> Am I correct @Nitin?
>>
>> @Thoihen : As Nitin has said, when you talk about that much data you
>> don't actually just do a "put". You could use something like "distcp" for
>> parallel copying. A better approach would be to use a data aggregation tool
>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>> data aggregation tool, called Scribe for this purpose.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> NN would still be in picture because it will be writing a lot of meta
>>> data for each individual file. so you will need a NN capable enough which
>>> can store the metadata for your entire dataset. Data will never go to NN
>>> but lot of metadata about data will be on NN so its always good idea to
>>> have a strong NN.
>>>
>>>
>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>> part of the actual data write pipeline , means that the data would not
>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>> get locations of DN as where to store the data blocks.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> is it safe? .. there is no direct answer yes or no
>>>>>
>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>  to HDFS, several factors come into picture
>>>>>
>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>> 2) If there guarantee that network will not go down?
>>>>>
>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>> By that I mean you have a capable namenode.
>>>>>
>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>> to speed up the process.
>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>> has not failed when we write a lot of data.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>>
>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>
>>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>>
>>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>>> line . Can hadoop put command line work with huge data.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>> from there its sent to archivers or deleted.
>>>>>>>
>>>>>>> For data management products, you can look at falcon which is open
>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>
>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>> options available to you
>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>> 2) use hdfs proxy
>>>>>>> 3) there is webhdfs
>>>>>>> 4) command line hdfs
>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>> flume etc
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>>> for processing
>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>> local file system.
>>>>>>>>
>>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>
>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>
>>>>>>>> Please help me .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> thoihen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

absolutely rite Mohammad


On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Shahab Yunus <sh...@gmail.com>.

@Thoihen. If the data that you are trying to load is not streaming or the
data loading is not real-time in nature then why don't you use
Sqoop? Relatively easy to use with not much learning curve.

Regards,
Shahab


On Sat, May 11, 2013 at 12:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

absolutely rite Mohammad


On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

absolutely rite Mohammad


On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Shahab Yunus <sh...@gmail.com>.

@Thoihen. If the data that you are trying to load is not streaming or the
data loading is not real-time in nature then why don't you use
Sqoop? Relatively easy to use with not much learning curve.

Regards,
Shahab


On Sat, May 11, 2013 at 12:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

absolutely rite Mohammad


On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Shahab Yunus <sh...@gmail.com>.

@Thoihen. If the data that you are trying to load is not streaming or the
data loading is not real-time in nature then why don't you use
Sqoop? Relatively easy to use with not much learning curve.

Regards,
Shahab


On Sat, May 11, 2013 at 12:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Shahab Yunus <sh...@gmail.com>.

@Thoihen. If the data that you are trying to load is not streaming or the
data loading is not real-time in nature then why don't you use
Sqoop? Relatively easy to use with not much learning curve.

Regards,
Shahab


On Sat, May 11, 2013 at 12:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>> flume etc
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>>> for processing
>>>>>>> and after processing how they download those files from HDFS to
>>>>>>> local file system.
>>>>>>>
>>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>
>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>
>>>>>>> Please help me .
>>>>>>>
>>>>>>> Thanks
>>>>>>> thoihen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

Sorry for barging in guys. I think Nitin is talking about this :

Every file and block in HDFS is treated as an object and for each object
around 200B of metadata get created. So the NN should be powerful enough to
handle that much metadata, since it is going to be in-memory. Actually
memory is the most important metric when it comes to NN.

Am I correct @Nitin?

@Thoihen : As Nitin has said, when you talk about that much data you don't
actually just do a "put". You could use something like "distcp" for
parallel copying. A better approach would be to use a data aggregation tool
like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
data aggregation tool, called Scribe for this purpose.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:

> NN would still be in picture because it will be writing a lot of meta data
> for each individual file. so you will need a NN capable enough which can
> store the metadata for your entire dataset. Data will never go to NN but
> lot of metadata about data will be on NN so its always good idea to have a
> strong NN.
>
>
> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Nitin , parallel dfs to write to hdfs is great , but could not
>> understand the meaning of capable NN. As I know , the NN would not be a
>> part of the actual data write pipeline , means that the data would not
>> travel through the NN , the dfs would contact the NN from time to time to
>> get locations of DN as where to store the data blocks.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> is it safe? .. there is no direct answer yes or no
>>>
>>> when you say , you have files worth 10TB files and you want to upload
>>>  to HDFS, several factors come into picture
>>>
>>> 1) Is the machine in the same network as your hadoop cluster?
>>> 2) If there guarantee that network will not go down?
>>>
>>> and Most importantly I assume that you have a capable hadoop cluster. By
>>> that I mean you have a capable namenode.
>>>
>>> I would definitely not write files sequentially in HDFS. I would prefer
>>> to write files in parallel to hdfs to utilize the DFS write features to
>>> speed up the process.
>>> you can hdfs put command in parallel manner and in my experience it has
>>> not failed when we write a lot of data.
>>>
>>>
>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>
>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>
>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>
>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>> line . Can hadoop put command line work with huge data.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>>> go. Its an accumulating process and most of the companies do have a data
>>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>>> and  then its retained on hdfs for some duration as per needed and from
>>>>> there its sent to archivers or deleted.
>>>>>
>>>>> For data management products, you can look at falcon which is open
>>>>> sourced by inmobi along with hortonworks.
>>>>>
>>>>> In any case, if you want to write files to hdfs there are few options
>>>>> available to you
>>>>> 1) Write your dfs client which writes to dfs
>>>>> 2) use hdfs proxy
>>>>> 3) there is webhdfs
>>>>> 4) command line hdfs
>>>>> 5) data collection tools come with support to write to hdfs like flume
>>>>> etc
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>> for processing
>>>>>> and after processing how they download those files from HDFS to local
>>>>>> file system.
>>>>>>
>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>
>>>>>> Or if they use any tool to upload huge files.
>>>>>>
>>>>>> Please help me .
>>>>>>
>>>>>> Thanks
>>>>>> thoihen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

Sorry for barging in guys. I think Nitin is talking about this :

Every file and block in HDFS is treated as an object and for each object
around 200B of metadata get created. So the NN should be powerful enough to
handle that much metadata, since it is going to be in-memory. Actually
memory is the most important metric when it comes to NN.

Am I correct @Nitin?

@Thoihen : As Nitin has said, when you talk about that much data you don't
actually just do a "put". You could use something like "distcp" for
parallel copying. A better approach would be to use a data aggregation tool
like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
data aggregation tool, called Scribe for this purpose.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:

> NN would still be in picture because it will be writing a lot of meta data
> for each individual file. so you will need a NN capable enough which can
> store the metadata for your entire dataset. Data will never go to NN but
> lot of metadata about data will be on NN so its always good idea to have a
> strong NN.
>
>
> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Nitin , parallel dfs to write to hdfs is great , but could not
>> understand the meaning of capable NN. As I know , the NN would not be a
>> part of the actual data write pipeline , means that the data would not
>> travel through the NN , the dfs would contact the NN from time to time to
>> get locations of DN as where to store the data blocks.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> is it safe? .. there is no direct answer yes or no
>>>
>>> when you say , you have files worth 10TB files and you want to upload
>>>  to HDFS, several factors come into picture
>>>
>>> 1) Is the machine in the same network as your hadoop cluster?
>>> 2) If there guarantee that network will not go down?
>>>
>>> and Most importantly I assume that you have a capable hadoop cluster. By
>>> that I mean you have a capable namenode.
>>>
>>> I would definitely not write files sequentially in HDFS. I would prefer
>>> to write files in parallel to hdfs to utilize the DFS write features to
>>> speed up the process.
>>> you can hdfs put command in parallel manner and in my experience it has
>>> not failed when we write a lot of data.
>>>
>>>
>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>
>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>
>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>
>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>> line . Can hadoop put command line work with huge data.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>>> go. Its an accumulating process and most of the companies do have a data
>>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>>> and  then its retained on hdfs for some duration as per needed and from
>>>>> there its sent to archivers or deleted.
>>>>>
>>>>> For data management products, you can look at falcon which is open
>>>>> sourced by inmobi along with hortonworks.
>>>>>
>>>>> In any case, if you want to write files to hdfs there are few options
>>>>> available to you
>>>>> 1) Write your dfs client which writes to dfs
>>>>> 2) use hdfs proxy
>>>>> 3) there is webhdfs
>>>>> 4) command line hdfs
>>>>> 5) data collection tools come with support to write to hdfs like flume
>>>>> etc
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>> for processing
>>>>>> and after processing how they download those files from HDFS to local
>>>>>> file system.
>>>>>>
>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>
>>>>>> Or if they use any tool to upload huge files.
>>>>>>
>>>>>> Please help me .
>>>>>>
>>>>>> Thanks
>>>>>> thoihen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

Sorry for barging in guys. I think Nitin is talking about this :

Every file and block in HDFS is treated as an object and for each object
around 200B of metadata get created. So the NN should be powerful enough to
handle that much metadata, since it is going to be in-memory. Actually
memory is the most important metric when it comes to NN.

Am I correct @Nitin?

@Thoihen : As Nitin has said, when you talk about that much data you don't
actually just do a "put". You could use something like "distcp" for
parallel copying. A better approach would be to use a data aggregation tool
like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
data aggregation tool, called Scribe for this purpose.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:

> NN would still be in picture because it will be writing a lot of meta data
> for each individual file. so you will need a NN capable enough which can
> store the metadata for your entire dataset. Data will never go to NN but
> lot of metadata about data will be on NN so its always good idea to have a
> strong NN.
>
>
> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Nitin , parallel dfs to write to hdfs is great , but could not
>> understand the meaning of capable NN. As I know , the NN would not be a
>> part of the actual data write pipeline , means that the data would not
>> travel through the NN , the dfs would contact the NN from time to time to
>> get locations of DN as where to store the data blocks.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> is it safe? .. there is no direct answer yes or no
>>>
>>> when you say , you have files worth 10TB files and you want to upload
>>>  to HDFS, several factors come into picture
>>>
>>> 1) Is the machine in the same network as your hadoop cluster?
>>> 2) If there guarantee that network will not go down?
>>>
>>> and Most importantly I assume that you have a capable hadoop cluster. By
>>> that I mean you have a capable namenode.
>>>
>>> I would definitely not write files sequentially in HDFS. I would prefer
>>> to write files in parallel to hdfs to utilize the DFS write features to
>>> speed up the process.
>>> you can hdfs put command in parallel manner and in my experience it has
>>> not failed when we write a lot of data.
>>>
>>>
>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>
>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>
>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>
>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>> line . Can hadoop put command line work with huge data.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>>> go. Its an accumulating process and most of the companies do have a data
>>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>>> and  then its retained on hdfs for some duration as per needed and from
>>>>> there its sent to archivers or deleted.
>>>>>
>>>>> For data management products, you can look at falcon which is open
>>>>> sourced by inmobi along with hortonworks.
>>>>>
>>>>> In any case, if you want to write files to hdfs there are few options
>>>>> available to you
>>>>> 1) Write your dfs client which writes to dfs
>>>>> 2) use hdfs proxy
>>>>> 3) there is webhdfs
>>>>> 4) command line hdfs
>>>>> 5) data collection tools come with support to write to hdfs like flume
>>>>> etc
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>> for processing
>>>>>> and after processing how they download those files from HDFS to local
>>>>>> file system.
>>>>>>
>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>
>>>>>> Or if they use any tool to upload huge files.
>>>>>>
>>>>>> Please help me .
>>>>>>
>>>>>> Thanks
>>>>>> thoihen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Mohammad Tariq <do...@gmail.com>.

Sorry for barging in guys. I think Nitin is talking about this :

Every file and block in HDFS is treated as an object and for each object
around 200B of metadata get created. So the NN should be powerful enough to
handle that much metadata, since it is going to be in-memory. Actually
memory is the most important metric when it comes to NN.

Am I correct @Nitin?

@Thoihen : As Nitin has said, when you talk about that much data you don't
actually just do a "put". You could use something like "distcp" for
parallel copying. A better approach would be to use a data aggregation tool
like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
data aggregation tool, called Scribe for this purpose.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <ni...@gmail.com>wrote:

> NN would still be in picture because it will be writing a lot of meta data
> for each individual file. so you will need a NN capable enough which can
> store the metadata for your entire dataset. Data will never go to NN but
> lot of metadata about data will be on NN so its always good idea to have a
> strong NN.
>
>
> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Nitin , parallel dfs to write to hdfs is great , but could not
>> understand the meaning of capable NN. As I know , the NN would not be a
>> part of the actual data write pipeline , means that the data would not
>> travel through the NN , the dfs would contact the NN from time to time to
>> get locations of DN as where to store the data blocks.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> is it safe? .. there is no direct answer yes or no
>>>
>>> when you say , you have files worth 10TB files and you want to upload
>>>  to HDFS, several factors come into picture
>>>
>>> 1) Is the machine in the same network as your hadoop cluster?
>>> 2) If there guarantee that network will not go down?
>>>
>>> and Most importantly I assume that you have a capable hadoop cluster. By
>>> that I mean you have a capable namenode.
>>>
>>> I would definitely not write files sequentially in HDFS. I would prefer
>>> to write files in parallel to hdfs to utilize the DFS write features to
>>> speed up the process.
>>> you can hdfs put command in parallel manner and in my experience it has
>>> not failed when we write a lot of data.
>>>
>>>
>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com>wrote:
>>>
>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>
>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>
>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>> line . Can hadoop put command line work with huge data.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>>> go. Its an accumulating process and most of the companies do have a data
>>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>>> and  then its retained on hdfs for some duration as per needed and from
>>>>> there its sent to archivers or deleted.
>>>>>
>>>>> For data management products, you can look at falcon which is open
>>>>> sourced by inmobi along with hortonworks.
>>>>>
>>>>> In any case, if you want to write files to hdfs there are few options
>>>>> available to you
>>>>> 1) Write your dfs client which writes to dfs
>>>>> 2) use hdfs proxy
>>>>> 3) there is webhdfs
>>>>> 4) command line hdfs
>>>>> 5) data collection tools come with support to write to hdfs like flume
>>>>> etc
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>>> for processing
>>>>>> and after processing how they download those files from HDFS to local
>>>>>> file system.
>>>>>>
>>>>>> I don't think they might be using the command line hadoop fs put to
>>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>
>>>>>> Or if they use any tool to upload huge files.
>>>>>>
>>>>>> Please help me .
>>>>>>
>>>>>> Thanks
>>>>>> thoihen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

NN would still be in picture because it will be writing a lot of meta data
for each individual file. so you will need a NN capable enough which can
store the metadata for your entire dataset. Data will never go to NN but
lot of metadata about data will be on NN so its always good idea to have a
strong NN.


On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Nitin , parallel dfs to write to hdfs is great , but could not understand
> the meaning of capable NN. As I know , the NN would not be a part of the
> actual data write pipeline , means that the data would not travel through
> the NN , the dfs would contact the NN from time to time to get locations of
> DN as where to store the data blocks.
>
> Thanks,
> Rahul
>
>
>
> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> is it safe? .. there is no direct answer yes or no
>>
>> when you say , you have files worth 10TB files and you want to upload  to
>> HDFS, several factors come into picture
>>
>> 1) Is the machine in the same network as your hadoop cluster?
>> 2) If there guarantee that network will not go down?
>>
>> and Most importantly I assume that you have a capable hadoop cluster. By
>> that I mean you have a capable namenode.
>>
>> I would definitely not write files sequentially in HDFS. I would prefer
>> to write files in parallel to hdfs to utilize the DFS write features to
>> speed up the process.
>> you can hdfs put command in parallel manner and in my experience it has
>> not failed when we write a lot of data.
>>
>>
>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>>
>>> @Nitin Pawar , thanks for clearing my doubts .
>>>
>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>
>>> Is it perfectly OK to use hadopo fs put command to upload these files of
>>> size 10 TB and is there any limit to the file size  using hadoop command
>>> line . Can hadoop put command line work with huge data.
>>>
>>> Thanks in advance
>>>
>>>
>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>> go. Its an accumulating process and most of the companies do have a data
>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>> and  then its retained on hdfs for some duration as per needed and from
>>>> there its sent to archivers or deleted.
>>>>
>>>> For data management products, you can look at falcon which is open
>>>> sourced by inmobi along with hortonworks.
>>>>
>>>> In any case, if you want to write files to hdfs there are few options
>>>> available to you
>>>> 1) Write your dfs client which writes to dfs
>>>> 2) use hdfs proxy
>>>> 3) there is webhdfs
>>>> 4) command line hdfs
>>>> 5) data collection tools come with support to write to hdfs like flume
>>>> etc
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>> for processing
>>>>> and after processing how they download those files from HDFS to local
>>>>> file system.
>>>>>
>>>>> I don't think they might be using the command line hadoop fs put to
>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>
>>>>> Or if they use any tool to upload huge files.
>>>>>
>>>>> Please help me .
>>>>>
>>>>> Thanks
>>>>> thoihen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

NN would still be in picture because it will be writing a lot of meta data
for each individual file. so you will need a NN capable enough which can
store the metadata for your entire dataset. Data will never go to NN but
lot of metadata about data will be on NN so its always good idea to have a
strong NN.


On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Nitin , parallel dfs to write to hdfs is great , but could not understand
> the meaning of capable NN. As I know , the NN would not be a part of the
> actual data write pipeline , means that the data would not travel through
> the NN , the dfs would contact the NN from time to time to get locations of
> DN as where to store the data blocks.
>
> Thanks,
> Rahul
>
>
>
> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> is it safe? .. there is no direct answer yes or no
>>
>> when you say , you have files worth 10TB files and you want to upload  to
>> HDFS, several factors come into picture
>>
>> 1) Is the machine in the same network as your hadoop cluster?
>> 2) If there guarantee that network will not go down?
>>
>> and Most importantly I assume that you have a capable hadoop cluster. By
>> that I mean you have a capable namenode.
>>
>> I would definitely not write files sequentially in HDFS. I would prefer
>> to write files in parallel to hdfs to utilize the DFS write features to
>> speed up the process.
>> you can hdfs put command in parallel manner and in my experience it has
>> not failed when we write a lot of data.
>>
>>
>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>>
>>> @Nitin Pawar , thanks for clearing my doubts .
>>>
>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>
>>> Is it perfectly OK to use hadopo fs put command to upload these files of
>>> size 10 TB and is there any limit to the file size  using hadoop command
>>> line . Can hadoop put command line work with huge data.
>>>
>>> Thanks in advance
>>>
>>>
>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>> go. Its an accumulating process and most of the companies do have a data
>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>> and  then its retained on hdfs for some duration as per needed and from
>>>> there its sent to archivers or deleted.
>>>>
>>>> For data management products, you can look at falcon which is open
>>>> sourced by inmobi along with hortonworks.
>>>>
>>>> In any case, if you want to write files to hdfs there are few options
>>>> available to you
>>>> 1) Write your dfs client which writes to dfs
>>>> 2) use hdfs proxy
>>>> 3) there is webhdfs
>>>> 4) command line hdfs
>>>> 5) data collection tools come with support to write to hdfs like flume
>>>> etc
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>> for processing
>>>>> and after processing how they download those files from HDFS to local
>>>>> file system.
>>>>>
>>>>> I don't think they might be using the command line hadoop fs put to
>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>
>>>>> Or if they use any tool to upload huge files.
>>>>>
>>>>> Please help me .
>>>>>
>>>>> Thanks
>>>>> thoihen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

NN would still be in picture because it will be writing a lot of meta data
for each individual file. so you will need a NN capable enough which can
store the metadata for your entire dataset. Data will never go to NN but
lot of metadata about data will be on NN so its always good idea to have a
strong NN.


On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Nitin , parallel dfs to write to hdfs is great , but could not understand
> the meaning of capable NN. As I know , the NN would not be a part of the
> actual data write pipeline , means that the data would not travel through
> the NN , the dfs would contact the NN from time to time to get locations of
> DN as where to store the data blocks.
>
> Thanks,
> Rahul
>
>
>
> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> is it safe? .. there is no direct answer yes or no
>>
>> when you say , you have files worth 10TB files and you want to upload  to
>> HDFS, several factors come into picture
>>
>> 1) Is the machine in the same network as your hadoop cluster?
>> 2) If there guarantee that network will not go down?
>>
>> and Most importantly I assume that you have a capable hadoop cluster. By
>> that I mean you have a capable namenode.
>>
>> I would definitely not write files sequentially in HDFS. I would prefer
>> to write files in parallel to hdfs to utilize the DFS write features to
>> speed up the process.
>> you can hdfs put command in parallel manner and in my experience it has
>> not failed when we write a lot of data.
>>
>>
>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>>
>>> @Nitin Pawar , thanks for clearing my doubts .
>>>
>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>
>>> Is it perfectly OK to use hadopo fs put command to upload these files of
>>> size 10 TB and is there any limit to the file size  using hadoop command
>>> line . Can hadoop put command line work with huge data.
>>>
>>> Thanks in advance
>>>
>>>
>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>> go. Its an accumulating process and most of the companies do have a data
>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>> and  then its retained on hdfs for some duration as per needed and from
>>>> there its sent to archivers or deleted.
>>>>
>>>> For data management products, you can look at falcon which is open
>>>> sourced by inmobi along with hortonworks.
>>>>
>>>> In any case, if you want to write files to hdfs there are few options
>>>> available to you
>>>> 1) Write your dfs client which writes to dfs
>>>> 2) use hdfs proxy
>>>> 3) there is webhdfs
>>>> 4) command line hdfs
>>>> 5) data collection tools come with support to write to hdfs like flume
>>>> etc
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>> for processing
>>>>> and after processing how they download those files from HDFS to local
>>>>> file system.
>>>>>
>>>>> I don't think they might be using the command line hadoop fs put to
>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>
>>>>> Or if they use any tool to upload huge files.
>>>>>
>>>>> Please help me .
>>>>>
>>>>> Thanks
>>>>> thoihen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

NN would still be in picture because it will be writing a lot of meta data
for each individual file. so you will need a NN capable enough which can
store the metadata for your entire dataset. Data will never go to NN but
lot of metadata about data will be on NN so its always good idea to have a
strong NN.


On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Nitin , parallel dfs to write to hdfs is great , but could not understand
> the meaning of capable NN. As I know , the NN would not be a part of the
> actual data write pipeline , means that the data would not travel through
> the NN , the dfs would contact the NN from time to time to get locations of
> DN as where to store the data blocks.
>
> Thanks,
> Rahul
>
>
>
> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> is it safe? .. there is no direct answer yes or no
>>
>> when you say , you have files worth 10TB files and you want to upload  to
>> HDFS, several factors come into picture
>>
>> 1) Is the machine in the same network as your hadoop cluster?
>> 2) If there guarantee that network will not go down?
>>
>> and Most importantly I assume that you have a capable hadoop cluster. By
>> that I mean you have a capable namenode.
>>
>> I would definitely not write files sequentially in HDFS. I would prefer
>> to write files in parallel to hdfs to utilize the DFS write features to
>> speed up the process.
>> you can hdfs put command in parallel manner and in my experience it has
>> not failed when we write a lot of data.
>>
>>
>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>>
>>> @Nitin Pawar , thanks for clearing my doubts .
>>>
>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>
>>> Is it perfectly OK to use hadopo fs put command to upload these files of
>>> size 10 TB and is there any limit to the file size  using hadoop command
>>> line . Can hadoop put command line work with huge data.
>>>
>>> Thanks in advance
>>>
>>>
>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> first of all .. most of the companies do not get 100 PB of data in one
>>>> go. Its an accumulating process and most of the companies do have a data
>>>> pipeline in place where the data is written to hdfs on a frequency basis
>>>> and  then its retained on hdfs for some duration as per needed and from
>>>> there its sent to archivers or deleted.
>>>>
>>>> For data management products, you can look at falcon which is open
>>>> sourced by inmobi along with hortonworks.
>>>>
>>>> In any case, if you want to write files to hdfs there are few options
>>>> available to you
>>>> 1) Write your dfs client which writes to dfs
>>>> 2) use hdfs proxy
>>>> 3) there is webhdfs
>>>> 4) command line hdfs
>>>> 5) data collection tools come with support to write to hdfs like flume
>>>> etc
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>>> for processing
>>>>> and after processing how they download those files from HDFS to local
>>>>> file system.
>>>>>
>>>>> I don't think they might be using the command line hadoop fs put to
>>>>> upload files as it would take too long or do they divide say 10 parts each
>>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>>
>>>>> Or if they use any tool to upload huge files.
>>>>>
>>>>> Please help me .
>>>>>
>>>>> Thanks
>>>>> thoihen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Nitin , parallel dfs to write to hdfs is great , but could not understand
the meaning of capable NN. As I know , the NN would not be a part of the
actual data write pipeline , means that the data would not travel through
the NN , the dfs would contact the NN from time to time to get locations of
DN as where to store the data blocks.

Thanks,
Rahul



On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:

> is it safe? .. there is no direct answer yes or no
>
> when you say , you have files worth 10TB files and you want to upload  to
> HDFS, several factors come into picture
>
> 1) Is the machine in the same network as your hadoop cluster?
> 2) If there guarantee that network will not go down?
>
> and Most importantly I assume that you have a capable hadoop cluster. By
> that I mean you have a capable namenode.
>
> I would definitely not write files sequentially in HDFS. I would prefer to
> write files in parallel to hdfs to utilize the DFS write features to speed
> up the process.
> you can hdfs put command in parallel manner and in my experience it has
> not failed when we write a lot of data.
>
>
> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>
>> @Nitin Pawar , thanks for clearing my doubts .
>>
>> But I have one more question , say I have 10 TB data in the pipeline .
>>
>> Is it perfectly OK to use hadopo fs put command to upload these files of
>> size 10 TB and is there any limit to the file size  using hadoop command
>> line . Can hadoop put command line work with huge data.
>>
>> Thanks in advance
>>
>>
>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> first of all .. most of the companies do not get 100 PB of data in one
>>> go. Its an accumulating process and most of the companies do have a data
>>> pipeline in place where the data is written to hdfs on a frequency basis
>>> and  then its retained on hdfs for some duration as per needed and from
>>> there its sent to archivers or deleted.
>>>
>>> For data management products, you can look at falcon which is open
>>> sourced by inmobi along with hortonworks.
>>>
>>> In any case, if you want to write files to hdfs there are few options
>>> available to you
>>> 1) Write your dfs client which writes to dfs
>>> 2) use hdfs proxy
>>> 3) there is webhdfs
>>> 4) command line hdfs
>>> 5) data collection tools come with support to write to hdfs like flume
>>> etc
>>>
>>>
>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>> for processing
>>>> and after processing how they download those files from HDFS to local
>>>> file system.
>>>>
>>>> I don't think they might be using the command line hadoop fs put to
>>>> upload files as it would take too long or do they divide say 10 parts each
>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>
>>>> Or if they use any tool to upload huge files.
>>>>
>>>> Please help me .
>>>>
>>>> Thanks
>>>> thoihen
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Nitin , parallel dfs to write to hdfs is great , but could not understand
the meaning of capable NN. As I know , the NN would not be a part of the
actual data write pipeline , means that the data would not travel through
the NN , the dfs would contact the NN from time to time to get locations of
DN as where to store the data blocks.

Thanks,
Rahul



On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:

> is it safe? .. there is no direct answer yes or no
>
> when you say , you have files worth 10TB files and you want to upload  to
> HDFS, several factors come into picture
>
> 1) Is the machine in the same network as your hadoop cluster?
> 2) If there guarantee that network will not go down?
>
> and Most importantly I assume that you have a capable hadoop cluster. By
> that I mean you have a capable namenode.
>
> I would definitely not write files sequentially in HDFS. I would prefer to
> write files in parallel to hdfs to utilize the DFS write features to speed
> up the process.
> you can hdfs put command in parallel manner and in my experience it has
> not failed when we write a lot of data.
>
>
> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>
>> @Nitin Pawar , thanks for clearing my doubts .
>>
>> But I have one more question , say I have 10 TB data in the pipeline .
>>
>> Is it perfectly OK to use hadopo fs put command to upload these files of
>> size 10 TB and is there any limit to the file size  using hadoop command
>> line . Can hadoop put command line work with huge data.
>>
>> Thanks in advance
>>
>>
>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> first of all .. most of the companies do not get 100 PB of data in one
>>> go. Its an accumulating process and most of the companies do have a data
>>> pipeline in place where the data is written to hdfs on a frequency basis
>>> and  then its retained on hdfs for some duration as per needed and from
>>> there its sent to archivers or deleted.
>>>
>>> For data management products, you can look at falcon which is open
>>> sourced by inmobi along with hortonworks.
>>>
>>> In any case, if you want to write files to hdfs there are few options
>>> available to you
>>> 1) Write your dfs client which writes to dfs
>>> 2) use hdfs proxy
>>> 3) there is webhdfs
>>> 4) command line hdfs
>>> 5) data collection tools come with support to write to hdfs like flume
>>> etc
>>>
>>>
>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>> for processing
>>>> and after processing how they download those files from HDFS to local
>>>> file system.
>>>>
>>>> I don't think they might be using the command line hadoop fs put to
>>>> upload files as it would take too long or do they divide say 10 parts each
>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>
>>>> Or if they use any tool to upload huge files.
>>>>
>>>> Please help me .
>>>>
>>>> Thanks
>>>> thoihen
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Nitin , parallel dfs to write to hdfs is great , but could not understand
the meaning of capable NN. As I know , the NN would not be a part of the
actual data write pipeline , means that the data would not travel through
the NN , the dfs would contact the NN from time to time to get locations of
DN as where to store the data blocks.

Thanks,
Rahul



On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:

> is it safe? .. there is no direct answer yes or no
>
> when you say , you have files worth 10TB files and you want to upload  to
> HDFS, several factors come into picture
>
> 1) Is the machine in the same network as your hadoop cluster?
> 2) If there guarantee that network will not go down?
>
> and Most importantly I assume that you have a capable hadoop cluster. By
> that I mean you have a capable namenode.
>
> I would definitely not write files sequentially in HDFS. I would prefer to
> write files in parallel to hdfs to utilize the DFS write features to speed
> up the process.
> you can hdfs put command in parallel manner and in my experience it has
> not failed when we write a lot of data.
>
>
> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>
>> @Nitin Pawar , thanks for clearing my doubts .
>>
>> But I have one more question , say I have 10 TB data in the pipeline .
>>
>> Is it perfectly OK to use hadopo fs put command to upload these files of
>> size 10 TB and is there any limit to the file size  using hadoop command
>> line . Can hadoop put command line work with huge data.
>>
>> Thanks in advance
>>
>>
>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> first of all .. most of the companies do not get 100 PB of data in one
>>> go. Its an accumulating process and most of the companies do have a data
>>> pipeline in place where the data is written to hdfs on a frequency basis
>>> and  then its retained on hdfs for some duration as per needed and from
>>> there its sent to archivers or deleted.
>>>
>>> For data management products, you can look at falcon which is open
>>> sourced by inmobi along with hortonworks.
>>>
>>> In any case, if you want to write files to hdfs there are few options
>>> available to you
>>> 1) Write your dfs client which writes to dfs
>>> 2) use hdfs proxy
>>> 3) there is webhdfs
>>> 4) command line hdfs
>>> 5) data collection tools come with support to write to hdfs like flume
>>> etc
>>>
>>>
>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>> for processing
>>>> and after processing how they download those files from HDFS to local
>>>> file system.
>>>>
>>>> I don't think they might be using the command line hadoop fs put to
>>>> upload files as it would take too long or do they divide say 10 parts each
>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>
>>>> Or if they use any tool to upload huge files.
>>>>
>>>> Please help me .
>>>>
>>>> Thanks
>>>> thoihen
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

@Nitin , parallel dfs to write to hdfs is great , but could not understand
the meaning of capable NN. As I know , the NN would not be a part of the
actual data write pipeline , means that the data would not travel through
the NN , the dfs would contact the NN from time to time to get locations of
DN as where to store the data blocks.

Thanks,
Rahul



On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <ni...@gmail.com>wrote:

> is it safe? .. there is no direct answer yes or no
>
> when you say , you have files worth 10TB files and you want to upload  to
> HDFS, several factors come into picture
>
> 1) Is the machine in the same network as your hadoop cluster?
> 2) If there guarantee that network will not go down?
>
> and Most importantly I assume that you have a capable hadoop cluster. By
> that I mean you have a capable namenode.
>
> I would definitely not write files sequentially in HDFS. I would prefer to
> write files in parallel to hdfs to utilize the DFS write features to speed
> up the process.
> you can hdfs put command in parallel manner and in my experience it has
> not failed when we write a lot of data.
>
>
> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:
>
>> @Nitin Pawar , thanks for clearing my doubts .
>>
>> But I have one more question , say I have 10 TB data in the pipeline .
>>
>> Is it perfectly OK to use hadopo fs put command to upload these files of
>> size 10 TB and is there any limit to the file size  using hadoop command
>> line . Can hadoop put command line work with huge data.
>>
>> Thanks in advance
>>
>>
>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> first of all .. most of the companies do not get 100 PB of data in one
>>> go. Its an accumulating process and most of the companies do have a data
>>> pipeline in place where the data is written to hdfs on a frequency basis
>>> and  then its retained on hdfs for some duration as per needed and from
>>> there its sent to archivers or deleted.
>>>
>>> For data management products, you can look at falcon which is open
>>> sourced by inmobi along with hortonworks.
>>>
>>> In any case, if you want to write files to hdfs there are few options
>>> available to you
>>> 1) Write your dfs client which writes to dfs
>>> 2) use hdfs proxy
>>> 3) there is webhdfs
>>> 4) command line hdfs
>>> 5) data collection tools come with support to write to hdfs like flume
>>> etc
>>>
>>>
>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>> for processing
>>>> and after processing how they download those files from HDFS to local
>>>> file system.
>>>>
>>>> I don't think they might be using the command line hadoop fs put to
>>>> upload files as it would take too long or do they divide say 10 parts each
>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>
>>>> Or if they use any tool to upload huge files.
>>>>
>>>> Please help me .
>>>>
>>>> Thanks
>>>> thoihen
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

is it safe? .. there is no direct answer yes or no

when you say , you have files worth 10TB files and you want to upload  to
HDFS, several factors come into picture

1) Is the machine in the same network as your hadoop cluster?
2) If there guarantee that network will not go down?

and Most importantly I assume that you have a capable hadoop cluster. By
that I mean you have a capable namenode.

I would definitely not write files sequentially in HDFS. I would prefer to
write files in parallel to hdfs to utilize the DFS write features to speed
up the process.
you can hdfs put command in parallel manner and in my experience it has not
failed when we write a lot of data.


On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:

> @Nitin Pawar , thanks for clearing my doubts .
>
> But I have one more question , say I have 10 TB data in the pipeline .
>
> Is it perfectly OK to use hadopo fs put command to upload these files of
> size 10 TB and is there any limit to the file size  using hadoop command
> line . Can hadoop put command line work with huge data.
>
> Thanks in advance
>
>
> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> first of all .. most of the companies do not get 100 PB of data in one
>> go. Its an accumulating process and most of the companies do have a data
>> pipeline in place where the data is written to hdfs on a frequency basis
>> and  then its retained on hdfs for some duration as per needed and from
>> there its sent to archivers or deleted.
>>
>> For data management products, you can look at falcon which is open
>> sourced by inmobi along with hortonworks.
>>
>> In any case, if you want to write files to hdfs there are few options
>> available to you
>> 1) Write your dfs client which writes to dfs
>> 2) use hdfs proxy
>> 3) there is webhdfs
>> 4) command line hdfs
>> 5) data collection tools come with support to write to hdfs like flume etc
>>
>>
>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>
>>> Hi All,
>>>
>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>> for processing
>>> and after processing how they download those files from HDFS to local
>>> file system.
>>>
>>> I don't think they might be using the command line hadoop fs put to
>>> upload files as it would take too long or do they divide say 10 parts each
>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>
>>> Or if they use any tool to upload huge files.
>>>
>>> Please help me .
>>>
>>> Thanks
>>> thoihen
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

is it safe? .. there is no direct answer yes or no

when you say , you have files worth 10TB files and you want to upload  to
HDFS, several factors come into picture

1) Is the machine in the same network as your hadoop cluster?
2) If there guarantee that network will not go down?

and Most importantly I assume that you have a capable hadoop cluster. By
that I mean you have a capable namenode.

I would definitely not write files sequentially in HDFS. I would prefer to
write files in parallel to hdfs to utilize the DFS write features to speed
up the process.
you can hdfs put command in parallel manner and in my experience it has not
failed when we write a lot of data.


On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:

> @Nitin Pawar , thanks for clearing my doubts .
>
> But I have one more question , say I have 10 TB data in the pipeline .
>
> Is it perfectly OK to use hadopo fs put command to upload these files of
> size 10 TB and is there any limit to the file size  using hadoop command
> line . Can hadoop put command line work with huge data.
>
> Thanks in advance
>
>
> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> first of all .. most of the companies do not get 100 PB of data in one
>> go. Its an accumulating process and most of the companies do have a data
>> pipeline in place where the data is written to hdfs on a frequency basis
>> and  then its retained on hdfs for some duration as per needed and from
>> there its sent to archivers or deleted.
>>
>> For data management products, you can look at falcon which is open
>> sourced by inmobi along with hortonworks.
>>
>> In any case, if you want to write files to hdfs there are few options
>> available to you
>> 1) Write your dfs client which writes to dfs
>> 2) use hdfs proxy
>> 3) there is webhdfs
>> 4) command line hdfs
>> 5) data collection tools come with support to write to hdfs like flume etc
>>
>>
>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>
>>> Hi All,
>>>
>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>> for processing
>>> and after processing how they download those files from HDFS to local
>>> file system.
>>>
>>> I don't think they might be using the command line hadoop fs put to
>>> upload files as it would take too long or do they divide say 10 parts each
>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>
>>> Or if they use any tool to upload huge files.
>>>
>>> Please help me .
>>>
>>> Thanks
>>> thoihen
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

is it safe? .. there is no direct answer yes or no

when you say , you have files worth 10TB files and you want to upload  to
HDFS, several factors come into picture

1) Is the machine in the same network as your hadoop cluster?
2) If there guarantee that network will not go down?

and Most importantly I assume that you have a capable hadoop cluster. By
that I mean you have a capable namenode.

I would definitely not write files sequentially in HDFS. I would prefer to
write files in parallel to hdfs to utilize the DFS write features to speed
up the process.
you can hdfs put command in parallel manner and in my experience it has not
failed when we write a lot of data.


On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:

> @Nitin Pawar , thanks for clearing my doubts .
>
> But I have one more question , say I have 10 TB data in the pipeline .
>
> Is it perfectly OK to use hadopo fs put command to upload these files of
> size 10 TB and is there any limit to the file size  using hadoop command
> line . Can hadoop put command line work with huge data.
>
> Thanks in advance
>
>
> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> first of all .. most of the companies do not get 100 PB of data in one
>> go. Its an accumulating process and most of the companies do have a data
>> pipeline in place where the data is written to hdfs on a frequency basis
>> and  then its retained on hdfs for some duration as per needed and from
>> there its sent to archivers or deleted.
>>
>> For data management products, you can look at falcon which is open
>> sourced by inmobi along with hortonworks.
>>
>> In any case, if you want to write files to hdfs there are few options
>> available to you
>> 1) Write your dfs client which writes to dfs
>> 2) use hdfs proxy
>> 3) there is webhdfs
>> 4) command line hdfs
>> 5) data collection tools come with support to write to hdfs like flume etc
>>
>>
>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>
>>> Hi All,
>>>
>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>> for processing
>>> and after processing how they download those files from HDFS to local
>>> file system.
>>>
>>> I don't think they might be using the command line hadoop fs put to
>>> upload files as it would take too long or do they divide say 10 parts each
>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>
>>> Or if they use any tool to upload huge files.
>>>
>>> Please help me .
>>>
>>> Thanks
>>> thoihen
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

is it safe? .. there is no direct answer yes or no

when you say , you have files worth 10TB files and you want to upload  to
HDFS, several factors come into picture

1) Is the machine in the same network as your hadoop cluster?
2) If there guarantee that network will not go down?

and Most importantly I assume that you have a capable hadoop cluster. By
that I mean you have a capable namenode.

I would definitely not write files sequentially in HDFS. I would prefer to
write files in parallel to hdfs to utilize the DFS write features to speed
up the process.
you can hdfs put command in parallel manner and in my experience it has not
failed when we write a lot of data.


On Sat, May 11, 2013 at 4:38 PM, maisnam ns <ma...@gmail.com> wrote:

> @Nitin Pawar , thanks for clearing my doubts .
>
> But I have one more question , say I have 10 TB data in the pipeline .
>
> Is it perfectly OK to use hadopo fs put command to upload these files of
> size 10 TB and is there any limit to the file size  using hadoop command
> line . Can hadoop put command line work with huge data.
>
> Thanks in advance
>
>
> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> first of all .. most of the companies do not get 100 PB of data in one
>> go. Its an accumulating process and most of the companies do have a data
>> pipeline in place where the data is written to hdfs on a frequency basis
>> and  then its retained on hdfs for some duration as per needed and from
>> there its sent to archivers or deleted.
>>
>> For data management products, you can look at falcon which is open
>> sourced by inmobi along with hortonworks.
>>
>> In any case, if you want to write files to hdfs there are few options
>> available to you
>> 1) Write your dfs client which writes to dfs
>> 2) use hdfs proxy
>> 3) there is webhdfs
>> 4) command line hdfs
>> 5) data collection tools come with support to write to hdfs like flume etc
>>
>>
>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>>
>>> Hi All,
>>>
>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>> for processing
>>> and after processing how they download those files from HDFS to local
>>> file system.
>>>
>>> I don't think they might be using the command line hadoop fs put to
>>> upload files as it would take too long or do they divide say 10 parts each
>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>
>>> Or if they use any tool to upload huge files.
>>>
>>> Please help me .
>>>
>>> Thanks
>>> thoihen
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Hadoop noob question

Posted by maisnam ns <ma...@gmail.com>.

@Nitin Pawar , thanks for clearing my doubts .

But I have one more question , say I have 10 TB data in the pipeline .

Is it perfectly OK to use hadopo fs put command to upload these files of
size 10 TB and is there any limit to the file size  using hadoop command
line . Can hadoop put command line work with huge data.

Thanks in advance


On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:

> first of all .. most of the companies do not get 100 PB of data in one go.
> Its an accumulating process and most of the companies do have a data
> pipeline in place where the data is written to hdfs on a frequency basis
> and  then its retained on hdfs for some duration as per needed and from
> there its sent to archivers or deleted.
>
> For data management products, you can look at falcon which is open sourced
> by inmobi along with hortonworks.
>
> In any case, if you want to write files to hdfs there are few options
> available to you
> 1) Write your dfs client which writes to dfs
> 2) use hdfs proxy
> 3) there is webhdfs
> 4) command line hdfs
> 5) data collection tools come with support to write to hdfs like flume etc
>
>
> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>
>> Hi All,
>>
>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>> for processing
>> and after processing how they download those files from HDFS to local
>> file system.
>>
>> I don't think they might be using the command line hadoop fs put to
>> upload files as it would take too long or do they divide say 10 parts each
>> 10 petabytes and  compress and use the command line hadoop fs put
>>
>> Or if they use any tool to upload huge files.
>>
>> Please help me .
>>
>> Thanks
>> thoihen
>>
>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by maisnam ns <ma...@gmail.com>.

@Nitin Pawar , thanks for clearing my doubts .

But I have one more question , say I have 10 TB data in the pipeline .

Is it perfectly OK to use hadopo fs put command to upload these files of
size 10 TB and is there any limit to the file size  using hadoop command
line . Can hadoop put command line work with huge data.

Thanks in advance


On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:

> first of all .. most of the companies do not get 100 PB of data in one go.
> Its an accumulating process and most of the companies do have a data
> pipeline in place where the data is written to hdfs on a frequency basis
> and  then its retained on hdfs for some duration as per needed and from
> there its sent to archivers or deleted.
>
> For data management products, you can look at falcon which is open sourced
> by inmobi along with hortonworks.
>
> In any case, if you want to write files to hdfs there are few options
> available to you
> 1) Write your dfs client which writes to dfs
> 2) use hdfs proxy
> 3) there is webhdfs
> 4) command line hdfs
> 5) data collection tools come with support to write to hdfs like flume etc
>
>
> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>
>> Hi All,
>>
>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>> for processing
>> and after processing how they download those files from HDFS to local
>> file system.
>>
>> I don't think they might be using the command line hadoop fs put to
>> upload files as it would take too long or do they divide say 10 parts each
>> 10 petabytes and  compress and use the command line hadoop fs put
>>
>> Or if they use any tool to upload huge files.
>>
>> Please help me .
>>
>> Thanks
>> thoihen
>>
>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by maisnam ns <ma...@gmail.com>.

@Nitin Pawar , thanks for clearing my doubts .

But I have one more question , say I have 10 TB data in the pipeline .

Is it perfectly OK to use hadopo fs put command to upload these files of
size 10 TB and is there any limit to the file size  using hadoop command
line . Can hadoop put command line work with huge data.

Thanks in advance


On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:

> first of all .. most of the companies do not get 100 PB of data in one go.
> Its an accumulating process and most of the companies do have a data
> pipeline in place where the data is written to hdfs on a frequency basis
> and  then its retained on hdfs for some duration as per needed and from
> there its sent to archivers or deleted.
>
> For data management products, you can look at falcon which is open sourced
> by inmobi along with hortonworks.
>
> In any case, if you want to write files to hdfs there are few options
> available to you
> 1) Write your dfs client which writes to dfs
> 2) use hdfs proxy
> 3) there is webhdfs
> 4) command line hdfs
> 5) data collection tools come with support to write to hdfs like flume etc
>
>
> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>
>> Hi All,
>>
>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>> for processing
>> and after processing how they download those files from HDFS to local
>> file system.
>>
>> I don't think they might be using the command line hadoop fs put to
>> upload files as it would take too long or do they divide say 10 parts each
>> 10 petabytes and  compress and use the command line hadoop fs put
>>
>> Or if they use any tool to upload huge files.
>>
>> Please help me .
>>
>> Thanks
>> thoihen
>>
>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by maisnam ns <ma...@gmail.com>.

@Nitin Pawar , thanks for clearing my doubts .

But I have one more question , say I have 10 TB data in the pipeline .

Is it perfectly OK to use hadopo fs put command to upload these files of
size 10 TB and is there any limit to the file size  using hadoop command
line . Can hadoop put command line work with huge data.

Thanks in advance


On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <ni...@gmail.com>wrote:

> first of all .. most of the companies do not get 100 PB of data in one go.
> Its an accumulating process and most of the companies do have a data
> pipeline in place where the data is written to hdfs on a frequency basis
> and  then its retained on hdfs for some duration as per needed and from
> there its sent to archivers or deleted.
>
> For data management products, you can look at falcon which is open sourced
> by inmobi along with hortonworks.
>
> In any case, if you want to write files to hdfs there are few options
> available to you
> 1) Write your dfs client which writes to dfs
> 2) use hdfs proxy
> 3) there is webhdfs
> 4) command line hdfs
> 5) data collection tools come with support to write to hdfs like flume etc
>
>
> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:
>
>> Hi All,
>>
>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>> for processing
>> and after processing how they download those files from HDFS to local
>> file system.
>>
>> I don't think they might be using the command line hadoop fs put to
>> upload files as it would take too long or do they divide say 10 parts each
>> 10 petabytes and  compress and use the command line hadoop fs put
>>
>> Or if they use any tool to upload huge files.
>>
>> Please help me .
>>
>> Thanks
>> thoihen
>>
>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

first of all .. most of the companies do not get 100 PB of data in one go.
Its an accumulating process and most of the companies do have a data
pipeline in place where the data is written to hdfs on a frequency basis
and  then its retained on hdfs for some duration as per needed and from
there its sent to archivers or deleted.

For data management products, you can look at falcon which is open sourced
by inmobi along with hortonworks.

In any case, if you want to write files to hdfs there are few options
available to you
1) Write your dfs client which writes to dfs
2) use hdfs proxy
3) there is webhdfs
4) command line hdfs
5) data collection tools come with support to write to hdfs like flume etc

On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:

> Hi All,
>
> Can anyone help me know how does companies like Facebook ,Yahoo etc upload
> bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for
> processing
> and after processing how they download those files from HDFS to local file
> system.
>
> I don't think they might be using the command line hadoop fs put to upload
> files as it would take too long or do they divide say 10 parts each 10
> petabytes and  compress and use the command line hadoop fs put
>
> Or if they use any tool to upload huge files.
>
> Please help me .
>
> Thanks
> thoihen
>

-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

first of all .. most of the companies do not get 100 PB of data in one go.
Its an accumulating process and most of the companies do have a data
pipeline in place where the data is written to hdfs on a frequency basis
and  then its retained on hdfs for some duration as per needed and from
there its sent to archivers or deleted.

For data management products, you can look at falcon which is open sourced
by inmobi along with hortonworks.

In any case, if you want to write files to hdfs there are few options
available to you
1) Write your dfs client which writes to dfs
2) use hdfs proxy
3) there is webhdfs
4) command line hdfs
5) data collection tools come with support to write to hdfs like flume etc

On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:

> Hi All,
>
> Can anyone help me know how does companies like Facebook ,Yahoo etc upload
> bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for
> processing
> and after processing how they download those files from HDFS to local file
> system.
>
> I don't think they might be using the command line hadoop fs put to upload
> files as it would take too long or do they divide say 10 parts each 10
> petabytes and  compress and use the command line hadoop fs put
>
> Or if they use any tool to upload huge files.
>
> Please help me .
>
> Thanks
> thoihen
>

-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

first of all .. most of the companies do not get 100 PB of data in one go.
Its an accumulating process and most of the companies do have a data
pipeline in place where the data is written to hdfs on a frequency basis
and  then its retained on hdfs for some duration as per needed and from
there its sent to archivers or deleted.

For data management products, you can look at falcon which is open sourced
by inmobi along with hortonworks.

In any case, if you want to write files to hdfs there are few options
available to you
1) Write your dfs client which writes to dfs
2) use hdfs proxy
3) there is webhdfs
4) command line hdfs
5) data collection tools come with support to write to hdfs like flume etc

On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:

> Hi All,
>
> Can anyone help me know how does companies like Facebook ,Yahoo etc upload
> bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for
> processing
> and after processing how they download those files from HDFS to local file
> system.
>
> I don't think they might be using the command line hadoop fs put to upload
> files as it would take too long or do they divide say 10 parts each 10
> petabytes and  compress and use the command line hadoop fs put
>
> Or if they use any tool to upload huge files.
>
> Please help me .
>
> Thanks
> thoihen
>

-- 
Nitin Pawar

Re: Hadoop noob question

Posted by Nitin Pawar <ni...@gmail.com>.

first of all .. most of the companies do not get 100 PB of data in one go.
Its an accumulating process and most of the companies do have a data
pipeline in place where the data is written to hdfs on a frequency basis
and  then its retained on hdfs for some duration as per needed and from
there its sent to archivers or deleted.

For data management products, you can look at falcon which is open sourced
by inmobi along with hortonworks.

In any case, if you want to write files to hdfs there are few options
available to you
1) Write your dfs client which writes to dfs
2) use hdfs proxy
3) there is webhdfs
4) command line hdfs
5) data collection tools come with support to write to hdfs like flume etc

On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <th...@gmail.com>wrote:

> Hi All,
>
> Can anyone help me know how does companies like Facebook ,Yahoo etc upload
> bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for
> processing
> and after processing how they download those files from HDFS to local file
> system.
>
> I don't think they might be using the command line hadoop fs put to upload
> files as it would take too long or do they divide say 10 parts each 10
> petabytes and  compress and use the command line hadoop fs put
>
> Or if they use any tool to upload huge files.
>
> Please help me .
>
> Thanks
> thoihen
>

-- 
Nitin Pawar