You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Austin Chungath <au...@gmail.com> on 2012/05/03 08:11:19 UTC

Best practice to migrate HDFS from 0.20.205 to CDH3u3

Hi,
I am migrating from Apache hadoop 0.20.205 to CDH3u3.
I don't want to lose the data that is in the HDFS of Apache hadoop
0.20.205.
How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
What is the best practice/ techniques to do this?

Thanks & Regards,
Austin

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

$DuplicationException: Invalid input, there are duplicated files in the
sources: hftp://ub13:50070/tmp/Rtmp1BU9Kb/file6abc6ccb6551/_logs/history,
hftp://ub13:50070/tmp/Rtmp3yCJhu/file1ca96d9331/_logs/history

Any idea what is the problem here?
They are different files how are they conflicting?

Thanks & Regards

On Tue, May 8, 2012 at 11:52 PM, Adam Faris <af...@linkedin.com> wrote:

> Hi Austin,
>
> I'm glad that helped out.  Regarding the -p flag for distcp, here's the
> online documentation
>
> http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index
>
> You can also get this info from running 'hadoop distcp' without any flags.
> --------
> -p[rbugp]       Preserve
>                       r: replication number
>                       b: block size
>                       u: user
>                       g: group
>                       p: permission
> --------
>
> -- Adam
>
> On May 7, 2012, at 10:55 PM, Austin Chungath wrote:
>
> > Thanks Adam,
> >
> > That was very helpful. Your second point solved my problems :-)
> > The hdfs port number was wrong.
> > I didn't use the option -ppgu what does it do?
> >
> >
> >
> > On Mon, May 7, 2012 at 8:07 PM, Adam Faris <af...@linkedin.com> wrote:
> >
> >> Hi Austin,
> >>
> >> I don't know about using CDH3, but we use distcp for moving data between
> >> different versions of apache grids and several things come to mind.
> >>
> >> 1) you should use the -i flag to ignore checksum differences on the
> >> blocks.  I'm not 100% but want to say hftp doesn't support checksums on
> the
> >> blocks as they go across the wire.
> >>
> >> 2) you should read from hftp but write to hdfs.  Also make sure to check
> >> your port numbers.   For example I can read from hftp on port 50070 and
> >> write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml
> and
> >> hdfs in core-site.xml on apache releases.
> >>
> >> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3
> support
> >> security?  If security is enabled on 0.20.205 and CDH3 does not support
> >> security, you will need to disable security on 0.20.205.  This is
> because
> >> you are unable to write from a secure to unsecured grid.
> >>
> >> 4) use the -m flag to limit your mappers so you don't DDOS your network
> >> backbone.
> >>
> >> 5) why isn't your vender helping you with the data migration? :)
> >>
> >> Otherwise something like this should get you going.
> >>
> >> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
> >> hftp://mynamenode.grid.one:50070/path/to/my/src/data
> >> hdfs://mynamenode.grid.two:9000/path/to/my/dst
> >>
> >> -- Adam
> >>
> >> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:
> >>
> >>> things to check
> >>>
> >>> 1) when you launch distcp jobs all the datanodes of older hdfs are live
> >> and
> >>> connected
> >>> 2) when you launch distcp no data is being written/moved/deleteed in
> hdfs
> >>> 3)  you can use option -log to log errors into directory and user -i to
> >>> ignore errors
> >>>
> >>> also u can try using distcp with hdfs protocol instead of hftp  ... for
> >>> more you can refer
> >>>
> >>
> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
> >>>
> >>>
> >>>
> >>> if it failed there should be some error
> >>> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <au...@gmail.com>
> >> wrote:
> >>>
> >>>> ok that was a lame mistake.
> >>>> $ hadoop distcp hftp://localhost:50070/tmp
> >> hftp://localhost:60070/tmp_copy
> >>>> I had spelled hdfs instead of "hftp"
> >>>>
> >>>> $ hadoop distcp hftp://localhost:50070/docs/index.html
> >>>> hftp://localhost:60070/user/hadoop
> >>>> 12/05/07 16:38:09 INFO tools.DistCp:
> >>>> srcPaths=[hftp://localhost:50070/docs/index.html]
> >>>> 12/05/07 16:38:09 INFO tools.DistCp:
> >>>> destPath=hftp://localhost:60070/user/hadoop
> >>>> With failures, global counters are inaccurate; consider running with
> -i
> >>>> Copy failed: java.io.IOException: Not supported
> >>>> at
> org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
> >>>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
> >>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
> >>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
> >>>>
> >>>> Any idea why this error is coming?
> >>>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
> >>>> (/user/hadoop)
> >>>>
> >>>> Thanks & Regards,
> >>>> Austin
> >>>>
> >>>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks,
> >>>>>
> >>>>> So I decided to try and move using distcp.
> >>>>>
> >>>>> $ hadoop distcp hdfs://localhost:54310/tmp
> >> hdfs://localhost:8021/tmp_copy
> >>>>> 12/05/07 14:57:38 INFO tools.DistCp:
> >>>> srcPaths=[hdfs://localhost:54310/tmp]
> >>>>> 12/05/07 14:57:38 INFO tools.DistCp:
> >>>>> destPath=hdfs://localhost:8021/tmp_copy
> >>>>> With failures, global counters are inaccurate; consider running with
> -i
> >>>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
> >>>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
> >> (client
> >>>> =
> >>>>> 63, server = 61)
> >>>>>
> >>>>> I found that we can do distcp like above only if both are of the same
> >>>>> hadoop version.
> >>>>> so I tried:
> >>>>>
> >>>>> $ hadoop distcp hftp://localhost:50070/tmp
> >>>> hdfs://localhost:60070/tmp_copy
> >>>>> 12/05/07 15:02:44 INFO tools.DistCp:
> >>>> srcPaths=[hftp://localhost:50070/tmp]
> >>>>> 12/05/07 15:02:44 INFO tools.DistCp:
> >>>>> destPath=hdfs://localhost:60070/tmp_copy
> >>>>>
> >>>>> But this process seemed to be hangs at this stage. What might I be
> >> doing
> >>>>> wrong?
> >>>>>
> >>>>> hftp://<dfs.http.address>/<path>
> >>>>> hftp://localhost:50070 is dfs.http.address of 0.20.205
> >>>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
> >>>>>
> >>>>> Thanks and regards,
> >>>>> Austin
> >>>>>
> >>>>>
> >>>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <
> >> michael_segel@hotmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> Ok... So riddle me this...
> >>>>>> I currently have a replication factor of 3.
> >>>>>> I reset it to two.
> >>>>>>
> >>>>>> What do you have to do to get the replication factor of 3 down to 2?
> >>>>>> Do I just try to rebalance the nodes?
> >>>>>>
> >>>>>> The point is that you are looking at a very small cluster.
> >>>>>> You may want to start the be cluster with a replication factor of 2
> >> and
> >>>>>> then when the data is moved over, increase it to a factor of 3. Or
> >> maybe
> >>>>>> not.
> >>>>>>
> >>>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck
> >> for
> >>>> a
> >>>>>> sanity check and then remove the files I copied. As I gain more
> room,
> >> I
> >>>> can
> >>>>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
> >>>>>>
> >>>>>> Even though this us a dev cluster, the OP wants to retain the data.
> >>>>>>
> >>>>>> There are other options depending on the amount and size of new
> >>>> hardware.
> >>>>>> I mean make one machine a RAID 5 machine, copy data to it clearing
> off
> >>>>>> the cluster.
> >>>>>>
> >>>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
> >>>>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could
> fit
> >> it
> >>>>>> on one machine, depending on hardware, or maybe 2 machines...  Now
> you
> >>>> can
> >>>>>> rebuild initial cluster and then move data back. Then rebuild those
> >>>>>> machines. Lots of options... ;-)
> >>>>>>
> >>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>
> >>>>>> Mike Segel
> >>>>>>
> >>>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <
> suresh@hortonworks.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> This probably is a more relevant question in CDH mailing lists.
> That
> >>>>>> said,
> >>>>>>> what Edward is suggesting seems reasonable. Reduce replication
> >> factor,
> >>>>>>> decommission some of the nodes and create a new cluster with those
> >>>> nodes
> >>>>>>> and do distcp.
> >>>>>>>
> >>>>>>> Could you share with us the reasons you want to migrate from Apache
> >>>> 205?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Suresh
> >>>>>>>
> >>>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
> >>>> edlinuxguru@gmail.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably
> more
> >>>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick
> it
> >>>>>>>> out. But yes like Michael said two clusters on the same gear and
> >>>>>>>> distcp. If you are using RF=3 you could also lower your
> replication
> >>>> to
> >>>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> >>>>>>>> stuff.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
> >>>>>> michael_segel@hotmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Ok... When you get your new hardware...
> >>>>>>>>>
> >>>>>>>>> Set up one server as your new NN, JT, SN.
> >>>>>>>>> Set up the others as a DN.
> >>>>>>>>> (Cloudera CDH3u3)
> >>>>>>>>>
> >>>>>>>>> On your existing cluster...
> >>>>>>>>> Remove your old log files, temp files on HDFS anything you don't
> >>>> need.
> >>>>>>>>> This should give you some more space.
> >>>>>>>>> Start copying some of the directories/files to the new cluster.
> >>>>>>>>> As you gain space, decommission a node, rebalance, add node to
> new
> >>>>>>>> cluster...
> >>>>>>>>>
> >>>>>>>>> It's a slow process.
> >>>>>>>>>
> >>>>>>>>> Should I remind you to make sure you up you bandwidth setting,
> and
> >>>> to
> >>>>>>>> clean up the hdfs directories when you repurpose the nodes?
> >>>>>>>>>
> >>>>>>>>> Does this make sense?
> >>>>>>>>>
> >>>>>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>>>>
> >>>>>>>>> Mike Segel
> >>>>>>>>>
> >>>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Yeah I know :-)
> >>>>>>>>>> and this is not a production cluster ;-) and yes there is more
> >>>>>> hardware
> >>>>>>>>>> coming :-)
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
> >>>>>> michael_segel@hotmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Well, you've kind of painted yourself in to a corner...
> >>>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
> >>>> but
> >>>>>>>> it's a
> >>>>>>>>>>> generic question...
> >>>>>>>>>>>
> >>>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual
> >> disks?
> >>>>>>>>>>> And please tell me you've already ordered more hardware..
> Right?
> >>>>>>>>>>>
> >>>>>>>>>>> And please tell me this isn't your production cluster...
> >>>>>>>>>>>
> >>>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept
> >> my
> >>>>>>>>>>> upcoming proposal talk... ;-)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>>>>>>
> >>>>>>>>>>> Mike Segel
> >>>>>>>>>>>
> >>>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <
> austincv@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
> >>>>>> were no
> >>>>>>>>>>>> responses.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But this is not related to cloudera as such.
> >>>>>>>>>>>>
> >>>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
> >>>>>> apache
> >>>>>>>>>>>> hadoop 0.20.205
> >>>>>>>>>>>>
> >>>>>>>>>>>> There is an upgrade namenode option when we are migrating to a
> >>>>>> higher
> >>>>>>>>>>>> version say from 0.20 to 0.20.205
> >>>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> >>>>>>>>>>>> Is this possible?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
> >>>>>>>> prash1784@gmail.com
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
> >>>> would
> >>>>>> not
> >>>>>>>>>>> know
> >>>>>>>>>>>>> much, but you might find some help moving this to Cloudera
> >>>> mailing
> >>>>>>>> list.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
> >>>>>> austincv@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> There is only one cluster. I am not copying between
> clusters.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB
> >> storage
> >>>>>>>>>>> capacity
> >>>>>>>>>>>>>> and has about 8 TB of data.
> >>>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use
> >> that
> >>>>>>>> same 8
> >>>>>>>>>>> TB
> >>>>>>>>>>>>>> of data.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2
> >> TB
> >>>>>> of
> >>>>>>>> free
> >>>>>>>>>>>>>> space
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
> >>>>>>>> nitinpawar432@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> you can actually look at the distcp
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> but this means that you have two different set of clusters
> >>>>>>>> available
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>> the migration
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
> >>>>>>>> austincv@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the suggestions,
> >>>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
> >>>> dfs
> >>>>>>>>>>>>> because
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> data is huge.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
> >>>> can
> >>>>>> do
> >>>>>>>> a
> >>>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to
> >> use
> >>>>>> CDH3
> >>>>>>>>>>>>> now,
> >>>>>>>>>>>>>>>> which is based on 0.20
> >>>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
> >>>> has
> >>>>>> to
> >>>>>>>> be
> >>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>> by 0.20's namenode.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >>>>>>>>>>>>> nitinpawar432@gmail.com
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> i can think of following options
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data
> from
> >>>>>> DFS
> >>>>>>>> and
> >>>>>>>>>>>>>>> loads
> >>>>>>>>>>>>>>>>> it in dfs
> >>>>>>>>>>>>>>>>> 2) see if the distcp  between both versions are
> compatible
> >>>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
> >>>> hundred
> >>>>>>>> GB)
> >>>>>>>>>>>>> ..
> >>>>>>>>>>>>>>>> did a
> >>>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
> >>>> copyFromLocal
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >>>>>>>>>>>>> austincv@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >>>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of
> >> Apache
> >>>>>>>>>>>>> hadoop
> >>>>>>>>>>>>>>>>>> 0.20.205.
> >>>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have
> >> on
> >>>>>>>>>>>>>> 0.20.205.
> >>>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>>>>>>>> Austin
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Nitin Pawar
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Nitin Pawar
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Nitin Pawar
> >>
> >>
>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Adam Faris <af...@linkedin.com>.

Hi Austin,

I'm glad that helped out.  Regarding the -p flag for distcp, here's the online documentation

http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index

You can also get this info from running 'hadoop distcp' without any flags.
--------
-p[rbugp]       Preserve
                       r: replication number
                       b: block size
                       u: user
                       g: group
                       p: permission
--------

-- Adam

On May 7, 2012, at 10:55 PM, Austin Chungath wrote:

> Thanks Adam,
> 
> That was very helpful. Your second point solved my problems :-)
> The hdfs port number was wrong.
> I didn't use the option -ppgu what does it do?
> 
> 
> 
> On Mon, May 7, 2012 at 8:07 PM, Adam Faris <af...@linkedin.com> wrote:
> 
>> Hi Austin,
>> 
>> I don't know about using CDH3, but we use distcp for moving data between
>> different versions of apache grids and several things come to mind.
>> 
>> 1) you should use the -i flag to ignore checksum differences on the
>> blocks.  I'm not 100% but want to say hftp doesn't support checksums on the
>> blocks as they go across the wire.
>> 
>> 2) you should read from hftp but write to hdfs.  Also make sure to check
>> your port numbers.   For example I can read from hftp on port 50070 and
>> write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and
>> hdfs in core-site.xml on apache releases.
>> 
>> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support
>> security?  If security is enabled on 0.20.205 and CDH3 does not support
>> security, you will need to disable security on 0.20.205.  This is because
>> you are unable to write from a secure to unsecured grid.
>> 
>> 4) use the -m flag to limit your mappers so you don't DDOS your network
>> backbone.
>> 
>> 5) why isn't your vender helping you with the data migration? :)
>> 
>> Otherwise something like this should get you going.
>> 
>> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
>> hftp://mynamenode.grid.one:50070/path/to/my/src/data
>> hdfs://mynamenode.grid.two:9000/path/to/my/dst
>> 
>> -- Adam
>> 
>> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:
>> 
>>> things to check
>>> 
>>> 1) when you launch distcp jobs all the datanodes of older hdfs are live
>> and
>>> connected
>>> 2) when you launch distcp no data is being written/moved/deleteed in hdfs
>>> 3)  you can use option -log to log errors into directory and user -i to
>>> ignore errors
>>> 
>>> also u can try using distcp with hdfs protocol instead of hftp  ... for
>>> more you can refer
>>> 
>> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
>>> 
>>> 
>>> 
>>> if it failed there should be some error
>>> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <au...@gmail.com>
>> wrote:
>>> 
>>>> ok that was a lame mistake.
>>>> $ hadoop distcp hftp://localhost:50070/tmp
>> hftp://localhost:60070/tmp_copy
>>>> I had spelled hdfs instead of "hftp"
>>>> 
>>>> $ hadoop distcp hftp://localhost:50070/docs/index.html
>>>> hftp://localhost:60070/user/hadoop
>>>> 12/05/07 16:38:09 INFO tools.DistCp:
>>>> srcPaths=[hftp://localhost:50070/docs/index.html]
>>>> 12/05/07 16:38:09 INFO tools.DistCp:
>>>> destPath=hftp://localhost:60070/user/hadoop
>>>> With failures, global counters are inaccurate; consider running with -i
>>>> Copy failed: java.io.IOException: Not supported
>>>> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
>>>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
>>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
>>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
>>>> 
>>>> Any idea why this error is coming?
>>>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
>>>> (/user/hadoop)
>>>> 
>>>> Thanks & Regards,
>>>> Austin
>>>> 
>>>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Thanks,
>>>>> 
>>>>> So I decided to try and move using distcp.
>>>>> 
>>>>> $ hadoop distcp hdfs://localhost:54310/tmp
>> hdfs://localhost:8021/tmp_copy
>>>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>>> srcPaths=[hdfs://localhost:54310/tmp]
>>>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>>>> destPath=hdfs://localhost:8021/tmp_copy
>>>>> With failures, global counters are inaccurate; consider running with -i
>>>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
>>>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
>> (client
>>>> =
>>>>> 63, server = 61)
>>>>> 
>>>>> I found that we can do distcp like above only if both are of the same
>>>>> hadoop version.
>>>>> so I tried:
>>>>> 
>>>>> $ hadoop distcp hftp://localhost:50070/tmp
>>>> hdfs://localhost:60070/tmp_copy
>>>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>>> srcPaths=[hftp://localhost:50070/tmp]
>>>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>>>> destPath=hdfs://localhost:60070/tmp_copy
>>>>> 
>>>>> But this process seemed to be hangs at this stage. What might I be
>> doing
>>>>> wrong?
>>>>> 
>>>>> hftp://<dfs.http.address>/<path>
>>>>> hftp://localhost:50070 is dfs.http.address of 0.20.205
>>>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
>>>>> 
>>>>> Thanks and regards,
>>>>> Austin
>>>>> 
>>>>> 
>>>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <
>> michael_segel@hotmail.com
>>>>> wrote:
>>>>> 
>>>>>> Ok... So riddle me this...
>>>>>> I currently have a replication factor of 3.
>>>>>> I reset it to two.
>>>>>> 
>>>>>> What do you have to do to get the replication factor of 3 down to 2?
>>>>>> Do I just try to rebalance the nodes?
>>>>>> 
>>>>>> The point is that you are looking at a very small cluster.
>>>>>> You may want to start the be cluster with a replication factor of 2
>> and
>>>>>> then when the data is moved over, increase it to a factor of 3. Or
>> maybe
>>>>>> not.
>>>>>> 
>>>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck
>> for
>>>> a
>>>>>> sanity check and then remove the files I copied. As I gain more room,
>> I
>>>> can
>>>>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
>>>>>> 
>>>>>> Even though this us a dev cluster, the OP wants to retain the data.
>>>>>> 
>>>>>> There are other options depending on the amount and size of new
>>>> hardware.
>>>>>> I mean make one machine a RAID 5 machine, copy data to it clearing off
>>>>>> the cluster.
>>>>>> 
>>>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
>>>>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit
>> it
>>>>>> on one machine, depending on hardware, or maybe 2 machines...  Now you
>>>> can
>>>>>> rebuild initial cluster and then move data back. Then rebuild those
>>>>>> machines. Lots of options... ;-)
>>>>>> 
>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>> 
>>>>>> Mike Segel
>>>>>> 
>>>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> This probably is a more relevant question in CDH mailing lists. That
>>>>>> said,
>>>>>>> what Edward is suggesting seems reasonable. Reduce replication
>> factor,
>>>>>>> decommission some of the nodes and create a new cluster with those
>>>> nodes
>>>>>>> and do distcp.
>>>>>>> 
>>>>>>> Could you share with us the reasons you want to migrate from Apache
>>>> 205?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Suresh
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
>>>> edlinuxguru@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>>>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it
>>>>>>>> out. But yes like Michael said two clusters on the same gear and
>>>>>>>> distcp. If you are using RF=3 you could also lower your replication
>>>> to
>>>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>>>>>>>> stuff.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
>>>>>> michael_segel@hotmail.com>
>>>>>>>> wrote:
>>>>>>>>> Ok... When you get your new hardware...
>>>>>>>>> 
>>>>>>>>> Set up one server as your new NN, JT, SN.
>>>>>>>>> Set up the others as a DN.
>>>>>>>>> (Cloudera CDH3u3)
>>>>>>>>> 
>>>>>>>>> On your existing cluster...
>>>>>>>>> Remove your old log files, temp files on HDFS anything you don't
>>>> need.
>>>>>>>>> This should give you some more space.
>>>>>>>>> Start copying some of the directories/files to the new cluster.
>>>>>>>>> As you gain space, decommission a node, rebalance, add node to new
>>>>>>>> cluster...
>>>>>>>>> 
>>>>>>>>> It's a slow process.
>>>>>>>>> 
>>>>>>>>> Should I remind you to make sure you up you bandwidth setting, and
>>>> to
>>>>>>>> clean up the hdfs directories when you repurpose the nodes?
>>>>>>>>> 
>>>>>>>>> Does this make sense?
>>>>>>>>> 
>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>> 
>>>>>>>>> Mike Segel
>>>>>>>>> 
>>>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Yeah I know :-)
>>>>>>>>>> and this is not a production cluster ;-) and yes there is more
>>>>>> hardware
>>>>>>>>>> coming :-)
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
>>>>>> michael_segel@hotmail.com
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Well, you've kind of painted yourself in to a corner...
>>>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
>>>> but
>>>>>>>> it's a
>>>>>>>>>>> generic question...
>>>>>>>>>>> 
>>>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual
>> disks?
>>>>>>>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>>>>>>>> 
>>>>>>>>>>> And please tell me this isn't your production cluster...
>>>>>>>>>>> 
>>>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept
>> my
>>>>>>>>>>> upcoming proposal talk... ;-)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>>> 
>>>>>>>>>>> Mike Segel
>>>>>>>>>>> 
>>>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
>>>>>> were no
>>>>>>>>>>>> responses.
>>>>>>>>>>>> 
>>>>>>>>>>>> But this is not related to cloudera as such.
>>>>>>>>>>>> 
>>>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
>>>>>> apache
>>>>>>>>>>>> hadoop 0.20.205
>>>>>>>>>>>> 
>>>>>>>>>>>> There is an upgrade namenode option when we are migrating to a
>>>>>> higher
>>>>>>>>>>>> version say from 0.20 to 0.20.205
>>>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>>>>>>>> Is this possible?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>>>>>>>> prash1784@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
>>>> would
>>>>>> not
>>>>>>>>>>> know
>>>>>>>>>>>>> much, but you might find some help moving this to Cloudera
>>>> mailing
>>>>>>>> list.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
>>>>>> austincv@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB
>> storage
>>>>>>>>>>> capacity
>>>>>>>>>>>>>> and has about 8 TB of data.
>>>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use
>> that
>>>>>>>> same 8
>>>>>>>>>>> TB
>>>>>>>>>>>>>> of data.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2
>> TB
>>>>>> of
>>>>>>>> free
>>>>>>>>>>>>>> space
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> you can actually look at the distcp
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> but this means that you have two different set of clusters
>>>>>>>> available
>>>>>>>>>>> to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> the migration
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>>>>>>>> austincv@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
>>>> dfs
>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> data is huge.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
>>>> can
>>>>>> do
>>>>>>>> a
>>>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to
>> use
>>>>>> CDH3
>>>>>>>>>>>>> now,
>>>>>>>>>>>>>>>> which is based on 0.20
>>>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
>>>> has
>>>>>> to
>>>>>>>> be
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> i can think of following options
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from
>>>>>> DFS
>>>>>>>> and
>>>>>>>>>>>>>>> loads
>>>>>>>>>>>>>>>>> it in dfs
>>>>>>>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
>>>> hundred
>>>>>>>> GB)
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>> did a
>>>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
>>>> copyFromLocal
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>>>>>>>> austincv@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of
>> Apache
>>>>>>>>>>>>> hadoop
>>>>>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have
>> on
>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>>>> Austin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Nitin Pawar
>> 
>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

Thanks Adam,

That was very helpful. Your second point solved my problems :-)
The hdfs port number was wrong.
I didn't use the option -ppgu what does it do?



On Mon, May 7, 2012 at 8:07 PM, Adam Faris <af...@linkedin.com> wrote:

> Hi Austin,
>
> I don't know about using CDH3, but we use distcp for moving data between
> different versions of apache grids and several things come to mind.
>
> 1) you should use the -i flag to ignore checksum differences on the
> blocks.  I'm not 100% but want to say hftp doesn't support checksums on the
> blocks as they go across the wire.
>
> 2) you should read from hftp but write to hdfs.  Also make sure to check
> your port numbers.   For example I can read from hftp on port 50070 and
> write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and
> hdfs in core-site.xml on apache releases.
>
> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support
> security?  If security is enabled on 0.20.205 and CDH3 does not support
> security, you will need to disable security on 0.20.205.  This is because
> you are unable to write from a secure to unsecured grid.
>
> 4) use the -m flag to limit your mappers so you don't DDOS your network
> backbone.
>
> 5) why isn't your vender helping you with the data migration? :)
>
> Otherwise something like this should get you going.
>
> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
> hftp://mynamenode.grid.one:50070/path/to/my/src/data
> hdfs://mynamenode.grid.two:9000/path/to/my/dst
>
> -- Adam
>
> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:
>
> > things to check
> >
> > 1) when you launch distcp jobs all the datanodes of older hdfs are live
> and
> > connected
> > 2) when you launch distcp no data is being written/moved/deleteed in hdfs
> > 3)  you can use option -log to log errors into directory and user -i to
> > ignore errors
> >
> > also u can try using distcp with hdfs protocol instead of hftp  ... for
> > more you can refer
> >
> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
> >
> >
> >
> > if it failed there should be some error
> > On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <au...@gmail.com>
> wrote:
> >
> >> ok that was a lame mistake.
> >> $ hadoop distcp hftp://localhost:50070/tmp
> hftp://localhost:60070/tmp_copy
> >> I had spelled hdfs instead of "hftp"
> >>
> >> $ hadoop distcp hftp://localhost:50070/docs/index.html
> >> hftp://localhost:60070/user/hadoop
> >> 12/05/07 16:38:09 INFO tools.DistCp:
> >> srcPaths=[hftp://localhost:50070/docs/index.html]
> >> 12/05/07 16:38:09 INFO tools.DistCp:
> >> destPath=hftp://localhost:60070/user/hadoop
> >> With failures, global counters are inaccurate; consider running with -i
> >> Copy failed: java.io.IOException: Not supported
> >> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
> >> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
> >> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
> >> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
> >>
> >> Any idea why this error is coming?
> >> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
> >> (/user/hadoop)
> >>
> >> Thanks & Regards,
> >> Austin
> >>
> >> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com>
> >> wrote:
> >>
> >>> Thanks,
> >>>
> >>> So I decided to try and move using distcp.
> >>>
> >>> $ hadoop distcp hdfs://localhost:54310/tmp
> hdfs://localhost:8021/tmp_copy
> >>> 12/05/07 14:57:38 INFO tools.DistCp:
> >> srcPaths=[hdfs://localhost:54310/tmp]
> >>> 12/05/07 14:57:38 INFO tools.DistCp:
> >>> destPath=hdfs://localhost:8021/tmp_copy
> >>> With failures, global counters are inaccurate; consider running with -i
> >>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
> >>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
> (client
> >> =
> >>> 63, server = 61)
> >>>
> >>> I found that we can do distcp like above only if both are of the same
> >>> hadoop version.
> >>> so I tried:
> >>>
> >>> $ hadoop distcp hftp://localhost:50070/tmp
> >> hdfs://localhost:60070/tmp_copy
> >>> 12/05/07 15:02:44 INFO tools.DistCp:
> >> srcPaths=[hftp://localhost:50070/tmp]
> >>> 12/05/07 15:02:44 INFO tools.DistCp:
> >>> destPath=hdfs://localhost:60070/tmp_copy
> >>>
> >>> But this process seemed to be hangs at this stage. What might I be
> doing
> >>> wrong?
> >>>
> >>> hftp://<dfs.http.address>/<path>
> >>> hftp://localhost:50070 is dfs.http.address of 0.20.205
> >>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
> >>>
> >>> Thanks and regards,
> >>> Austin
> >>>
> >>>
> >>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <
> michael_segel@hotmail.com
> >>> wrote:
> >>>
> >>>> Ok... So riddle me this...
> >>>> I currently have a replication factor of 3.
> >>>> I reset it to two.
> >>>>
> >>>> What do you have to do to get the replication factor of 3 down to 2?
> >>>> Do I just try to rebalance the nodes?
> >>>>
> >>>> The point is that you are looking at a very small cluster.
> >>>> You may want to start the be cluster with a replication factor of 2
> and
> >>>> then when the data is moved over, increase it to a factor of 3. Or
> maybe
> >>>> not.
> >>>>
> >>>> I do a distcp to. Copy the data and after each distcp, I do an fsck
> for
> >> a
> >>>> sanity check and then remove the files I copied. As I gain more room,
> I
> >> can
> >>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
> >>>>
> >>>> Even though this us a dev cluster, the OP wants to retain the data.
> >>>>
> >>>> There are other options depending on the amount and size of new
> >> hardware.
> >>>> I mean make one machine a RAID 5 machine, copy data to it clearing off
> >>>> the cluster.
> >>>>
> >>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
> >>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit
> it
> >>>> on one machine, depending on hardware, or maybe 2 machines...  Now you
> >> can
> >>>> rebuild initial cluster and then move data back. Then rebuild those
> >>>> machines. Lots of options... ;-)
> >>>>
> >>>> Sent from a remote device. Please excuse any typos...
> >>>>
> >>>> Mike Segel
> >>>>
> >>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
> >>>> wrote:
> >>>>
> >>>>> This probably is a more relevant question in CDH mailing lists. That
> >>>> said,
> >>>>> what Edward is suggesting seems reasonable. Reduce replication
> factor,
> >>>>> decommission some of the nodes and create a new cluster with those
> >> nodes
> >>>>> and do distcp.
> >>>>>
> >>>>> Could you share with us the reasons you want to migrate from Apache
> >> 205?
> >>>>>
> >>>>> Regards,
> >>>>> Suresh
> >>>>>
> >>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
> >> edlinuxguru@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
> >>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it
> >>>>>> out. But yes like Michael said two clusters on the same gear and
> >>>>>> distcp. If you are using RF=3 you could also lower your replication
> >> to
> >>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> >>>>>> stuff.
> >>>>>>
> >>>>>>
> >>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
> >>>> michael_segel@hotmail.com>
> >>>>>> wrote:
> >>>>>>> Ok... When you get your new hardware...
> >>>>>>>
> >>>>>>> Set up one server as your new NN, JT, SN.
> >>>>>>> Set up the others as a DN.
> >>>>>>> (Cloudera CDH3u3)
> >>>>>>>
> >>>>>>> On your existing cluster...
> >>>>>>> Remove your old log files, temp files on HDFS anything you don't
> >> need.
> >>>>>>> This should give you some more space.
> >>>>>>> Start copying some of the directories/files to the new cluster.
> >>>>>>> As you gain space, decommission a node, rebalance, add node to new
> >>>>>> cluster...
> >>>>>>>
> >>>>>>> It's a slow process.
> >>>>>>>
> >>>>>>> Should I remind you to make sure you up you bandwidth setting, and
> >> to
> >>>>>> clean up the hdfs directories when you repurpose the nodes?
> >>>>>>>
> >>>>>>> Does this make sense?
> >>>>>>>
> >>>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>>
> >>>>>>> Mike Segel
> >>>>>>>
> >>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Yeah I know :-)
> >>>>>>>> and this is not a production cluster ;-) and yes there is more
> >>>> hardware
> >>>>>>>> coming :-)
> >>>>>>>>
> >>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
> >>>> michael_segel@hotmail.com
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Well, you've kind of painted yourself in to a corner...
> >>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
> >> but
> >>>>>> it's a
> >>>>>>>>> generic question...
> >>>>>>>>>
> >>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual
> disks?
> >>>>>>>>> And please tell me you've already ordered more hardware.. Right?
> >>>>>>>>>
> >>>>>>>>> And please tell me this isn't your production cluster...
> >>>>>>>>>
> >>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept
> my
> >>>>>>>>> upcoming proposal talk... ;-)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>>>>
> >>>>>>>>> Mike Segel
> >>>>>>>>>
> >>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
> >>>> were no
> >>>>>>>>>> responses.
> >>>>>>>>>>
> >>>>>>>>>> But this is not related to cloudera as such.
> >>>>>>>>>>
> >>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
> >>>> apache
> >>>>>>>>>> hadoop 0.20.205
> >>>>>>>>>>
> >>>>>>>>>> There is an upgrade namenode option when we are migrating to a
> >>>> higher
> >>>>>>>>>> version say from 0.20 to 0.20.205
> >>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> >>>>>>>>>> Is this possible?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
> >>>>>> prash1784@gmail.com
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
> >> would
> >>>> not
> >>>>>>>>> know
> >>>>>>>>>>> much, but you might find some help moving this to Cloudera
> >> mailing
> >>>>>> list.
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
> >>>> austincv@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> There is only one cluster. I am not copying between clusters.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB
> storage
> >>>>>>>>> capacity
> >>>>>>>>>>>> and has about 8 TB of data.
> >>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use
> that
> >>>>>> same 8
> >>>>>>>>> TB
> >>>>>>>>>>>> of data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2
> TB
> >>>> of
> >>>>>> free
> >>>>>>>>>>>> space
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
> >>>>>> nitinpawar432@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> you can actually look at the distcp
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> but this means that you have two different set of clusters
> >>>>>> available
> >>>>>>>>> to
> >>>>>>>>>>>> do
> >>>>>>>>>>>>> the migration
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
> >>>>>> austincv@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the suggestions,
> >>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
> >> dfs
> >>>>>>>>>>> because
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> data is huge.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
> >> can
> >>>> do
> >>>>>> a
> >>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to
> use
> >>>> CDH3
> >>>>>>>>>>> now,
> >>>>>>>>>>>>>> which is based on 0.20
> >>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
> >> has
> >>>> to
> >>>>>> be
> >>>>>>>>>>>> used
> >>>>>>>>>>>>>> by 0.20's namenode.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >>>>>>>>>>> nitinpawar432@gmail.com
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> i can think of following options
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from
> >>>> DFS
> >>>>>> and
> >>>>>>>>>>>>> loads
> >>>>>>>>>>>>>>> it in dfs
> >>>>>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
> >>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
> >> hundred
> >>>>>> GB)
> >>>>>>>>>>> ..
> >>>>>>>>>>>>>> did a
> >>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
> >> copyFromLocal
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >>>>>>>>>>> austincv@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of
> Apache
> >>>>>>>>>>> hadoop
> >>>>>>>>>>>>>>>> 0.20.205.
> >>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have
> on
> >>>>>>>>>>>> 0.20.205.
> >>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>>>>>> Austin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Nitin Pawar
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Nitin Pawar
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Nitin Pawar
>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Adam Faris <af...@linkedin.com>.

Hi Austin,

I don't know about using CDH3, but we use distcp for moving data between different versions of apache grids and several things come to mind.

1) you should use the -i flag to ignore checksum differences on the blocks.  I'm not 100% but want to say hftp doesn't support checksums on the blocks as they go across the wire.

2) you should read from hftp but write to hdfs.  Also make sure to check your port numbers.   For example I can read from hftp on port 50070 and write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and hdfs in core-site.xml on apache releases.

3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support security?  If security is enabled on 0.20.205 and CDH3 does not support security, you will need to disable security on 0.20.205.  This is because you are unable to write from a secure to unsecured grid.

4) use the -m flag to limit your mappers so you don't DDOS your network backbone.   

5) why isn't your vender helping you with the data migration? :)  

Otherwise something like this should get you going.

hadoop -i -ppgu -log /tmp/mylog -m 20 distcp hftp://mynamenode.grid.one:50070/path/to/my/src/data hdfs://mynamenode.grid.two:9000/path/to/my/dst 

-- Adam

On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:

> things to check
> 
> 1) when you launch distcp jobs all the datanodes of older hdfs are live and
> connected
> 2) when you launch distcp no data is being written/moved/deleteed in hdfs
> 3)  you can use option -log to log errors into directory and user -i to
> ignore errors
> 
> also u can try using distcp with hdfs protocol instead of hftp  ... for
> more you can refer
> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
> 
> 
> 
> if it failed there should be some error
> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <au...@gmail.com> wrote:
> 
>> ok that was a lame mistake.
>> $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy
>> I had spelled hdfs instead of "hftp"
>> 
>> $ hadoop distcp hftp://localhost:50070/docs/index.html
>> hftp://localhost:60070/user/hadoop
>> 12/05/07 16:38:09 INFO tools.DistCp:
>> srcPaths=[hftp://localhost:50070/docs/index.html]
>> 12/05/07 16:38:09 INFO tools.DistCp:
>> destPath=hftp://localhost:60070/user/hadoop
>> With failures, global counters are inaccurate; consider running with -i
>> Copy failed: java.io.IOException: Not supported
>> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
>> 
>> Any idea why this error is coming?
>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
>> (/user/hadoop)
>> 
>> Thanks & Regards,
>> Austin
>> 
>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com>
>> wrote:
>> 
>>> Thanks,
>>> 
>>> So I decided to try and move using distcp.
>>> 
>>> $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
>>> 12/05/07 14:57:38 INFO tools.DistCp:
>> srcPaths=[hdfs://localhost:54310/tmp]
>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>> destPath=hdfs://localhost:8021/tmp_copy
>>> With failures, global counters are inaccurate; consider running with -i
>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client
>> =
>>> 63, server = 61)
>>> 
>>> I found that we can do distcp like above only if both are of the same
>>> hadoop version.
>>> so I tried:
>>> 
>>> $ hadoop distcp hftp://localhost:50070/tmp
>> hdfs://localhost:60070/tmp_copy
>>> 12/05/07 15:02:44 INFO tools.DistCp:
>> srcPaths=[hftp://localhost:50070/tmp]
>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>> destPath=hdfs://localhost:60070/tmp_copy
>>> 
>>> But this process seemed to be hangs at this stage. What might I be doing
>>> wrong?
>>> 
>>> hftp://<dfs.http.address>/<path>
>>> hftp://localhost:50070 is dfs.http.address of 0.20.205
>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
>>> 
>>> Thanks and regards,
>>> Austin
>>> 
>>> 
>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <michael_segel@hotmail.com
>>> wrote:
>>> 
>>>> Ok... So riddle me this...
>>>> I currently have a replication factor of 3.
>>>> I reset it to two.
>>>> 
>>>> What do you have to do to get the replication factor of 3 down to 2?
>>>> Do I just try to rebalance the nodes?
>>>> 
>>>> The point is that you are looking at a very small cluster.
>>>> You may want to start the be cluster with a replication factor of 2 and
>>>> then when the data is moved over, increase it to a factor of 3. Or maybe
>>>> not.
>>>> 
>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck for
>> a
>>>> sanity check and then remove the files I copied. As I gain more room, I
>> can
>>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
>>>> 
>>>> Even though this us a dev cluster, the OP wants to retain the data.
>>>> 
>>>> There are other options depending on the amount and size of new
>> hardware.
>>>> I mean make one machine a RAID 5 machine, copy data to it clearing off
>>>> the cluster.
>>>> 
>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
>>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
>>>> on one machine, depending on hardware, or maybe 2 machines...  Now you
>> can
>>>> rebuild initial cluster and then move data back. Then rebuild those
>>>> machines. Lots of options... ;-)
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
>>>> wrote:
>>>> 
>>>>> This probably is a more relevant question in CDH mailing lists. That
>>>> said,
>>>>> what Edward is suggesting seems reasonable. Reduce replication factor,
>>>>> decommission some of the nodes and create a new cluster with those
>> nodes
>>>>> and do distcp.
>>>>> 
>>>>> Could you share with us the reasons you want to migrate from Apache
>> 205?
>>>>> 
>>>>> Regards,
>>>>> Suresh
>>>>> 
>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
>> edlinuxguru@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it
>>>>>> out. But yes like Michael said two clusters on the same gear and
>>>>>> distcp. If you are using RF=3 you could also lower your replication
>> to
>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>>>>>> stuff.
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
>>>> michael_segel@hotmail.com>
>>>>>> wrote:
>>>>>>> Ok... When you get your new hardware...
>>>>>>> 
>>>>>>> Set up one server as your new NN, JT, SN.
>>>>>>> Set up the others as a DN.
>>>>>>> (Cloudera CDH3u3)
>>>>>>> 
>>>>>>> On your existing cluster...
>>>>>>> Remove your old log files, temp files on HDFS anything you don't
>> need.
>>>>>>> This should give you some more space.
>>>>>>> Start copying some of the directories/files to the new cluster.
>>>>>>> As you gain space, decommission a node, rebalance, add node to new
>>>>>> cluster...
>>>>>>> 
>>>>>>> It's a slow process.
>>>>>>> 
>>>>>>> Should I remind you to make sure you up you bandwidth setting, and
>> to
>>>>>> clean up the hdfs directories when you repurpose the nodes?
>>>>>>> 
>>>>>>> Does this make sense?
>>>>>>> 
>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>> 
>>>>>>> Mike Segel
>>>>>>> 
>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Yeah I know :-)
>>>>>>>> and this is not a production cluster ;-) and yes there is more
>>>> hardware
>>>>>>>> coming :-)
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
>>>> michael_segel@hotmail.com
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Well, you've kind of painted yourself in to a corner...
>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
>> but
>>>>>> it's a
>>>>>>>>> generic question...
>>>>>>>>> 
>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>>>>>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>>>>>> 
>>>>>>>>> And please tell me this isn't your production cluster...
>>>>>>>>> 
>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept my
>>>>>>>>> upcoming proposal talk... ;-)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>> 
>>>>>>>>> Mike Segel
>>>>>>>>> 
>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
>>>> were no
>>>>>>>>>> responses.
>>>>>>>>>> 
>>>>>>>>>> But this is not related to cloudera as such.
>>>>>>>>>> 
>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
>>>> apache
>>>>>>>>>> hadoop 0.20.205
>>>>>>>>>> 
>>>>>>>>>> There is an upgrade namenode option when we are migrating to a
>>>> higher
>>>>>>>>>> version say from 0.20 to 0.20.205
>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>>>>>> Is this possible?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>>>>>> prash1784@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
>> would
>>>> not
>>>>>>>>> know
>>>>>>>>>>> much, but you might find some help moving this to Cloudera
>> mailing
>>>>>> list.
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
>>>> austincv@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>>>>>> 
>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>>>>>>>>> capacity
>>>>>>>>>>>> and has about 8 TB of data.
>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
>>>>>> same 8
>>>>>>>>> TB
>>>>>>>>>>>> of data.
>>>>>>>>>>>> 
>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB
>>>> of
>>>>>> free
>>>>>>>>>>>> space
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>>>>>> nitinpawar432@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> you can actually look at the distcp
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> but this means that you have two different set of clusters
>>>>>> available
>>>>>>>>> to
>>>>>>>>>>>> do
>>>>>>>>>>>>> the migration
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>>>>>> austincv@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
>> dfs
>>>>>>>>>>> because
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> data is huge.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
>> can
>>>> do
>>>>>> a
>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use
>>>> CDH3
>>>>>>>>>>> now,
>>>>>>>>>>>>>> which is based on 0.20
>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
>> has
>>>> to
>>>>>> be
>>>>>>>>>>>> used
>>>>>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> i can think of following options
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from
>>>> DFS
>>>>>> and
>>>>>>>>>>>>> loads
>>>>>>>>>>>>>>> it in dfs
>>>>>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
>> hundred
>>>>>> GB)
>>>>>>>>>>> ..
>>>>>>>>>>>>>> did a
>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
>> copyFromLocal
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>>>>>> austincv@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>>>>>>>>>>> hadoop
>>>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>> Austin
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Nitin Pawar

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Nitin Pawar <ni...@gmail.com>.

things to check

1) when you launch distcp jobs all the datanodes of older hdfs are live and
connected
2) when you launch distcp no data is being written/moved/deleteed in hdfs
3)  you can use option -log to log errors into directory and user -i to
ignore errors

also u can try using distcp with hdfs protocol instead of hftp  ... for
more you can refer
https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd



if it failed there should be some error
On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <au...@gmail.com> wrote:

> ok that was a lame mistake.
> $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy
> I had spelled hdfs instead of "hftp"
>
> $ hadoop distcp hftp://localhost:50070/docs/index.html
> hftp://localhost:60070/user/hadoop
> 12/05/07 16:38:09 INFO tools.DistCp:
> srcPaths=[hftp://localhost:50070/docs/index.html]
> 12/05/07 16:38:09 INFO tools.DistCp:
> destPath=hftp://localhost:60070/user/hadoop
> With failures, global counters are inaccurate; consider running with -i
> Copy failed: java.io.IOException: Not supported
> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
>
> Any idea why this error is coming?
> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
> (/user/hadoop)
>
> Thanks & Regards,
> Austin
>
> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com>
> wrote:
>
> > Thanks,
> >
> > So I decided to try and move using distcp.
> >
> > $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
> > 12/05/07 14:57:38 INFO tools.DistCp:
> srcPaths=[hdfs://localhost:54310/tmp]
> > 12/05/07 14:57:38 INFO tools.DistCp:
> > destPath=hdfs://localhost:8021/tmp_copy
> > With failures, global counters are inaccurate; consider running with -i
> > Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
> > org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client
> =
> > 63, server = 61)
> >
> > I found that we can do distcp like above only if both are of the same
> > hadoop version.
> > so I tried:
> >
> > $ hadoop distcp hftp://localhost:50070/tmp
> hdfs://localhost:60070/tmp_copy
> > 12/05/07 15:02:44 INFO tools.DistCp:
> srcPaths=[hftp://localhost:50070/tmp]
> > 12/05/07 15:02:44 INFO tools.DistCp:
> > destPath=hdfs://localhost:60070/tmp_copy
> >
> > But this process seemed to be hangs at this stage. What might I be doing
> > wrong?
> >
> > hftp://<dfs.http.address>/<path>
> > hftp://localhost:50070 is dfs.http.address of 0.20.205
> > hdfs://localhost:60070 is dfs.http.address of cdh3u3
> >
> > Thanks and regards,
> > Austin
> >
> >
> > On Fri, May 4, 2012 at 4:30 AM, Michel Segel <michael_segel@hotmail.com
> >wrote:
> >
> >> Ok... So riddle me this...
> >> I currently have a replication factor of 3.
> >> I reset it to two.
> >>
> >> What do you have to do to get the replication factor of 3 down to 2?
> >> Do I just try to rebalance the nodes?
> >>
> >> The point is that you are looking at a very small cluster.
> >> You may want to start the be cluster with a replication factor of 2 and
> >> then when the data is moved over, increase it to a factor of 3. Or maybe
> >> not.
> >>
> >> I do a distcp to. Copy the data and after each distcp, I do an fsck for
> a
> >> sanity check and then remove the files I copied. As I gain more room, I
> can
> >> then slowly drop nodes, do an fsck, rebalance and then repeat.
> >>
> >> Even though this us a dev cluster, the OP wants to retain the data.
> >>
> >> There are other options depending on the amount and size of new
> hardware.
> >> I mean make one machine a RAID 5 machine, copy data to it clearing off
> >> the cluster.
> >>
> >> If 8TB was the amount of disk used, that would be 2.6666 TB used.
> >> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
> >> on one machine, depending on hardware, or maybe 2 machines...  Now you
> can
> >> rebuild initial cluster and then move data back. Then rebuild those
> >> machines. Lots of options... ;-)
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
> >> wrote:
> >>
> >> > This probably is a more relevant question in CDH mailing lists. That
> >> said,
> >> > what Edward is suggesting seems reasonable. Reduce replication factor,
> >> > decommission some of the nodes and create a new cluster with those
> nodes
> >> > and do distcp.
> >> >
> >> > Could you share with us the reasons you want to migrate from Apache
> 205?
> >> >
> >> > Regards,
> >> > Suresh
> >> >
> >> > On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
> edlinuxguru@gmail.com
> >> >wrote:
> >> >
> >> >> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
> >> >> or a cross-grade then an upgrade or downgrade. I would just stick it
> >> >> out. But yes like Michael said two clusters on the same gear and
> >> >> distcp. If you are using RF=3 you could also lower your replication
> to
> >> >> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> >> >> stuff.
> >> >>
> >> >>
> >> >> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
> >> michael_segel@hotmail.com>
> >> >> wrote:
> >> >>> Ok... When you get your new hardware...
> >> >>>
> >> >>> Set up one server as your new NN, JT, SN.
> >> >>> Set up the others as a DN.
> >> >>> (Cloudera CDH3u3)
> >> >>>
> >> >>> On your existing cluster...
> >> >>> Remove your old log files, temp files on HDFS anything you don't
> need.
> >> >>> This should give you some more space.
> >> >>> Start copying some of the directories/files to the new cluster.
> >> >>> As you gain space, decommission a node, rebalance, add node to new
> >> >> cluster...
> >> >>>
> >> >>> It's a slow process.
> >> >>>
> >> >>> Should I remind you to make sure you up you bandwidth setting, and
> to
> >> >> clean up the hdfs directories when you repurpose the nodes?
> >> >>>
> >> >>> Does this make sense?
> >> >>>
> >> >>> Sent from a remote device. Please excuse any typos...
> >> >>>
> >> >>> Mike Segel
> >> >>>
> >> >>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
> >> wrote:
> >> >>>
> >> >>>> Yeah I know :-)
> >> >>>> and this is not a production cluster ;-) and yes there is more
> >> hardware
> >> >>>> coming :-)
> >> >>>>
> >> >>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
> >> michael_segel@hotmail.com
> >> >>> wrote:
> >> >>>>
> >> >>>>> Well, you've kind of painted yourself in to a corner...
> >> >>>>> Not sure why you didn't get a response from the Cloudera lists,
> but
> >> >> it's a
> >> >>>>> generic question...
> >> >>>>>
> >> >>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
> >> >>>>> And please tell me you've already ordered more hardware.. Right?
> >> >>>>>
> >> >>>>> And please tell me this isn't your production cluster...
> >> >>>>>
> >> >>>>> (Strong hint to Strata and Cloudea... You really want to accept my
> >> >>>>> upcoming proposal talk... ;-)
> >> >>>>>
> >> >>>>>
> >> >>>>> Sent from a remote device. Please excuse any typos...
> >> >>>>>
> >> >>>>> Mike Segel
> >> >>>>>
> >> >>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
> >> >> wrote:
> >> >>>>>
> >> >>>>>> Yes. This was first posted on the cloudera mailing list. There
> >> were no
> >> >>>>>> responses.
> >> >>>>>>
> >> >>>>>> But this is not related to cloudera as such.
> >> >>>>>>
> >> >>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
> >> apache
> >> >>>>>> hadoop 0.20.205
> >> >>>>>>
> >> >>>>>> There is an upgrade namenode option when we are migrating to a
> >> higher
> >> >>>>>> version say from 0.20 to 0.20.205
> >> >>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> >> >>>>>> Is this possible?
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
> >> >> prash1784@gmail.com
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
> would
> >> not
> >> >>>>> know
> >> >>>>>>> much, but you might find some help moving this to Cloudera
> mailing
> >> >> list.
> >> >>>>>>>
> >> >>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
> >> austincv@gmail.com>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> There is only one cluster. I am not copying between clusters.
> >> >>>>>>>>
> >> >>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
> >> >>>>> capacity
> >> >>>>>>>> and has about 8 TB of data.
> >> >>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
> >> >> same 8
> >> >>>>> TB
> >> >>>>>>>> of data.
> >> >>>>>>>>
> >> >>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB
> >> of
> >> >> free
> >> >>>>>>>> space
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
> >> >> nitinpawar432@gmail.com>
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> you can actually look at the distcp
> >> >>>>>>>>>
> >> >>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >> >>>>>>>>>
> >> >>>>>>>>> but this means that you have two different set of clusters
> >> >> available
> >> >>>>> to
> >> >>>>>>>> do
> >> >>>>>>>>> the migration
> >> >>>>>>>>>
> >> >>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
> >> >> austincv@gmail.com>
> >> >>>>>>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> Thanks for the suggestions,
> >> >>>>>>>>>> My concerns are that I can't actually copyToLocal from the
> dfs
> >> >>>>>>> because
> >> >>>>>>>>> the
> >> >>>>>>>>>> data is huge.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
> can
> >> do
> >> >> a
> >> >>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
> >> >>>>>>>>>>
> >> >>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use
> >> CDH3
> >> >>>>>>> now,
> >> >>>>>>>>>> which is based on 0.20
> >> >>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
> has
> >> to
> >> >> be
> >> >>>>>>>> used
> >> >>>>>>>>>> by 0.20's namenode.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Any idea how I can achieve what I am trying to do?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks.
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >> >>>>>>> nitinpawar432@gmail.com
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> i can think of following options
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 1) write a simple get and put code which gets the data from
> >> DFS
> >> >> and
> >> >>>>>>>>> loads
> >> >>>>>>>>>>> it in dfs
> >> >>>>>>>>>>> 2) see if the distcp  between both versions are compatible
> >> >>>>>>>>>>> 3) this is what I had done (and my data was hardly few
> hundred
> >> >> GB)
> >> >>>>>>> ..
> >> >>>>>>>>>> did a
> >> >>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
> copyFromLocal
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >> >>>>>>> austincv@gmail.com
> >> >>>>>>>>>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> Hi,
> >> >>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >> >>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
> >> >>>>>>> hadoop
> >> >>>>>>>>>>>> 0.20.205.
> >> >>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
> >> >>>>>>>> 0.20.205.
> >> >>>>>>>>>>>> What is the best practice/ techniques to do this?
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks & Regards,
> >> >>>>>>>>>>>> Austin
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Nitin Pawar
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Nitin Pawar
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>
> >> >>
> >>
> >
> >
>



-- 
Nitin Pawar

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

ok that was a lame mistake.
$ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_copy
I had spelled hdfs instead of "hftp"

$ hadoop distcp hftp://localhost:50070/docs/index.html
hftp://localhost:60070/user/hadoop
12/05/07 16:38:09 INFO tools.DistCp:
srcPaths=[hftp://localhost:50070/docs/index.html]
12/05/07 16:38:09 INFO tools.DistCp:
destPath=hftp://localhost:60070/user/hadoop
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Not supported
at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

Any idea why this error is coming?
I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
(/user/hadoop)

Thanks & Regards,
Austin

On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <au...@gmail.com> wrote:

> Thanks,
>
> So I decided to try and move using distcp.
>
> $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
> 12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp]
> 12/05/07 14:57:38 INFO tools.DistCp:
> destPath=hdfs://localhost:8021/tmp_copy
> With failures, global counters are inaccurate; consider running with -i
> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client =
> 63, server = 61)
>
> I found that we can do distcp like above only if both are of the same
> hadoop version.
> so I tried:
>
> $ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy
> 12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp]
> 12/05/07 15:02:44 INFO tools.DistCp:
> destPath=hdfs://localhost:60070/tmp_copy
>
> But this process seemed to be hangs at this stage. What might I be doing
> wrong?
>
> hftp://<dfs.http.address>/<path>
> hftp://localhost:50070 is dfs.http.address of 0.20.205
> hdfs://localhost:60070 is dfs.http.address of cdh3u3
>
> Thanks and regards,
> Austin
>
>
> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <mi...@hotmail.com>wrote:
>
>> Ok... So riddle me this...
>> I currently have a replication factor of 3.
>> I reset it to two.
>>
>> What do you have to do to get the replication factor of 3 down to 2?
>> Do I just try to rebalance the nodes?
>>
>> The point is that you are looking at a very small cluster.
>> You may want to start the be cluster with a replication factor of 2 and
>> then when the data is moved over, increase it to a factor of 3. Or maybe
>> not.
>>
>> I do a distcp to. Copy the data and after each distcp, I do an fsck for a
>> sanity check and then remove the files I copied. As I gain more room, I can
>> then slowly drop nodes, do an fsck, rebalance and then repeat.
>>
>> Even though this us a dev cluster, the OP wants to retain the data.
>>
>> There are other options depending on the amount and size of new hardware.
>> I mean make one machine a RAID 5 machine, copy data to it clearing off
>> the cluster.
>>
>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
>> on one machine, depending on hardware, or maybe 2 machines...  Now you can
>> rebuild initial cluster and then move data back. Then rebuild those
>> machines. Lots of options... ;-)
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>>
>> > This probably is a more relevant question in CDH mailing lists. That
>> said,
>> > what Edward is suggesting seems reasonable. Reduce replication factor,
>> > decommission some of the nodes and create a new cluster with those nodes
>> > and do distcp.
>> >
>> > Could you share with us the reasons you want to migrate from Apache 205?
>> >
>> > Regards,
>> > Suresh
>> >
>> > On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <edlinuxguru@gmail.com
>> >wrote:
>> >
>> >> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>> >> or a cross-grade then an upgrade or downgrade. I would just stick it
>> >> out. But yes like Michael said two clusters on the same gear and
>> >> distcp. If you are using RF=3 you could also lower your replication to
>> >> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>> >> stuff.
>> >>
>> >>
>> >> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
>> michael_segel@hotmail.com>
>> >> wrote:
>> >>> Ok... When you get your new hardware...
>> >>>
>> >>> Set up one server as your new NN, JT, SN.
>> >>> Set up the others as a DN.
>> >>> (Cloudera CDH3u3)
>> >>>
>> >>> On your existing cluster...
>> >>> Remove your old log files, temp files on HDFS anything you don't need.
>> >>> This should give you some more space.
>> >>> Start copying some of the directories/files to the new cluster.
>> >>> As you gain space, decommission a node, rebalance, add node to new
>> >> cluster...
>> >>>
>> >>> It's a slow process.
>> >>>
>> >>> Should I remind you to make sure you up you bandwidth setting, and to
>> >> clean up the hdfs directories when you repurpose the nodes?
>> >>>
>> >>> Does this make sense?
>> >>>
>> >>> Sent from a remote device. Please excuse any typos...
>> >>>
>> >>> Mike Segel
>> >>>
>> >>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
>> wrote:
>> >>>
>> >>>> Yeah I know :-)
>> >>>> and this is not a production cluster ;-) and yes there is more
>> hardware
>> >>>> coming :-)
>> >>>>
>> >>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
>> michael_segel@hotmail.com
>> >>> wrote:
>> >>>>
>> >>>>> Well, you've kind of painted yourself in to a corner...
>> >>>>> Not sure why you didn't get a response from the Cloudera lists, but
>> >> it's a
>> >>>>> generic question...
>> >>>>>
>> >>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>> >>>>> And please tell me you've already ordered more hardware.. Right?
>> >>>>>
>> >>>>> And please tell me this isn't your production cluster...
>> >>>>>
>> >>>>> (Strong hint to Strata and Cloudea... You really want to accept my
>> >>>>> upcoming proposal talk... ;-)
>> >>>>>
>> >>>>>
>> >>>>> Sent from a remote device. Please excuse any typos...
>> >>>>>
>> >>>>> Mike Segel
>> >>>>>
>> >>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
>> >> wrote:
>> >>>>>
>> >>>>>> Yes. This was first posted on the cloudera mailing list. There
>> were no
>> >>>>>> responses.
>> >>>>>>
>> >>>>>> But this is not related to cloudera as such.
>> >>>>>>
>> >>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
>> apache
>> >>>>>> hadoop 0.20.205
>> >>>>>>
>> >>>>>> There is an upgrade namenode option when we are migrating to a
>> higher
>> >>>>>> version say from 0.20 to 0.20.205
>> >>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>> >>>>>> Is this possible?
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>> >> prash1784@gmail.com
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so would
>> not
>> >>>>> know
>> >>>>>>> much, but you might find some help moving this to Cloudera mailing
>> >> list.
>> >>>>>>>
>> >>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
>> austincv@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> There is only one cluster. I am not copying between clusters.
>> >>>>>>>>
>> >>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>> >>>>> capacity
>> >>>>>>>> and has about 8 TB of data.
>> >>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
>> >> same 8
>> >>>>> TB
>> >>>>>>>> of data.
>> >>>>>>>>
>> >>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB
>> of
>> >> free
>> >>>>>>>> space
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>> >> nitinpawar432@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> you can actually look at the distcp
>> >>>>>>>>>
>> >>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>> >>>>>>>>>
>> >>>>>>>>> but this means that you have two different set of clusters
>> >> available
>> >>>>> to
>> >>>>>>>> do
>> >>>>>>>>> the migration
>> >>>>>>>>>
>> >>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>> >> austincv@gmail.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Thanks for the suggestions,
>> >>>>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
>> >>>>>>> because
>> >>>>>>>>> the
>> >>>>>>>>>> data is huge.
>> >>>>>>>>>>
>> >>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can
>> do
>> >> a
>> >>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>> >>>>>>>>>>
>> >>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use
>> CDH3
>> >>>>>>> now,
>> >>>>>>>>>> which is based on 0.20
>> >>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has
>> to
>> >> be
>> >>>>>>>> used
>> >>>>>>>>>> by 0.20's namenode.
>> >>>>>>>>>>
>> >>>>>>>>>> Any idea how I can achieve what I am trying to do?
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks.
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>> >>>>>>> nitinpawar432@gmail.com
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> i can think of following options
>> >>>>>>>>>>>
>> >>>>>>>>>>> 1) write a simple get and put code which gets the data from
>> DFS
>> >> and
>> >>>>>>>>> loads
>> >>>>>>>>>>> it in dfs
>> >>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>> >>>>>>>>>>> 3) this is what I had done (and my data was hardly few hundred
>> >> GB)
>> >>>>>>> ..
>> >>>>>>>>>> did a
>> >>>>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>> >>>>>>> austincv@gmail.com
>> >>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi,
>> >>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>> >>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>> >>>>>>> hadoop
>> >>>>>>>>>>>> 0.20.205.
>> >>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>> >>>>>>>> 0.20.205.
>> >>>>>>>>>>>> What is the best practice/ techniques to do this?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks & Regards,
>> >>>>>>>>>>>> Austin
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Nitin Pawar
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Nitin Pawar
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>>
>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

Thanks,

So I decided to try and move using distcp.

$ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_copy
12/05/07 14:57:38 INFO tools.DistCp: srcPaths=[hdfs://localhost:54310/tmp]
12/05/07 14:57:38 INFO tools.DistCp: destPath=hdfs://localhost:8021/tmp_copy
With failures, global counters are inaccurate; consider running with -i
Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client =
63, server = 61)

I found that we can do distcp like above only if both are of the same
hadoop version.
so I tried:

$ hadoop distcp hftp://localhost:50070/tmp hdfs://localhost:60070/tmp_copy
12/05/07 15:02:44 INFO tools.DistCp: srcPaths=[hftp://localhost:50070/tmp]
12/05/07 15:02:44 INFO tools.DistCp:
destPath=hdfs://localhost:60070/tmp_copy

But this process seemed to be hangs at this stage. What might I be doing
wrong?

hftp://<dfs.http.address>/<path>
hftp://localhost:50070 is dfs.http.address of 0.20.205
hdfs://localhost:60070 is dfs.http.address of cdh3u3

Thanks and regards,
Austin


On Fri, May 4, 2012 at 4:30 AM, Michel Segel <mi...@hotmail.com>wrote:

> Ok... So riddle me this...
> I currently have a replication factor of 3.
> I reset it to two.
>
> What do you have to do to get the replication factor of 3 down to 2?
> Do I just try to rebalance the nodes?
>
> The point is that you are looking at a very small cluster.
> You may want to start the be cluster with a replication factor of 2 and
> then when the data is moved over, increase it to a factor of 3. Or maybe
> not.
>
> I do a distcp to. Copy the data and after each distcp, I do an fsck for a
> sanity check and then remove the files I copied. As I gain more room, I can
> then slowly drop nodes, do an fsck, rebalance and then repeat.
>
> Even though this us a dev cluster, the OP wants to retain the data.
>
> There are other options depending on the amount and size of new hardware.
> I mean make one machine a RAID 5 machine, copy data to it clearing off the
> cluster.
>
> If 8TB was the amount of disk used, that would be 2.6666 TB used.
> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it
> on one machine, depending on hardware, or maybe 2 machines...  Now you can
> rebuild initial cluster and then move data back. Then rebuild those
> machines. Lots of options... ;-)
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
>
> > This probably is a more relevant question in CDH mailing lists. That
> said,
> > what Edward is suggesting seems reasonable. Reduce replication factor,
> > decommission some of the nodes and create a new cluster with those nodes
> > and do distcp.
> >
> > Could you share with us the reasons you want to migrate from Apache 205?
> >
> > Regards,
> > Suresh
> >
> > On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
> >
> >> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
> >> or a cross-grade then an upgrade or downgrade. I would just stick it
> >> out. But yes like Michael said two clusters on the same gear and
> >> distcp. If you are using RF=3 you could also lower your replication to
> >> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> >> stuff.
> >>
> >>
> >> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <michael_segel@hotmail.com
> >
> >> wrote:
> >>> Ok... When you get your new hardware...
> >>>
> >>> Set up one server as your new NN, JT, SN.
> >>> Set up the others as a DN.
> >>> (Cloudera CDH3u3)
> >>>
> >>> On your existing cluster...
> >>> Remove your old log files, temp files on HDFS anything you don't need.
> >>> This should give you some more space.
> >>> Start copying some of the directories/files to the new cluster.
> >>> As you gain space, decommission a node, rebalance, add node to new
> >> cluster...
> >>>
> >>> It's a slow process.
> >>>
> >>> Should I remind you to make sure you up you bandwidth setting, and to
> >> clean up the hdfs directories when you repurpose the nodes?
> >>>
> >>> Does this make sense?
> >>>
> >>> Sent from a remote device. Please excuse any typos...
> >>>
> >>> Mike Segel
> >>>
> >>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com>
> wrote:
> >>>
> >>>> Yeah I know :-)
> >>>> and this is not a production cluster ;-) and yes there is more
> hardware
> >>>> coming :-)
> >>>>
> >>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
> michael_segel@hotmail.com
> >>> wrote:
> >>>>
> >>>>> Well, you've kind of painted yourself in to a corner...
> >>>>> Not sure why you didn't get a response from the Cloudera lists, but
> >> it's a
> >>>>> generic question...
> >>>>>
> >>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
> >>>>> And please tell me you've already ordered more hardware.. Right?
> >>>>>
> >>>>> And please tell me this isn't your production cluster...
> >>>>>
> >>>>> (Strong hint to Strata and Cloudea... You really want to accept my
> >>>>> upcoming proposal talk... ;-)
> >>>>>
> >>>>>
> >>>>> Sent from a remote device. Please excuse any typos...
> >>>>>
> >>>>> Mike Segel
> >>>>>
> >>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Yes. This was first posted on the cloudera mailing list. There were
> no
> >>>>>> responses.
> >>>>>>
> >>>>>> But this is not related to cloudera as such.
> >>>>>>
> >>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
> apache
> >>>>>> hadoop 0.20.205
> >>>>>>
> >>>>>> There is an upgrade namenode option when we are migrating to a
> higher
> >>>>>> version say from 0.20 to 0.20.205
> >>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> >>>>>> Is this possible?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
> >> prash1784@gmail.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so would
> not
> >>>>> know
> >>>>>>> much, but you might find some help moving this to Cloudera mailing
> >> list.
> >>>>>>>
> >>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
> austincv@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> There is only one cluster. I am not copying between clusters.
> >>>>>>>>
> >>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
> >>>>> capacity
> >>>>>>>> and has about 8 TB of data.
> >>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
> >> same 8
> >>>>> TB
> >>>>>>>> of data.
> >>>>>>>>
> >>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of
> >> free
> >>>>>>>> space
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
> >> nitinpawar432@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> you can actually look at the distcp
> >>>>>>>>>
> >>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >>>>>>>>>
> >>>>>>>>> but this means that you have two different set of clusters
> >> available
> >>>>> to
> >>>>>>>> do
> >>>>>>>>> the migration
> >>>>>>>>>
> >>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
> >> austincv@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks for the suggestions,
> >>>>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
> >>>>>>> because
> >>>>>>>>> the
> >>>>>>>>>> data is huge.
> >>>>>>>>>>
> >>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can
> do
> >> a
> >>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
> >>>>>>>>>>
> >>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use
> CDH3
> >>>>>>> now,
> >>>>>>>>>> which is based on 0.20
> >>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has
> to
> >> be
> >>>>>>>> used
> >>>>>>>>>> by 0.20's namenode.
> >>>>>>>>>>
> >>>>>>>>>> Any idea how I can achieve what I am trying to do?
> >>>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >>>>>>> nitinpawar432@gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> i can think of following options
> >>>>>>>>>>>
> >>>>>>>>>>> 1) write a simple get and put code which gets the data from DFS
> >> and
> >>>>>>>>> loads
> >>>>>>>>>>> it in dfs
> >>>>>>>>>>> 2) see if the distcp  between both versions are compatible
> >>>>>>>>>>> 3) this is what I had done (and my data was hardly few hundred
> >> GB)
> >>>>>>> ..
> >>>>>>>>>> did a
> >>>>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >>>>>>> austincv@gmail.com
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
> >>>>>>> hadoop
> >>>>>>>>>>>> 0.20.205.
> >>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
> >>>>>>>> 0.20.205.
> >>>>>>>>>>>> What is the best practice/ techniques to do this?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks & Regards,
> >>>>>>>>>>>> Austin
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Nitin Pawar
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Nitin Pawar
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Michel Segel <mi...@hotmail.com>.

Ok... So riddle me this...
I currently have a replication factor of 3.
I reset it to two.

What do you have to do to get the replication factor of 3 down to 2?
Do I just try to rebalance the nodes?

The point is that you are looking at a very small cluster.
You may want to start the be cluster with a replication factor of 2 and then when the data is moved over, increase it to a factor of 3. Or maybe not.

I do a distcp to. Copy the data and after each distcp, I do an fsck for a sanity check and then remove the files I copied. As I gain more room, I can then slowly drop nodes, do an fsck, rebalance and then repeat.

Even though this us a dev cluster, the OP wants to retain the data. 

There are other options depending on the amount and size of new hardware.
I mean make one machine a RAID 5 machine, copy data to it clearing off the cluster.

If 8TB was the amount of disk used, that would be 2.6666 TB used.
Let's say 3TB. Going raid 5, how much disk is that?  So you could fit it on one machine, depending on hardware, or maybe 2 machines...  Now you can rebuild initial cluster and then move data back. Then rebuild those machines. Lots of options... ;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 3, 2012, at 11:26 AM, Suresh Srinivas <su...@hortonworks.com> wrote:

> This probably is a more relevant question in CDH mailing lists. That said,
> what Edward is suggesting seems reasonable. Reduce replication factor,
> decommission some of the nodes and create a new cluster with those nodes
> and do distcp.
> 
> Could you share with us the reasons you want to migrate from Apache 205?
> 
> Regards,
> Suresh
> 
> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <ed...@gmail.com>wrote:
> 
>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>> or a cross-grade then an upgrade or downgrade. I would just stick it
>> out. But yes like Michael said two clusters on the same gear and
>> distcp. If you are using RF=3 you could also lower your replication to
>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>> stuff.
>> 
>> 
>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <mi...@hotmail.com>
>> wrote:
>>> Ok... When you get your new hardware...
>>> 
>>> Set up one server as your new NN, JT, SN.
>>> Set up the others as a DN.
>>> (Cloudera CDH3u3)
>>> 
>>> On your existing cluster...
>>> Remove your old log files, temp files on HDFS anything you don't need.
>>> This should give you some more space.
>>> Start copying some of the directories/files to the new cluster.
>>> As you gain space, decommission a node, rebalance, add node to new
>> cluster...
>>> 
>>> It's a slow process.
>>> 
>>> Should I remind you to make sure you up you bandwidth setting, and to
>> clean up the hdfs directories when you repurpose the nodes?
>>> 
>>> Does this make sense?
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com> wrote:
>>> 
>>>> Yeah I know :-)
>>>> and this is not a production cluster ;-) and yes there is more hardware
>>>> coming :-)
>>>> 
>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <michael_segel@hotmail.com
>>> wrote:
>>>> 
>>>>> Well, you've kind of painted yourself in to a corner...
>>>>> Not sure why you didn't get a response from the Cloudera lists, but
>> it's a
>>>>> generic question...
>>>>> 
>>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>> 
>>>>> And please tell me this isn't your production cluster...
>>>>> 
>>>>> (Strong hint to Strata and Cloudea... You really want to accept my
>>>>> upcoming proposal talk... ;-)
>>>>> 
>>>>> 
>>>>> Sent from a remote device. Please excuse any typos...
>>>>> 
>>>>> Mike Segel
>>>>> 
>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Yes. This was first posted on the cloudera mailing list. There were no
>>>>>> responses.
>>>>>> 
>>>>>> But this is not related to cloudera as such.
>>>>>> 
>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
>>>>>> hadoop 0.20.205
>>>>>> 
>>>>>> There is an upgrade namenode option when we are migrating to a higher
>>>>>> version say from 0.20 to 0.20.205
>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>> Is this possible?
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>> prash1784@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so would not
>>>>> know
>>>>>>> much, but you might find some help moving this to Cloudera mailing
>> list.
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>> 
>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>>>>> capacity
>>>>>>>> and has about 8 TB of data.
>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use that
>> same 8
>>>>> TB
>>>>>>>> of data.
>>>>>>>> 
>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of
>> free
>>>>>>>> space
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>> nitinpawar432@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> you can actually look at the distcp
>>>>>>>>> 
>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>> 
>>>>>>>>> but this means that you have two different set of clusters
>> available
>>>>> to
>>>>>>>> do
>>>>>>>>> the migration
>>>>>>>>> 
>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>> austincv@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
>>>>>>> because
>>>>>>>>> the
>>>>>>>>>> data is huge.
>>>>>>>>>> 
>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do
>> a
>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>> 
>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
>>>>>>> now,
>>>>>>>>>> which is based on 0.20
>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to
>> be
>>>>>>>> used
>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>> 
>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> i can think of following options
>>>>>>>>>>> 
>>>>>>>>>>> 1) write a simple get and put code which gets the data from DFS
>> and
>>>>>>>>> loads
>>>>>>>>>>> it in dfs
>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>> 3) this is what I had done (and my data was hardly few hundred
>> GB)
>>>>>>> ..
>>>>>>>>>> did a
>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>> austincv@gmail.com
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>>>>>>> hadoop
>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>>>>>>> 0.20.205.
>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Austin
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>

Re: Problem with cluster

Posted by Ravi Prakash <ra...@gmail.com>.

Hi Pat,

20.205 is the stable version before 1.0. 1.0 is not substantially different
than 0.20. Any reasons you don't wanna use it?

I don't think "occasional HDFS corruption" is a known issue. That would be,
umm... lets just say pretty severe. Are you sure you've configured it
properly?

Your task is killing the Hadoop daemons? :-o You might wanna check with the
developers of Mahout / bixo if that is a known issue. Obviously it should
not happen. Hadoop daemons are known to be quite long lasting (many months
atleast), and there are ways you can setup security to prevent tasks from
doing that (but guessing you have 2 nodes, maybe you don't want to invest
in that)

The message is displayed when the DN is trying to shut down but cannot
because it is waiting on some (apparently 1) thread.

HTH
Ravi

On Thu, May 3, 2012 at 12:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I'm trying to use a small cluster to make sure I understand the setup and
> have my code running before going to a big cluster. I have two machines.
> I've followed the tutorial here: http://www.michael-noll.com/**
> tutorials/running-hadoop-on-**ubuntu-linux-multi-node-**cluster/<http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/>I have been using 0.20.203 -- is this the most stable version of pre-1.0
> code?
>
> The cluster seemed fine for some time except for the occasional HDFS
> corruption, a know issue. I have run  mostly mahout code unaltered with
> success.
>
> However I am now getting some consistent errors with mahout and bixo (only
> recently started using this). When I start a job from the master, say a
> command line mahout job, the slave dies pretty quickly. It looks like
> spawned threads never complete and kill the slave. Hadoop may recover or it
> may not depending on what it is doing.
>
> In any case when I go to the slave and do ps -e I get a huge list of
>
>   "fuser <defunct>" with a long list of pids.
>
>
> The datanode logs on the slave have this warning:
>
>   pat@occam:~$ tail -f
>   hadoop-0.20.203.0/logs/hadoop-**pat-datanode-occam.log
>   2012-05-03 08:39:39,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:40,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:41,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:42,036 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   etc....
>
> So far I have removed the slave from the master's config and set
> replication to 1 and all works, just slower.
>
> Any ideas? and should I upgrade to a newer version?
>
>
>
>

Problem with cluster

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I'm trying to use a small cluster to make sure I understand the setup 
and have my code running before going to a big cluster. I have two 
machines. I've followed the tutorial here: 
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 
I have been using 0.20.203 -- is this the most stable version of pre-1.0 
code?

The cluster seemed fine for some time except for the occasional HDFS 
corruption, a know issue. I have run  mostly mahout code unaltered with 
success.

However I am now getting some consistent errors with mahout and bixo 
(only recently started using this). When I start a job from the master, 
say a command line mahout job, the slave dies pretty quickly. It looks 
like spawned threads never complete and kill the slave. Hadoop may 
recover or it may not depending on what it is doing.

In any case when I go to the slave and do ps -e I get a huge list of

    "fuser <defunct>" with a long list of pids.


The datanode logs on the slave have this warning:

    pat@occam:~$ tail -f
    hadoop-0.20.203.0/logs/hadoop-pat-datanode-occam.log
    2012-05-03 08:39:39,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:40,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:41,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:42,036 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    etc....

So far I have removed the slave from the master's config and set 
replication to 1 and all works, just slower.

Any ideas? and should I upgrade to a newer version?

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Suresh Srinivas <su...@hortonworks.com>.

This probably is a more relevant question in CDH mailing lists. That said,
what Edward is suggesting seems reasonable. Reduce replication factor,
decommission some of the nodes and create a new cluster with those nodes
and do distcp.

Could you share with us the reasons you want to migrate from Apache 205?

Regards,
Suresh

On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
> or a cross-grade then an upgrade or downgrade. I would just stick it
> out. But yes like Michael said two clusters on the same gear and
> distcp. If you are using RF=3 you could also lower your replication to
> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> stuff.
>
>
> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <mi...@hotmail.com>
> wrote:
> > Ok... When you get your new hardware...
> >
> > Set up one server as your new NN, JT, SN.
> > Set up the others as a DN.
> > (Cloudera CDH3u3)
> >
> > On your existing cluster...
> > Remove your old log files, temp files on HDFS anything you don't need.
> > This should give you some more space.
> > Start copying some of the directories/files to the new cluster.
> > As you gain space, decommission a node, rebalance, add node to new
> cluster...
> >
> > It's a slow process.
> >
> > Should I remind you to make sure you up you bandwidth setting, and to
> clean up the hdfs directories when you repurpose the nodes?
> >
> > Does this make sense?
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com> wrote:
> >
> >> Yeah I know :-)
> >> and this is not a production cluster ;-) and yes there is more hardware
> >> coming :-)
> >>
> >> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <michael_segel@hotmail.com
> >wrote:
> >>
> >>> Well, you've kind of painted yourself in to a corner...
> >>> Not sure why you didn't get a response from the Cloudera lists, but
> it's a
> >>> generic question...
> >>>
> >>> 8 out of 10 TB. Are you talking effective storage or actual disks?
> >>> And please tell me you've already ordered more hardware.. Right?
> >>>
> >>> And please tell me this isn't your production cluster...
> >>>
> >>> (Strong hint to Strata and Cloudea... You really want to accept my
> >>> upcoming proposal talk... ;-)
> >>>
> >>>
> >>> Sent from a remote device. Please excuse any typos...
> >>>
> >>> Mike Segel
> >>>
> >>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com>
> wrote:
> >>>
> >>>> Yes. This was first posted on the cloudera mailing list. There were no
> >>>> responses.
> >>>>
> >>>> But this is not related to cloudera as such.
> >>>>
> >>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
> >>>> hadoop 0.20.205
> >>>>
> >>>> There is an upgrade namenode option when we are migrating to a higher
> >>>> version say from 0.20 to 0.20.205
> >>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> >>>> Is this possible?
> >>>>
> >>>>
> >>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
> prash1784@gmail.com
> >>>> wrote:
> >>>>
> >>>>> Seems like a matter of upgrade. I am not a Cloudera user so would not
> >>> know
> >>>>> much, but you might find some help moving this to Cloudera mailing
> list.
> >>>>>
> >>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> There is only one cluster. I am not copying between clusters.
> >>>>>>
> >>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
> >>> capacity
> >>>>>> and has about 8 TB of data.
> >>>>>> Now how can I migrate the same cluster to use cdh3 and use that
> same 8
> >>> TB
> >>>>>> of data.
> >>>>>>
> >>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of
> free
> >>>>>> space
> >>>>>>
> >>>>>>
> >>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
> nitinpawar432@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> you can actually look at the distcp
> >>>>>>>
> >>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >>>>>>>
> >>>>>>> but this means that you have two different set of clusters
> available
> >>> to
> >>>>>> do
> >>>>>>> the migration
> >>>>>>>
> >>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
> austincv@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for the suggestions,
> >>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
> >>>>> because
> >>>>>>> the
> >>>>>>>> data is huge.
> >>>>>>>>
> >>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do
> a
> >>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
> >>>>>>>>
> >>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
> >>>>> now,
> >>>>>>>> which is based on 0.20
> >>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to
> be
> >>>>>> used
> >>>>>>>> by 0.20's namenode.
> >>>>>>>>
> >>>>>>>> Any idea how I can achieve what I am trying to do?
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >>>>> nitinpawar432@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> i can think of following options
> >>>>>>>>>
> >>>>>>>>> 1) write a simple get and put code which gets the data from DFS
> and
> >>>>>>> loads
> >>>>>>>>> it in dfs
> >>>>>>>>> 2) see if the distcp  between both versions are compatible
> >>>>>>>>> 3) this is what I had done (and my data was hardly few hundred
> GB)
> >>>>> ..
> >>>>>>>> did a
> >>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
> >>>>>>>>>
> >>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >>>>> austincv@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
> >>>>> hadoop
> >>>>>>>>>> 0.20.205.
> >>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
> >>>>>> 0.20.205.
> >>>>>>>>>> What is the best practice/ techniques to do this?
> >>>>>>>>>>
> >>>>>>>>>> Thanks & Regards,
> >>>>>>>>>> Austin
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Nitin Pawar
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Nitin Pawar
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Edward Capriolo <ed...@gmail.com>.

Honestly that is a hassle, going from 205 to cdh3u3 is probably more
or a cross-grade then an upgrade or downgrade. I would just stick it
out. But yes like Michael said two clusters on the same gear and
distcp. If you are using RF=3 you could also lower your replication to
rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
stuff.


On Thu, May 3, 2012 at 7:25 AM, Michel Segel <mi...@hotmail.com> wrote:
> Ok... When you get your new hardware...
>
> Set up one server as your new NN, JT, SN.
> Set up the others as a DN.
> (Cloudera CDH3u3)
>
> On your existing cluster...
> Remove your old log files, temp files on HDFS anything you don't need.
> This should give you some more space.
> Start copying some of the directories/files to the new cluster.
> As you gain space, decommission a node, rebalance, add node to new cluster...
>
> It's a slow process.
>
> Should I remind you to make sure you up you bandwidth setting, and to clean up the hdfs directories when you repurpose the nodes?
>
> Does this make sense?
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com> wrote:
>
>> Yeah I know :-)
>> and this is not a production cluster ;-) and yes there is more hardware
>> coming :-)
>>
>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <mi...@hotmail.com>wrote:
>>
>>> Well, you've kind of painted yourself in to a corner...
>>> Not sure why you didn't get a response from the Cloudera lists, but it's a
>>> generic question...
>>>
>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>>> And please tell me you've already ordered more hardware.. Right?
>>>
>>> And please tell me this isn't your production cluster...
>>>
>>> (Strong hint to Strata and Cloudea... You really want to accept my
>>> upcoming proposal talk... ;-)
>>>
>>>
>>> Sent from a remote device. Please excuse any typos...
>>>
>>> Mike Segel
>>>
>>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com> wrote:
>>>
>>>> Yes. This was first posted on the cloudera mailing list. There were no
>>>> responses.
>>>>
>>>> But this is not related to cloudera as such.
>>>>
>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
>>>> hadoop 0.20.205
>>>>
>>>> There is an upgrade namenode option when we are migrating to a higher
>>>> version say from 0.20 to 0.20.205
>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>> Is this possible?
>>>>
>>>>
>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <prash1784@gmail.com
>>>> wrote:
>>>>
>>>>> Seems like a matter of upgrade. I am not a Cloudera user so would not
>>> know
>>>>> much, but you might find some help moving this to Cloudera mailing list.
>>>>>
>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>
>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>>> capacity
>>>>>> and has about 8 TB of data.
>>>>>> Now how can I migrate the same cluster to use cdh3 and use that same 8
>>> TB
>>>>>> of data.
>>>>>>
>>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of free
>>>>>> space
>>>>>>
>>>>>>
>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> you can actually look at the distcp
>>>>>>>
>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>
>>>>>>> but this means that you have two different set of clusters available
>>> to
>>>>>> do
>>>>>>> the migration
>>>>>>>
>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the suggestions,
>>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
>>>>> because
>>>>>>> the
>>>>>>>> data is huge.
>>>>>>>>
>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>
>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
>>>>> now,
>>>>>>>> which is based on 0.20
>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to be
>>>>>> used
>>>>>>>> by 0.20's namenode.
>>>>>>>>
>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>> nitinpawar432@gmail.com
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> i can think of following options
>>>>>>>>>
>>>>>>>>> 1) write a simple get and put code which gets the data from DFS and
>>>>>>> loads
>>>>>>>>> it in dfs
>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>> 3) this is what I had done (and my data was hardly few hundred GB)
>>>>> ..
>>>>>>>> did a
>>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>>>>>>>>>
>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>> austincv@gmail.com
>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>>>>> hadoop
>>>>>>>>>> 0.20.205.
>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>>>>> 0.20.205.
>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Austin
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>
>>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Michel Segel <mi...@hotmail.com>.

Ok... When you get your new hardware...

Set up one server as your new NN, JT, SN.
Set up the others as a DN.
(Cloudera CDH3u3)

On your existing cluster... 
Remove your old log files, temp files on HDFS anything you don't need.
This should give you some more space.
Start copying some of the directories/files to the new cluster.
As you gain space, decommission a node, rebalance, add node to new cluster...

It's a slow process. 

Should I remind you to make sure you up you bandwidth setting, and to clean up the hdfs directories when you repurpose the nodes?

Does this make sense?

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 3, 2012, at 5:46 AM, Austin Chungath <au...@gmail.com> wrote:

> Yeah I know :-)
> and this is not a production cluster ;-) and yes there is more hardware
> coming :-)
> 
> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <mi...@hotmail.com>wrote:
> 
>> Well, you've kind of painted yourself in to a corner...
>> Not sure why you didn't get a response from the Cloudera lists, but it's a
>> generic question...
>> 
>> 8 out of 10 TB. Are you talking effective storage or actual disks?
>> And please tell me you've already ordered more hardware.. Right?
>> 
>> And please tell me this isn't your production cluster...
>> 
>> (Strong hint to Strata and Cloudea... You really want to accept my
>> upcoming proposal talk... ;-)
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com> wrote:
>> 
>>> Yes. This was first posted on the cloudera mailing list. There were no
>>> responses.
>>> 
>>> But this is not related to cloudera as such.
>>> 
>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
>>> hadoop 0.20.205
>>> 
>>> There is an upgrade namenode option when we are migrating to a higher
>>> version say from 0.20 to 0.20.205
>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>> Is this possible?
>>> 
>>> 
>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <prash1784@gmail.com
>>> wrote:
>>> 
>>>> Seems like a matter of upgrade. I am not a Cloudera user so would not
>> know
>>>> much, but you might find some help moving this to Cloudera mailing list.
>>>> 
>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
>>>> wrote:
>>>> 
>>>>> There is only one cluster. I am not copying between clusters.
>>>>> 
>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storage
>> capacity
>>>>> and has about 8 TB of data.
>>>>> Now how can I migrate the same cluster to use cdh3 and use that same 8
>> TB
>>>>> of data.
>>>>> 
>>>>> I can't copy 8 TB of data using distcp because I have only 2 TB of free
>>>>> space
>>>>> 
>>>>> 
>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> you can actually look at the distcp
>>>>>> 
>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>> 
>>>>>> but this means that you have two different set of clusters available
>> to
>>>>> do
>>>>>> the migration
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thanks for the suggestions,
>>>>>>> My concerns are that I can't actually copyToLocal from the dfs
>>>> because
>>>>>> the
>>>>>>> data is huge.
>>>>>>> 
>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>> 
>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
>>>> now,
>>>>>>> which is based on 0.20
>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to be
>>>>> used
>>>>>>> by 0.20's namenode.
>>>>>>> 
>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>> nitinpawar432@gmail.com
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> i can think of following options
>>>>>>>> 
>>>>>>>> 1) write a simple get and put code which gets the data from DFS and
>>>>>> loads
>>>>>>>> it in dfs
>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>> 3) this is what I had done (and my data was hardly few hundred GB)
>>>> ..
>>>>>>> did a
>>>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>> austincv@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>>>> hadoop
>>>>>>>>> 0.20.205.
>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>>>> 0.20.205.
>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>> 
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Austin
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Nitin Pawar
>>>>>> 
>>>>> 
>>>> 
>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

Yeah I know :-)
and this is not a production cluster ;-) and yes there is more hardware
coming :-)

On Thu, May 3, 2012 at 4:10 PM, Michel Segel <mi...@hotmail.com>wrote:

> Well, you've kind of painted yourself in to a corner...
> Not sure why you didn't get a response from the Cloudera lists, but it's a
> generic question...
>
> 8 out of 10 TB. Are you talking effective storage or actual disks?
> And please tell me you've already ordered more hardware.. Right?
>
> And please tell me this isn't your production cluster...
>
> (Strong hint to Strata and Cloudea... You really want to accept my
> upcoming proposal talk... ;-)
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com> wrote:
>
> > Yes. This was first posted on the cloudera mailing list. There were no
> > responses.
> >
> > But this is not related to cloudera as such.
> >
> > cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
> > hadoop 0.20.205
> >
> > There is an upgrade namenode option when we are migrating to a higher
> > version say from 0.20 to 0.20.205
> > but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> > Is this possible?
> >
> >
> > On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
> >
> >> Seems like a matter of upgrade. I am not a Cloudera user so would not
> know
> >> much, but you might find some help moving this to Cloudera mailing list.
> >>
> >> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
> >> wrote:
> >>
> >>> There is only one cluster. I am not copying between clusters.
> >>>
> >>> Say I have a cluster running apache 0.20.205 with 10 TB storage
> capacity
> >>> and has about 8 TB of data.
> >>> Now how can I migrate the same cluster to use cdh3 and use that same 8
> TB
> >>> of data.
> >>>
> >>> I can't copy 8 TB of data using distcp because I have only 2 TB of free
> >>> space
> >>>
> >>>
> >>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
> >>> wrote:
> >>>
> >>>> you can actually look at the distcp
> >>>>
> >>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >>>>
> >>>> but this means that you have two different set of clusters available
> to
> >>> do
> >>>> the migration
> >>>>
> >>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks for the suggestions,
> >>>>> My concerns are that I can't actually copyToLocal from the dfs
> >> because
> >>>> the
> >>>>> data is huge.
> >>>>>
> >>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> >>>>> namenode upgrade. I don't have to copy data out of dfs.
> >>>>>
> >>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
> >> now,
> >>>>> which is based on 0.20
> >>>>> Now it is actually a downgrade as 0.20.205's namenode info has to be
> >>> used
> >>>>> by 0.20's namenode.
> >>>>>
> >>>>> Any idea how I can achieve what I am trying to do?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> >> nitinpawar432@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> i can think of following options
> >>>>>>
> >>>>>> 1) write a simple get and put code which gets the data from DFS and
> >>>> loads
> >>>>>> it in dfs
> >>>>>> 2) see if the distcp  between both versions are compatible
> >>>>>> 3) this is what I had done (and my data was hardly few hundred GB)
> >> ..
> >>>>> did a
> >>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
> >>>>>>
> >>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> >> austincv@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> >>>>>>> I don't want to lose the data that is in the HDFS of Apache
> >> hadoop
> >>>>>>> 0.20.205.
> >>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
> >>> 0.20.205.
> >>>>>>> What is the best practice/ techniques to do this?
> >>>>>>>
> >>>>>>> Thanks & Regards,
> >>>>>>> Austin
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Nitin Pawar
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Nitin Pawar
> >>>>
> >>>
> >>
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Michel Segel <mi...@hotmail.com>.

Well, you've kind of painted yourself in to a corner...
Not sure why you didn't get a response from the Cloudera lists, but it's a generic question...

8 out of 10 TB. Are you talking effective storage or actual disks? 
And please tell me you've already ordered more hardware.. Right?

And please tell me this isn't your production cluster...

(Strong hint to Strata and Cloudea... You really want to accept my upcoming proposal talk... ;-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 3, 2012, at 5:25 AM, Austin Chungath <au...@gmail.com> wrote:

> Yes. This was first posted on the cloudera mailing list. There were no
> responses.
> 
> But this is not related to cloudera as such.
> 
> cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
> hadoop 0.20.205
> 
> There is an upgrade namenode option when we are migrating to a higher
> version say from 0.20 to 0.20.205
> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
> Is this possible?
> 
> 
> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
> 
>> Seems like a matter of upgrade. I am not a Cloudera user so would not know
>> much, but you might find some help moving this to Cloudera mailing list.
>> 
>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
>> wrote:
>> 
>>> There is only one cluster. I am not copying between clusters.
>>> 
>>> Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
>>> and has about 8 TB of data.
>>> Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
>>> of data.
>>> 
>>> I can't copy 8 TB of data using distcp because I have only 2 TB of free
>>> space
>>> 
>>> 
>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
>>> wrote:
>>> 
>>>> you can actually look at the distcp
>>>> 
>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>> 
>>>> but this means that you have two different set of clusters available to
>>> do
>>>> the migration
>>>> 
>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Thanks for the suggestions,
>>>>> My concerns are that I can't actually copyToLocal from the dfs
>> because
>>>> the
>>>>> data is huge.
>>>>> 
>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>> 
>>>>> But here I am having Apache hadoop 0.20.205 and I want to use CDH3
>> now,
>>>>> which is based on 0.20
>>>>> Now it is actually a downgrade as 0.20.205's namenode info has to be
>>> used
>>>>> by 0.20's namenode.
>>>>> 
>>>>> Any idea how I can achieve what I am trying to do?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>> nitinpawar432@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> i can think of following options
>>>>>> 
>>>>>> 1) write a simple get and put code which gets the data from DFS and
>>>> loads
>>>>>> it in dfs
>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>> 3) this is what I had done (and my data was hardly few hundred GB)
>> ..
>>>>> did a
>>>>>> dfs -copyToLocal and then in the new grid did a copyFromLocal
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>> austincv@gmail.com
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>> I don't want to lose the data that is in the HDFS of Apache
>> hadoop
>>>>>>> 0.20.205.
>>>>>>> How do I migrate to CDH3u3 but keep the data that I have on
>>> 0.20.205.
>>>>>>> What is the best practice/ techniques to do this?
>>>>>>> 
>>>>>>> Thanks & Regards,
>>>>>>> Austin
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Nitin Pawar
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Nitin Pawar
>>>> 
>>> 
>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

Yes. This was first posted on the cloudera mailing list. There were no
responses.

But this is not related to cloudera as such.

cdh3 is based on apache hadoop 0.20 as the base. My data is in apache
hadoop 0.20.205

There is an upgrade namenode option when we are migrating to a higher
version say from 0.20 to 0.20.205
but here I am downgrading from 0.20.205 to 0.20 (cdh3)
Is this possible?


On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Seems like a matter of upgrade. I am not a Cloudera user so would not know
> much, but you might find some help moving this to Cloudera mailing list.
>
> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com>
> wrote:
>
> > There is only one cluster. I am not copying between clusters.
> >
> > Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
> > and has about 8 TB of data.
> > Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
> > of data.
> >
> > I can't copy 8 TB of data using distcp because I have only 2 TB of free
> > space
> >
> >
> > On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
> > wrote:
> >
> > > you can actually look at the distcp
> > >
> > > http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> > >
> > > but this means that you have two different set of clusters available to
> > do
> > > the migration
> > >
> > > On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the suggestions,
> > > > My concerns are that I can't actually copyToLocal from the dfs
> because
> > > the
> > > > data is huge.
> > > >
> > > > Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> > > > namenode upgrade. I don't have to copy data out of dfs.
> > > >
> > > > But here I am having Apache hadoop 0.20.205 and I want to use CDH3
> now,
> > > > which is based on 0.20
> > > > Now it is actually a downgrade as 0.20.205's namenode info has to be
> > used
> > > > by 0.20's namenode.
> > > >
> > > > Any idea how I can achieve what I am trying to do?
> > > >
> > > > Thanks.
> > > >
> > > > On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
> nitinpawar432@gmail.com
> > > > >wrote:
> > > >
> > > > > i can think of following options
> > > > >
> > > > > 1) write a simple get and put code which gets the data from DFS and
> > > loads
> > > > > it in dfs
> > > > > 2) see if the distcp  between both versions are compatible
> > > > > 3) this is what I had done (and my data was hardly few hundred GB)
> ..
> > > > did a
> > > > > dfs -copyToLocal and then in the new grid did a copyFromLocal
> > > > >
> > > > > On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
> austincv@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > > > > > I don't want to lose the data that is in the HDFS of Apache
> hadoop
> > > > > > 0.20.205.
> > > > > > How do I migrate to CDH3u3 but keep the data that I have on
> > 0.20.205.
> > > > > > What is the best practice/ techniques to do this?
> > > > > >
> > > > > > Thanks & Regards,
> > > > > > Austin
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Nitin Pawar
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Prashant Kommireddi <pr...@gmail.com>.

Seems like a matter of upgrade. I am not a Cloudera user so would not know
much, but you might find some help moving this to Cloudera mailing list.

On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <au...@gmail.com> wrote:

> There is only one cluster. I am not copying between clusters.
>
> Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
> and has about 8 TB of data.
> Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
> of data.
>
> I can't copy 8 TB of data using distcp because I have only 2 TB of free
> space
>
>
> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
> > you can actually look at the distcp
> >
> > http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
> >
> > but this means that you have two different set of clusters available to
> do
> > the migration
> >
> > On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
> > wrote:
> >
> > > Thanks for the suggestions,
> > > My concerns are that I can't actually copyToLocal from the dfs because
> > the
> > > data is huge.
> > >
> > > Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> > > namenode upgrade. I don't have to copy data out of dfs.
> > >
> > > But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
> > > which is based on 0.20
> > > Now it is actually a downgrade as 0.20.205's namenode info has to be
> used
> > > by 0.20's namenode.
> > >
> > > Any idea how I can achieve what I am trying to do?
> > >
> > > Thanks.
> > >
> > > On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <nitinpawar432@gmail.com
> > > >wrote:
> > >
> > > > i can think of following options
> > > >
> > > > 1) write a simple get and put code which gets the data from DFS and
> > loads
> > > > it in dfs
> > > > 2) see if the distcp  between both versions are compatible
> > > > 3) this is what I had done (and my data was hardly few hundred GB) ..
> > > did a
> > > > dfs -copyToLocal and then in the new grid did a copyFromLocal
> > > >
> > > > On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <austincv@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > > > > I don't want to lose the data that is in the HDFS of Apache hadoop
> > > > > 0.20.205.
> > > > > How do I migrate to CDH3u3 but keep the data that I have on
> 0.20.205.
> > > > > What is the best practice/ techniques to do this?
> > > > >
> > > > > Thanks & Regards,
> > > > > Austin
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Nitin Pawar
> > > >
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

There is only one cluster. I am not copying between clusters.

Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
and has about 8 TB of data.
Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
of data.

I can't copy 8 TB of data using distcp because I have only 2 TB of free
space


On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <ni...@gmail.com> wrote:

> you can actually look at the distcp
>
> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>
> but this means that you have two different set of clusters available to do
> the migration
>
> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com>
> wrote:
>
> > Thanks for the suggestions,
> > My concerns are that I can't actually copyToLocal from the dfs because
> the
> > data is huge.
> >
> > Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> > namenode upgrade. I don't have to copy data out of dfs.
> >
> > But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
> > which is based on 0.20
> > Now it is actually a downgrade as 0.20.205's namenode info has to be used
> > by 0.20's namenode.
> >
> > Any idea how I can achieve what I am trying to do?
> >
> > Thanks.
> >
> > On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <nitinpawar432@gmail.com
> > >wrote:
> >
> > > i can think of following options
> > >
> > > 1) write a simple get and put code which gets the data from DFS and
> loads
> > > it in dfs
> > > 2) see if the distcp  between both versions are compatible
> > > 3) this is what I had done (and my data was hardly few hundred GB) ..
> > did a
> > > dfs -copyToLocal and then in the new grid did a copyFromLocal
> > >
> > > On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <au...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > > > I don't want to lose the data that is in the HDFS of Apache hadoop
> > > > 0.20.205.
> > > > How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> > > > What is the best practice/ techniques to do this?
> > > >
> > > > Thanks & Regards,
> > > > Austin
> > > >
> > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>
>
>
> --
> Nitin Pawar
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Nitin Pawar <ni...@gmail.com>.

you can actually look at the distcp

http://hadoop.apache.org/common/docs/r0.20.0/distcp.html

but this means that you have two different set of clusters available to do
the migration

On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <au...@gmail.com> wrote:

> Thanks for the suggestions,
> My concerns are that I can't actually copyToLocal from the dfs because the
> data is huge.
>
> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> namenode upgrade. I don't have to copy data out of dfs.
>
> But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
> which is based on 0.20
> Now it is actually a downgrade as 0.20.205's namenode info has to be used
> by 0.20's namenode.
>
> Any idea how I can achieve what I am trying to do?
>
> Thanks.
>
> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <nitinpawar432@gmail.com
> >wrote:
>
> > i can think of following options
> >
> > 1) write a simple get and put code which gets the data from DFS and loads
> > it in dfs
> > 2) see if the distcp  between both versions are compatible
> > 3) this is what I had done (and my data was hardly few hundred GB) ..
> did a
> > dfs -copyToLocal and then in the new grid did a copyFromLocal
> >
> > On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <au...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > > I don't want to lose the data that is in the HDFS of Apache hadoop
> > > 0.20.205.
> > > How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> > > What is the best practice/ techniques to do this?
> > >
> > > Thanks & Regards,
> > > Austin
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>



-- 
Nitin Pawar

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Austin Chungath <au...@gmail.com>.

Thanks for the suggestions,
My concerns are that I can't actually copyToLocal from the dfs because the
data is huge.

Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
namenode upgrade. I don't have to copy data out of dfs.

But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
which is based on 0.20
Now it is actually a downgrade as 0.20.205's namenode info has to be used
by 0.20's namenode.

Any idea how I can achieve what I am trying to do?

Thanks.

On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <ni...@gmail.com>wrote:

> i can think of following options
>
> 1) write a simple get and put code which gets the data from DFS and loads
> it in dfs
> 2) see if the distcp  between both versions are compatible
> 3) this is what I had done (and my data was hardly few hundred GB) .. did a
> dfs -copyToLocal and then in the new grid did a copyFromLocal
>
> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <au...@gmail.com>
> wrote:
>
> > Hi,
> > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > I don't want to lose the data that is in the HDFS of Apache hadoop
> > 0.20.205.
> > How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> > What is the best practice/ techniques to do this?
> >
> > Thanks & Regards,
> > Austin
> >
>
>
>
> --
> Nitin Pawar
>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Posted by Nitin Pawar <ni...@gmail.com>.

i can think of following options

1) write a simple get and put code which gets the data from DFS and loads
it in dfs
2) see if the distcp  between both versions are compatible
3) this is what I had done (and my data was hardly few hundred GB) .. did a
dfs -copyToLocal and then in the new grid did a copyFromLocal

On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <au...@gmail.com> wrote:

> Hi,
> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> I don't want to lose the data that is in the HDFS of Apache hadoop
> 0.20.205.
> How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> What is the best practice/ techniques to do this?
>
> Thanks & Regards,
> Austin
>

-- 
Nitin Pawar