Hank Lin

A new blog

Recovery of EBS-boot EC2 Instances

| Comments

最近又接到了AWS的通知信, 說因為他們的問題, 造成我的一個instance 會被terminated. 信大概長這樣: We have noticed that one or more of your instances are running on a host degraded due to hardware failure. i-a20a871a The host needs to undergo maintenance and will be taken down at 12:00 GMT on 2010-06-16. Your instances will be terminated at this point. ......

這算是比較正常的failure, 當然EC2 instances也有直接掛掉的情形. 我覺得以前比較常發生直接掛掉, 最近一年來倒是沒遇到. 而且直接掛掉的都是m1.small 的instances. 比較大的instances都沒有直接掛掉過. 在上次的課程裡, 有人問說為什麼要改用EC2, EC2還是會掛啊?! 沒錯, 天有不測風雲, EC2 instances也是會掛的. 事實是, 任何東西都會掛. 就算是最貴最高檔的server也一樣. 你要有一個觀念, 用cloud computing 時, Backup 和 Diseaster recovery 一樣也不能少. 差別在於, EC2 instances 掛了, 我坐在我的辦公室裡, 下幾個指令就回來了. 如果是租主機, 只能通報, 然後在那邊著急. 如果是主機代管, 很抱歉, 自己去一趟機房咩!

Steps of Recovery

因為我是使用EBS-boot instances, 所以recover 的動作就變得很簡單. 以下就一步一步來看:

先找出那個instance的資料, 後面會需要用到:

$ ec2-describe-instances i-a20a871a
RESERVATION     r-441f2d2c      107357334611    default
INSTANCE        i-a20a871a      ami-155b3811    ec2-67-202-19-78.compute-1.amazonaws.com       domU-12-11-21-01-D1-F2.compute-1.internal       running mykeypair 0      m 1.small 2010-03-01T07:26:57+0000        us-east-1a                              monitoring-disabled     67.202.19.78   10.253.214.4                    ebs
BLOCKDEVICE     /dev/sda1       vol-242e0c1a    2010-03-01T08:04:09.000Z

要把volume ID 和AMI ID記下來, 我的volume ID是vol-242e0c1a, AMI ID是ami-155b3811 也可以查一下這個舊的volume, 一定是attach到舊的那個instance.

$ ec2-describe-volumes vol-242e0c1a
ATTACHMENT      vol-242e0c1a    i-a20a871a      /dev/sda1       attached        2010-03-01T07:27:09+0000

連進去舊的instance, 小心一點的話, 最好先確定一下instance ID 是不是一樣:

$ curl http://169.254.169.254/latest/meta-data/instance-id && echo
i-a20a871a

把該關的services關一關, 還有如果有其它資料是存在instance storage(ephemeral storage)上要備份的話, 也先備份出來, 要不然terminate instance之後那些資料就沒了. 好了之後, 可以先stop instance, 只有EBS-boot instances可以被stop, stop時是不算EC2 的instance/hour 費用的. 好處就是可以再start起來, 資料還是之前存在EBS volume上的. 缺點就是IP 會改.

$ ec2-stop-instances i-a20a871a
INSTANCE        i-a20a871a      running stopping

等它的狀態是stopped

$ ec2-describe-instances i-a20a871a
RESERVATION     r-441f2d2c      107357334611    default
INSTANCE        i-a20a871a      ami-155b3811                    stopped mykeypair 0               m1.small        2010-03-01T07:26:57+0000        us-east-1a              monitoring-disabled                                      ebs
BLOCKDEVICE     /dev/sda1       vol-242e0c1a    2010-03-01T08:04:09.000Z

然後對那個舊的volume作snapshot, -d "description" 記得要寫得易懂一些.

$ ec2-create-snapshot vol-242e0c1a -d "Ubuntu 8.04 20100607"
SNAPSHOT        snap-1a73348a   vol-242e0c1a    pending 2010-06-07T07:55:59+0000                107357334611    20      Ubuntu 8.04 20100607

等snapshot的狀態是completed

$ ec2-describe-snapshots snap-1a73348a
SNAPSHOT        snap-1a73348a   vol-242e0c1a    completed       2010-06-07T07:55:59+0000        100%    107357334611    20      Ubuntu 8.04 20100607

現在把這個新的snapshot註冊為AMI, 同樣的, -d "description" 也是要寫得descriptive一點, 自己以後才看得懂是什麼image咩!

$ ec2-register -s snap-1a73348a -a i386 -n "ebs-ubuntu-8.04-i386-20100607" -d "EBS Ubuntu 8.04 i386 20100607" -b /dev/sda2=ephemeral0 --root-device-name /dev/sda1
IMAGE   ami-c4b65e31

用新的AMI來開instance吧!

$ ec2-run-instances ami-c4b65e31 -k mykeypair -g default -z us-east-1a -t m1.small --instance-initiated-shutdown-behavior stop --disable-api-termination
RESERVATION     r-4ac31021      107357334611    default
INSTANCE        i-949c91ef      ami-c4b65e31                    pending mykeypair 0               m1.small        2010-06-07T08:01:21+0000        us-east-1a             monitoring-disabled  

等新的instance的狀態是running, 就可以連進去看看

$ ec2-describe-instances i-949c91ef
RESERVATION     r-4ac31021      107357334611    default
INSTANCE        i-949c91ef      ami-c4b65e31    ec2-174-129-101-39.compute-1.amazonaws.com      domU-12-31-38-00-40-22.compute-1.internal       running mykeypair 0      m1.small 2010-06-07T08:01:21+0000        us-east-1a                              monitoring-disabled     174.129.101.39  10.252.71.208                   ebs
BLOCKDEVICE     /dev/sda1       vol-a1179d66    2010-06-07T08:01:25.000Z

$ ssh -i mykeypair.pem root@174.129.101.39

Clean Up

確認新的instance工作一切正常之後, 就可以把之前舊的instance, volume, snapshot清乾淨了. 首先, 把舊的instance給砍了.

$ ec2-terminate-instances i-a20a871a
Client.OperationNotPermitted: The instance 'i-a20a871a' may not be terminated. Modify its 'disableApiTermination' instance attribute and try again.

哈哈! 如果你和我一樣, 都習慣在run instance時加了--disable-api-termination, 或是用

$ ec2-modify-instance-attribute --disable-api-termination true $instanceId

把API termination 給disabled的話, 就可以避免不小心把instance給砍了, 一失足成千古恨咩! 現在確定要砍了的話, 就把API termination enable:

$ ec2-modify-instance-attribute i-a20a871a --disable-api-termination false
disableApiTermination   i-a20a871a      false

現在可以terminate了.

$ ec2-terminate-instances i-a20a871a
INSTANCE        i-a20a871a      stopped terminated

然後看一下舊的instance的AMI的資料, 記下snapshot ID.

$ ec2-describe-images ami-155b3811
IMAGE   ami-155b3811    107357334611/ebs-ubuntu-8.04-32b-20100301  107357334611    available       private         i386    machine                         ebs
BLOCKDEVICEMAPPING      /dev/sda1               snap-e13ab246   20

先把舊的AMI deregister:

$ ec2-deregister ami-155b3811
IMAGE   ami-155b3811

再把舊的snapshot給砍了

$ ec2-delete-snapshot snap-e13ab246
SNAPSHOT        snap-e13ab246

最後可以把舊的volume 給砍了

$ ec2-delete-volume vol-242e0c1a
VOLUME  vol-242e0c1a

完成! 坐在自己的位子就可以做好了咩!