Today we had an outage simply because the unsafe behavior of AWS EKS managed node group.
The document says, if you delete the node group and the node group uses an IAM role, it will check the role, if the role is still in use by other nodes then it is not going to remove that role.
True, it does not remove that role, but it did to do the same unsafe action, which is modifying the aws-auth configmap in namespace kube-system and remove the block of node permission there …
My God, why it tries to do that is beyond my comprehension.
This leads to all other nodes not be able to contact the control plane. All nodes are unknown state. All pods are yellow in Lens.
I have feedback to the AWS help desk guy, not sure how seriously AWS takes this though. Lets wait and see…