Create pod, kuryr-controller persistent CrashLoopBackOff

Bug #1909165 reported by liujinxin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kuryr-kubernetes
Confirmed
Low
Unassigned

Bug Description

**What happened**:
```
Create pod, kuryr-controller persistent CrashLoopBackOff
```

**What you expected to happen**:
```
```
**How to reproduce it (as minimally and precisely as possible)**:
```
NeutronVIFPool._recover_precreated_ports() does not finish for a long time (too many precreated ports)

I'll fix this by adjusting the value of livenessProbe.initialDelaySeconds for the kuryr-controller container

```

**Environment**:
- Kubernetes version:
`v1.14.3`
- Kuryr-Kubernetes version:
`stable/train`
- Openstack version:
`stable/Rocky`

- Cluster information:
```
NAME STATUS ROLES AGE VERSION
master-0 Ready l3-node,master,monitor,node,openvswitch 47d v1.14.3
master-1 Ready l3-node,master,node,openstack-compute,openvswitch 47d v1.14.3
master-2 Ready l3-node,master,node,openvswitch 47d v1.14.3
slave-0 Ready l3-node,node,openvswitch 23d v1.14.3

kuryr-cni-ds-bngz6 1/1 Running 10 16d 192.168.15.86 master-1
kuryr-cni-ds-c7zp7 1/1 Running 1 16d 192.168.15.93 master-2
kuryr-cni-ds-cww29 1/1 Running 2 16d 192.168.15.91 master-0
kuryr-cni-ds-xz688 1/1 Running 0 22h 192.168.15.100 slave-0
kuryr-controller-5c964f789f-ndpk6 0/1 CrashLoopBackOff 11 20h 192.168.15.91 master-0

```
**Anything else we need to know?**:
```
---
kuryr-controller.log
---

2020-12-24 03:00:37.252 1 DEBUG kuryr_kubernetes.controller.drivers.vif_pool [-] Kuryr-controller not yet ready to handle new pods. request_vif /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py:196
2020-12-24 03:00:37.252 1 DEBUG kuryr_kubernetes.handlers.retry [-] Handler KuryrPortHandler failed (attempt 1; ResourceNotReady: Resource not ready: {'kind': 'Pod', 'apiVersion': 'v1', 'metadata': {'name': 'nginx-5b98f4fcf8-trj99', 'generateName': 'nginx-5b98f4fcf8-', 'namespace': 'default', 'selfLink': '/api/v1/namespaces/default/pods/nginx-5b98f4fcf8-trj
99', 'uid': 'd1108ff5-4586-11eb-9bab-fa163e903454', 'resourceVersion': '10031577', 'creationTimestamp': '2020-12-24T01:24:50Z', 'labels': {'app': 'nginx', 'pod-template-hash': '5b98f4fcf8'}, 'ownerReferences': [{'apiVersion': 'apps/v1', 'kind': 'ReplicaSet', 'name': 'nginx-5b98f4fcf8', 'uid': '03303ec6-44f4-11eb-b14a-fa163e33b9d8', 'controller': True, 'bloc
kOwnerDeletion': True}], 'finalizers': ['kuryr.openstack.org/pod-finalizer']}, 'spec': {'volumes': [{'name': 'default-token-489ph', 'secret': {'secretName': 'default-token-489ph', 'defaultMode': 420}}], 'containers': [{'name': 'nginx', 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.17.5', 'ports': [{'containerPort': 8080, 'prot
ocol': 'TCP'}], 'resources': {}, 'volumeMounts': [{'name': 'default-token-489ph', 'readOnly': True, 'mountPath': '/var/run/secrets/kubernetes.io/serviceaccount'}], 'terminationMessagePath': '/dev/termination-log', 'terminationMessagePolicy': 'File', 'imagePullPolicy': 'IfNotPresent'}], 'restartPolicy': 'Always', 'terminationGracePeriodSeconds': 30, 'dnsPoli
cy': 'ClusterFirst', 'serviceAccountName': 'default', 'serviceAccount': 'default', 'nodeName': 'master-0', 'securityContext': {}, 'schedulerName': 'default-scheduler', 'tolerations': [{'key': 'node.kubernetes.io/not-ready', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}, {'key': 'node.kubernetes.io/unreachable
', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}], 'priority': 0, 'enableServiceLinks': True}, 'status': {'phase': 'Pending', 'conditions': [{'type': 'Initialized', 'status': 'True', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:24:54Z'}, {'type': 'Ready', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime':
 '2020-12-24T01:24:54Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'ContainersReady', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:24:54Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'PodScheduled', 'status': 'True', '
lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:24:51Z'}], 'hostIP': '192.168.15.91', 'startTime': '2020-12-24T01:24:54Z', 'containerStatuses': [{'name': 'nginx', 'state': {'waiting': {'reason': 'ContainerCreating'}}, 'lastState': {}, 'ready': False, 'restartCount': 0, 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.1
7.5', 'imageID': ''}], 'qosClass': 'BestEffort'}}) _sleep /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py:101

...

2020-12-24 01:54:07.668 1 DEBUG kuryr_kubernetes.k8s_client [-] Get /api/v1/nodes/slave-0 get /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py:94
2020-12-24 01:54:07.681 1 DEBUG dogpile.cache.region [-] Cache value generated in 0.013 seconds for key(s): 'kuryr_kubernetes.controller.drivers.vif_pool:_get_node_vif_driver|slave-0' _log_time /usr/local/lib/python3.6/site-packages/dogpile/cache/region.py:790
2020-12-24 01:54:07.682 1 DEBUG dogpile.lock [-] Released creation lock _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:182
2020-12-24 01:54:07.682 1 DEBUG kuryr_kubernetes.controller.drivers.vif_pool [-] Kuryr-controller not yet ready to handle new pods. request_vif /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py:196
2020-12-24 01:54:07.682 1 DEBUG kuryr_kubernetes.handlers.retry [-] Handler KuryrPortHandler failed (attempt 4; ResourceNotReady: Resource not ready: {'kind': 'Pod', 'apiVersion': 'v1', 'metadata': {'name': 'nginx-5b98f4fcf8-l4rtm', 'generateName': 'nginx-5b98f4fcf8-', 'namespace': 'default', 'selfLink': '/api/v1/namespaces/default/pods/nginx-5b98f4fcf8-l4rtm', 'uid': 'd4bb15b8-4586-11eb-9bab-fa163e903454', 'resourceVersion': '10031605', 'creationTimestamp': '2020-12-24T01:24:56Z', 'labels': {'app': 'nginx', 'pod-template-hash': '5b98f4fcf8'}, 'ownerReferences': [{'apiVersion': 'apps/v1', 'kind': 'ReplicaSet', 'name': 'nginx-5b98f4fcf8', 'uid': '03303ec6-44f4-11eb-b14a-fa163e33b9d8', 'controller': True, 'blockOwnerDeletion': True}], 'finalizers': ['kuryr.openstack.org/pod-finalizer']}, 'spec': {'volumes': [{'name': 'default-token-489ph', 'secret': {'secretName': 'default-token-489ph', 'defaultMode': 420}}], 'containers': [{'name': 'nginx', 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.17.5', 'ports': [{'containerPort': 8080, 'protocol': 'TCP'}], 'resources': {}, 'volumeMounts': [{'name': 'default-token-489ph', 'readOnly': True, 'mountPath': '/var/run/secrets/kubernetes.io/serviceaccount'}], 'terminationMessagePath': '/dev/termination-log', 'terminationMessagePolicy': 'File', 'imagePullPolicy': 'IfNotPresent'}], 'restartPolicy': 'Always', 'terminationGracePeriodSeconds': 30, 'dnsPolicy': 'ClusterFirst', 'serviceAccountName': 'default', 'serviceAccount': 'default', 'nodeName': 'slave-0', 'securityContext': {}, 'schedulerName': 'default-scheduler', 'tolerations': [{'key': 'node.kubernetes.io/not-ready', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}, {'key': 'node.kubernetes.io/unreachable', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}], 'priority': 0, 'enableServiceLinks': True}, 'status': {'phase': 'Pending', 'conditions': [{'type': 'Initialized', 'status': 'True', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:09Z'}, {'type': 'Ready', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:09Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'ContainersReady', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:09Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'PodScheduled', 'status': 'True', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:06Z'}], 'hostIP': '192.168.15.100', 'startTime': '2020-12-24T01:25:09Z', 'containerStatuses': [{'name': 'nginx', 'state': {'waiting': {'reason': 'ContainerCreating'}}, 'lastState': {}, 'ready': False, 'restartCount': 0, 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.17.5', 'imageID': ''}], 'qosClass': 'BestEffort'}}) _sleep /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py:101
...

2020-12-24 01:54:08.199 1 DEBUG dogpile.cache.region [-] Cache value generated in 0.299 seconds for key(s): 'kuryr_kubernetes.utils:get_subnet|46c618de-d40e-4b81-a506-23436c9c7305' _log_time /usr/local/lib/python3.6/site-packages/dogpile/cache/region.py:790
2020-12-24 01:54:08.200 1 DEBUG dogpile.lock [-] Released creation lock _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:182
2020-12-24 01:54:08.203 1 DEBUG kuryr_kubernetes.os_vif_util [-] Detected vhostuser_mode=False for port 3caa87a0-548d-460a-b705-0890ce3f2466 neutron_to_osvif_vif_ovs /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/os_vif_util.py:270
2020-12-24 01:54:08.204 1 DEBUG dogpile.cache.region [-] No value present for key: 'kuryr_kubernetes.utils:get_subnet|46c618de-d40e-4b81-a506-23436c9c7305' _is_cache_miss /usr/local/lib/python3.6/site-packages/dogpile/cache/region.py:795
2020-12-24 01:54:08.205 1 DEBUG dogpile.lock [-] NeedRegenerationException _enter /usr/local/lib/python3.6/site-packages/dogpile/lock.py:90
2020-12-24 01:54:08.205 1 DEBUG dogpile.lock [-] no value, waiting for create lock _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:131
2020-12-24 01:54:08.205 1 DEBUG dogpile.lock [-] value creation lock <dogpile.cache.backends.null.NullLock object at 0x7f4e36d02e48> acquired _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:135
2020-12-24 01:54:08.205 1 DEBUG dogpile.cache.region [-] No value present for key: 'kuryr_kubernetes.utils:get_subnet|46c618de-d40e-4b81-a506-23436c9c7305' _is_cache_miss /usr/local/lib/python3.6/site-packages/dogpile/cache/region.py:795
2020-12-24 01:54:08.206 1 DEBUG dogpile.lock [-] Calling creation function for not-yet-present value _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:176
....

2020-12-24 01:54:09.061 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
2020-12-24 01:54:09.062 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods'
2020-12-24 01:54:09.063 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces'
2020-12-24 01:54:09.065 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks'
2020-12-24 01:54:09.065 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports'
2020-12-24 01:54:09.066 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies'
2020-12-24 01:54:09.067 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies'
2020-12-24 01:54:09.067 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting...
2020-12-24 01:54:09.067 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping

2020-12-24 01:54:09.069 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 78, in __call__\n self._handler(event, *args, **kwargs)\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 90, in __call__\n self.on_present(obj)\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 63, in on_present\n if not self.get_vifs(kuryrport_crd):\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 227, in get_vifs\n pod, project_id, subnets, security_groups)\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1198, in request_vif\n pod, project_id, subnets, security_groups)\n', ' File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 197, in request_vif\n raise exceptions.ResourceNotReady(pod)\n', "kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: {'kind': 'Pod', 'apiVersion': 'v1', 'metadata': {'name': 'nginx-5b98f4fcf8-dpqdf', 'generateName': 'nginx-5b98f4fcf8-', 'namespace': 'default', 'selfLink': '/api/v1/namespaces/default/pods/nginx-5b98f4fcf8-dpqdf', 'uid': 'd4babb92-4586-11eb-9bab-fa163e903454', 'resourceVersion': '10031621', 'creationTimestamp': '2020-12-24T01:24:56Z', 'labels': {'app': 'nginx', 'pod-template-hash': '5b98f4fcf8'}, 'ownerReferences': [{'apiVersion': 'apps/v1', 'kind': 'ReplicaSet', 'name': 'nginx-5b98f4fcf8', 'uid': '03303ec6-44f4-11eb-b14a-fa163e33b9d8', 'controller': True, 'blockOwnerDeletion': True}], 'finalizers': ['kuryr.openstack.org/pod-finalizer']}, 'spec': {'volumes': [{'name': 'default-token-489ph', 'secret': {'secretName': 'default-token-489ph', 'defaultMode': 420}}], 'containers': [{'name': 'nginx', 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.17.5', 'ports': [{'containerPort': 8080, 'protocol': 'TCP'}], 'resources': {}, 'volumeMounts': [{'name': 'default-token-489ph', 'readOnly': True, 'mountPath': '/var/run/secrets/kubernetes.io/serviceaccount'}], 'terminationMessagePath': '/dev/termination-log', 'terminationMessagePolicy': 'File', 'imagePullPolicy': 'IfNotPresent'}], 'restartPolicy': 'Always', 'terminationGracePeriodSeconds': 30, 'dnsPolicy': 'ClusterFirst', 'serviceAccountName': 'default', 'serviceAccount': 'default', 'nodeName': 'slave-0', 'securityContext': {}, 'schedulerName': 'default-scheduler', 'tolerations': [{'key': 'node.kubernetes.io/not-ready', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}, {'key': 'node.kubernetes.io/unreachable', 'operator': 'Exists', 'effect': 'NoExecute', 'tolerationSeconds': 300}], 'priority': 0, 'enableServiceLinks': True}, 'status': {'phase': 'Pending', 'conditions': [{'type': 'Initialized', 'status': 'True', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:12Z'}, {'type': 'Ready', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:12Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'ContainersReady', 'status': 'False', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:12Z', 'reason': 'ContainersNotReady', 'message': 'containers with unready status: [nginx]'}, {'type': 'PodScheduled', 'status': 'True', 'lastProbeTime': None, 'lastTransitionTime': '2020-12-24T01:25:06Z'}], 'hostIP': '192.168.15.100', 'startTime': '2020-12-24T01:25:12Z', 'containerStatuses': [{'name': 'nginx', 'state': {'waiting': {'reason': 'ContainerCreating'}}, 'lastState': {}, 'ready': False, 'restartCount': 0, 'image': 'registry-rdcenter-test.cloud.cn/library/common/nginx-amd64:1.17.5', 'imageID': ''}], 'qosClass': 'BestEffort'}}\n"]: greenlet.GreenletExit
...

2020-12-24 01:54:09.085 1 DEBUG kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler stopped processing group 2b89c768-4587-11eb-9ac9-fa163ebaf135 _done /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/asynchronous.py:108
2020-12-24 01:54:09.085 1 DEBUG kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler is idle _done /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/asynchronous.py:117
2020-12-24 01:54:09.086 1 DEBUG dogpile.lock [-] Released creation lock _enter_create /usr/local/lib/python3.6/site-packages/dogpile/lock.py:182
2020-12-24 01:54:09.086 1 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/local/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:266
2020-12-24 01:54:09.086 1 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/local/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:282
2020-12-24 01:54:09.087 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopped
```

liujinxin (scilla)
description: updated
liujinxin (scilla)
description: updated
liujinxin (scilla)
description: updated
liujinxin (scilla)
description: updated
Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Are you using DevStack to deploy Kuryr? Technically it is supposed to only be used in development environments.

But anyway seems looks like with OpenShift production envs we use way higher values [1]. Can you check if those work in your case?

[1] https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/kuryr/006-controller.yaml#L34-L47

Changed in kuryr-kubernetes:
status: New → Confirmed
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.