Agent lost: invalid entity name or password (unauthorized access)

Bug #1956975 reported by Laurent Sesquès
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

While looking for reproducers for lp:1953077, I found a "lost" agent without the issue with previously failed actions mentioned there.
Its last logs were several weeks old:

2021-11-18 23:30:52 INFO juju.worker.uniter.operation runhook.go:155 skipped "update-status" hook (missing)
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "migration-minion" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "log-sender" manifold worker returned unexpected error: cannot send log message: tls: use of closed connection
2021-11-18 23:30:53 INFO juju.worker.logger logger.go:136 logger worker stopped
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "logging-config-updater" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "api-address-updater" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "meter-status" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 INFO juju.worker.uniter uniter.go:323 unit "myapp-wsgi/3" shutting down: catacomb 0xc000ee4d80 is dying
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "uniter" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2021-11-18 23:40:13 INFO juju unit_agent.go:253 Starting unit workers for "myapp-wsgi/3"
2021-11-18 23:40:13 INFO juju.agent.setup agentconf.go:128 setting logging config to "<root>=INFO;unit=DEBUG"
2021-11-18 23:40:13 INFO juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: [f2483f] "unit-myapp-wsgi-3" cannot open api: try again (try again)
2021-11-18 23:40:54 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

Restarting the jujud didn't generate any new log lines for this unit (it did for all the others on the same machine).

This is on juju 2.9.21, on an openstack cloud.

Revision history for this message
John A Meinel (jameinel) wrote :

```
 2021-11-18 23:30:53 ERROR juju.worker.dependency engine.go:676 "meter-status" manifold worker returned unexpected error: watcher has been stopped (stopped)
2021-11-18 23:30:53 INFO juju.worker.uniter uniter.go:323 unit "myapp-wsgi/3" shutting down: catacomb 0xc000ee4d80 is dying
```

Feels like there should be some corresponding logs on the controller side from around this same time period which might give us a clue as to why it is stopping the watcher and closing the connection.

Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

From the unit and machine logs, it looks like the connectivity issues started on 11/14. Continued when the model was upgraded from 2.8.7 to 2.9.18 on 11/15. It's unknown when the controller was upgraded.

2021-11-14 23:47:09 ERROR juju.worker.dependency engine.go:671 "leadership-tracker" manifold worker returned unexpected error: error while sso-wsgi/3 waiting for sso-wsgi leadership release: error blocking on leadership release: lease manager stopped
2021-11-14 23:47:09 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: tls: use of closed connection
2021-11-14 23:47:09 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2021-11-14 23:47:09 ERROR juju.worker.uniter agent.go:31 resolver loop error: committing operation "accept leadership" for sso-wsgi/3: writing state: connection is shut down
2021-11-14 23:47:09 ERROR juju.worker.uniter agent.go:34 updating agent status: connection is shut down
2021-11-14 23:47:09 INFO juju.worker.uniter uniter.go:286 unit "sso-wsgi/3" shutting down: committing operation "accept leadership" for sso-wsgi/3: writing state: connection is shut down
2021-11-14 23:47:09 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-sso-wsgi-3/622713258/s: use of closed network connection
2021-11-14 23:47:09 INFO juju.worker.logger logger.go:136 logger worker stopped
2021-11-14 23:47:27 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [f2483f] "unit-sso-wsgi-3" cannot open api: unable to connect to API: read tcp x.x.x.x:36728->x.x.x.x:17070: read: connection reset by peer

According to the agent.conf, the current version of juju in the model is 2.9.22. The machine upgraded to 2.9.21 on 12/8/2022.

The unit restarts a few times a day from 11/14 to 11/18. Starting after:
2021-11-16 08:50:51 ERROR juju.worker.dependency engine.go:676 "leadership-tracker" manifold worker returned unexpected error: error while sso-wsgi/3 waiting for sso-wsgi leadership release: error blocking on leadership release: lease manager stopped

No further unit logs in given data after:
2021-11-18 23:40:54 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

The machine agent has been trying to restart the unit since, however the current logs have no mention of what failed.

@laurant, can you please provide the /var/log/juju/logsink.log from machine-1 if it exists. As well as controller logs from 11/10 to 11/19.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Laurent Sesquès (sajoupa) wrote :

On the 2 units where I could see this issue, there was no logsink.log.
Controller logs have been shared internally within Canonical.
FWIW, the controller was upgraded from Juju 2.8.7 to 2.9.18 on 2021-11-14.

Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

This error:
2021-11-18 23:40:54 ERROR juju.worker.apicaller connect.go:204 Failed to connect to controller: invalid entity name or password (unauthorized access)

Triggered a worker.ErrTerminateAgent, which didn't do what it's supposed to because the agent wasn't running. Stop and Remove agent are done as one...

I can reproduce with a wrench in the apiserver authorization/login code.

How to get out of this state? What is the root cause?

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

To get the lost agent running again after seeing unexpected or unintended error message "Failed to connect to controller: invalid entity name or password (unauthorized access)":

1. juju ssh <lost-unit>
2. Edit the /var/lib/juju/agents/machine-#/agent.conf file to remove a line starting:
" stopped: <unit>"
3. Then bounce jujud on the machine.

Verified in two places.

Update: in 2.9.25 additional logging has been added for this failure case. If you try the work around, please check for error logs on the controller containing "invalid password" and update this bug with the date.

Changed in juju:
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

Juju 2.9.24 will have an updated juju_start_unit to avoid having to do the manual steps in #5.

$ juju ssh <unit>
$ juju_start_unit <unit>

https://github.com/juju/juju/pull/13619

Changed in juju:
status: In Progress → Triaged
assignee: Heather Lanigan (hmlanigan) → nobody
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

I tried the workaround described in comment #5 but without success.

logs => https://pastebin.ubuntu.com/p/D2n4NmNmGf/

Controller 2.9.18 without upgrade.

Not sure if the same BUG but the logs are similar to the issue described here.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@Pedro - do you have the machine log that etcd/9 was running on for the same time as the pastebin you provided? Including the time period you ran the workaround?

Revision history for this message
Guillermo Gonzalez (verterok) wrote (last edit ):

We are hitting a similar issue in the ps5 prodstack beta controller (2.9.27)
The unit is still there without applying the workaround.
agent logs: https://pastebin.canonical.com/p/JrBv2VQQMr/

We can ask IS to get you controller logs if needed, they checked and couldn't find any mention of `invalid password` there.

summary: - Agent lost
+ Agent lost: invalid entity name or password (unauthorized access)
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@verterok, we really need the juju controller logs looking for ERRORs as noted in #5 please.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

No errors related were found in the controller logs. However machine-0 logs were extremely short, so perhaps incomplete.

Revision history for this message
Przemyslaw Lal (przemeklal) wrote (last edit ):

I ran into this bug on Juju 2.9.32 today. Tried the workaround from #5 and #6 with no success.
Also, after restarting one of the juju controllers, another unit of the same application was affected by the same bug.

mysql-innodb-cluster/0 unknown lost 27/lxd/5 10.193.0.56 agent lost, see 'juju show-status-log mysql-innodb-cluster/0'
mysql-innodb-cluster/2 unknown lost 29/lxd/8 10.193.0.130 agent lost, see 'juju show-status-log mysql-innodb-cluster/2'

ubuntu@juju-7df8cc-29-lxd-8:KDC1:~$ juju_unit_status
agent: machine-29-lxd-8
units:
  filebeat/435: running
  landscape-client/438: running
  mysql-innodb-cluster/2: stopped
  nrpe-container/293: running
  telegraf/438: running

I tried removing "stopped-units: mysql-innodb-cluster/2" line from /var/lib/juju/agents/machine-29-lxd-8/agent.conf but after jujud is restarted this line re-appears.

Logging into the machine and running "juju_start_unit mysql-innodb-cluster/2" doesn't help either.

Unit agent logs: https://pastebin.canonical.com/p/Q9xcfVvjtD/
I couldn't find any messages including "unauthorized" or anything related to this unit in logsink.log on all controllers but I'll be happy to share controller logs privately.

Revision history for this message
Przemyslaw Lal (przemeklal) wrote :

I managed to resolve this by resetting passwords on affected units using the steps described in [0].

[0] https://discourse.charmhub.io/t/help-recover-a-broken-juju-controller-upgrade/3455/12

Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I was hitting this bug hard in the last couple days. I needed to reboot the three controller machines with a complete openstack control plane (about 70 lxd containers over the 3 of them) in a controlled way (one at a time, check, repeat). The three of them came back with 4 or 5 agents lost, with problems on the password.

After fixing all of them (lots of work), I upgraded the juju release from 2.9.38 to 2.9.42 and while doing that all agents appeared as lost (I think this part is expected) but when coming back, a few more agents were permanently lost with password problems again.

They only way to fix them was to 1) delete stopped-units from the machine agents file, 2) change the password for a new one in the unit specific agents file and 3) update the password in the database to match the new password in the agents file, 4) restart the juju agent.

Steps 2 and 3 were done with the help of the generate-password tool in the juju/scripts source as pointed in comment #13 above. Everything worked well in the end, just a lot of work.

Revision history for this message
Erik Lönroth (erik-lonroth) wrote (last edit ):

We hit the same kind of issue. I've written about it here: https://discourse.charmhub.io/t/failed-to-connect-to-controller-invalid-entity-name-or-password-unauthorized-access/8892/2

This is scary since its a very tedious situation.

What we did prior to this happening on a single LXD host, was to reboot the lxd hosts. Most containers/units came back OK, but some lost their access to the controller somehow.

Juju 2.9.38

Revision history for this message
Haw Loeung (hloeung) wrote :

I've had much success with the procedure above. I've had to do it for units as well as machines.

For unit:

| ./juju-generate-password 5e6635fe-ec57-45d8-87d6-c713064112d9 launchpad-mailman/0

For machines:

| ./juju-generate-password 4d00e10a-0c0c-430c-8fee-7ece72545b99 2
| ./juju-generate-password 4d00e10a-0c0c-430c-8fee-7ece72545b99 3

Where the version I'm using doesn't allow me to specify machine-X but just the machine ID.

tags: added: canonical-is canonical-is-upgrades
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

I had the same with juju 3.1.2 in an environment suffering disk pressure, after a major restart of many agents.

The procedure with the generate-passowrd utility worked great, thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.