leaking agent Pingers when HA enabled

Bug #1731745 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Tim Penhey

Bug Description

I was investigating some of our issues with Transaction Queues and Leadership, and when you introduce a bad txn it causes all of the leadership workers to bounce.
During this time, something happens that causes the presence table to go crazy.

Specifically, I did:
$ juju enable-ha -n3

and then in mongo shell:
db.leases.update({"_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:application-leadership#ubuntu-lite#"}, {$push: {"txn-queue": "5a03fbfc1741fc14deadbeef_deadbeef"}})

(Pick any valid lease _id)

After which, I let the system sit there for a while (couple of hours?) with something like 100 applications, but only 1 of them broken.

After running this way for a while I did the opposite:
db.leases.update({"_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:application-leadership#ubuntu-lite#"}, {$pull: {"txn-queue": "5a03fbfc1741fc144411b4cc_562c305b"}}

And I was observing whether the system could recover once the bad transaction was removed.
It was, but I observed in mongotop an inordinate amount of time being spent in presence.presence.beings.

So I ran:
juju:PRIMARY> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}])
{ "_id" : "m#1", "count" : 2813 }
{ "_id" : "m#2", "count" : 2801 }
{ "_id" : "m#0", "count" : 2768 }
{ "_id" : "u#ul084/0", "count" : 1 }
{ "_id" : "u#ul047/0", "count" : 1 }
{ "_id" : "u#ul036/0", "count" : 1 }
{ "_id" : "u#ul073/0", "count" : 1 }
{ "_id" : "u#ul017/0", "count" : 1 }
{ "_id" : "u#ul086/0", "count" : 1 }
{ "_id" : "u#ul000/0", "count" : 1 }
...

It seems that for whatever reason, all of the controllers are running lots-and-lots of Presence data for each of the machine agents.

Presumably something like a Pinger that isn't getting killed when Leadership worker is getting restarted.

Note that the pings themselves are getting reaped properly:
juju:PRIMARY> db.presence.pings.count()
4

It seems that all of these Pingers are still actively pinging. Which also causes "juju status" to thrash, because it sees all of these alive pingers, but only 1 is valid for a given agent. Which means the Pinger logic reads in all of the extra identities, and then throws them away because their state is invalid. But the *next* status has to read them all in again.
juju:PRIMARY> db.presence.pings.find().limit(1).pretty()
{
        "_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:1510484520",
        "slot" : NumberLong(1510484520),
        "alive" : {
                "48" : NumberLong("9223372036854775807"),
                "58" : NumberLong("9223372036854775807"),
                "3e" : NumberLong("9223372036854775807"),
                "8" : NumberLong("9223372036854775807"),
                "49" : NumberLong("9223372036854775807"),
                "27" : NumberLong("9223372036854775807"),
                "2a" : NumberLong("9223372036854775807"),
                "6b" : NumberLong("9223372036854775807"),
                "75" : NumberLong("9223372036854775807"),
                "63" : NumberLong("9223372036854775807"),
                "5e" : NumberLong("9223372036854775807"),
                "26" : NumberLong("9223372036854775807"),
                "89" : NumberLong("9223372036854775807"),
                "38" : NumberLong("9223372036854775807"),
                "37" : NumberLong("9223372036854775807"),
                "3a" : NumberLong("9223372036854775807"),
                "20" : NumberLong("9223372036854775807"),
                "8d" : NumberLong(8191),
                "66" : NumberLong("9223372036854775807"),
                "5a" : NumberLong("9223372036854775807"),
                "7e" : NumberLong("9223372036854775807"),
                "7b" : NumberLong("9223372036854775807"),
                "54" : NumberLong("9223372036854775807"),
                "67" : NumberLong("9223372036854775807"),
                "45" : NumberLong("9223372036854775807"),
                "34" : NumberLong("9223372036854775807"),
                "2d" : NumberLong("9223372036854775807"),
                "79" : NumberLong("9223372036854775807"),
                "51" : NumberLong("9223372036854775807"),
                "83" : NumberLong("9223372036854775807"),
                "68" : NumberLong("9223372036854775807"),
                "19" : NumberLong("9223372036854775807"),
                "2f" : NumberLong("9223372036854775807"),
                "65" : NumberLong("9223372036854775807"),
                "74" : NumberLong("9223372036854775807"),
                "87" : NumberLong("9223372036854775807"),
                "33" : NumberLong("9223372036854775807"),
                "80" : NumberLong("9223372036854775807"),
                "8a" : NumberLong("9223372036854775807"),
                "76" : NumberLong("9223372036854775807"),
                "41" : NumberLong("9223372036854775807"),
                "1d" : NumberLong("9223372036854775807"),
                "5c" : NumberLong("9223372036854775807"),
                "c" : NumberLong("9223372036854775807"),
                "25" : NumberLong("9223372036854775807"),
                "71" : NumberLong("9223372036854775807"),
                "b" : NumberLong("9223372036854775807"),
                "7c" : NumberLong("9223372036854775807"),
                "60" : NumberLong("9223372036854775807"),
                "55" : NumberLong("9223372036854775807"),
                "3c" : NumberLong("9223372036854775807"),
                "1a" : NumberLong("9223372036854775807"),
                "64" : NumberLong("9223372036854775807"),
                "61" : NumberLong("9223372036854775807"),
                "85" : NumberLong("9223372036854775807"),
                "7d" : NumberLong("9223372036854775807"),
                "31" : NumberLong("9223372036854775807"),
                "13" : NumberLong("9223372036854775807"),
                "32" : NumberLong("9223372036854775807"),
                "7f" : NumberLong("9223372036854775807"),
                "d" : NumberLong("9223372036854775807"),
                "f" : NumberLong("9223372036854775807"),
                "6a" : NumberLong("9223372036854775807"),
                "23" : NumberLong("9223372036854775807"),
                "1c" : NumberLong("9223372036854775807"),
                "8b" : NumberLong("9223372036854775807"),
                "47" : NumberLong("9223372036854775807"),
                "86" : NumberLong("9223372036854775807"),
                "6e" : NumberLong("9223372036854775807"),
                "72" : NumberLong("9223372036854775807"),
                "3b" : NumberLong("9223372036854775807"),
                "17" : NumberLong("9223372036854775807"),
                "52" : NumberLong("9223372036854775807"),
                "46" : NumberLong("9223372036854775807"),
                "44" : NumberLong("9223372036854775807"),
                "88" : NumberLong("9223372036854775807"),
                "57" : NumberLong("9223372036854775807"),
                "59" : NumberLong("9223372036854775807"),
                "8c" : NumberLong("9223372036854775807"),
                "78" : NumberLong("9223372036854775807"),
                "5" : NumberLong("9223372036854775807"),
                "4e" : NumberLong("9223372036854775807"),
                "73" : NumberLong("9223372036854775807"),
                "4b" : NumberLong("9223372036854775807"),
                "15" : NumberLong("9223372036854775807"),
                "5b" : NumberLong("9223372036854775807"),
                "16" : NumberLong("9223372036854775807"),
                "30" : NumberLong("9223372036854775807"),
                "6c" : NumberLong("9223372036854775807"),
                "2e" : NumberLong("9223372036854775807"),
                "5d" : NumberLong("9223372036854775807"),
                "77" : NumberLong("9223372036854775807"),
                "4a" : NumberLong("9223372036854775807"),
                "82" : NumberLong("9223372036854775807"),
                "70" : NumberLong("9223372036854775807"),
                "10" : NumberLong("9223372036854775807"),
                "3f" : NumberLong("9223372036854775807"),
                "6d" : NumberLong("9223372036854775807"),
                "40" : NumberLong("9223372036854775807"),
                "1f" : NumberLong("9223372036854775807"),
                "24" : NumberLong("9223372036854775807"),
                "29" : NumberLong("9223372036854775807"),
                "12" : NumberLong("9223372036854775807"),
                "7" : NumberLong("9223372036854775807"),
                "e" : NumberLong("9223372036854775807"),
                "81" : NumberLong("9223372036854775807"),
                "18" : NumberLong("9223372036854775807"),
                "62" : NumberLong("9223372036854775807"),
                "3d" : NumberLong("9223372036854775807"),
                "84" : NumberLong("9223372036854775807"),
                "4d" : NumberLong("9223372036854775807"),
                "36" : NumberLong("9223372036854775807"),
                "6" : NumberLong("9223372036854775807"),
                "11" : NumberLong("9223372036854775807"),
                "1b" : NumberLong("9223372036854775807"),
                "7a" : NumberLong("9223372036854775807"),
                "5f" : NumberLong("9223372036854775807"),
                "22" : NumberLong("9223372036854775807"),
                "4f" : NumberLong("9223372036854775807"),
                "50" : NumberLong("9223372036854775807"),
                "69" : NumberLong("9223372036854775807"),
                "53" : NumberLong("9223372036854775807"),
                "2c" : NumberLong("9223372036854775807"),
                "42" : NumberLong("9223372036854775807"),
                "39" : NumberLong("9223372036854775807"),
                "14" : NumberLong("9223372036854775807"),
                "a" : NumberLong("9223372036854775807"),
                "6f" : NumberLong("9223372036854775807"),
                "43" : NumberLong("9223372036854775807"),
                "9" : NumberLong("9223372036854775807"),
                "28" : NumberLong("9223372036854775807"),
                "56" : NumberLong("9223372036854775807"),
                "1e" : NumberLong("9223372036854775807"),
                "21" : NumberLong("9223372036854775807"),
                "2b" : NumberLong("9223372036854775807"),
                "4c" : NumberLong("9223372036854775807"),
                "35" : NumberLong("9223372036854775807"),
                "3" : NumberLong("3481790096963796992"),
                "4" : NumberLong("9223372036854775797"),
                "0" : NumberLong(256)
        }
}

Note that statepool report doesn't seem to show anything going on:
# juju-statepool-report
Querying @jujud-machine-0 introspection socket: /statepool/
State Pool Report:

Model count: 1 models
Marked for removal: 0 models

Model: 4051bd02-cbb5-41eb-873c-758286f35f0c
  Marked for removal: false
  Reference count: 3
    [1]
goroutine 7181821 [running]:
runtime/debug.Stack(0x2df9680, 0xc4237b8360, 0xc4203eff4d)
        /snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc4203eff4d, 0x24, 0x0, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc4207531b8, 0xc4203eff4d, 0x24, 0x4cc4560, 0xc4275e97a0, 0xc42760db20, 0x12, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc4207531b8)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00, 0xc42a999ab0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
        /snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc42ab8e1e0, 0x4cc3c20, 0xc423c2da00)
        /snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
        /snap/go/1016/src/net/http/server.go:2720 +0x288

    [2]
goroutine 7182021 [running]:
runtime/debug.Stack(0x2df9680, 0xc42343bb60, 0xc42230ea8d)
        /snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc42230ea8d, 0x24, 0x0, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc420753f60, 0xc42230ea8d, 0x24, 0x4cc4560, 0xc425912520, 0xc4217fe100, 0x12, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc420753f60)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00, 0xc42bf11ab0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
        /snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc42ab8e280, 0x4cc3c20, 0xc427230d00)
        /snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
        /snap/go/1016/src/net/http/server.go:2720 +0x288

    [3]
goroutine 745 [running]:
runtime/debug.Stack(0x2df9680, 0xc420df5080, 0xc420a8294d)
        /snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc420a8294d, 0x24, 0x0, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc42039b3c0, 0xc420a8294d, 0x24, 0x4cc4560, 0xc420b0e160, 0xc4204ee360, 0xf, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc42039b3c0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc42013ee00, 0xc420984f00, 0xc420c55ab0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
        /snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc420da0140, 0x4cc3c20, 0xc420d10840)
        /snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
        /snap/go/1016/src/net/http/server.go:2720 +0x288

Revision history for this message
John A Meinel (jameinel) wrote : Re: bad TXN restarting Leadership worker leaks Pingers

There may be a different issue going on, because I haven't been changing anything, just let it run, and now we're up to:
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}])
{ "_id" : "m#1", "count" : 2952 }
{ "_id" : "m#2", "count" : 2941 }
{ "_id" : "m#0", "count" : 2904 }

{ "_id" : "m#1", "count" : 2966 }
{ "_id" : "m#2", "count" : 2955 }
{ "_id" : "m#0", "count" : 2916 }

2017-11-12T11:10:29.261+0000
db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
i{ "_id" : "m#1", "count" : 2977 }
{ "_id" : "m#2", "count" : 2967 }
{ "_id" : "m#0", "count" : 2928 }

summary: - killing leadership worker leaves *way* too many machine identities in
- presence.beings
+ bad TXN restarting Leadership worker leaks Pingers
Revision history for this message
John A Meinel (jameinel) wrote :

So something is actively leaking Pingers.

Revision history for this message
John A Meinel (jameinel) wrote :

2017-11-12T11:29:49.197
db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#1", "count" : 3195 }
{ "_id" : "m#2", "count" : 3184 }
{ "_id" : "m#0", "count" : 3153 }

So after approx 20 min, we have another 200 pingers. (1 new pinger every 6s or so.)

Revision history for this message
John A Meinel (jameinel) wrote :

After restarting all of the controller agents, the number did eventually drop:
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#0", "count" : 81 }
{ "_id" : "m#2", "count" : 78 }
{ "_id" : "m#1", "count" : 77 }

However, 80 is still *way* too high.

And what's worse, is that we still seem to be leaking them:
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#0", "count" : 105 }
{ "_id" : "m#2", "count" : 102 }
{ "_id" : "m#1", "count" : 101 }

Now, I am running a "watch juju status" in one window, and in another to stress the system I did:
while true; do time juju status >/dev/null; done

However, since this is just happening by some process, and doesn't have to do with Leadership TXNs, I'm flagging this as critical for 2.3

summary: - bad TXN restarting Leadership worker leaks Pingers
+ leaking agent Pingers
Revision history for this message
John A Meinel (jameinel) wrote : Re: leaking agent Pingers

Whatever is causing us to naturally leak agent Pingers is going to cause problems in production, as it causes the presence.beings and presence.pings collections to get bloated, slowing down 'juju status'.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: High → Critical
milestone: none → 2.3-rc1
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

I can reproduce this on develop tip just by using enable-ha:

$ juju bootstrap lxd --debug
$ juju switch controller
$ juju deploy -m controller cs:~jameinel/ubuntu-lite --to 0 -B
$ juju enable-ha -n3

$ juju ssh -m controller 0
$$ sudo su -
$$ dialmgo() {
    agent=$(cd /var/lib/juju/agents; echo machine-*)
    pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut '-d ' -sf2)
    /usr/lib/juju/mongo3.2/bin/mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju
}
$$ dialmgo
> use presence
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#2", "count" : 12 }
{ "_id" : "m#1", "count" : 11 }
{ "_id" : "m#0", "count" : 4 }
{ "_id" : "u#ubuntu-lite/0", "count" : 3 }

> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#2", "count" : 24 }
{ "_id" : "m#1", "count" : 23 }
{ "_id" : "m#0", "count" : 16 }
{ "_id" : "u#ubuntu-lite/0", "count" : 3 }

I also don't understand why we would have 3 units of ubuntu-lite agent, but we *definitely* shouldn't be leaking active pingers for the machine agents.

summary: - leaking agent Pingers
+ leaking agent Pingers when HA enabled
Revision history for this message
John A Meinel (jameinel) wrote :

Note that I watched the presence table before I did enable-ha and I only saw 3 m#0 pingers, and the number didn't appear to be growing.

Revision history for this message
John A Meinel (jameinel) wrote :

Adding some logging, I get:
machine-2: 17:06:19 WARNING juju.state.presence NewPinger for: a0b5ea90-5e2f-4043-848b-501f8d66c51c m#1
goroutine 2421 [running]:
runtime/debug.Stack(0xc421c7bcb8, 0x413277, 0xc422256220)
        /snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state/presence.NewPinger(0xc422258750, 0xc4203c7c26, 0x24, 0xc422161c28, 0x3, 0xc422256220, 0x3)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/presence/presence.go:688 +0x74
github.com/juju/juju/state.(*Machine).SetAgentPresence(0xc420620d80, 0x0, 0x0, 0xc4214e65f0)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/machine.go:1008 +0x241
github.com/juju/juju/apiserver.presenceShim.Start(0x7f08b6464140, 0xc420620d80, 0x12, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/admin.go:684 +0x35
github.com/juju/juju/apiserver.(presenceShim).Start-fm(0xc4214ed860, 0x30d0362, 0x12, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/admin.go:704 +0x39
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).maybeStartPinger(0xc420838e10, 0x1, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/facades/agent/presence/pinger.go:153 +0x71
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).loop(0xc420838e10, 0xc4214e6768, 0xc4216a41c0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/facades/agent/presence/pinger.go:141 +0x44
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).(github.com/juju/juju/apiserver/facades/agent/presence.loop)-fm(0xc400000008, 0x31b6020)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/facades/agent/presence/pinger.go:84 +0x2a
github.com/juju/juju/worker/catacomb.runSafely(0xc4222925d0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x63
github.com/juju/juju/worker/catacomb.Invoke.func3(0xc420838e10, 0xc420838e10, 0xc4222925d0, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x81
created by github.com/juju/juju/worker/catacomb.Invoke
        /home/jameinel/dev/go/src/github.com/juju/juju/worker/catacomb/catacomb.go:113 +0x221

It seems whatever is going on in "maybeStartPinger" doesn't realize there is another one already running.

Revision history for this message
John A Meinel (jameinel) wrote :

One clear thing that stands out is:
machine-2: 17:06:02 WARNING juju.state.presence NewPinger for: a0b5ea90-5e2f-4043-848b-501f8d66c51c m#1

Why is machine-2 starting a pinger for machine-1?
What is going on in HA that is causing the Controllers to launch Pingers for the *other* machines, and then never reaping them.

Revision history for this message
John A Meinel (jameinel) wrote :

Letting that run for a while, it seems all the extra pingers are from cross agents. (machine-1 starting a pinger for machine-2, machine-2 starting one for machine 0, etc.)

I don't know what connections we're running that is causing this to happen, it probably goes way back to our original bug around agents connecting to each other.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.6 KiB)

This appears to be a presence.New call from startPingerIfAgent.

Which is theoretically handled here:
 if !controllerMachineLogin {
  if err := startPingerIfAgent(a.srv.pingClock, a.root, a.root.entity); err != nil {
   return nil, errors.Trace(err)
  }
 }

But apparently controllerMachineLogin isn't what we expect it to be.

I also don't understand why we are creating these over-and-over-and-over again, and never closing them.

Looking here:
func (w *Worker) loop() error {
 var delay time.Duration
 for {
  select {
  case <-w.catacomb.Dying():
   return w.catacomb.ErrDying()
  case <-w.config.Clock.After(delay):
   maybePinger := w.maybeStartPinger()
   w.reportRunning()
   w.waitPinger(maybePinger)
  }
  delay = w.config.RetryDelay
 }
}

We seem to be calling Start() but nothing in there is passing in the Catacomb to the Pinger that we are starting. We are calling w.waitPinger(...) which ultimately has:
 go func() {
  select {
  case <-done:
  case <-w.catacomb.Dying():
   w.logger.Tracef("stopping pinger")
   pinger.Stop()
  }
 }()

So theoretically we're launching a goroutine that should notice with this goroutine is dying and ask the pinger to stop. However if we hit done() it does nothing, but we should only hit done() if pinger.Wait() returns meaning the pinger itself has decided to exit already.)

We have:
  Start: presenceShim{agent}.Start,

Which seems to be the point where we are initializing this thing.

If I look at more of the log I see the agents connecting to each other:
machine-0: 17:57:03 INFO juju.apiserver.connection agent login: machine-2 for a0b5ea90-5e2f-4043-848b-501f8d66c51c
machine-0: 17:57:03 WARNING juju.state.presence NewPinger for: a0b5ea90-5e2f-4043-848b-501f8d66c51c m#2
goroutine 425171 [running]:
runtime/debug.Stack(0xc423bd1cb8, 0x413277, 0xc4242d8d90)
        /snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state/presence.NewPinger(0xc424264e40, 0xc420486c96, 0x24, 0xc424363518, 0x3, 0xc4242d8d90, 0x3)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/presence/presence.go:688 +0x74
github.com/juju/juju/state.(*Machine).SetAgentPresence(0xc422179440, 0x0, 0x0, 0xc423d09df0)
        /home/jameinel/dev/go/src/github.com/juju/juju/state/machine.go:1008 +0x241
github.com/juju/juju/apiserver.presenceShim.Start(0x7f6f623501e8, 0xc422179440, 0x12, 0x0, 0x0, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/admin.go:684 +0x35
github.com/juju/juju/apiserver.(presenceShim).Start-fm(0xc42106c330, 0x30d0362, 0x12, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/admin.go:704 +0x39
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).maybeStartPinger(0xc421d65200, 0x1, 0x0)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/facades/agent/presence/pinger.go:153 +0x71
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).loop(0xc421d65200, 0xc423d09f68, 0xc42454cd80)
        /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/facades/agent/presence/pinger.go:141 +0x44
github.com/juju/juju/apiserver/facades/agent/presence.(*Worker).(github.com/juju/juju/apiserver/facades/agent/presence.loop)-fm(0xc400000008,...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Another possibility is our "connect to the API server" from a worker, causing us to try to connect to *all* Controllers, even though we're one of them. But since the connection succeeds locally, we then throw away the second connections (though they then still leak their Catacomb.)

Revision history for this message
John A Meinel (jameinel) wrote :

AFAICT this was introduced with the changes to allow Anonymous logins. It changed a fair bit of code in this area.
Things like:
 // If above login fails, we may still be a login to a controller
 // machine in the controller model.
 controllerMachineLogin, err := a.handleAuthError(req, machineAgent, err)

that function does:
 if !machineAgent {
  return false, errors.Trace(err)
 }
 if errors.Cause(err) != common.ErrBadCreds {
  return false, err
 }
 // If we are here, we may be logging into a controller machine
 // in the controller model.
 a.root.entity, err = a.checkControllerMachineCreds(req)
 if err != nil {
  return false, errors.Trace(err)
 }
 return true, nil

I'm not satisfied that this code properly rejects agents that give the wrong password. If the Kind is 'machine' then as long as it doesn't get the exact object ErrBadCreds will end up logged in as a Controller.

This looks *much* more like we need a whitelist of who gets to pretend to be a Controller, not a Blacklist of only things that don't get ErrBadCreds.

I think we have Controllers that are *accidentally* succeeding at logging into another controller.

Revision history for this message
John A Meinel (jameinel) wrote :

This is the patch I was using to see where the Pingers were getting created:
diff --git a/state/presence/presence.go b/state/presence/presence.go
index 4d02f38..3989f3b 100644
--- a/state/presence/presence.go
+++ b/state/presence/presence.go
@@ -10,6 +10,7 @@ package presence
 import (
        "fmt"
        "math/rand"
+ "runtime/debug"
        "strconv"
        "sync"
        "time"
@@ -684,6 +685,7 @@ type PingRecorder interface {
 // NewPinger returns a new Pinger to report that key is alive.
 // It starts reporting after Start is called.
 func NewPinger(base *mgo.Collection, modelTag names.ModelTag, key string, recorderFunc func() PingRecorder) *Pinger {
+ logger.Warningf("NewPinger for: %s %s\n%s", modelTag.Id(), key, string(debug.Stack()))
        return &Pinger{
                base: base,
                pings: pingsC(base),

Revision history for this message
John A Meinel (jameinel) wrote :

This bug doesn't appear to be present w/ Juju 2.2.6:
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#1", "count" : 5 }
{ "_id" : "m#2", "count" : 2 }
{ "_id" : "m#0", "count" : 2 }

That count of 5 is higher than I would expect, so maybe there are some bugs there, but it doesn't appear to be growing over time like it does with 2.3.

Revision history for this message
John A Meinel (jameinel) wrote :

In fact, it actually went back down:
> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
{ "_id" : "m#1", "count" : 2 }
{ "_id" : "m#2", "count" : 2 }
{ "_id" : "m#0", "count" : 2 }

Revision history for this message
John A Meinel (jameinel) wrote :

This is a useful query if you have many models and want to separate out the agents:
> db.presence.beings.aggregate([{$group: {_id: {$concat: [{$substr: ["$_id", 0, 4]}, "-", "$key"]}, count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 10}])

It prefixes each agent with the first 4 bytes of the model-uuid:
{ "_id" : "9ef3-u#ul2/0", "count" : 2 }
{ "_id" : "9ef3-m#0", "count" : 2 }
{ "_id" : "2144-m#2", "count" : 2 }
{ "_id" : "9ef3-m#2", "count" : 2 }
{ "_id" : "2144-m#1", "count" : 2 }
{ "_id" : "2144-m#0", "count" : 2 }
{ "_id" : "9ef3-m#1", "count" : 1 }
{ "_id" : "9ef3-u#ubuntu-lite/0", "count" : 1 }
{ "_id" : "9ef3-u#ul1/0", "count" : 1 }

Revision history for this message
Tim Penhey (thumper) wrote :
Tim Penhey (thumper)
Changed in juju:
assignee: John A Meinel (jameinel) → Tim Penhey (thumper)
Revision history for this message
John A Meinel (jameinel) wrote :

Testing with Develop, I can see that we stay steady at 4 pingers per agent.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.