Some data races found using Helgrind
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
Low
|
mdavidsaver | ||
3.14 |
Fix Released
|
Undecided
|
Unassigned | ||
3.15 |
Fix Released
|
Undecided
|
Unassigned | ||
3.16 |
Fix Released
|
Low
|
mdavidsaver |
Bug Description
Hello,
I've been testing my application with Valgrind's Helgrind, a thread debugging tool. I have six patches for various data races. Most of them appear benign, but as far as I know, even the "benign" ones can still theoretically cause problems, possibly due to compiler optimizations. Below are short descriptions of patches and the Helgrind warnings. These patches are generated from a modified version of R3.14.11, but probably apply fine to latest 3.14. I have checked for all issues if they appear in 3.15 branch; all but the dbCaCtl issue do.
Note that I've seen a few other warnings from Helgrind related to Base code but I believe they were false positives for all practical purposes. Notable ones were data races where volatile is already used (I think that's fine when used carefully), and many warnings about inconsistent lock ordering related to dbScanLock (where deadlock is not possible due to use of pthread_
repeaterSubscri
I've added a mutex to protect some variables in repeaterSubscri
It's not nice to keep the lock while printing text/errors but it's the easiest solution I could find.
This patch stops the Helgrind warning.
==22259== Possible data race during write of size 1 at 0x4DDEB8C by thread #28
==22259== Locks held: none
==22259== at 0x4D5BE0: repeaterSubscri
==22259== by 0x4C5B4F: udpiiu:
==22259== by 0x4C5829: udpiiu:
==22259== by 0x4C77A8: udpRecvThread:
==22259== by 0x4E50F0: epicsThreadCall
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.
==22259== by 0x310BA079D0: start_thread (pthread_
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== This conflicts with a previous read of size 1 by thread #26
==22259== Locks held: none
==22259== at 0x4D5E15: repeaterSubscri
==22259== by 0x4F0232: timerQueue:
==22259== by 0x4EEB1A: timerQueueActiv
==22259== by 0x4E50F0: epicsThreadCall
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.
==22259== by 0x310BA079D0: start_thread (pthread_
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== Address 0x4DDEB8C is 66748 bytes inside a block of size 66920 alloc'd
==22259== at 0x4A0758C: operator new(unsigned long) (vg_replace_
==22259== by 0x4BD7D9: cac::createChan
==22259== by 0x4B9460: oldChannelNotif
==22259== by 0x4AF4AC: ca_create_channel (access.cpp:333)
==22259== - output omitted -
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.
==22259== by 0x310BA079D0: start_thread (pthread_
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
dbCaCtl.patch
I believe dbCaCtl must be volatile. I've seen similar "ctl" cases where volatile is already used.
The Helgrind warning remains though.
==10904== Possible data race during write of size 4 at 0x756B08 by thread #1
==10904== Locks held: none
==10904== at 0x498C87: dbCaRun (dbCa.c:202)
==10904== by 0x471046: iocRun (iocInit.c:169)
==10904== by 0x471D68: iocInit (iocInit.c:87)
==10904== by 0x4DDE60: iocshBody (iocsh.cpp:765)
==10904== by 0x40E521: main
==10904==
==10904== This conflicts with a previous read of size 4 by thread #36
==10904== Locks held: none
==10904== at 0x499168: dbCaTask (dbCa.c:845)
==10904== by 0x4EA776: start_routine (osdThread.c:385)
==10904== by 0x4A0B819: mythread_wrapper (in /nix/store/
==10904== by 0x310BA079D0: start_thread (pthread_
==10904== by 0x310B6E88FC: clone (clone.S:115)
==10904== Address 0x756b08 is 0 bytes inside data symbol "dbCaCtl"
epicsSingletonM
The assert needs to be moved inside the lock.
This patch stops the Helgrind warning.
==3431== Possible data race during read of size 8 at 0x758F08 by thread #27
==3431== Locks held: 1, at address 0x4DC07D8
==3431== at 0x4D92F6: SingletonUntype
==3431== by 0x4EEFD1: ~reference (epicsSingleton
==3431== by 0x4EEFD1: epicsTimerQueue
==3431== by 0x4BFBB2: cac::cac(
==3431== by 0x4B5D2D: ca_client_
==3431== by 0x4AEF35: ca_context_create (access.cpp:206)
==3431== by 0x41370A: - output omitted -
==3431== by 0x4EA786: start_routine (osdThread.c:385)
==3431== by 0x4A0B819: mythread_wrapper (in /nix/store/
==3431== by 0x310BA079D0: start_thread (pthread_
==3431== by 0x310B6E88FC: clone (clone.S:115)
==3431==
==3431== This conflicts with a previous write of size 8 by thread #1
==3431== Locks held: 1, at address 0x4D7B918
==3431== at 0x4D93E1: SingletonUntype
==3431== by 0x4ED7F0: reference (epicsSingleton
==3431== by 0x4ED7F0: getReference (epicsSingleton
==3431== by 0x4ED7F0: epicsTimerQueue
==3431== by 0x4971DD: callbackInitOnce (callback.c:121)
==3431== by 0x4EB230: epicsThreadOnce (osdThread.c:445)
==3431== by 0x47174D: iocBuild (iocInit.c:115)
==3431== by 0x471D58: iocInit (iocInit.c:87)
==3431== by 0x4DDE70: iocshBody (iocsh.cpp:765)
==3431== by 0x40E521: main
==3431== Address 0x758f08 is 8 bytes inside data symbol "timerQueueMgrE
timerQueueActiv
There seems to be a problem with starting a thread in the constructor which calls back a virtual function.
I've changed the code to finish constructing timerQueueActive before starting the thread. Note, epicsTimerQueue
This patch stops the Helgrind warning. But I'm not so sure if it really is an issue formally.
==22259== Possible data race during read of size 8 at 0x4D74D18 by thread #5
==22259== Locks held: none
==22259== at 0x4E50EB: epicsThreadCall
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.
==22259== by 0x310BA079D0: start_thread (pthread_
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== This conflicts with a previous write of size 8 by thread #1
==22259== Locks held: 1, at address 0x4D74C38
==22259== at 0x4ED584: epicsTimerQueue
==22259== by 0x4EF5E8: timerQueueActiv
==22259== by 0x4ED6ED: epicsTimerQueue
==22259== by 0x440882: asynInit (asynManager.c:461)
==22259== by 0x445533: registerPort (asynManager.c:519)
==22259== by 0x4258B1: asynPortDriver:
==22259== - output omitted -
osiLocalAddr.patch
The local address determination seems to be unprotected.
To fix it, I have used the same approach with epicsThreadOnce as is used in osdInterrupt.c.
This patch stops the Helgrind warning.
==14531== Possible data race during read of size 1 at 0x758E60 by thread #29
==14531== Locks held: 2, at addresses 0x4E80D38 0x4DC5C08
==14531== at 0x4EBFAE: osiLocalAddr (osdNetIntf.c:252)
==14531== by 0x4C5EA6: caRepeaterRegis
==14531== by 0x4C5F44: udpiiu:
==14531== by 0x4D5EC9: repeaterSubscri
==14531== by 0x4F0352: timerQueue:
==14531== by 0x4EEC3A: timerQueueActiv
==14531== by 0x4E51E0: epicsThreadCall
==14531== by 0x4EA786: start_routine (osdThread.c:385)
==14531== by 0x4A0B819: mythread_wrapper (in /nix/store/
==14531== by 0x310BA079D0: start_thread (pthread_
==14531== by 0x310B6E88FC: clone (clone.S:115)
==14531==
==14531== This conflicts with a previous write of size 1 by thread #26
==14531== Locks held: 2, at addresses 0x4DC0438 0x4E03B98
==14531== at 0x4EC0C9: osiLocalAddr (osdNetIntf.c:318)
==14531== by 0x4C5EA6: caRepeaterRegis
==14531== by 0x4C5F44: udpiiu:
==14531== by 0x4D5EC9: repeaterSubscri
==14531== by 0x4F0352: timerQueue:
==14531== by 0x4EEC3A: timerQueueActiv
==14531== by 0x4E51E0: epicsThreadCall
==14531== by 0x4EA786: start_routine (osdThread.c:385)
==14531== Address 0x758e60 is 0 bytes inside data symbol "init.4944"
freeListItemsAv
I've added a mutex when reading the number of free items.
This function is used to check if there are enough buffers when a CA client connects. I believe that this check is otherwise okay, that there is no risk of buffers no longer being available when they subsequently allocated, because only a single thread accepts connections.
==25096== Possible data race during read of size 8 at 0x7BBC068 by thread #45
==25096== Locks held: none
==25096== at 0x4DBF80: freeListItemsAvail (freeListLib.c:149)
==25096== by 0x4729DC: create_client (caservertask.
==25096== by 0x472B9B: create_tcp_client (caservertask.
==25096== by 0x4733A2: req_server (caservertask.
==25096== by 0x4EA786: start_routine (osdThread.c:385)
==25096== by 0x4A0B819: mythread_wrapper (in /nix/store/
==25096== by 0x310BA079D0: start_thread (pthread_
==25096== by 0x310B6E88FC: clone (clone.S:115)
==25096==
==25096== This conflicts with a previous write of size 8 by thread #58
==25096== Locks held: 1, at address 0x7BBC0C8
==25096== at 0x4DC01F: freeListFree (freeListLib.c:126)
==25096== by 0x47277E: destroy_client (caservertask.
==25096== by 0x473D89: camsgtask (camsgtask.c:164)
==25096== by 0x4EA786: start_routine (osdThread.c:385)
==25096== by 0x4A0B819: mythread_wrapper (in /nix/store/
==25096== by 0x310BA079D0: start_thread (pthread_
==25096== by 0x310B6E88FC: clone (clone.S:115)
==25096== Address 0x7bbc068 is 24 bytes inside a block of size 40 alloc'd
==25096== at 0x4A08D16: calloc (in /nix/store/
==25096== by 0x4E22DC: callocMustSucceed (cantProceed.c:27)
==25096== by 0x4DC205: freeListInitPvt (freeListLib.c:42)
==25096== by 0x472EE5: rsrv_init (caservertask.
==25096== by 0x471BB5: iocBuild (iocInit.c:151)
==25096== by 0x471D58: iocInit (iocInit.c:87)
==25096== by 0x4DDE70: iocshBody (iocsh.cpp:765)
==25096== by 0x40E521: main
==25096== Block was alloc'd by thread #1
Ah, never mind the dbCaCtl issue, it is already fixed in latest 3.14 :)