Thanks for your response on this and I apologize, you are correct that commit 7514c0362ffdd9af953ae94334018e7356b31313 was not the fix for our issue. I had previously just tested the last handful of commits in 5.9.0-rc4 and didn't realize that 7514c0362ffdd9af953ae94334018e7356b31313 was a merge commit and the other commits that didn't include the fix had parent commits prior to this fix being implemented. I tested more kernels this week and narrowed in on commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a which appears to fix our issue:
commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a
Author: Peter Xu <email address hidden>
Date: Fri Aug 21 19:49:57 2020 -0400
mm/gup: Remove enfornced COW mechanism
With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.
This essentially reverts commit 17839856fd58 ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:
To verify this is the proper fix for the issue we are running into I built a kernel using the parent of this fix (1a0cf26323c80e2f1c58fc04f15686de61bfab0c) and verified it exhibited the broken behavior (memory spikes within our container while running gdb which eventually causes docker to oom kill the container due to hitting the hard memory limit we have set). I then pulled the 1a0cf26323c80e2f1c58fc04f15686de61bfab0c code, cherry-picked a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a and built/ran that kernel and verified that I could no longer repro the issue.
I also built kernels with cfc905f158eaa099d6258031614d11869e7ef71c, 4facb95b7adaf77e2da73aafb9ba60996fe42a12 and 9e2369c06c8a181478039258a4598c1ddd2cadfa and verified those exhibited the broken behavior. I then pulled those same commits and cherry picked the a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a fix into them and verified that fixed the behavior we are seeing.
Here is a list of all the commits that I tested in the past few days to narrow in on this commit as the fix:
9322c47b21b9e05d7f9c037aa2c472e9f0dc7f3b - FIXED
b17164e258e3888d376a7434415013175d637377 - FIXED
1ef6ea0efe8e68d0299dad44c39dc6ad9e5d1f39 - FIXED
c183edff33fdcd639d222a8f473bf44602adc655 - BROKEN - parent commits were based off rc1 branch, prior to a308c71 fix
c70672d8d316ebd46ea447effadfe57ab7a30a50 - FIXED
09274aed9021642cb3e5e0eb0e657a13ee3eafed - FIXED
16bf121b2ddebd4421bd73098eaae1500dd40389 - FIXED
41bef91c8aa351255cd19e7e72608ee86f7f4bab - FIXED
f162626a038ec06da98ac38ce3d6bdbd715e9c5f - FIXED
d824e0809ce3c9e935f3aa37381cda7fd4184f12 - FIXED
8075fc3b113dee1531106aaec3dfa19c8158374d - FIXED
d849ca483dba7546ad176da83bf66d1c013725f6 - FIXED
2fb547911ca54bc9ffa2709c55c9a7638ac50ae4 - FIXED
e2dacf6cd13c1f8d40a59fdda41ecd139c2207df - FIXED
86edf52e7c7201fabfba39ae694a5206d48e77af - FIXED
cf85f5de83b19361c3d575fa0ea05d8194bb0d05 - FIXED
acf69c946233259ab4d64f8869d4037a198c7f06 - FIXED
b25d1dc9474e1f0cefca994885e82beea271acfe - FIXED
f7ce2c3afc938b7d743ee8e5563560c04e17d7be - BROKEN - parents 763700f + eacc9c5 were prior to fix in a308c71
798a6b87ecd72828a6c6b5469aaa2032a57e92b7 - FIXED - parent is a308c71 so has fix
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a - FIXED - parent is 1a0cf26 which is broken so looks like this is the actual code fix
1a0cf26323c80e2f1c58fc04f15686de61bfab0c - BROKEN
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 - BROKEN
cfc905f158eaa099d6258031614d11869e7ef71c - BROKEN
7b81ce7cdcef3a3ae71eb3fb863433c646b4a2f4 - BROKEN
4facb95b7adaf77e2da73aafb9ba60996fe42a12 - BROKEN
d5c678aed5eddb944b8e7ce451b107b39245962d - BROKEN
662a0221893a3d58aa72719671844264306f6e4b - BROKEN
2356bb4b8221d7dc8c7beb810418122ed90254c9 - BROKEN
29aaebbca4abc4cceb38738483051abefafb6950 - BROKEN
2822e582501b65707089b097e773e6fd70774841 - BROKEN
7cad554887f1c5fd77e57e6bf4be38370c2160cb - BROKEN
e52d58d54a321d4fe9d0ecdabe4f8774449f0d6e - BROKEN
26e495f341075c09023ba16dee9a7f37a021e745 - BROKEN
a5f785ce608cafc444cdf047d1791d5ad97943ba - BROKEN
0ffdab6f2dea9e23ec33230de24e492ff0b186d9 - BROKEN
30d24faba0532d6972df79a1bf060601994b5873 - BROKEN
2d33b7d631d9dc81c78bb71368645cf7f0e68cb1 - BROKEN
6e4e9ec65078093165463c13d4eb92b3e8d7b2e8 - BROKEN
365d2a23663711c32e778c9c18b07163f9193925 - BROKEN
9e2369c06c8a181478039258a4598c1ddd2cadfa - BROKEN
Please let me know if you need any further information from me on this.
Tim / Kleber,
Thanks for your response on this and I apologize, you are correct that commit 7514c0362ffdd9a f953ae94334018e 7356b31313 was not the fix for our issue. I had previously just tested the last handful of commits in 5.9.0-rc4 and didn't realize that 7514c0362ffdd9a f953ae94334018e 7356b31313 was a merge commit and the other commits that didn't include the fix had parent commits prior to this fix being implemented. I tested more kernels this week and narrowed in on commit a308c71bf1e6e19 cc2e4ced31853ee 0fc7cb439a which appears to fix our issue:
commit a308c71bf1e6e19 cc2e4ced31853ee 0fc7cb439a
Author: Peter Xu <email address hidden>
Date: Fri Aug 21 19:49:57 2020 -0400
mm/gup: Remove enfornced COW mechanism
With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.
This essentially reverts commit 17839856fd58 ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:
2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)
Some lines moved back and forth with those, but this revert patch should
have striped out and covered all the enforced cow bits anyways.
Suggested-by: Linus Torvalds <email address hidden>
Signed-off-by: Peter Xu <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
To verify this is the proper fix for the issue we are running into I built a kernel using the parent of this fix (1a0cf26323c80e 2f1c58fc04f1568 6de61bfab0c) and verified it exhibited the broken behavior (memory spikes within our container while running gdb which eventually causes docker to oom kill the container due to hitting the hard memory limit we have set). I then pulled the 1a0cf26323c80e2 f1c58fc04f15686 de61bfab0c code, cherry-picked a308c71bf1e6e19 cc2e4ced31853ee 0fc7cb439a and built/ran that kernel and verified that I could no longer repro the issue.
I also built kernels with cfc905f158eaa09 9d6258031614d11 869e7ef71c, 4facb95b7adaf77 e2da73aafb9ba60 996fe42a12 and 9e2369c06c8a181 478039258a4598c 1ddd2cadfa and verified those exhibited the broken behavior. I then pulled those same commits and cherry picked the a308c71bf1e6e19 cc2e4ced31853ee 0fc7cb439a fix into them and verified that fixed the behavior we are seeing.
Here is a list of all the commits that I tested in the past few days to narrow in on this commit as the fix:
9322c47b21b9e05 d7f9c037aa2c472 e9f0dc7f3b - FIXED d376a7434415013 175d637377 - FIXED 0299dad44c39dc6 ad9e5d1f39 - FIXED 39d222a8f473bf4 4602adc655 - BROKEN - parent commits were based off rc1 branch, prior to a308c71 fix 46ea447effadfe5 7ab7a30a50 - FIXED cb3e5e0eb0e657a 13ee3eafed - FIXED 421bd73098eaae1 500dd40389 - FIXED 55cd19e7e72608e e86f7f4bab - FIXED da98ac38ce3d6bd bd715e9c5f - FIXED 935f3aa37381cda 7fd4184f12 - FIXED 531106aaec3dfa1 9c8158374d - FIXED 6ad176da83bf66d 1c013725f6 - FIXED 9ffa2709c55c9a7 638ac50ae4 - FIXED d40a59fdda41ecd 139c2207df - FIXED abfba39ae694a52 06d48e77af - FIXED 1c3d575fa0ea05d 8194bb0d05 - FIXED ab4d64f8869d403 7a198c7f06 - FIXED cefca994885e82b eea271acfe - FIXED d743ee8e5563560 c04e17d7be - BROKEN - parents 763700f + eacc9c5 were prior to fix in a308c71 8a6c6b5469aaa20 32a57e92b7 - FIXED - parent is a308c71 so has fix cc2e4ced31853ee 0fc7cb439a - FIXED - parent is 1a0cf26 which is broken so looks like this is the actual code fix f1c58fc04f15686 de61bfab0c - BROKEN 886996bfbee2530 b3d8a7f4f4 - BROKEN 9d6258031614d11 869e7ef71c - BROKEN ae71eb3fb863433 c646b4a2f4 - BROKEN e2da73aafb9ba60 996fe42a12 - BROKEN 44b8e7ce451b107 b39245962d - BROKEN 8aa727196718442 64306f6e4b - BROKEN c8c7beb81041812 2ed90254c9 - BROKEN ceb38738483051a befafb6950 - BROKEN 07089b097e773e6 fd70774841 - BROKEN d77e57e6bf4be38 370c2160cb - BROKEN fe9d0ecdabe4f87 74449f0d6e - BROKEN 9023ba16dee9a7f 37a021e745 - BROKEN 444cdf047d1791d 5ad97943ba - BROKEN 3ec33230de24e49 2ff0b186d9 - BROKEN 972df79a1bf0606 01994b5873 - BROKEN 1c78bb71368645c f7f0e68cb1 - BROKEN 165463c13d4eb92 b3e8d7b2e8 - BROKEN 32e778c9c18b071 63f9193925 - BROKEN 478039258a4598c 1ddd2cadfa - BROKEN
b17164e258e3888
1ef6ea0efe8e68d
c183edff33fdcd6
c70672d8d316ebd
09274aed9021642
16bf121b2ddebd4
41bef91c8aa3512
f162626a038ec06
d824e0809ce3c9e
8075fc3b113dee1
d849ca483dba754
2fb547911ca54bc
e2dacf6cd13c1f8
86edf52e7c7201f
cf85f5de83b1936
acf69c946233259
b25d1dc9474e1f0
f7ce2c3afc938b7
798a6b87ecd7282
a308c71bf1e6e19
1a0cf26323c80e2
09854ba94c6aad7
cfc905f158eaa09
7b81ce7cdcef3a3
4facb95b7adaf77
d5c678aed5eddb9
662a0221893a3d5
2356bb4b8221d7d
29aaebbca4abc4c
2822e582501b657
7cad554887f1c5f
e52d58d54a321d4
26e495f341075c0
a5f785ce608cafc
0ffdab6f2dea9e2
30d24faba0532d6
2d33b7d631d9dc8
6e4e9ec65078093
365d2a23663711c
9e2369c06c8a181
Please let me know if you need any further information from me on this.
Thanks,
Paul