Missing optimization with predecrement in while-loop

Bug #1538629 reported by demiurg_spb
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Arm Embedded Toolchain
New
Undecided
Unassigned

Bug Description

foo.c:

void delay_cy(volatile unsigned int cy)
{
 do
 {
  ;
 } while (--cy);
}

arm-none-eabi-gcc -S -mcpu=cortex-m1 -mthumb -msoft-float -Os foo.c

.L2:
 ldr r3, [sp, #4]
 subs r3, r3, #1
 str r3, [sp, #4]
 cmp r3, #0 - this operation is dummy because of SUBS allready sets status regiser correctly
 bne .L2

 .ident "GCC: (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]"

Revision history for this message
David Brown (davidbrown) wrote :

I think this is an effect of the counter being "volatile". The line "while (--cy)" is being handled as two operations - first "cy" is decremented, then it is compared to 0. Since cy is marked volatile, these operations are not combined.

The C standards are not very clear on exactly how such code should be interpreted - I believe it is right for the compiler to be a bit conservative here.

Revision history for this message
demiurg_spb (demiurg-spb-h) wrote :

t is absolutely pointless. As the value in the r3 register does not reload again before the comparison operation with zero.

Revision history for this message
demiurg_spb (demiurg-spb-h) wrote :

just checked on the:
.ident "GCC: (15:7-2018-q2-6) 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907]"
All the same.

Revision history for this message
demiurg_spb (demiurg-spb-h) wrote :

The volatile-qualifier forces the r3 register to reload at the very beginning of the cycle and saves the r3 register at the end of the cycle. This is absolutely right.

Revision history for this message
David Brown (davidbrown) wrote :

Since you are using a volatile variable, the code /is/ right - even with the "extra" compare. "Volatile" tells the compiler to avoid optimisations, and generate bigger and slower code. It doesn't make sense in code like this to say the compiler generated unnecessarily inefficient code - anything other than a "bx lr" instruction is inefficient for an empty loop. You don't use a loop like this for a specific timed delay - you use it merely for a "wait a little bit" delay where the exact timing doesn't matter. If the timing is important, use a timer.

A slightly better (IMHO) way to write such loops is:

void delay_cy(unsigned int cy)
{
 do
 {
  asm volatile("");
 } while (--cy);
}

This avoids stack and memory accesses, and keeps the loop shorter and simpler. It also minimises the volatile accesses, which keeps things more consistent.

Having said all that, there does appear to be an issue with code generation in loops for the M0 and M1. It is more general than just this case, but the loop here makes it clear. When compiled with "-mcpu=cortex-m1", or "m0" or "m0plus", the compiler generates an extra "cmp" instruction. When compiled with "m3", "m4" or "m7", this extra "cmp" is not there.

So while this should not be an issue for such volatile delay code, it /is/ an issue in more general code generation for the M0 and M1. And it happens on a range of loops, and still happens on relatively recent gcc 8.2 version.

Revision history for this message
demiurg_spb (demiurg-spb-h) wrote :

You are absolutely right. This problem has nothing to do with the use of a volatile-qualifier. This is a more common problem. I published a minimal test case that I was able to invent.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.