GNU Arm Embedded Toolchain

Much poorer code generated at -O2 than -Os for accessing an array through a pointer

Bug #1646883 reported by David Brown on 2016-12-02

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	GNU Arm Embedded Toolchain	New	Undecided	Unassigned

Bug Description

These tests have all been done with flags "-mcpu=cortex-m4 -mthumb", in the context of writing to the bit-band region on a microcontroller. The example here simply writes a number of fixed values to fixed addresses, in a non-sequential order.

Three versions of the function are:

#include <stdint.h>

#define BB_ADDRESS 0x43fe1800

void test1(void) {
volatile uint32_t * const p = (uint32_t *) BB_ADDRESS;

  p[3] = 1;
  p[4] = 2;
  p[1] = 3;
  p[7] = 4;
  p[0] = 6;
}

void test2(void) {
  typedef struct {
    uint32_t b[8];
  } bb_t;
  volatile bb_t * const p = (bb_t *) BB_ADDRESS;

  p->b[3] = 1;
  p->b[4] = 2;
  p->b[1] = 3;
  p->b[7] = 4;
  p->b[0] = 6;
}

void test3(void) {
  typedef struct {
    uint32_t b0, b1, b2, b3, b4, b5, b6, b7;
  } bb_t;
  volatile bb_t * const p = (bb_t *) BB_ADDRESS;

  p->b3 = 1;
  p->b4 = 2;
  p->b1 = 3;
  p->b7 = 4;
  p->b0 = 6;
}

Code generated for test2 and test3 is identical with the tests I tried. Testing for test1 and test2 was done with -Os and -O2 (-O3 gave the same results as -O2), using gcc 4.6 (from https://gcc.godbolt.org), gcc 4.8 (from NXP's Kinetis Design Studio 2, which is from GNU ARM Embedded), and gcc 5.4 (from gcc.godbolt.org again). The different gcc versions gave the same patterns, with minor variations in the code details. The generated assembly here is from gcc 5.4.

// Test 1, -O2
test1:
        push {r4, r5, r6, r7}
        ldr r2, .L3
        ldr r6, .L3+4
        ldr r4, .L3+8
        ldr r1, .L3+12
        ldr r3, .L3+16
        movs r0, #1
        str r0, [r2]
        movs r7, #2
        movs r5, #3
        movs r0, #4
        movs r2, #6
        str r7, [r6]
        str r5, [r4]
        str r0, [r1]
        pop {r4, r5, r6, r7}
        str r2, [r3]
        bx lr
.L3:
        .word 1140725772
        .word 1140725776
        .word 1140725764
        .word 1140725788
        .word 1140725760

// Test 2, -O2
test2:
        push {r4, r5}
        ldr r3, .L7
        movs r5, #1
        movs r4, #2
        movs r0, #3
        movs r1, #4
        movs r2, #6
        str r5, [r3, #12]
        str r4, [r3, #16]
        str r0, [r3, #4]
        pop {r4, r5}
        str r1, [r3, #28]
        str r2, [r3]
        bx lr
.L7:
        .word 1140725760

// test1, -Os
test1:
        ldr r3, .L2
        movs r2, #1
        str r2, [r3]
        movs r2, #2
        str r2, [r3, #4]
        movs r2, #3
        str r2, [r3, #-8]
        movs r2, #4
        str r2, [r3, #16]
        movs r2, #6
        str r2, [r3, #-12]
        bx lr
.L2:
        .word 1140725772

// test2, -Os
test2:
        ldr r3, .L5
        movs r2, #1
        str r2, [r3, #12]
        movs r2, #2
        str r2, [r3, #16]
        movs r2, #3
        str r2, [r3, #4]
        movs r2, #4
        str r2, [r3, #28]
        movs r2, #6
        str r2, [r3]
        bx lr
.L5:
        .word 1140725760

The code for test1 and test2 with -Os is the same (baring irrelevant differences in the choice of base address). It is optimal in its use of registers, and of the register+offset addressing modes. The interleaving of the stores and other operations is also good, especially for bit-band region access where stores take several bus controller cycles (the compiler does not know this detail).

For -O2, the code for test1 takes a good deal more space, more registers (leading to more pushes and pops), and more time due to the additional instructions and poorer scheduling. Code for test2 is not quite as bad, but still has poorer scheduling and register usage than with -Os.

It is normal to see that -Os sometimes gives faster code than -O2. But there should not be such big differences, nor should there be such differences between the pointer version and the struct version. This is not a bug as such - all generated code is correct. But it is sub-optimal optimisation.

Revision history for this message

David Brown (davidbrown) wrote on 2018-01-23:

The behaviour is the same with gcc 7.

Revision history for this message

David Brown (davidbrown) wrote on 2019-01-28:

gcc 8 gives the same results (baring very minor differences).

Testing without the "-mcpu=cortex-m4 -mthumb" gives an interesting view, however, which may give some insight into the problem.

The code produced for test1 and test2 is similar in this case, with one major difference. The test2 code uses a single base address and accesses the rest of the data at offsets 0 to 28 (just like on the M4). For test1, however, the base used is 2047 higher - and so the offsets in the code are from -2047 to -2019. This makes sense on 32-bit ARM instructions - offsets here are allowed in the range -4095 to +4095. However, with Thumb-2, the offsets must be in the range -255 to +4095 (or -255 to +255 for some instructions).

Perhaps with -O2, the compiler is trying to use the same base as for 32-bit ARM, realising the offsets are out of range, and generating multiple bases instead?

Revision history for this message

David Brown (davidbrown) wrote on 2019-10-31:

A patch has now been added for gcc 10 improving this:

<https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02234.html>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.