Skip to main content

Command Palette

Search for a command to run...

How many cycles does function call takes STM32?

Published
3 min read
How many cycles does function call takes STM32?

Short answer: there’s no single number. A function call on STM32 depends on the CPU core (M0/M0+, M3/M4/M33, M7), compiler/optimizations, how many registers must be saved, flash/ SRAM wait-states, and whether the FPU context is involved. But you can estimate—and measure—very precisely.

Typical overhead (rule-of-thumb)

These are just the call/return + prologue/epilogue costs for a small C function (no heavy locals), assuming code runs from flash with a few wait-states, stack in SRAM, and no FPU context save:

  • Cortex-M0/M0+: ~12–30 cycles
    (BL/branch penalty is higher; fewer multi-register ops)

  • Cortex-M3/M4/M33: ~8–20 cycles
    (BL + pipeline refill + PUSH/POP of LR/r7…, 1 cycle per register + memory waits)

  • Cortex-M7: ~6–15 cycles
    (branch prediction & deeper pipeline can reduce taken-branch cost)

Add ~1 cycle per register saved/restored (plus memory wait-states).
If FPU context is saved (no lazy stacking or you use FP in that function/ISR), add ~18 words each way (S0–S15 + FPSCR) → easily +40–80 cycles total depending on bus/waits.

Many tiny functions get inlined at -O2/-O3zero call overhead. Tail-calls can also eliminate the return cost.

Why it varies (components)

  • BL/BLX (branch-with-link): incurs a taken-branch penalty (pipeline refill).

  • Prologue/Epilogue: PUSH {LR,…}, POP {… ,PC} (or BX LR) — cost ≈ 1 cycle per register + bus waits.

  • Arguments/ABI: AAPCS passes first 4 args in r0–r3; spills or many locals force extra stack traffic.

  • Memory system: Flash wait-states, I-cache/D-cache (M7), and stack SRAM speed all change cycle counts.

  • FPU: With FP use (or if lazy FP stacking is off), saving FP regs dominates.

Measure on your STM32 (exact cycle count)

Use the DWT cycle counter:

// Works on M3/M4/M7/M33 (not on M0/M0+).
#include <stdint.h>
#define DEMCR      (*((volatile uint32_t*)0xE000EDFC))
#define DWT_CTRL   (*((volatile uint32_t*)0xE0001000))
#define DWT_CYCCNT (*((volatile uint32_t*)0xE0001004))
#define DEMCR_TRCENA (1<<24)
#define DWT_CYCCNTENA (1<<0)

__attribute__((noinline)) int tiny(int x){ return x+1; } // avoid inline

void setup() {
  Serial.begin(115200);
  DEMCR |= DEMCR_TRCENA;
  DWT_CYCCNT = 0;
  DWT_CTRL |= DWT_CYCCNTENA;

  // Warm up caches/predictors (M7) and branch paths
  for (volatile int i=0;i<1000;i++) tiny(i);
}

void loop() {
  const int N = 1000;
  volatile int sink = 0;

  uint32_t start = DWT_CYCCNT;
  for (int i=0;i<N;i++) sink += tiny(i); // call many times
  uint32_t cycles = DWT_CYCCNT - start;

  Serial.print("Avg cycles per call: ");
  Serial.println((double)cycles / N, 2);

  while(1);
}

Tips for accurate numbers:

  • Mark the callee noinline, the loop noopt if needed, and compile with your usual -O2/-O3.

  • Run multiple passes and average; ensure the loop body doesn’t get vectorized/reassociated in ways that remove calls.

  • If your part is M0/M0+, there’s no DWT. Then toggle a GPIO around the call and measure with a scope/LA, or use the SysTick counter as a coarse timer.

Practical takeaways

  • On M3/M4/M33, budget ~10–20 cycles per small call; on M7 often <15 cycles; on M0/M0+ ~20+ cycles.

  • If you see hundreds of cycles per call, you’re hitting flash waits, cache misses, or FP context saves.

  • For hot paths, prefer static inline, LTO, or rework to pass by register and avoid stack use.

  • Keep FP math out of ISRs or enable lazy FP stacking to avoid huge saves.

More from this blog

A

Ampheo Electronic Blog-Chip and component knowledge sharing

181 posts

Original and Genuine Electronic Components Distributor