How many cycles does function call takes STM32?

Short answer: there’s no single number. A function call on STM32 depends on the CPU core (M0/M0+, M3/M4/M33, M7), compiler/optimizations, how many registers must be saved, flash/ SRAM wait-states, and whether the FPU context is involved. But you can estimate—and measure—very precisely.

Typical overhead (rule-of-thumb)

These are just the call/return + prologue/epilogue costs for a small C function (no heavy locals), assuming code runs from flash with a few wait-states, stack in SRAM, and no FPU context save:

Cortex-M0/M0+: ~12–30 cycles
(BL/branch penalty is higher; fewer multi-register ops)
Cortex-M3/M4/M33: ~8–20 cycles
(BL + pipeline refill + PUSH/POP of LR/r7…, 1 cycle per register + memory waits)
Cortex-M7: ~6–15 cycles
(branch prediction & deeper pipeline can reduce taken-branch cost)

Add ~1 cycle per register saved/restored (plus memory wait-states).
If FPU context is saved (no lazy stacking or you use FP in that function/ISR), add ~18 words each way (S0–S15 + FPSCR) → easily +40–80 cycles total depending on bus/waits.

Many tiny functions get inlined at -O2/-O3 → zero call overhead. Tail-calls can also eliminate the return cost.

Why it varies (components)

BL/BLX (branch-with-link): incurs a taken-branch penalty (pipeline refill).
Prologue/Epilogue: PUSH {LR,…}, POP {… ,PC} (or BX LR) — cost ≈ 1 cycle per register + bus waits.
Arguments/ABI: AAPCS passes first 4 args in r0–r3; spills or many locals force extra stack traffic.
Memory system: Flash wait-states, I-cache/D-cache (M7), and stack SRAM speed all change cycle counts.
FPU: With FP use (or if lazy FP stacking is off), saving FP regs dominates.

Measure on your STM32 (exact cycle count)

Use the DWT cycle counter:

// Works on M3/M4/M7/M33 (not on M0/M0+).
#include <stdint.h>
#define DEMCR      (*((volatile uint32_t*)0xE000EDFC))
#define DWT_CTRL   (*((volatile uint32_t*)0xE0001000))
#define DWT_CYCCNT (*((volatile uint32_t*)0xE0001004))
#define DEMCR_TRCENA (1<<24)
#define DWT_CYCCNTENA (1<<0)

__attribute__((noinline)) int tiny(int x){ return x+1; } // avoid inline

void setup() {
  Serial.begin(115200);
  DEMCR |= DEMCR_TRCENA;
  DWT_CYCCNT = 0;
  DWT_CTRL |= DWT_CYCCNTENA;

  // Warm up caches/predictors (M7) and branch paths
  for (volatile int i=0;i<1000;i++) tiny(i);
}

void loop() {
  const int N = 1000;
  volatile int sink = 0;

  uint32_t start = DWT_CYCCNT;
  for (int i=0;i<N;i++) sink += tiny(i); // call many times
  uint32_t cycles = DWT_CYCCNT - start;

  Serial.print("Avg cycles per call: ");
  Serial.println((double)cycles / N, 2);

  while(1);
}

Tips for accurate numbers:

Mark the callee noinline, the loop noopt if needed, and compile with your usual -O2/-O3.
Run multiple passes and average; ensure the loop body doesn’t get vectorized/reassociated in ways that remove calls.
If your part is M0/M0+, there’s no DWT. Then toggle a GPIO around the call and measure with a scope/LA, or use the SysTick counter as a coarse timer.

Practical takeaways

On M3/M4/M33, budget ~10–20 cycles per small call; on M7 often <15 cycles; on M0/M0+ ~20+ cycles.
If you see hundreds of cycles per call, you’re hitting flash waits, cache misses, or FP context saves.
For hot paths, prefer static inline, LTO, or rework to pass by register and avoid stack use.
Keep FP math out of ISRs or enable lazy FP stacking to avoid huge saves.