edge-faas-cpp

Edge FaaS Cold-Start Mitigation via EWMA + CUSUM Prediction

Proposal

Check #1

Check #2

Before Pre — Final Results Heading into the Class Talk

#1 — What We’re Doing

#2 — Completion Status

#3 — Test Journey & Results

#4 — Next Steps

Build & Run

What Changed Since Check #1

Check #2 Checklist

How to Run

Key Findings & Known Issues

What Was Added Since Check #2

Main Result (5 modes × 4 cycles, single trial on 2026-04-20)

CoW Template — 9× Cold-Start Speedup

CUSUM Real-Data Trace — Alarms During the Ramp, Not the Peak

Workload Design — Bursty-Ramp × 4 Cycles

Warmup-Sweep Ablation — Two Failure Modes, No Winner

Known Limitations / Final Report TODOs

Key Design Principles

Architecture

Implemented (Core System — All Running)

Not Yet Done (Gap to Paper-Level Evaluation)

Phase 1 — V1.0: Trivial Workload (Misleading “Perfect” Results)

Phase 2 — Threshold Bug: Everything Classified as COLD

Phase 3 — V2.0: Properly Calibrated Experiment

Language: English 中文版

[!WARNING] Not all source files have been fully polished yet (e.g., comments, code structure, and style). Some cleanup work has been done, but much remains. All code will be brought up to professional engineering quality before the end of the semester. Apologies for the current state.🙏🙏🙏

Prof: Ramesh Govindan

Proposal Report: Proposal.pdf

This project builds a clean-slate, bare-metal FaaS (Function-as-a-Service) worker node from scratch in C++, targeting the cold-start problem in edge serverless environments.

The core research question: Can a lightweight, O(1) prediction heuristic (EWMA + CUSUM + Little’s Law) reduce cold-start penalties in cyclic workloads — while consuming far less overhead than heavyweight time-series models like ARIMA?

No Kubernetes, no Docker. OS-level process primitives + Copy-on-Write as a “proxy sandbox.” The goal is to decouple and precisely measure our prediction algorithm’s overhead without KVM/container noise masking the signal. This is a methodological adjustment made after the advisor raised the Firecracker question: we are measuring the control plane, not reinventing data-plane isolation technology.
Strict Reactor / DispatchPool separation. The epoll event loop never blocks; all UDS I/O is offloaded to a 64-thread DispatchPool. The predictor runs on the same thread as the event loop with zero contention.
Simplicity-first. EWMA + CUSUM run in O(1) time and O(1) memory. We claim this is sufficient for cyclic IoT workloads and intend to prove it against ARIMA on a Pareto (latency vs. memory) curve.

Module	File	Status
C++ Epoll Reactor (ET mode, non-blocking)	`tcp_server.hpp`	✅ Done
64-thread DispatchPool (all blocking I/O here)	`DispatchPool.hpp`	✅ Done
WorkerPool: fork + scale-down scavenger thread	`WorkerPool.hpp`	✅ Done
Predictor: EWMA + CUSUM + Little’s Law	`Predictor.hpp`	✅ Done
Dynamic T feedback (`UpdateServiceTime`)	`web_server.hpp`	✅ Done
Python Worker (simulates T=500ms AI inference)	`worker.py`	✅ Done
Two-cycle comparison test + drain phase	`load_tester.py`	✅ Done

Missing Item	Type	Priority
ARIMA baseline (Python `statsmodels`)	Code + Experiment	🔴 High — explicitly requested by advisor
Reactive baseline (predictor disabled, cold fallback only)	Code + Experiment	🔴 High — needed to show worst case
Static baseline (fixed N workers, no scaling)	Code + Experiment	🟡 Medium — needed for Pareto curve
Memory / CPU overhead measurement	Experiment	🔴 High — core data for Pareto curve
Predictor inference latency measurement	Experiment	🟡 Medium — proves EWMA/CUSUM « ARIMA
CloudLab bare-metal testing	Environment	🟡 Medium — loopback has no network noise
Cyclic bursty workload with `wrk`	Experiment	🟡 Medium — proposal uses wrk, not Python script

Log: test_20260303_013519.log

In V1.0, the Python worker processed a 1×1 pixel image — service time T ≈ 0.001s. With such a tiny T, Little’s Law gives N = ⌈60 × 0.001⌉ + 1 = 2 workers even at peak 60 RPS. The predictor never needed to scale beyond N=2, so cold starts never occurred. Worse, scale-down was not yet implemented: workers forked in Cycle 1 stayed alive forever, so Cycle 2 trivially had zero cold starts — not because of EWMA memory, but because workers were never killed.

Result: P50=1.4ms, P95=1.7ms, P99=2.1ms, 0 cold starts. Looks great, means nothing.

Root cause of the misleading result: T was too small → N was always 2 → predictor was irrelevant. No scale-down meant Cycle 1 “pre-warmed” Cycle 2 for free.

Logs: test_20260303_022336.log, test_20260303_022743.log

After adding time.sleep(0.5) to the worker (T=500ms), the warm path latency became ~502ms. The cold-start threshold in load_tester.py was still set at rtt > 500 — so every warm request (502ms) was classified as COLD. 100% COLD across all phases. The fix was to set the threshold at 700ms, the midpoint between the warm path (~502ms) and the cold fallback (~800ms).

Log: test_20260303_220742.log

With T=500ms, threshold=700ms, idle_timeout=6s, and an 8-second drain phase between cycles to force the scavenger to kill all workers:

Phase	Total	WARM	COLD	P50	P99
C1-Warmup (2 RPS, 10s)	20	18	2	503 ms	806 ms
C1-Spike (30 RPS, 5s)	150	55	95	801 ms	806 ms
C1-Cooldown (2 RPS, 6s)	12	12	0	502 ms	504 ms
[Drain: 8s zero traffic — scavenger kills all workers]
C2-Warmup (2 RPS, 5s)	10	8	2	515 ms	812 ms
C2-Spike (30 RPS, 5s)	150	64	86	801 ms	804 ms

Key comparison — spike phase only:

C1-Spike: 55 warm / 95 cold → 37% warm hit rate
C2-Spike: 64 warm / 86 cold → 43% warm hit rate
Cold-start reduction: −9.5%

What caused the improvement? After C1’s spike, EWMA climbs to ~15 RPS. During cooldown it decays to ~2.9 by C2-Warmup. Little’s Law gives ⌈2.9 × 0.5⌉ + 1 = 3 workers — one more than in C1’s warmup. That extra pre-warmed worker accounts for the +9 warm hits in C2’s spike.

Why is the improvement modest? CUSUM is reactive, not predictive: it fires at second 2 of the spike, but new workers need 0.9s to warm up. Most spike improvement comes from EWMA memory during warmup, not CUSUM during the spike itself. The real bottleneck is that 30 RPS × 0.5s = 15 concurrent workers needed, but we only pre-warm ~3. The system simply doesn’t have enough workers during a steep spike.

To reach paper-quality evaluation, three things are needed in order of priority:

Implement baselines — Reactive (no predictor), Static (fixed N), and ARIMA (Python statsmodels). Without these, we have no Pareto curve and no paper claim.
Measure overhead — CPU and memory footprint of each policy. This is the axis that differentiates us from ARIMA.
CloudLab deployment — Real bare-metal nodes with real network latency. Loopback measurements are clean but don’t represent edge conditions.

# Dependencies
pip install Pillow

# Build
make clean && make

# Run server (port 8080)
./server

# Run two-cycle comparison test
python3 load_tester.py

Log output: logs/test_<TIMESTAMP>.log

check #2 report: checkin2.pdf

Change	Description
CoW Template Process	Added `worker_template.py`: loads Pillow once at startup; all subsequent workers are CoW-forked from it. Per-worker cold start drops from ~800 ms to ~100 ms
Narrative Reframe	Dropped the circular “bypass KVM for clean measurement” argument. New framing: edge inference nodes cannot afford MicroVM overhead — OS process isolation is the appropriate data plane
Reactive Baseline	`./server reactive` — pure scale-on-demand, no prediction; establishes cold-start lower bound
Static Baseline	`./server static 15` — fixed pool of 15 workers kept alive throughout; establishes resource upper bound
ARIMA Baseline	`./server arima` — separate Python process running ARIMA(2,1,2); measures cost of heavyweight forecasting
4-Cycle Workload	`load_tester.py` upgraded to 4 cycles (C1: 8 s warmup; C2–C4: 35 s warmup) to give ARIMA sufficient history
Experiment Automation	`run_experiments.sh` runs all 4 baselines serially and archives logs to `logs/exp_<mode>_<ts>/`
Resource Monitor	`resource_monitor.py` samples RSS, CPU, and worker count at 1 Hz; outputs CSV for Pareto analysis

Completed:

☑ CoW template process (worker_template.py + refactored WorkerPool.hpp)
☑ Reactive / Static / ARIMA baselines
☑ 4-cycle bursty-ramp load tester
☑ Automated experiment runner + log archiving
☑ Resource monitor + Pareto analysis
☑ Visualizations (3 figures: cold-start bar chart, worker time-series, Pareto scatter)

Remaining (Final Deliverable):

☐ CloudLab bare-metal deployment
☐ wrk-based high-frequency load generation
☐ Adaptive CUSUM threshold (fix C2/C4 regression)
☐ Frontend Web UI
☐ Final written report

# Dependencies
pip install Pillow statsmodels

# Build
make clean && make

# Run a specific mode
./server ewma          # default: EWMA+CUSUM prediction
./server reactive      # reactive baseline
./server static 15     # fixed pool of 15 workers
./server arima         # ARIMA prediction

# Run 4-cycle load test
python3 load_tester.py

# Run all 4 baselines (recommended — auto-archives logs)
./run_experiments.sh

# Run a single baseline
./run_experiments.sh ewma

Log output: logs/exp_<mode>_<timestamp>/ (contains test.log, server.log, resource.csv)

Key Findings:

EWMA+CUSUM achieves zero cold starts in C1/C3 (odd cycles) — Reactive never does. Predictive pre-warming works for bursty-ramp traffic
C2/C4 regress to cold=36. Root cause: the first Ramp measurement window straddles the Cooldown/Ramp boundary and captures only ~9 RPS; CUSUM accumulates to 7.83 — just 0.17 below the threshold of 8 — so the alarm fires only after the Spike has begun
ARIMA requires 4+ cycles to converge (C4 cold=10); it underperforms Reactive in C1–C3, confirming the “history cold-start” cost
EWMA+CUSUM has the lowest average RSS (2,021 MB) despite peaking at 23 workers — the scavenger recycles workers every 6 s, keeping each new fork in a clean CoW state

Known Issues / Limitations:

Fixed CUSUM threshold causes every-other-cycle regression; adaptive threshold is the planned fix
All tests run on localhost loopback; CloudLab results with real network latency may differ
RSS figures are summed per-process and double-count CoW shared pages; true physical memory (PSS) is lower

This section captures the consolidated final results compiled ahead of the 2026-04-30 CSCI 599 class presentation. It systematically supplements Check #2 with three additions: CoW cold-start quantification, the Adaptive CUSUM baseline, and the Warmup-Sweep ablation. All figures live in figures/pre/ — see MANIFEST.md for provenance.

Addition	Files / Command	Description
CoW cold-start quantification	`figures/plot_cow.py` → `slide05_cow.png`	Hand-measured from `worker.py` simulated cold start + `server.log` “Worker N ready (CoW fork)” lines: `Naive (exec Python + import) ≈ 900 ms` vs `CoW (fork from warm parent) ≈ 100 ms` — 9× faster, no runtime dependency
CUSUM real-data trace	`figures/plot_rps_cusum.py` → `slide06_rps_cusum.png`	CUSUM accumulator trajectory reconstructed from sweep #3’s real `server.log` (90 predictor ticks, 11 SPIKE DETECTED events, drift=5, h=8) — confirms alarms fire entirely during the ramp phase
Adaptive CUSUM baseline	`./server ewma_adaptive`	Uses EWMSD (running σ) for z-score normalization — the alternative to fixed-drift CUSUM
Workload design figure	`figures/plot_workload.py` → `slide07_workload.png`	`load_tester.py`’s 4-cycle Bursty-Ramp parameters visualized as a timeline; the Ramp band is labeled as the CUSUM detection window
5-mode main result	`figures/plot_main_result.py` → `slide08_main_result.png`	2026-04-20 re-run of all 5 modes × 4 cycles; cold counts parsed from each `load_tester_output.txt` SPIKE COMPARISON table
Warmup-Sweep ablation	`figures/plot_sweep.py` → `slide10_sweep.png`	sweep #1 (W=5, 120 s endpoints) + sweep #3 (W=10, 20, 35, 60 s interior); fixed vs adaptive comparison

Total cold starts = number of spike-phase requests (out of 600 across 4 cycles) classified as cold (RTT > 700 ms).

Key findings:

Predict-based (Fixed CUSUM) clearly beats reactive baselines: 30% fewer cold starts than Reactive, 45% fewer than ARIMA
33 of those 47 come from a single C3 clock-aliasing event: the 2-second measurement window happened to truncate the ramp, leaving CUSUM at 18.24 instead of the expected ~8–10 (server.log at t=1776671064), delaying the first SPIKE DETECTED by 4 seconds. This is a known failure mode of fixed-drift CUSUM, not a bug. Without this event, total ≈ 14 — almost at the Static-15 floor, but without pinning 15 workers
CoW Template drops per-worker spin-up from 900 ms to 100 ms (9× speedup, no runtime dependency)
CUSUM fires during the ramp, not after the peak: a single 200 s 4-cycle run produces 11 SPIKE DETECTED alarms, all of them during the ramp climb — none after the peak

We pre-import Pillow and set up the socket once in a single template Python process. Every new worker is cloned from it via fork(). Linux’s copy-on-write makes the fork itself nearly free — Pillow’s code and import tables are read-only, so almost no pages are duplicated.

→ No image. No snapshot. No registry. Just fork() from a warm parent.

Top panel: blue is measured RPS, dashed orange is the EWMA baseline (α=0.2). The baseline lags by design — that lag is what lets CUSUM observe the gap when RPS pulls away.

Bottom panel: green is the CUSUM accumulator. It crosses the red threshold h = 8 and ★ alarms fire — all 11 alarms land during the ramp climb, none after the peak. This is the empirical proof of “catch the ramp, not the peak”.

Each cycle simulates one “train arrival” pattern: Warmup → Ramp (30 s) → Spike (30 RPS × 5 s) → Cooldown → Drain.

C1 warmup = 8 s: tests the truly cold case — no prior history
C2–C4 warmup = 35 s: long enough for the scavenger to scale workers back down, but the EWMA baseline still remembers the last spike — this tests the predictor’s cross-cycle memory
The orange Ramp band is the CUSUM detection window: every worker we want at peak time has to be forked within those 30 seconds

W (s)	Fixed CUSUM	Adaptive CUSUM
5	48	287
10	45	135
20	0	0
35	0	0
60	0	0
120	32	0

Adaptive fails at short W (τ_σ cliff): the running σ estimator cannot decay between back-to-back bursts, so the z-score never crosses threshold. Measured τ_σ ≈ 6.6 s
Fixed fails at long W (aliasing miss): threshold h=8 was tuned for the typical ramp, and the 2 s measurement window can truncate the ramp at boundary cases
Sweet spot: W = 20 ~ 60 s — both predictors work here. The main result at W=35 sits inside this band, which is why both score 0
Grid totals: fixed=131 < adaptive=423 — raw counts actually favor fixed

⚠ Framing: two failure modes, no winner. Fixed handles tight cadence; adaptive handles loose cadence. The predictor isn’t a choice — it’s a knob. Adaptive’s real contribution is scale invariance + aliasing robustness, not lower cold-start counts.

Static-15 (over-provisioned)

15 workers pinned the entire run — upper-bound reference

Adaptive CUSUM (EWMSD z-score)

W=35 happens to sit in the sweet spot

Fixed CUSUM (ours)

C3 hit a clock-aliasing event

Reactive (scale on backlog)

Reactive baseline

ARIMA (smoothed Target)

Heavyweight time-series forecasting

n = 1 per sweep point: the sweep grid is single-trial due to time budget heading into the class talk → CloudLab multi-trial (n ≥ 5) is deferred to the final report
Current Python load_tester caps near 300 RPS, not enough to model realistic edge bursts (target ≥ 2k RPS, requires wrk or a Rust async generator)
Regime-aware ensemble is the natural research follow-up to the sweep result: fixed + adaptive + meta-controller, picking automatically by workload cadence — closes the W ≤ 10 s gap without giving up adaptive’s scale invariance
All tests still run on localhost loopback (no real network latency)
Reported RSS sums per-process residency and double-counts CoW shared pages; true PSS is lower
Presentation deliverables: docs/pre_how_4.md (13-slide script, EN/ZH bilingual) + figures/pre/ (5 figures)

This site is open source. Improve this page.