Language: English 中文版
[!WARNING] Not all source files have been fully polished yet (e.g., comments, code structure, and style). Some cleanup work has been done, but much remains. All code will be brought up to professional engineering quality before the end of the semester. Apologies for the current state.🙏🙏🙏
Prof: Ramesh Govindan

Proposal Report: Proposal.pdf

This project builds a clean-slate, bare-metal FaaS (Function-as-a-Service) worker node from scratch in C++, targeting the cold-start problem in edge serverless environments.
The core research question: Can a lightweight, O(1) prediction heuristic (EWMA + CUSUM + Little’s Law) reduce cold-start penalties in cyclic workloads — while consuming far less overhead than heavyweight time-series models like ARIMA?
DispatchPool. The predictor runs on the same thread as the event loop with zero contention.| Module | File | Status |
|---|---|---|
| C++ Epoll Reactor (ET mode, non-blocking) | tcp_server.hpp |
✅ Done |
| 64-thread DispatchPool (all blocking I/O here) | DispatchPool.hpp |
✅ Done |
| WorkerPool: fork + scale-down scavenger thread | WorkerPool.hpp |
✅ Done |
| Predictor: EWMA + CUSUM + Little’s Law | Predictor.hpp |
✅ Done |
Dynamic T feedback (UpdateServiceTime) |
web_server.hpp |
✅ Done |
| Python Worker (simulates T=500ms AI inference) | worker.py |
✅ Done |
| Two-cycle comparison test + drain phase | load_tester.py |
✅ Done |
| Missing Item | Type | Priority |
|---|---|---|
ARIMA baseline (Python statsmodels) |
Code + Experiment | 🔴 High — explicitly requested by advisor |
| Reactive baseline (predictor disabled, cold fallback only) | Code + Experiment | 🔴 High — needed to show worst case |
| Static baseline (fixed N workers, no scaling) | Code + Experiment | 🟡 Medium — needed for Pareto curve |
| Memory / CPU overhead measurement | Experiment | 🔴 High — core data for Pareto curve |
| Predictor inference latency measurement | Experiment | 🟡 Medium — proves EWMA/CUSUM « ARIMA |
| CloudLab bare-metal testing | Environment | 🟡 Medium — loopback has no network noise |
Cyclic bursty workload with wrk |
Experiment | 🟡 Medium — proposal uses wrk, not Python script |
Log: test_20260303_013519.log
In V1.0, the Python worker processed a 1×1 pixel image — service time T ≈ 0.001s. With such a tiny T, Little’s Law gives N = ⌈60 × 0.001⌉ + 1 = 2 workers even at peak 60 RPS. The predictor never needed to scale beyond N=2, so cold starts never occurred. Worse, scale-down was not yet implemented: workers forked in Cycle 1 stayed alive forever, so Cycle 2 trivially had zero cold starts — not because of EWMA memory, but because workers were never killed.
Result: P50=1.4ms, P95=1.7ms, P99=2.1ms, 0 cold starts. Looks great, means nothing.
Root cause of the misleading result: T was too small → N was always 2 → predictor was irrelevant. No scale-down meant Cycle 1 “pre-warmed” Cycle 2 for free.
Logs: test_20260303_022336.log, test_20260303_022743.log
After adding time.sleep(0.5) to the worker (T=500ms), the warm path latency became ~502ms. The cold-start threshold in load_tester.py was still set at rtt > 500 — so every warm request (502ms) was classified as COLD. 100% COLD across all phases. The fix was to set the threshold at 700ms, the midpoint between the warm path (~502ms) and the cold fallback (~800ms).
Log: test_20260303_220742.log
With T=500ms, threshold=700ms, idle_timeout=6s, and an 8-second drain phase between cycles to force the scavenger to kill all workers:
| Phase | Total | WARM | COLD | P50 | P99 |
|---|---|---|---|---|---|
| C1-Warmup (2 RPS, 10s) | 20 | 18 | 2 | 503 ms | 806 ms |
| C1-Spike (30 RPS, 5s) | 150 | 55 | 95 | 801 ms | 806 ms |
| C1-Cooldown (2 RPS, 6s) | 12 | 12 | 0 | 502 ms | 504 ms |
| [Drain: 8s zero traffic — scavenger kills all workers] | |||||
| C2-Warmup (2 RPS, 5s) | 10 | 8 | 2 | 515 ms | 812 ms |
| C2-Spike (30 RPS, 5s) | 150 | 64 | 86 | 801 ms | 804 ms |
Key comparison — spike phase only:
What caused the improvement? After C1’s spike, EWMA climbs to ~15 RPS. During cooldown it decays to ~2.9 by C2-Warmup. Little’s Law gives ⌈2.9 × 0.5⌉ + 1 = 3 workers — one more than in C1’s warmup. That extra pre-warmed worker accounts for the +9 warm hits in C2’s spike.
Why is the improvement modest? CUSUM is reactive, not predictive: it fires at second 2 of the spike, but new workers need 0.9s to warm up. Most spike improvement comes from EWMA memory during warmup, not CUSUM during the spike itself. The real bottleneck is that 30 RPS × 0.5s = 15 concurrent workers needed, but we only pre-warm ~3. The system simply doesn’t have enough workers during a steep spike.
To reach paper-quality evaluation, three things are needed in order of priority:
statsmodels). Without these, we have no Pareto curve and no paper claim.# Dependencies
pip install Pillow
# Build
make clean && make
# Run server (port 8080)
./server
# Run two-cycle comparison test
python3 load_tester.py
Log output: logs/test_<TIMESTAMP>.log

check #2 report: checkin2.pdf
| Change | Description |
|---|---|
| CoW Template Process | Added worker_template.py: loads Pillow once at startup; all subsequent workers are CoW-forked from it. Per-worker cold start drops from ~800 ms to ~100 ms |
| Narrative Reframe | Dropped the circular “bypass KVM for clean measurement” argument. New framing: edge inference nodes cannot afford MicroVM overhead — OS process isolation is the appropriate data plane |
| Reactive Baseline | ./server reactive — pure scale-on-demand, no prediction; establishes cold-start lower bound |
| Static Baseline | ./server static 15 — fixed pool of 15 workers kept alive throughout; establishes resource upper bound |
| ARIMA Baseline | ./server arima — separate Python process running ARIMA(2,1,2); measures cost of heavyweight forecasting |
| 4-Cycle Workload | load_tester.py upgraded to 4 cycles (C1: 8 s warmup; C2–C4: 35 s warmup) to give ARIMA sufficient history |
| Experiment Automation | run_experiments.sh runs all 4 baselines serially and archives logs to logs/exp_<mode>_<ts>/ |
| Resource Monitor | resource_monitor.py samples RSS, CPU, and worker count at 1 Hz; outputs CSV for Pareto analysis |
Completed:
worker_template.py + refactored WorkerPool.hpp)Remaining (Final Deliverable):
wrk-based high-frequency load generation# Dependencies
pip install Pillow statsmodels
# Build
make clean && make
# Run a specific mode
./server ewma # default: EWMA+CUSUM prediction
./server reactive # reactive baseline
./server static 15 # fixed pool of 15 workers
./server arima # ARIMA prediction
# Run 4-cycle load test
python3 load_tester.py
# Run all 4 baselines (recommended — auto-archives logs)
./run_experiments.sh
# Run a single baseline
./run_experiments.sh ewma
Log output: logs/exp_<mode>_<timestamp>/ (contains test.log, server.log, resource.csv)
Key Findings:
Known Issues / Limitations:
This section captures the consolidated final results compiled ahead of the 2026-04-30 CSCI 599 class presentation. It systematically supplements Check #2 with three additions: CoW cold-start quantification, the Adaptive CUSUM baseline, and the Warmup-Sweep ablation. All figures live in
figures/pre/— seeMANIFEST.mdfor provenance.
| Addition | Files / Command | Description |
|---|---|---|
| CoW cold-start quantification | figures/plot_cow.py → slide05_cow.png |
Hand-measured from worker.py simulated cold start + server.log “Worker N ready (CoW fork)” lines: Naive (exec Python + import) ≈ 900 ms vs CoW (fork from warm parent) ≈ 100 ms — 9× faster, no runtime dependency |
| CUSUM real-data trace | figures/plot_rps_cusum.py → slide06_rps_cusum.png |
CUSUM accumulator trajectory reconstructed from sweep #3’s real server.log (90 predictor ticks, 11 SPIKE DETECTED events, drift=5, h=8) — confirms alarms fire entirely during the ramp phase |
| Adaptive CUSUM baseline | ./server ewma_adaptive |
Uses EWMSD (running σ) for z-score normalization — the alternative to fixed-drift CUSUM |
| Workload design figure | figures/plot_workload.py → slide07_workload.png |
load_tester.py’s 4-cycle Bursty-Ramp parameters visualized as a timeline; the Ramp band is labeled as the CUSUM detection window |
| 5-mode main result | figures/plot_main_result.py → slide08_main_result.png |
2026-04-20 re-run of all 5 modes × 4 cycles; cold counts parsed from each load_tester_output.txt SPIKE COMPARISON table |
| Warmup-Sweep ablation | figures/plot_sweep.py → slide10_sweep.png |
sweep #1 (W=5, 120 s endpoints) + sweep #3 (W=10, 20, 35, 60 s interior); fixed vs adaptive comparison |

| Mode | C1 | C2 | C3 | C4 | Total | Notes |
|---|---|---|---|---|---|---|
| Static-15 (over-provisioned) | 0 | 0 | 0 | 0 | 0 | 15 workers pinned the entire run — upper-bound reference |
| Adaptive CUSUM (EWMSD z-score) | 0 | 0 | 0 | 0 | 0 | W=35 happens to sit in the sweet spot |
| Fixed CUSUM (ours) | 0 | 14 | 33 | 0 | 47 | C3 hit a clock-aliasing event |
| Reactive (scale on backlog) | 20 | 15 | 20 | 12 | 67 | Reactive baseline |
| ARIMA (smoothed Target) | 20 | 18 | 31 | 16 | 85 | Heavyweight time-series forecasting |
Total cold starts = number of spike-phase requests (out of 600 across 4 cycles) classified as cold (RTT > 700 ms).
Key findings:
server.log at t=1776671064), delaying the first SPIKE DETECTED by 4 seconds. This is a known failure mode of fixed-drift CUSUM, not a bug. Without this event, total ≈ 14 — almost at the Static-15 floor, but without pinning 15 workers
We pre-import Pillow and set up the socket once in a single template Python process. Every new worker is cloned from it via fork(). Linux’s copy-on-write makes the fork itself nearly free — Pillow’s code and import tables are read-only, so almost no pages are duplicated.
→ No image. No snapshot. No registry. Just fork() from a warm parent.

Top panel: blue is measured RPS, dashed orange is the EWMA baseline (α=0.2). The baseline lags by design — that lag is what lets CUSUM observe the gap when RPS pulls away.
Bottom panel: green is the CUSUM accumulator. It crosses the red threshold h = 8 and ★ alarms fire — all 11 alarms land during the ramp climb, none after the peak. This is the empirical proof of “catch the ramp, not the peak”.

Each cycle simulates one “train arrival” pattern: Warmup → Ramp (30 s) → Spike (30 RPS × 5 s) → Cooldown → Drain.

| W (s) | Fixed CUSUM | Adaptive CUSUM |
|---|---|---|
| 5 | 48 | 287 |
| 10 | 45 | 135 |
| 20 | 0 | 0 |
| 35 | 0 | 0 |
| 60 | 0 | 0 |
| 120 | 32 | 0 |
⚠ Framing: two failure modes, no winner. Fixed handles tight cadence; adaptive handles loose cadence. The predictor isn’t a choice — it’s a knob. Adaptive’s real contribution is scale invariance + aliasing robustness, not lower cold-start counts.
wrk or a Rust async generator)docs/pre_how_4.md (13-slide script, EN/ZH bilingual) + figures/pre/ (5 figures)