1 files changed, 59 insertions, 19 deletions
diff --git a/ci-bench/README.md b/ci-bench/README.md
index 2f38c04b..14ff49a2 100644
--- a/ci-bench/README.md
+++ b/ci-bench/README.md
@@ -1,14 +1,17 @@
 # CI Bench
 
-This crate is meant for CI benchmarking. It measures CPU instructions using `cachegrind`, outputs
-the results in CSV format and allows comparing results from multiple runs.
+This crate is meant for CI benchmarking. It has two modes of operation:
+
+1. Measure CPU instructions using `cachegrind`.
+2. Measure wall-time (runs each benchmark multiple times, leaving it to the caller to do statistical
+   analysis).
 
 ## Usage
 
 You can get detailed usage information through `cargo run --release -- --help`. Below are the most
 important bits.
 
-### Running all benchmarks
+### Running all benchmarks in instruction count mode
 
 _Note: this step requires having `valgrind` in your path._
 
@@ -37,7 +40,24 @@ are useful to report detailed instruction count differences when comparing two b
 subdirectory also contains log information from cachegrind itself (in `.log` files), which can be
 used to diagnose unexpected cachegrind crashes.
 
-### Comparing results
+### Running all benchmarks in wall-time mode
+
+Use `cargo run --release -- walltime --iterations-per-scenario 3` to print the CSV results to stdout
+(we use 3 iterations here for demonstration purposes, but recommend 100 iterations to deal with
+noise). The output should look like the following (one column per iteration):
+
+```csv
+handshake_no_resume_ring_1.2_rsa_aes,6035261,1714158,977368
+handshake_session_id_ring_1.2_rsa_aes,1537632,2445849,1766888
+handshake_tickets_ring_1.2_rsa_aes,1553743,2418286,1636431
+transfer_no_resume_ring_1.2_rsa_aes,10192862,10374258,8988854
+handshake_no_resume_ring_1.3_rsa_aes,1010150,1400602,936029
+...
+... rest omitted for brevity
+...
+```
+
+### Comparing results from an instruction count benchmark run
 
 Use `cargo run --release -- compare foo bar`. It will output a report using GitHub-flavored markdown
 (used by the CI itself to give feedback about PRs). We currently consider differences of 0.2% to be
@@ -62,17 +82,19 @@ with names like `transfer_no_resume_1.3_rsa_aes_client`.
 We have made an effort to heavily document the source code of the benchmarks. In addition to that,
 here are some high-level considerations that can help you hack on the crate.
 
-### Architecture
+### Environment configuration
 
-An important goal of this benchmarking setup is that it should run with minimum noise on
-standard GitHub Actions runners. We achieve that by measuring CPU instructions using `cachegrind`,
-which runs fine on the cloud (contrary to hardware instruction counters). This is the same
-approach used by the [iai](https://crates.io/crates/iai) benchmarking crate, but we needed more
-flexibility and have therefore rolled our own setup.
+An important goal of this benchmarking setup is that it should run with minimal noise. Measuring CPU
+instructions using `cachegrind` yields excellent results, regardless of your environment. The
+wall-time benchmarks, however, require a more elaborate benchmarking environment: running them on a
+laptop is too noisy, but running them on a carefully configured bare-metal server yields accurate
+measurements (up to 1% resolution, according to our tests).
+
+### Instruction count mode
 
 Using `cachegrind` has some architectural consequences because it operates at the process level
-(i.e. it can count CPU instructions for a whole process, but not for a single function). The
-most important consequences are:
+(i.e. it can count CPU instructions for a whole process, but not for a single function). The most
+important consequences when running in instruction count mode are:
 
 - Since we want to measure server and client instruction counts separately, the benchmark runner
   spawns two child processes for each benchmark (one for the client, one for the server) and pipes
@@ -85,18 +107,36 @@ most important consequences are:
   subtracted from it. We are currently using this to subtract the handshake instructions from the
   data transfer benchmark.
 
-### Debugging
-
-If you need to debug the crate, here are a few tricks that might help:
+If you need to debug benchmarks in instruction count mode, here are a few tricks that might help:
 
 - For printf debugging, you should use `eprintln!`, because child processes use stdio as the
   transport for the TLS connection (i.e. if you print something to stdout, you won't even see it
   _and_ the other side of the connection will choke on it).
 - When using a proper debugger, remember that each side of the connection runs as a child process.
-  If necessary, you can tweak the code to ensure both sides of the connection run on the parent
-  process (e.g. by starting each side on its own thread and having them communicate through TCP).
-  This should require little effort, because the TLS transport layer is encapsulated and generic
-  over `Read` and `Write`.
+
+### Wall-time mode
+
+To increase determinism, it is important that wall-time mode benchmarks run in a single process and
+thread. All IO is done in-memory and there is no complex setup like in the case of the instruction
+counting mode. Because of this, the easiest way to debug the crate is by running the benchmarks in
+wall-time mode.
+
+### Code reuse between benchmarking modes
+
+Originally, we only supported the instruction count mode, implemented using blocking IO. Even though
+the code was generic over the `Reader` and `Writer`, it could not be reused for the wall-time mode
+because it was blocking (e.g. if the client side of the connection is waiting for a read, the thread
+is blocked and the server never gets a chance to write).
+
+The solution was to:
+
+1. Rewrite the IO code to use async / await.
+2. Keep using blocking operations under the hood in instruction-count mode, disguised as `Future`s
+   that complete after a single `poll`. This way we avoid using an async runtime, which could
+   introduce non-determinism.
+3. Use non-blocking operations under the hood in wall-time mode, which simulate IO through shared
+   in-memory buffers. The server and client `Future`s are polled in turns, so again we we avoid
+   pulling in an async runtime and keep things as deterministic as possible.
 
 ### Why measure CPU instructions