ci-bench: add wall-time mode

author: Adolfo Ochagavía <github@adolfo.ochagavia.nl> 2023-11-16 15:19:54 +0100
committer: Daniel McCarney <daniel@binaryparadox.net> 2023-11-23 17:09:09 +0000
commit: c005236270fac9632aeb9c874402148b1b473a9b (patch)
tree: 2c4d8a289e4bb64c985fa30e51616329240778ec /ci-bench
parent: 2463f991559242e2646f3999b9301ce2843059d1 (diff)
3 files changed, 408 insertions, 20 deletions
diff --git a/ci-bench/README.md b/ci-bench/README.md
index 2f38c04b..14ff49a2 100644
--- a/ci-bench/README.md
+++ b/ci-bench/README.md
@@ -1,14 +1,17 @@
 # CI Bench
 
-This crate is meant for CI benchmarking. It measures CPU instructions using `cachegrind`, outputs
-the results in CSV format and allows comparing results from multiple runs.
+This crate is meant for CI benchmarking. It has two modes of operation:
+
+1. Measure CPU instructions using `cachegrind`.
+2. Measure wall-time (runs each benchmark multiple times, leaving it to the caller to do statistical
+   analysis).
 
 ## Usage
 
 You can get detailed usage information through `cargo run --release -- --help`. Below are the most
 important bits.
 
-### Running all benchmarks
+### Running all benchmarks in instruction count mode
 
 _Note: this step requires having `valgrind` in your path._
 
@@ -37,7 +40,24 @@ are useful to report detailed instruction count differences when comparing two b
 subdirectory also contains log information from cachegrind itself (in `.log` files), which can be
 used to diagnose unexpected cachegrind crashes.
 
-### Comparing results
+### Running all benchmarks in wall-time mode
+
+Use `cargo run --release -- walltime --iterations-per-scenario 3` to print the CSV results to stdout
+(we use 3 iterations here for demonstration purposes, but recommend 100 iterations to deal with
+noise). The output should look like the following (one column per iteration):
+
+```csv
+handshake_no_resume_ring_1.2_rsa_aes,6035261,1714158,977368
+handshake_session_id_ring_1.2_rsa_aes,1537632,2445849,1766888
+handshake_tickets_ring_1.2_rsa_aes,1553743,2418286,1636431
+transfer_no_resume_ring_1.2_rsa_aes,10192862,10374258,8988854
+handshake_no_resume_ring_1.3_rsa_aes,1010150,1400602,936029
+...
+... rest omitted for brevity
+...
+```
+
+### Comparing results from an instruction count benchmark run
 
 Use `cargo run --release -- compare foo bar`. It will output a report using GitHub-flavored markdown
 (used by the CI itself to give feedback about PRs). We currently consider differences of 0.2% to be
@@ -62,17 +82,19 @@ with names like `transfer_no_resume_1.3_rsa_aes_client`.
 We have made an effort to heavily document the source code of the benchmarks. In addition to that,
 here are some high-level considerations that can help you hack on the crate.
 
-### Architecture
+### Environment configuration
 
-An important goal of this benchmarking setup is that it should run with minimum noise on
-standard GitHub Actions runners. We achieve that by measuring CPU instructions using `cachegrind`,
-which runs fine on the cloud (contrary to hardware instruction counters). This is the same
-approach used by the [iai](https://crates.io/crates/iai) benchmarking crate, but we needed more
-flexibility and have therefore rolled our own setup.
+An important goal of this benchmarking setup is that it should run with minimal noise. Measuring CPU
+instructions using `cachegrind` yields excellent results, regardless of your environment. The
+wall-time benchmarks, however, require a more elaborate benchmarking environment: running them on a
+laptop is too noisy, but running them on a carefully configured bare-metal server yields accurate
+measurements (up to 1% resolution, according to our tests).
+
+### Instruction count mode
 
 Using `cachegrind` has some architectural consequences because it operates at the process level
-(i.e. it can count CPU instructions for a whole process, but not for a single function). The
-most important consequences are:
+(i.e. it can count CPU instructions for a whole process, but not for a single function). The most
+important consequences when running in instruction count mode are:
 
 - Since we want to measure server and client instruction counts separately, the benchmark runner
   spawns two child processes for each benchmark (one for the client, one for the server) and pipes
@@ -85,18 +107,36 @@ most important consequences are:
   subtracted from it. We are currently using this to subtract the handshake instructions from the
   data transfer benchmark.
 
-### Debugging
-
-If you need to debug the crate, here are a few tricks that might help:
+If you need to debug benchmarks in instruction count mode, here are a few tricks that might help:
 
 - For printf debugging, you should use `eprintln!`, because child processes use stdio as the
   transport for the TLS connection (i.e. if you print something to stdout, you won't even see it
   _and_ the other side of the connection will choke on it).
 - When using a proper debugger, remember that each side of the connection runs as a child process.
-  If necessary, you can tweak the code to ensure both sides of the connection run on the parent
-  process (e.g. by starting each side on its own thread and having them communicate through TCP).
-  This should require little effort, because the TLS transport layer is encapsulated and generic
-  over `Read` and `Write`.
+
+### Wall-time mode
+
+To increase determinism, it is important that wall-time mode benchmarks run in a single process and
+thread. All IO is done in-memory and there is no complex setup like in the case of the instruction
+counting mode. Because of this, the easiest way to debug the crate is by running the benchmarks in
+wall-time mode.
+
+### Code reuse between benchmarking modes
+
+Originally, we only supported the instruction count mode, implemented using blocking IO. Even though
+the code was generic over the `Reader` and `Writer`, it could not be reused for the wall-time mode
+because it was blocking (e.g. if the client side of the connection is waiting for a read, the thread
+is blocked and the server never gets a chance to write).
+
+The solution was to:
+
+1. Rewrite the IO code to use async / await.
+2. Keep using blocking operations under the hood in instruction-count mode, disguised as `Future`s
+   that complete after a single `poll`. This way we avoid using an async runtime, which could
+   introduce non-determinism.
+3. Use non-blocking operations under the hood in wall-time mode, which simulate IO through shared
+   in-memory buffers. The server and client `Future`s are polled in turns, so again we we avoid
+   pulling in an async runtime and keep things as deterministic as possible.
 
 ### Why measure CPU instructions
 
diff --git a/ci-bench/src/main.rs b/ci-bench/src/main.rs
index e8ad47ef..2b44c7ee 100644
--- a/ci-bench/src/main.rs
+++ b/ci-bench/src/main.rs
@@ -6,6 +6,7 @@ use std::mem;
 use std::os::fd::{AsRawFd, FromRawFd};
 use std::path::{Path, PathBuf};
 use std::sync::Arc;
+use std::time::Instant;
 
 use anyhow::Context;
 use async_trait::async_trait;
@@ -77,6 +78,11 @@ pub enum Command {
     },
     /// Run a single benchmark at the provided index (used by the bench runner to start each benchmark in its own process)
     RunSingle { index: u32, side: Side },
+    /// Run all benchmarks in walltime mode and print the measured timings in CSV format
+    Walltime {
+        #[arg(short, long)]
+        iterations_per_scenario: usize,
+    },
     /// Compare the results from two previous benchmark runs and print a user-friendly markdown overview
     Compare {
         /// Path to the directory with the results of a previous `run-all` execution
@@ -183,6 +189,77 @@ fn main() -> anyhow::Result<()> {
             mem::forget(stdin);
             mem::forget(stdout);
         }
+        Command::Walltime {
+            iterations_per_scenario,
+        } => {
+            let mut timings = vec![Vec::with_capacity(iterations_per_scenario); benchmarks.len()];
+            for _ in 0..iterations_per_scenario {
+                for (i, bench) in benchmarks.iter().enumerate() {
+                    let start = Instant::now();
+
+                    // The variables below are used to initialize the client and server configs. We
+                    // let them go through `black_box` to ensure the optimizer doesn't take
+                    // advantage of knowing both the client and the server side of the
+                    // configuration.
+                    let resumption_kind = black_box(bench.kind.resumption_kind());
+                    let params = black_box(&bench.params);
+
+                    let (mut client_writer, mut server_reader) = async_io::async_pipe(262144);
+                    let (mut server_writer, mut client_reader) = async_io::async_pipe(262144);
+
+                    let server_side = async move {
+                        let handshake_buf = &mut [0u8; 262144];
+                        run_bench(
+                            ServerSideStepper {
+                                io: StepperIo {
+                                    reader: &mut server_reader,
+                                    writer: &mut server_writer,
+                                    handshake_buf,
+                                },
+                                config: ServerSideStepper::make_config(params, resumption_kind),
+                            },
+                            bench.kind,
+                        )
+                        .await
+                    };
+
+                    let client_side = async move {
+                        let handshake_buf = &mut [0u8; 262144];
+                        run_bench(
+                            ClientSideStepper {
+                                io: StepperIo {
+                                    reader: &mut client_reader,
+                                    writer: &mut client_writer,
+                                    handshake_buf,
+                                },
+                                resumption_kind,
+                                config: ClientSideStepper::make_config(params, resumption_kind),
+                            },
+                            bench.kind,
+                        )
+                        .await
+                    };
+
+                    let (client_result, server_result) =
+                        async_io::block_on_concurrent(client_side, server_side);
+                    client_result
+                        .with_context(|| format!("client side of {} crashed", bench.name()))?;
+                    server_result
+                        .with_context(|| format!("server side of {} crashed", bench.name()))?;
+
+                    timings[i].push(start.elapsed());
+                }
+            }
+
+            // Output the results
+            for (i, bench_timings) in timings.into_iter().enumerate() {
+                print!("{}", benchmarks[i].name());
+                for timing in bench_timings {
+                    print!(",{}", timing.as_nanos())
+                }
+                println!();
+            }
+        }
         Command::Compare {
             baseline_dir,
             candidate_dir,
diff --git a/ci-bench/src/util.rs b/ci-bench/src/util.rs
index b12d1a17..56022807 100644
--- a/ci-bench/src/util.rs
+++ b/ci-bench/src/util.rs
@@ -39,9 +39,12 @@ pub mod async_io {
     //! Async IO building blocks required for sharing code between the instruction count and
     //! wall-time benchmarks
 
+    use std::cell::{Cell, RefCell};
+    use std::collections::VecDeque;
     use std::fs::File;
     use std::future::Future;
-    use std::pin::pin;
+    use std::pin::{pin, Pin};
+    use std::rc::Rc;
     use std::task::{Poll, RawWaker, RawWakerVTable, Waker};
     use std::{io, ptr, task};
 
@@ -70,6 +73,61 @@ pub mod async_io {
         }
     }
 
+    /// Block on two futures that are run concurrently and return their results.
+    ///
+    /// Useful when measuring wall-time, because the server and the client side of the connection
+    /// run in a single process _and_ thread to minimize noise. Each side of the connection runs
+    /// inside its own future and they are polled in turns.
+    ///
+    /// Using this together with blocking futures can lead to deadlocks (i.e. when one of the
+    /// futures is blocked while it waits on a message from the other).
+    pub fn block_on_concurrent(
+        x: impl Future<Output = anyhow::Result<()>>,
+        y: impl Future<Output = anyhow::Result<()>>,
+    ) -> (anyhow::Result<()>, anyhow::Result<()>) {
+        let mut x = pin!(x);
+        let mut y = pin!(y);
+
+        // The futures won't complete right away, but since there are only two of them we can poll
+        // them in turns without a more complex waking mechanism.
+        let waker = noop_waker();
+        let mut ctx = task::Context::from_waker(&waker);
+
+        let mut x_output = None;
+        let mut y_output = None;
+
+        // Fuel makes sure we can exit a potential infinite loop if the futures are endlessly
+        // waiting on each other due to a bug (e.g. a read without a corresponding write)
+        let mut fuel = 1_000;
+        loop {
+            let futures_done = x_output.is_some() && y_output.is_some();
+            if futures_done || fuel == 0 {
+                break;
+            }
+
+            fuel -= 1;
+
+            if x_output.is_none() {
+                match x.as_mut().poll(&mut ctx) {
+                    Poll::Ready(output) => x_output = Some(output),
+                    Poll::Pending => {}
+                }
+            }
+
+            if y_output.is_none() {
+                match y.as_mut().poll(&mut ctx) {
+                    Poll::Ready(output) => y_output = Some(output),
+                    Poll::Pending => {}
+                }
+            }
+        }
+
+        match (x_output, y_output) {
+            (Some(x_output), Some(y_output)) => (x_output, y_output),
+            _ => panic!("at least one of the futures seems to be stuck"),
+        }
+    }
+
     // Copied from Waker::noop, which we cannot use directly because it hasn't been stabilized
     fn noop_waker() -> Waker {
         const VTABLE: RawWakerVTable = RawWakerVTable::new(|_| RAW, |_| {}, |_| {}, |_| {});
@@ -114,6 +172,219 @@ pub mod async_io {
             io::Write::flush(self)
         }
     }
+
+    /// Creates an unidirectional byte pipe of the given capacity, suitable for async reading and
+    /// writing
+    pub fn async_pipe(capacity: usize) -> (AsyncSender, AsyncReceiver) {
+        let open = Rc::new(Cell::new(true));
+        let buf = Rc::new(RefCell::new(VecDeque::with_capacity(capacity)));
+        (
+            AsyncSender {
+                inner: AsyncPipeSide {
+                    open: open.clone(),
+                    buf: buf.clone(),
+                },
+            },
+            AsyncReceiver {
+                inner: AsyncPipeSide { open, buf },
+            },
+        )
+    }
+
+    /// The sender end of an asynchronous byte pipe
+    pub struct AsyncSender {
+        inner: AsyncPipeSide,
+    }
+
+    /// The receiver end of an asynchronous byte pipe
+    pub struct AsyncReceiver {
+        inner: AsyncPipeSide,
+    }
+
+    struct AsyncPipeSide {
+        open: Rc<Cell<bool>>,
+        buf: Rc<RefCell<VecDeque<u8>>>,
+    }
+
+    impl Drop for AsyncPipeSide {
+        fn drop(&mut self) {
+            self.open.set(false);
+        }
+    }
+
+    #[async_trait(?Send)]
+    impl AsyncRead for AsyncReceiver {
+        async fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
+            AsyncPipeReadFuture {
+                reader: self,
+                user_buf: buf,
+            }
+            .await
+        }
+
+        async fn read_exact(&mut self, buf: &mut [u8]) -> io::Result<()> {
+            let mut read = 0;
+            while read < buf.len() {
+                read += self.read(&mut buf[read..]).await?;
+            }
+
+            Ok(())
+        }
+    }
+
+    #[async_trait(?Send)]
+    impl AsyncWrite for AsyncSender {
+        async fn write_all(&mut self, buf: &[u8]) -> io::Result<()> {
+            AsyncPipeWriteFuture {
+                writer: self,
+                user_buf: buf,
+            }
+            .await
+        }
+
+        async fn flush(&mut self) -> io::Result<()> {
+            Ok(())
+        }
+    }
+
+    struct AsyncPipeReadFuture<'a> {
+        reader: &'a AsyncReceiver,
+        user_buf: &'a mut [u8],
+    }
+
+    impl<'a> Future for AsyncPipeReadFuture<'a> {
+        type Output = io::Result<usize>;
+
+        fn poll(mut self: Pin<&mut Self>, _: &mut task::Context<'_>) -> Poll<Self::Output> {
+            let inner_buf = &mut self.reader.inner.buf.borrow_mut();
+            if inner_buf.is_empty() {
+                return if self.reader.inner.open.get() {
+                    // Wait for data to arrive, or EOF
+                    Poll::Pending
+                } else {
+                    // EOF
+                    Poll::Ready(Ok(0))
+                };
+            }
+
+            let bytes_to_write = inner_buf.len().min(self.user_buf.len());
+
+            // This is a convoluted way to copy the bytes from the inner buffer into the user's
+            // buffer
+            let (first_half, second_half) = inner_buf.as_slices();
+            let bytes_to_write_from_first_half = first_half.len().min(bytes_to_write);
+            let bytes_to_write_from_second_half =
+                bytes_to_write.saturating_sub(bytes_to_write_from_first_half);
+            self.user_buf[..bytes_to_write_from_first_half]
+                .copy_from_slice(&first_half[..bytes_to_write_from_first_half]);
+            self.user_buf[bytes_to_write_from_first_half..bytes_to_write]
+                .copy_from_slice(&second_half[..bytes_to_write_from_second_half]);
+
+            inner_buf.drain(..bytes_to_write);
+
+            Poll::Ready(Ok(bytes_to_write))
+        }
+    }
+
+    struct AsyncPipeWriteFuture<'a> {
+        writer: &'a AsyncSender,
+        user_buf: &'a [u8],
+    }
+
+    impl<'a> Future for AsyncPipeWriteFuture<'a> {
+        type Output = io::Result<()>;
+
+        fn poll(mut self: Pin<&mut Self>, _: &mut task::Context<'_>) -> Poll<Self::Output> {
+            if !self.writer.inner.open.get() {
+                return Poll::Ready(Err(io::Error::new(
+                    io::ErrorKind::Other,
+                    "channel was closed",
+                )));
+            }
+
+            let mut pipe_buf = self.writer.inner.buf.borrow_mut();
+            let capacity_left = pipe_buf.capacity() - pipe_buf.len();
+            let bytes_to_write = self.user_buf.len().min(capacity_left);
+            pipe_buf.extend(&self.user_buf[..bytes_to_write]);
+
+            if self.user_buf.len() > capacity_left {
+                self.user_buf = &self.user_buf[bytes_to_write..];
+
+                // Continue writing later once capacity is available
+                Poll::Pending
+            } else {
+                Poll::Ready(Ok(()))
+            }
+        }
+    }
+
+    #[cfg(test)]
+    mod test {
+        use super::*;
+
+        #[test]
+        fn test_block_on_concurrent_minimal_capacity() {
+            test_block_on_concurrent(1);
+        }
+
+        #[test]
+        fn test_block_on_concurrent_enough_capacity() {
+            test_block_on_concurrent(100);
+        }
+
+        fn test_block_on_concurrent(capacity: usize) {
+            let (mut server_writer, mut client_reader) = async_pipe(capacity);
+            let (mut client_writer, mut server_reader) = async_pipe(capacity);
+
+            let client = async {
+                client_writer
+                    .write_all(b"hello")
+                    .await
+                    .unwrap();
+
+                let mut buf = [0; 2];
+                client_reader
+                    .read_exact(&mut buf)
+                    .await
+                    .unwrap();
+                assert_eq!(&buf, b"42");
+
+                client_writer
+                    .write_all(b"bye bye")
+                    .await
+                    .unwrap();
+
+                Ok(())
+            };
+
+            let server = async {
+                let mut buf = [0; 5];
+                server_reader
+                    .read_exact(&mut buf)
+                    .await
+                    .unwrap();
+                assert_eq!(&buf, b"hello");
+
+                server_writer
+                    .write_all(b"42")
+                    .await
+                    .unwrap();
+
+                let mut buf = [0; 7];
+                server_reader
+                    .read_exact(&mut buf)
+                    .await
+                    .unwrap();
+                assert_eq!(&buf, b"bye bye");
+
+                Ok(())
+            };
+
+            let (client_result, server_result) = block_on_concurrent(client, server);
+            client_result.unwrap();
+            server_result.unwrap();
+        }
+    }
 }
 
 pub mod transport {
author	Adolfo Ochagavía <github@adolfo.ochagavia.nl>	2023-11-16 15:19:54 +0100
committer	Daniel McCarney <daniel@binaryparadox.net>	2023-11-23 17:09:09 +0000
commit	c005236270fac9632aeb9c874402148b1b473a9b (patch)
tree	2c4d8a289e4bb64c985fa30e51616329240778ec /ci-bench
parent	2463f991559242e2646f3999b9301ce2843059d1 (diff)