WinMMF - Ending the Saga

Riven Skaye / July 2025

+ ipc + windows + shared-mem + winmmf + rust + tech

Weren’t we done?

So the story seemed to be all wrapped up. That is, until I thanked the folks who were there as moral support. Remember my buddy Joel? Well, he noticed some small stuff and had some nits. Most of them justifiable, some of them being about errors that are less than perfect. Now he is right of course. But it sucks when you’re proud of your achievements and someone then has an “uhm akshually” moment after they say it looks good and give off an all-clear …

Minor flaws and annoying atomics

I was using fetch_update with some checking logic. Except that could actually leave locks somewhat unusable, or not guarantee the correctness of the value. You know, small stuff that’s easily fixed with a few lines of code. These issues have long since been addressed, more have been opened for testing, and having moved to Codeberg, I still need to migrate the issues there. That said, using compare_exchange semantics ensures we don’t accidentally wipe lock state as set by other users of the same MMF. I also added spinning locks that are actually usable as-is, did some cleanup, and then broke creating MMFs over 1024 bytes in size.

As part of the cleanup, I went from manually filling a Vec with enough bytes to zero out the dirty memory to using a vec![0; size]. If you’re familiar with Rust (or any other language with multiple sizes of integers), you can probably see what went wrong here … Trying to fix this led me down a few more rabbit holes, the most important one being Large Pages on Windows. Did you know they never get swapped out? The more surprising part was a relation between this and the 1024 byte limit issue. You see, accidentally using a u32 for zeroing out memory was never an issue when it was small sets of data. It became an issue when the accidental 4 byte multiplication happened on more than 1024. So once that was fixed I started playing around. The large page support doc I linked before clearly lists the GetLargePageMinimum function to know the minimum size you have to request when using large pages. It also says:

Include the MEM_LARGE_PAGES value when calling the VirtualAlloc function. The size and alignment must be a multiple of the large-page minimum.

Makes sense; can’t map half of a page, regardless of the size. I tried, Windows has all these funny panics and EventLog entries when you try and map one and a half large page. And then it hit me. Was Windows sneakily not just allocating more than requested, but also straight up mapping more into the view than I asked it to? And how I wish I could just say “yes” without a doubt. But I can’t, because I only did limited testing. As far as I can tell though, this is roughly what happens:

You create a file mapping < the page size, Windows can’t do less than a page;
You request to map a view < the file mapping into your process, and really it’s just a pointer into a slice managed by the OS;
Writing is fine so long as it’s ≤ a page in size,
- Bounds are never checked. Except for normal OS checks on your process space + pages listed in your virtual alloc table;
Reading is fine so long as it’s ≤ a page in size,
- Again, no bounds checks exist. I didn’t try reading beyond the “end” of the pointer though;
Other processes opening and mapping the same memory can also access beyond the advertised space;
Add in large page supports and shit hits the fan if you try mapping N bytes where N % GetLargePageMinimum() != 0;
If your system doesn’t do large pages, GetLargePageMinimum() returns 0 and requesting them results in exceptions as well.

So why is it that the MS docs don’t tell us we will unconditionally get access to multiples of the system page size? This behavior is expected if you know how allocations work. You see, they have examples for mapping views over less than the requested space. Which is a good idea if you can actually do the work in chunks. If your goal is copying full camera frames, this would just incur many extra allocs and copies for no benefit. You get a blob, and immediately write the blob. Though this does provide a small benefit. Remember the lock written into the first 4 bytes of the MMF itself? I might not need to allocate those 4 extra bytes after all, so long as (size + 4).checked_next_multiple_of(page_size) doesn’t overflow and isn’t greater than size.next_multiple_of(page_size) we don’t allocate any extra. And for any consumers of the MMF this means they might be getting more space than they explicitly need. Which is good for all of the fun little bugs with extra spice, like off-by-one pointer math and forgotten trailing null bytes. The weird part is not allocations in page sizes, that makes a lot of sense. The weird part is mapping a view over a smaller subsection of it, and there being nothing to prevent you from going out of bounds on the view. I’d expected the returned pointer to only be valid for N bytes – which would be easy if the pointer you got was offset from the start of the memory page. At least for the mapping, since the view needs to be able to move around for chunked work.

Actually fixing things

So after mangling code and breaking API more than once, I figured I’d list some of the changes and go through the benefits they provide. At first there were many lock improvements in 0.3 and 0.4, most of them to prevent the possibility of a change between checking the lock and writing it. Then I started improving on the ordering of things. Ensuring fences were in place, and then fixing bits and pieces I forgot (or messed up). This is also when I made a start on implementing the shared camera capture thing, up until that just crashed. I could observe the crash, but no panic handler seemed to run and there was nothing in the Windows Event Log which could be of use. That last part struck me as odd, until stumbled across this 2011 SO answer telling me that things under the Service Control Manager have all pipes closed. Okay, easy enough, just run it in a standalone binary and get an almost useful error then.

Fast forward to me figuring out that it was being killed with 0xc0000005 STATUS_ACCESS_VIOLATION and me hooking up LLDB in vscode on Windows. Which is not a thing I thought I’d publicly say, ever. It even worked well with the msvc target (which I tried to rule out ABI issues), and on both of them I got very similar failures in the exact same place. A simple memcpy. But it was also in one of those debugging sessions that I noticed the size being quadruple that of what I’d created the MMF with. And then the datatypes: Vec<u32>, &[u32], and *const [u32]. Those were supposed to be u8, and I was sure that’s how I had the code originally. So I went to look into local file history and found my issue.

- let mut zeroing = Vec::with_capacity(size);
- zeroing.resize(size, 0);
+ let zeroing = vec![0; size]; // linter recommendation
  unsafe { std::ptr::copy(zeroing.as_ptr(), map_view.Value.cast(), zeroing.len()) };

You see, the old code never errored. Maybe I just didn’t test with >1024, but I think I did? Like, I didn’t have issues until mapping more than 20MB for shits and giggles. And I’m pretty sure I tried some lower number first, though perhaps I’m mistaken. Either which way, the new code had to resolve that vec! macro to some type. And it went for the standard integer type on every PC since x86. To illustrate, I claim the first 4 bytes in the MMF for the lock, which is an AtomicU32 made up of [u8; 4]. Surprisingly, this alloc math works out perfectly. As well as the alignment, which in an MMF is always pointer size aligned which is more than Rust asks for. To illustrate, say we have a block of 64 bits, or 8 bytes. This block has 8 valid single byte alignments, 4 valid 2 byte alignments, 2 valid 4 byte ones, and a single 8 byte alignment. If you’d like to see this code in action, here’s the Rust Playground link.

use std::marker::Copy;
use std::ops::Add;
use std::fmt::Debug;

fn cast_bytes_to_slice<T, const N: usize>(slice_in: &u64) -> [T; N]
where T: Copy + Add<Output=T> + Debug {
  unsafe {
    // The following code will panic if alignment isn't safe
    let t = std::slice::from_ref(slice_in).align_to();
    let p_t: &[T] = t.1;
    eprintln!("{t:?}\n{p_t:?}");
    std::array::from_fn::<T, N, _>(|i| p_t[i])
  }
}

// It's a single 64-bit value. Of course it's aligned
const ALIGNMENTS: u64 = 0b00000001000000100000010000001000001000000100000010000000;

fn main() {
  let a: [u32; 2] = cast_bytes_to_slice::<u32, 2>(&ALIGNMENTS);
  let b: [u16; 4] = cast_bytes_to_slice::<u16, 4>(&ALIGNMENTS);
  let c: [u8;  8] = cast_bytes_to_slice::<u8,  8>(&ALIGNMENTS);
  let d: [u64; 1] = cast_bytes_to_slice::<u64, 1>(&ALIGNMENTS);
  println!("{a:?}\n{b:?}\n{c:?}\n{d:?}")
}

Windows giving us a guaranteed alignment of 4 bytes on x86 and 8 bytes on AMD64 means we can do all of this safely using just the pointer from opening the MMF. All that’s left now is to ensure that we don’t do anything to mess this alignment up. So we scoot over the existing *mut u8 by 4 bytes and we only allow reading and writing it as bytes. This means that any data you’re passing through it will be treated as its raw bytes, and you’re responsible for handling alignment coming in or out of the MMF yourself. To give a brief example of what I mean, assume we have an existing MMF and a SpecialStruct that has to align to 4 bytes at all times and has a size of 32 bytes:

unsafe fn read_from_mmf(mmf: &MemoryMappedFile) -> SpecialStruct {
  let mut buff = [0_u8; 32];
  mmf.read_to_raw(buff.as_mut_ptr(), 32).unwrap();
  // Assuming SpecialStruct implements Copy, this works. If it doesn't, you'll have to clone it.
  // Or make this take a pointer to write it into, and get the pointer from something like a Box::new_uninit()
  *(buff.as_ptr().cast::<SpecialStruct>())
}

Doing FFI things with it

Now that we have a working MMF at any requested size, I could get to work on using it. I figured I’d write an FFI wrapper since I’ll need to use this from other code as well. And while I’m at it, let’s also auto-generate C# bindings because the docs are not at all clear about working with namespaces and whether or not we can even open a globally namespaced MMF from dotnet. That, and a single call to handle the entire dance seems a lot better than doing the entire dance again. Work smarter, ~~be lazy~~ not harder. I did consider doing the evil thing here, and giving myself a way to get the raw inner pointer so I could blindly wrap that in a bitmap every time, but I decided not to. It also seems fun to do some C# pointer shuffling.

With that out of the way, we also still need to actually capture the camera – something that shouldn’t be hard anymore now that MMFs work at >1024 bytes. The implementation is viewable on Codeberg. The details of it are much more fun and pretty high effort, but this post is already entirely too long for that. It’s Windows-only for now, but I am open to changing that in due time.

We start out with Mullvad’s windows-service-rs crate as that handles a lot of the annoying bits for us. We then move on to the camera itself, captured using nokhwa. I would like to endorse these two crates. They’ve not only been a great help with good examples, they’re actually nice to work with! It’s also really nice of Mullvad to provide explanations and tips on how to get offloading work to background threads can be kept alive on a service. Just pretend you’re fully up and running before you really are and Windows doesn’t bat an eye. Now the code isn’t perfect yet, and it currently doesn’t support all kinds of capture formats, but that’s something to be fixed in a future version if anyone even wants it. The installer and uninstaller binaries are almost verbatim copies of what Mullvad’s examples use. With one difference: mine has help and env var configuration. Basically Windows Services are able to take parameters. But if you don’t manually specify them every time, things get finnicky. So the windows-service crate lets you specify command line args to start the binary with. I just unify them and let manual args override installer-specified ones. Since it runs as a system service, you will need to run the (un)installer as admin.

As for features, we only ship with one. Whether or not to write the names of the shared cameras into a file accessible through the %PUBLIC% folder, with a fallback to C:/Users/Public which should be present by default on w10+ if I understand correctly. The feature is on-by-default since this code is mostly written for me (and my employer, but these are still all my views and opinions and all that). As the service starts, it tries to open and bind all cams and write them to the shared file. Access permissions on Windows are generally determined by the parent folder, so reading the public camfile should just work as intended right away. And now from any other program, you can tap into the cameras by using the MMF. Open it using "Global\\shmemcam_name_here and you should be good! Naturally, using WinMMF it’s winmmf::MemoryMappedFile::open(size, shmemcam_name, winmmf::Namespace::GLOBAL, true, None).unwrap().

The future for WinMMF

Currently I have no immediate plans to continue work on WinMMF. It seems stable enough, and I do have other things to work on. That said, if it gains any traction and people start using it, I might look into async support on at least the spinning and locking stuff. Currently it wouldn’t be hard to just start it as a task using e.g. tokio::spawn, but I can imagine people preferring to just have an easy await mmf.read_async(...) at some point. After all, it should be very fast to do the actual blocking part of the read. It’s little more than a memcpy and we can do the rest while spinning!

I’d also like to make it easier to modify bits and pieces like the lock impl at some point. And I really should provide some way of using the standard lock that’s more ergonomic than None::<fn(&dyn MMFLock, _) -> _>, because it’s just ugly to use. Perhaps I can make that something the MMF wrapper provides as a field? It’s not something I’m in a rush to fix though. I expect people much better than me at this kind of thing will either create a better wrapper, or this just not being the way people want their low overhead FFI to work. A damn shame if you ask me, but I can’t force people to stop using the TCP stack.

Perhaps I’ll even self-nominate it for This Week In Rust’s Crate of the Week or Call for Participation in the hopes of attracting some more design and development help. But that’s all in the future. For now, I’ll just make prod dependent on it for the ages to come. The preceeding software is by now old enough to drink in any country where alcohol is legal. Now imagine some jank Rust code managing that!

Ramblings of an Alchemist by Riven Skaye is licensed under CC BY-SA 4.0 International Creative Commons logo CC-by Icon Share-Alike icon