WinMMF - Returning to a Finished MMF Wrapper
Riven Skaye / November 2025
Returning to a Finished MMF Wrapper
After having made sure it all ran well and stable, I added some more commits in the months after. Some polish here, ironing out creases and unergonomics with the locking mechanism there, and looking into Large Page support. This lets you grab large continuous areas of memory to help speed things up. You get sequential reads and writes, magical alignment promises from the OS, and due to how memory and cache works, you get improved performance. The mechanism has different names to different OSes, Huge Pages (Linux), Super Pages (BSDs, MacOS), Large Pages (Windows, AIX). For more in-depth info on what Large Pages are, there will be some links to the relevant Win32 docs in the article. Alternatively take a peek at IBM’s AIX documentation on it
The goal, and how I got there
Having the cameras in shared memory was great. I even wrote a Windows service to expose it once I had a proof of
concept working. But I figured there had to be a way to prevent the OS from deciding to either kill the service, or swap the memory out in a situation where
memory was getting scarce. I also noticed the service struggling when started very late. The Win32 documentation has a list of privilege constants
with a very relevant entry: SE_LOCK_MEMORY_NAME which is listed as a requirement for locking pages in memory. I ended up looking back at CreateFileMapping and
figuring I should see what the options are, when I happened across “Creating a File Mapping Using Large Pages”
which mentions requiring SeLockMemoryPrivilege in the first paragraph on the page. I also noticed the note on that page telling me that my earlier use of the flag
SEC_LARGE_PAGES was being ignored altogether, which is why it worked without permission elevation. Bummer!
A quick sidestep to the linked page on Large-Page Support told me that the performance gain on Windows was related to Large Pages using only a single translation buffer for any accesses to an area typically three orders of magnitude larger than the native page size. That is huge for memory that we’re at least writing to several times a second! The memory is also guaranteed to be contiguous if allocation succeeds. So there’s no fragmentation shenanigans to deal with either. Add to that the fact that it’s always read-write (readers still write the lock bytes), and that it’s non-pageable by default, and you have the perfect method of passing around complete video frames. The current design also doesn’t allow for partial (re)mapping of the MMF, nor does it allow for sliding the view around. So considering we already need a huge block, this is just free performance benefits for any setup that can actually leverage Large Pages!
Implementing it sanely
Since I’d already decided I’d run the writer and initializer as a service, that meant I had the relevant account privileges. And back on the page about creating
Large Page file mappings explained how to enable the privilege for the current process, in C++. And this was one of the bigger challenges here, because even with
the windows-rs and windows-sys crates, not everything maps 1:1 to Rust. Luckily the examples for windows-sys include privilege mangling, which was surprisingly
easy to clobber into windows-rs code. Shoutout to Kenny Kerr’s solid samples
for being easy to follow. Between that code and the C++ example in the Win32 docs, I ended up with all of 7 lines of code for the actual privilege change.
After implementing the change I noticed that starting the service on a very busy system was a pain. Makes sense, considering I had an IDE, two dozen browser tabs and all kinds of other stuff open, actively using most of the memory on my system at that point. RA and cargo tend to use a lot of memory and frequent rebuilds have a habit of causing them to run a lot. One quick reboot later and I started running the tests first. That’s when I also noticed normal pages being slower to acquire. Some digging and playing with LLDB pretty quickly made me notice that it was adding a fair bit of code inside the init function, at which point I decided that paying the cold path penalty at init for Large Pages only was worth the performance gain for both kinds of pages in actual use. If the situation doesn’t require Large Pages, it shouldn’t be slower to run. And if you do require Large Pages, it’s a one-time hit.
I then decided to add a parameter to control whether or not Large Pages would be selected by default. Users might have a reason to need them, even if it’s just for
locking the page. Likewise users might want to share large chunks of memory while explicitly allowing paging it out, for example if the shared mem bits aren’t
frequently accessed and would otherwise get in the way of normal system operation. But I actually really liked the auto selection behavior as well, not in the least
because it allows users to not care, or have WinMMF select it for whatever number they happen to need at runtime. It’s nice to have, and if you don’t know what
you need, it’s nice when a library can help you pick the best option. But the explicit off switch is very nice for the use case of needing a huge block of memory
that you know is going to get requested in a situation where the privilege is unavaialble. After all, why try when you know you can’t? So I landed on a three-state
solution to the problem: Option<bool>. This effectively allows you to short circuit your way out of the following set of checks:
- Large Pages aren’t explicitly disabled (
Some(false)) - Two-part check a. Large Page minimum is equal to or smaller than the required size OR b. User explicitly requests Large Pages
- Large Pages are available on this platform
The ordering of those checks is a conscious decision. For 2a we already know the user isn’t against them, so we prefer them if the size is applicable and otherwise only use them if requested. The check for availability being at the end is also a well-thought-out choice, most platforms support the feature these days and the check is mainly there to protect users in the event of some code not disabling it when compiled for a platform where it’s not available. This all worked. I ran it, it didn’t crash or error, so I modified the tests and pushed it. And then it did start crashing when I ran them.
System Support vs User Permissions
Yeah, turns out that normal users don’t have the correct privileges to lock memory or request Large Pages. And this is fine, this is the natural order of things. Except sometimes on some work systems you try some stuff. And then you forget that your user does have those magical permissions. So I had set those perms back at some point, and when I reran the tests I got loud errors. There’s a lesson here, kids. One taught by the Rust language as a whole, that I’ve learned well enough to be using it in other languages to prevent dealing with any raised or thrown errors. Allow me to quote relevant literature:
[…] acknowledge the possibility of an error and take some action […]. This requirement makes your program more robust by ensuring that you’ll discover errors and handle them appropriately before deploying your code to production!
— The Rust Book: Error Handling
Fast-forward a bit, and now there’s code with some nested checks. If we’re using Large Pages and they weren’t explicitly requested, we now fall back to normal pages if acquiring SeLockMemoryPrivilege fails. If Large Page support was explicitly requested, we error out instead of letting the program crash. I wager that makes for a much easier time to debug misbhehaving and misconfigured programs. That’s also about the point where I realized that my service runner will most likely be running with a very different permission set than the client applications reading camera frames. So I did some careful testing, and it turns out you can actually request access to the same MMF without Large Page support and it just works. Which kinda makes sense, considering every process after the creator is effectively doing nothing other than getting a pointer into the memory. It’s there, it’s allocated, and how the OS handles it internally is hidden from the lib itself. So we still get the performance benefits of Large Pages, even though we don’t even know that we’re getting them!
Lock cleanup
I did say at the start that I ironed out some unergonomic creases in the locking code, didn’t I? Well, it’s the closing note for the changes I made to WinMMF itself. The locking code was mostly fine, had a default implementation and … allowed you to select the default impl without ever using it. By design, WinMMF creates the requested type of lock internally based on trait methods. That lock is a thin wrapper around an atomic, and enabling any of the concrete locking code enables all of it. You can choose between just getting the trait, or getting the full lock implementation here. So after another long, hard look at it, I decided to just use the instance methods and formally require the max tries parameter.
I will gladly accept any contributions that offer different kinds of locks and acquisition mechanisms. And I’m even open to changing the trait to make it easier for people to use different kinds of locks altogether. In cases where a custom lock implementation requires more arguments, or actually radically different behavior that cannot be expressed through bringing your own traits and structs, I’m all ears. Bikeshedding is fun, but dealing with cross-proccess locking has been a real challenge for me. This is, after all, my first rodeo with these OS internals and a lot of the other concepts involved in doing multi-process code crimes.
The actual closing note
As for related things I did that weren’t to WinMMF itself, there is an FFI crate for which I ended up making a PR to CySharp/csbindgen. They provide an amazing crate for generating bindings between Rust and C#, which makes it easier to call into Rust from a dotnet project by taking away a lot of the annoying dll import code and extern definitions. There was just one small problem with it. Remember when I mentioned the current situation depending on 32-bit binaries of things, and having to deal with that one way or another? Yeah, you get into the ick that is Rust foreign calling conventions. And also the ick that is dotnet calling conventions. And finally you get to the nightmare of existing code that should not break over a change intended to support people very explicitly making specific choices.
csbindgen did not offer different handling for different extern calling conventions on the Rust side of things, and would instead always generate Cdecl for the
annotations and dllimport code. This is all fine and dandy until you do things with 32-bit processes on Windows. x86 Windows is full of obvious choices MS made just
to differentiate themselves from the rest (and probably to make it harder to use compilers made by anyone other than them or Borland). You see, the default calling
convention on x86 Windows is stdcall. stdcall has the callee cleaning the stack, unlike cdecl (which really means whatever the compiler defaults to for the
current target) which has the caller cleaning the stack. x86 supports cdecl just fine, but I had already taken into account the fact that x86 windows is special
cased by the system extern definition, and I already rolled with it for my existing code. So I could randomly break all x86 consumers of WinMMF (myself) or I
could fix what csbindgen emits.
There are other valid extern ABIs to specify on both ends of this interop story, but not all of them are actually supported in dotnet, not all of them are supported
on all hardware by anyone, and some of them are hard to even find any info about. So I forked csbindgen, cloned it, and hacked around to support other use cases
than just cdecl. I then proceeded to carefully lay out the choices I had made, and why I added panics into a build-only dependency.
I opened this PR to add calling conventions beyond cdecl, system, and stdcall because I figured that if I’m
adding the stuff I need, I might as well handle the other cases that dotnet supports and make sure users can’t accidentally mismatch them. The only ABI it doesn’t
handle is vectorcall. But that’s a non-issue anyway, since it’s already covered by the implicit disallows. And I even made sure that all existing users of
csbindgen that don’t do very freaky things should never even notice that these changes were merged in. At worst it adds a second to compile times of projects that
don’t get differen output. At best, this helps people catch things that should never have been allowed in the first place. And to be honest, it feels good knowing
that I made a niche use case a little bit safer just by writing a few lines of Rust.
Ramblings of an Alchemist
by
Riven Skaye
is licensed under
CC BY-SA 4.0 International