A good point of reference for jog wheel on hardware is the oscilloscope. It's been a standard input there for decades now, both for manipulating values and navigating menus. Some oscilloscope standards don't make as much sense for the camera (e.g. knob press to go to trigger time), but press to toggle/change selection and/or scroll speed both seem like natural UI options.
Press+hold is generally unwieldy, so I'm against that, but I could be on board with a long-press to open a meta-selection menu, then long press to exit selection menu. For navigating a video, for example, a long press would bring up a menu showing different speed selections you can choose from, but short pressing would immediately toggle between 2 (or the proposed 4). That would allow you to have some "quick" options, but also choose from all the options with minimal extra hassle.
As for the issue of #frames vs %buffer, in StepMania, speed is controlled in two ways that are directly comparable: fractions of the default, and constants/fixed speeds. Rather than offering a fixed number of both types, they instead let you select the mode type and number independently, kind of like units. So for the camera, you could be in 1 Frame mode, then change the units to % and now you're knob moves 1% of the buffer each click instead. This means your UI only needs 2 input fields to cover every possible combination.
I'm definitely not saying we should model everything after oscilloscopes or dance video games, but rather trying to provide examples where these sorts of problems also exist, and how their solutions might help inform the design for the camera.