How does that even work technically? macOS doesn't support multiple cursors. On native Cocoa apps you can pass input to a window without raising via command+click so possibly they synthesized those events, but fewer and fewer apps support that these days. And AppleScript is basically dead, so they can't be using that either.
I also read they acquired the Sky team (who I think were former Apple employees). No wonder they were able to pull of something so slick.
I remember looking trying to build something like this 6 years ago[0]. There are some interesting APIs for injecting click/keystroke events directly into Cocoa, and other APIs for reading framebuffers for apps that aren't in the foreground.
In particular there was some prior art that I found for doing it from the OpenQwaQ project, which was a GPLv2 3D virtual world project in Squeak/Smalltalk started by Alan Kay[1] back in 2011.
If I recall correctly, it worked well for native apps, but didn't work well for Chromium/Electron apps because they would use an API for grabbing the global mouse position rather than reading coordinates from events.
Which specific ones though allow you to send input to a window without raising it? People have been trying to do "focus follows mouse [without auto raise]" for a long time on mac, and the synthetic event equivalent to command+click is the only discovered method I'm aware of, e.g. used in https://github.com/sbmpost/AutoRaise
There is also this old blog post by Yegge [1] which mentions `AXUIElementPostKeyboardEvent` but there were plenty of bugs with that, and I haven't seen anyone else build on it. I guess the modern equivalent is `CGEventPostToPSN`/`CGEventPostToPid`. I guess it's a good candidate though, perhaps the Sky team they acquired knows the right private APIs to use to get this working.
Edit: The thread at [2] also has some interesting tidbits, such as Automator.app having "Watch Me Do" which can also do this, and a CLI tool that claims to use the CGEventPostToPid API [3]. Maybe there's more ways to do it than I realized.
Could you elaborate on what you mean? My understanding of the Cocoa event loop was that ultimately everything is received as an NSEvent at the application layer (maybe that's wrong though).
Do you mean that you can just AXUIElementPerformAction once you have a reference to it and the OS will internally synthesize the right type of event, even if it's not in the foreground?
yes you can do a lot background UI interaction using the AX APIs. Displaying a second cursor is also simple, just a borderless, transparent window that moves around.
For the few things you cannot achieve with the Accessibility API's there are ways to post events directly to an app - even though CGEventPostToPid is mostly broken when used on its own. These require a combination of CGEventPostToPid and CGEventTapCreateForPid. (I have done a lot of this stuff in my BetterTouchTool app)
Neat, good to know! And it does seem my mental model of event loop was broken. Accessibility related interactions don't have any related NSEvent.
They are handled as part of the "conceptual" run loop, but they seem to be dispatched internally by AXRuntime library from a callback off some mach port. And because of this, the call to nextEventMatchingEventMask in the main -[NSApplication run] loop never even sees any such NSEvent.
-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (in AppKit)
_DPSNextEvent (in AppKit)
_BlockUntilNextEventMatchingListInModeWithFilter (in HIToolbox)
ReceiveNextEventCommon (in HIToolbox)
RunCurrentEventLoopInMode (in HIToolbox)
CFRunLoopRunSpecific (in CoreFoundation)
__CFRunLoopRun (in CoreFoundation)
__CFRunLoopDoSource1 (in CoreFoundation)
__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ (in CoreFoundation)
mshMIGPerform (in HIServices)
_XPerformAction (in HIServices)
_AXXMIGPerformAction (in HIServices)
In some sense this is sort of similar to apple events, which are also "hidden" from the caller of nextEventMatchingEventMask. From what I can see those are handled by DPSNextEvent, which sorts based on the raw carbon EventRef. aevt types have `AEProcessAppleEvent` called on them, then the event is just consumed silently. Others get converted to a CGEvent and returned back to caller for it to handle. But of course accessibility events didn't exist in Classic mac, so they can't be handled at this layer so they were pushed further down. You can almost see the historical legacy here..
Maybe they used Claude to come up with a good method to do this. /s
But I was also wondering, how this even works. The AI agent can have its own cursors and none of its actions interrupt my own workflow at all? Maybe I need to try this.
Also, this sounds like it would be very expensive since from my understanding each app frame needs to be analysed as an image first, which is pretty token intensive.
How does that even work technically? macOS doesn't support multiple cursors. On native Cocoa apps you can pass input to a window without raising via command+click so possibly they synthesized those events, but fewer and fewer apps support that these days. And AppleScript is basically dead, so they can't be using that either.
I also read they acquired the Sky team (who I think were former Apple employees). No wonder they were able to pull of something so slick.