Microsoft presented Omniparser V2 and it is showing the best performances in GUI parsing. You can read about the details here.
It is not a multimodal but a visual parsing model specialized in detecting the GUI elements and can be combined with any LLM, including Claude.
It would be great to add support for Omniparser V2 in Claude-Debugs-For-You, so that it would be able to debug GUIs and web apps visual output!
https://github.com/microsoft/OmniParser/