Monitoring & Controlling / Tool · December 29, 2025

ZORO

ZORO is a zero-shot multimodal framework that enables process mining analysis of robotic behavior by leveraging foundation models to perform activity recognition from visual, auditory, and textual data. The framework includes a fusion module that integrates activities across modalities to produce the final holistic event log. ZORO runs foundation models locally, preserving the privacy of fine-grained multimodal data.

The Framework

ZORO follows a pipeline-based architecture that transforms multimodal fine-grained data into structured event logs. Each modality within the framework is treated independently from the others. Depending on the available data and the requirements of the analysis, the system can operate on all supported modalities or on any subset thereof. Moreover, it does not assume a one-to-one correspondence between modalities and inputs, and multiple inputs of the same modality can be processed. This design allows ZORO to flexibly adapt to heterogeneous sensing configurations without imposing strict assumptions on the number or type of available inputs. Finally, the fusion module integrates the modality-specific event logs produced in the previous step into a single multimodal event log.

The implementation

The tool supports two complementary execution modes. An interactive mode is offered through a graphical user interface, which enables exploratory analysis, configuration of modalities, prompts, and fusion strategies, and inspection of final results. In addition, a batch mode is available to support automated analyses and experimentation, allowing the framework to be applied programmatically to collections of robotic data.

The figure shows an example of the ZORO graphical user interface, where input data have been selected, a fusion strategy has been defined, and the analysis can be executed.