Uncategorized · December 1, 2023

Rainfall

Rainfall is a dataflow programming framework integrating mixed methods analysis. Rainfall enables the rapid prototyping of analysis pipelines through a user-friendly interface. Moreover, it is open-source and designed to ease the implementation and execution of a data analysis pipeline.

The Rainfall framework is composed of a web application from which to create and run the analysis pipelines and the library called Rain that enables mixed method analysis through the integration of data analysis libraries.

The Rain library allows the definition of configurable nodes executing one or more functions and their composition to build a pipeline following a declarative approach. A Rain node can either be a computational node that implements some sort of algorithm or a connector node, which is used for reading data from a source or writing data towards a sink. Computational nodes support different data analysis techniques, such as Process Mining, Deep Learning and Machine Learning, employing both standard and custom algorithms suitable for data analysis. Some other computational nodes can instead manipulate input data in order to produce formatted outputs that can be visualized with external tools like PowerBI more efficiently. Connector nodes enable data access and are designed to read data from a variety of data storage solutions, including local directories, databases (e.g., MongoDB), and data warehouses (e.g., Google Storage). Additionally, they can write data back to these sources, ensuring a two-way flow of information.

Apart from the library, the framework also includes a \textbf{web application} that makes it possible to communicate with the library by exposing REST APIs. A user can interact with Rain through the frontend of the web application, which renders an intuitive UI for designing and executing pipelines. Execution is managed by asynchronous \textbf{workers} monitoring a \textbf{task queue}, which is populated with requests sent when users launch pipelines. The task queue component keeps track of an execution request until a worker is available to run it, acting as an intermediary between the backend of the web application and the workers. Workers make it possible to free the backend of the web application from the responsibility of executing pipelines, thus improving the overall scalability of the framework. When a worker takes charge of a pipeline execution, it will load the respective metadata from the main database and send back the logs produced during the whole runtime, allowing users to monitor the status of the pipeline in real-time.
The web application also offers some additional features that make it easier to manage the workflow of defining pipelines, which can be persisted in the main database for future reuse and configuration. In fact, once a pipeline is defined, it can be saved in a repository, giving users the opportunity to keep track of multiple versions of the same pipeline that might differ in the way nodes are configured or access data. Once created, repositories can be shared, thus allowing collaboration between multiple users and organizations.
Moreover, a pipeline can be downloaded locally in the form of a Python script, granting both portability and transparency to the user, who is given the opportunity to inspect the actual code executed during the runtime.
Another major feature provided by the web application is the custom node editor. In fact, Rain was designed to allow extensibility when it comes to the set of available nodes, making it easy to implement new ones and use them right away in the definition of a pipeline. This can be done through the aforementioned custom node editor provided by the web application or by directly extending the library in a process that is well documented.

Web application, main database, workers, and task queue all operate in different Docker containers, enhancing system scalability and portability. Container scaling rules are set using a container orchestrator (i.e. Kubernetes), which can also handle resource allocation and recovery in case of failure.