Data Extraction Methodology for EVM-based Smart Contracts

The application implements a data extraction methodology to extract data from Ethereum smart contracts including execution-related data and state changes. To this aim, the methodology first captures
the knowledge about the contract transactions and extracts the related state changes for each of them. This is possible by replaying transactions inside the Ethereum Virtual Machine (EVM) and obtaining the traces generated to reconstruct smart contract variables’ changes history

The Application

Configuration. This step foresees the configuration of the parameters identifying the contract from which to extract data. the first parameter refers to the network to use, such as a particular mainnet or testent. A block range is also necessary to restrict the interval of transactions to retrieve. Some additional filters can also be set and they correspond to gas used, gas price, interval of time, set of sender addresses and set of executed functions. All these parameters are used in the next step to determine and filter the transactions to extract from the specified smart contract.

Get contract code. This step retrieves the source code of the target smart contract used for later compilation. To this purpose, the contract address and the contract name are taken into input. To provide a fully automated procedure, if the contract code is verified and publicly available, it is directly acquired, otherwise, the user can upload it manually.

Get contract transactions. The scope of this step is to collect all the transactions referring to the specified smart contract. For this purpose, initially, the list of transactions between the defined block interval is retrieved. Then, transactions are filtered according to the previously defined parameters so
that only those matching all parameters are effectively selected for extraction.

Compile contract. Once the smart contract source code is obtained, the Solidity compiler is used to get three particular outputs: (i) Application Binary Interface (ABI), (ii) Abstract Syntax Tree (AST), and (iii) storage layout.

Extract contract storage and internal transactions. This step captures the contract state changes by extracting the state variables updated during each transaction. For this purpose, each transaction is replayed in a local environment with the state of the blockchain where the transaction was originally executed. This is done by cloning the block where the transaction was included and using it to replay the transaction and any previous ones in the block. This returns the transaction trace containing, among the others, the list of executed operations (i.e., opcodes) and the state of the EVM (i.e., memory locations). In particular, the opcodes represent operations in the memory such as the inclusion of a new variable or the calculation of a storage index. The EVM state contains instead all the storage slots with their respective keys and values. To reconstruct the state variable changes, this information is matched with the storage layout to identify precisely which state variable was updated also in the case of dynamic ones.

Extract blocks, transactions and events. Once the contract state changes and internal transactions are collected and decoded, the methodology continues to read information associated with transactions, blocks and events. For each transaction, the methodology takes the name of the executed function from
the corresponding log and its inputs, decoded thanks to ABI. Then, other attributes are read, such as hash, sender, timestamp, gas used, and more. Using the ABI, also events emitted by the transactions are captured together with the name and the value of the decoded attributes.

Generate log. After all the previously mentioned data is extracted, this step generates the output log which is provided to the user. The log can be
generated in JSON and CSV formats to support higher compatibility with modern analysis techniques.

Data querying. In addition to the log, the methodology also provides a data querying step where the user can interact with the extracted data. Indeed, during the previous steps, such data is saved in a local database, accessible by the user with querying capabilities. In this way, the methodology permits faster data retrieval, without the need to replay transactions every time.
Also, the usage of a standard DBMS permits the definition of complex queries and aggregation features.