About two weeks ago me and two fellow research colleagues sat together and wanted to analyze the Ethereum blockchain with machine learning techniques. It became immediately clear that we had to parse the Ethereum blockchain in a more convenient format, such as a csv file or rows in a Mongo database, in order to be able to work with it. Searching for a solutions we came across a script by Alex Miller on GitHub which exactly did that: a python script that parses and transfers the entries from the blockchain to a Mongo database.
Unfortunately the script was already one year old and didn’t work at first. We therefore had to fork it and introduce some bug-fixes in order to be able to run it. For the data to get transfered on a MacOS X or Linux machine, the following step were necessary:
Attention: Python3 needs to be available. Please note that the following steps were outlined for a linux system. Still, we tried it on a MacOS X as well – some parts need to be amended though.
1. Download and extract of the script from GitHub
2. Synchronize the blockchain with Geth
2. Install the required python packages
pip install pymongo contractmap, tqdm and requests
3. Set the environment variable for the mongo database
4. Start Mongo Database
sudo systemctl start mongodb
5. Finally launch the parser
cd Ethereum_Blockchain_Parser-master/Scripts python3 preprocess.py
The process took us in total about three days. The reason for that is, that the blockchain synchronization itself took already quite a while (~two days). The script then puts itself between Geth and MongoDB. It requests via http individual blocks and transfers them one-by-one to the Mongo database. Nevertheless, after all the waiting we had the data ready for analysis. So plan well ahead if you’re thinking to use that approach to analyze the blockchain.
I hope you like our inputs. Please leave any comments or questions below.