About two weeks ago me and two fellow research colleagues sat together and wanted to analyze the Ethereum blockchain with machine learning techniques. It became immediately clear that we had to parse the Ethereum blockchain in a more convenient format, such as a csv file or rows in a Mongo database, in order to be able to work with it. Searching for a solutions we came across a script by Alex Miller on GitHub which exactly did that: a python script that parses and transfers the entries from the blockchain to a Mongo database.
Unfortunately the script was already one year old and didn’t work at first. We therefore had to fork it and introduce some bug-fixes in order to be able to run it. For the data to get transfered on a MacOS X or Linux machine, the following step were necessary:
Attention: Python3 needs to be available. Please note that the following steps were outlined for a linux system. Still, we tried it on a MacOS X as well – some parts need to be amended though.
1. Download and extract of the script from GitHub
wget https://github.com/svenboesiger/Ethereum_Blockchain_Parser/archive/master.zip
unzip master.zip
2. Synchronize the blockchain with Geth
geth --rpc
2. Install the required python packages
pip install pymongo contractmap, tqdm and requests
3. Set the environment variable for the mongo database
export BLOCKCHAIN_MONGO_DATA_DIR=/data/db
4. Start Mongo Database
sudo systemctl start mongodb
5. Finally launch the parser
cd Ethereum_Blockchain_Parser-master/Scripts python3 preprocess.py
The process took us in total about three days. The reason for that is, that the blockchain synchronization itself took already quite a while (~two days). The script then puts itself between Geth and MongoDB. It requests via http individual blocks and transfers them one-by-one to the Mongo database. Nevertheless, after all the waiting we had the data ready for analysis. So plan well ahead if you’re thinking to use that approach to analyze the blockchain.
I hope you like our inputs. Please leave any comments or questions below.
Thanks for fixing and updating the repository. It is able to do the job, partially. While running Scripts/preprocess.py, I came across a simple module import error. I tried solving it, but was unable to do so. The error is:
`Traceback (most recent call last):
File “preprocess.py”, line 9, in
from Crawler import Crawler
File “./../Preprocessing/Crawler/__init__.py”, line 1, in
from Crawler import Crawler
File “./../Preprocessing/Crawler/Crawler.py”, line 4, in
import crawler_util
ImportError: No module named ‘crawler_util’`
BTW, stream.py runs with an error:
‘Traceback (most recent call last):
File “stream.py”, line 56, in
blocks = ParsedBlocks(t)
File “Analysis/ParsedBlocks.py”, line 60, in __init__
self.contracts = ContractMap(load=True).addresses
File “Analysis/ContractMap.py”, line 57, in __init__
self.load()
File “Analysis/ContractMap.py”, line 132, in load
assert os.path.isfile(self.filepath), no_file’
, although it does the job it was supposed to. Please suggest me on how to proceed further.
I add a line in between following two lines in preprocess.py:
sys.path.append(“./../Preprocessing”)
sys.path.append(“./../Preprocessing/Crawler”) # add this line to append the right path
sys.path.append(“./../Analysis”)