Although the Hungarian Public Procurement Authority has made all the information publicly available online for the public procurement tenders between 1998 and 2004, the data format is inappropriate for statistical analysis. The information is stored in basic HTML files which does not provide any interface for sorting and searching among the data. In this technical paper we describe our data extraction process which we used to turn the HTML based information into database format by extracting relevant fields of information. The Python programming language was used for data cleaning and extraction, which resulted in a database suitable for further analysis. At the end of the paper, basic statistics about the dataset is shown to provide some examples for the usability of this dataset.
We know that this work should not be our job, but the Hungarian state’s.