Systematic Generation of a Dataset to Study University Dropout in Chile Using Public Data and Open Source Frameworks. 2026. Ferreira, A.; Campos, D.; Lagos, D.; Leal, P.; Reinecke, C.

Abstract:

In this paper, a structured and reproducible methodology is presented, to generate a dataset with a data quality that allows to study university dropout in Chile from multiple dimensions, such as academic, socioeconomic, demographic and temporal. The dropout variable was created during the process, since it was not reported in the initial sources. The methodology includes collecting and integrating multiple sources, each one composed by millions of records; cleaning and standardizing the fields, categorical and numerical (discrete or continuous); and local storage. The final dataset reached 594,224 records. This approach can be applied to other type of problems that seek to consolidate different sources in the build up of a quality dataset. An additional element is exclusive use of open source frameworks during the entire process. © 2026 IEEE.

Keywords: data engineering; data integration; data mining; university dropout

DOI: 

Otras publicaciones​