COVID-19 seroprevalence estimation and forecasting in the USA from ensemble machine learning models using a stacking strategy

Sagastabeitia, Gontzal; Doncel, Josu; Aguilar, Jose; Fernández Anta, Antonio; Ramírez, Juan Marcos

doi:10.1016/j.eswa.2024.124930

dc.contributor.author	Sagastabeitia, Gontzal
dc.contributor.author	Doncel, Josu
dc.contributor.author	Aguilar, Jose
dc.contributor.author	Fernández Anta, Antonio
dc.contributor.author	Ramírez, Juan Marcos
dc.date.accessioned	2024-09-23T16:06:20Z
dc.date.available	2024-09-23T16:06:20Z
dc.date.issued	2024-08-15
dc.identifier.issn	0957-4174	es
dc.identifier.uri	https://hdl.handle.net/20.500.12761/1849
dc.description.abstract	The COVID-19 pandemic exposed the importance of research on the spread of epidemic diseases. In this paper, we apply Artificial Intelligence and statistics techniques to build prediction models to estimate the SARS-CoV-2 seroprevalence in the United States, using multiple estimates of COVID-19 prevalence and other explanatory variables. We propose the use of stacking techniques based on multiple model building techniques (Linear and Beta Regression, Genetic Programming and Neural Networks) to obtain Predictive Ensemble Models. There has been extensive research on this field, but there has not been in-depth research on the application of stacking methods to estimate and forecast seroprevalence in the USA specifically. This paper provides a novel comparison of the behaviour and performance of different building techniques for stacking ensemble models and presents which methods are better for different scenarios. We find that Genetic Programming and Neural Networks are the best models with trained data within single states, and when multiple states are considered Genetic Programming is still better than the Regression models, but Neural Networks fail to estimate the seroprevalence accurately. Another novelty of our work is the use of cross-state validation to evaluate the models with new data, as well as temporal forecasting. Depending on how the data is processed, Linear Regression performs very well with cross-state validation and temporal forecasting, and Genetic Programming is very accurate with the former while Neural Networks work better with the latter.	es
dc.language.iso	eng	es
dc.publisher	Elsevier	es
dc.title	COVID-19 seroprevalence estimation and forecasting in the USA from ensemble machine learning models using a stacking strategy	es
dc.type	journal article	es
dc.journal.title	Expert Systems with Applications	es
dc.type.hasVersion	AO	es
dc.rights.accessRights	embargoed access	es
dc.volume.number	258	es
dc.identifier.doi	10.1016/j.eswa.2024.124930	es
dc.page.final	15	es
dc.page.initial	1	es
dc.relation.projectName	SocialProbing	es
dc.subject.keyword	COVID-19 Epidemiology Stacking ensemble method Machine learning Regression modelling Genetic programming Neural networks	es
dc.description.refereed	TRUE	es
dc.description.status	pub	es

Files in this item

Name:: ArticuloGontzalWCCI2024-17.pdf
Size:: 1.769Mb
Format:: PDF
Description:: Artículo principal

This item appears in the following Collection(s)

IMDEA Networks

Show simple item record