Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services

Vallina, Pelayo; Le Pochat, Victor; Feal, Álvaro; Paraschiv, Marius; Gamba, Julien; Burke, Time; Oliver, Hohlfeld; Tapiador, Juan; Vallina-Rodriguez, Narseo

dc.contributor.author	Vallina, Pelayo
dc.contributor.author	Le Pochat, Victor
dc.contributor.author	Feal, Álvaro
dc.contributor.author	Paraschiv, Marius
dc.contributor.author	Gamba, Julien
dc.contributor.author	Burke, Time
dc.contributor.author	Oliver, Hohlfeld
dc.contributor.author	Tapiador, Juan
dc.contributor.author	Vallina-Rodriguez, Narseo
dc.date.accessioned	2021-07-13T09:43:43Z
dc.date.available	2021-07-13T09:43:43Z
dc.date.issued	2020-10-27
dc.identifier.uri	http://hdl.handle.net/20.500.12761/848
dc.description.abstract	Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90% to below 1%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research.
dc.language.iso	eng
dc.title	Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services	en
dc.type	conference object
dc.conference.date	27-29 October 2020
dc.conference.place	Virtual event
dc.conference.title	The 20th ACM Internet Measurement Conference (ACM IMC 2020)	*
dc.event.type	conference
dc.pres.type	paper
dc.rights.accessRights	open access
dc.description.refereed	TRUE
dc.description.status	pub
dc.eprint.id	http://eprints.networks.imdea.org/id/eprint/2183

Ficheros en el ítem

Nombre:: paper.pdf
Tamaño:: 3.208Mb
Formato:: PDF

Este ítem aparece en la(s) siguiente(s) colección(ones)

IMDEA Networks

Mostrar el registro sencillo del ítem