• español
    • English
  • Login
  • English 
    • español
    • English
  • Publication Types
    • bookbook partconference objectdoctoral thesisjournal articlemagazinemaster thesispatenttechnical documentationtechnical report
View Item 
  •   IMDEA Networks Home
  • View Item
  •   IMDEA Networks Home
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Identifying Sensitive URLs at Web-Scale

Share
Files
imc20.pdf (491.6Kb)
Identifiers
URI: http://hdl.handle.net/20.500.12761/852
Metadata
Show full item record
Author(s)
Srdjan, Matic; Iordanou, Costas; Georgios, Smaragdakis; Laoutaris, Nikolaos
Date
2020-10-27
Abstract
Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.
Share
Files
imc20.pdf (491.6Kb)
Identifiers
URI: http://hdl.handle.net/20.500.12761/852
Metadata
Show full item record

Browse

All of IMDEA NetworksBy Issue DateAuthorsTitlesKeywordsTypes of content

My Account

Login

Statistics

View Usage Statistics

Dissemination

emailContact person Directory wifi Eduroam rss_feed News
IMDEA initiative About IMDEA Networks Organizational structure Annual reports Transparency
Follow us in:
Community of Madrid

EUROPEAN UNION

European Social Fund

EUROPEAN UNION

European Regional Development Fund

EUROPEAN UNION

European Structural and Investment Fund

© 2021 IMDEA Networks. | Accesibility declaration | Privacy Policy | Disclaimer | Cookie policy - We value your privacy: this site uses no cookies!