Show simple item record

dc.contributor.authorAhmad, Syed Suleman
dc.contributor.authorDar, Muhammad Daniyal
dc.contributor.authorZaffar, Zareed
dc.contributor.authorVallina-Rodriguez, Narseo 
dc.contributor.authorNithyanand, Rishab
dc.date.accessioned2021-07-13T09:40:50Z
dc.date.available2021-07-13T09:40:50Z
dc.date.issued2020-04-20
dc.identifier.urihttp://hdl.handle.net/20.500.12761/777
dc.description.abstractData generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conduct a systematic study of the trade-offs presented by different crawlers and the impact that these can have on various types of measurement studies. We make the following contributions: First, we conduct a survey of all research published since 2015 in the premier security and Internet measurement venues to identify and verify the repeatability of crawling methodologies deployed for different problem domains and publication venues. Next, we conduct a qualitative evaluation of a subset of all crawling tools identified in our survey. This evaluation allows us to draw conclusions about the suitability of each tool for specific types of data gathering. Finally, we present a methodology and a measurement framework to empirically highlight the differences between crawlers and how the choice of crawler can impact our understanding of the web.
dc.language.isoeng
dc.titleApophanies or Epiphanies: How Crawlers Can Impact Our Understanding of the Weben
dc.typeconference object
dc.conference.date20-24 April 2020
dc.conference.placeTaipei, Taiwan
dc.conference.titleThe Web Conference 2020 (WWW 2020)*
dc.event.typeconference
dc.pres.typepaper
dc.type.hasVersionVoR
dc.rights.accessRightsopen access
dc.description.refereedTRUE
dc.description.statuspub
dc.eprint.idhttp://eprints.networks.imdea.org/id/eprint/2082


Files in this item

This item appears in the following Collection(s)

Show simple item record