A Web crawler collects and updates Web pages to be added to webLyzard archive. We currently use the Java-based open source Apache Storm-Crawler to perform this task (released under the terms of the ASF 2.0 License) – typically not more than twice a week, and using bandwidth limits to minimize the resulting load on third-party servers. The data collection process respects the Web site owner’s robots.txt settings (a text file placed in the top directory, which is used by site administrators to restrict access to files and directories on a Web server). Please contact us if you are a site administrator and have questions regarding this process.
The majority of data is currently gathered for UNEP Live Web Intelligence, an information exploration system to analyze online media coverage on sustainable development goals, and for the research projects InDICEs (digital culture), EPOCH (event prediction), and GENTIO (text and impact optimization).
Social Media Content
When collecting content using the official APIs of social networking platforms, our system strictly adheres to these platform’s usage restrictions and only accesses the public portion of the content. Using channel or page names as well as keyword search terms to specify topics of interest for a project, we gather posts and comments together with basic account details. This may include the account name, number of followers, and public geo annotations.
The performed content analysis includes story detection, sentiment analysis, brand perception, and an assessment of the potential reach of a posting. It is never used to build user profiles or to infer details about an individual. Users can request the deletion of their content via email.