eWRT – Extensible Web Retrieval Toolkit
Knowledge capture in the age of massive Web data requires robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data. The Extensible Web Retrieval Toolkit (eWRT) is a modular open-source Python API that addresses this requirement. It retrieves data from social media sources such as Twitter, Facebook, Google+ and YouTube. eWRT also includes various helper classes for effective caching and data management.
Available via GitHub, the eWRT toolkit provides components for (i) content acquisition and caching, (ii) converting doc, pdf and html files into text documents, (iii) natural language processing functions such as language detection and string similarity measures including Levenshtein and Soundex distances, (iv) comparing and visualizing ontologies, (v) text cleanup and string normalization, and (vi) streamlining Python programming tasks.
eWRT has been jointly developed by researchers from MODUL University Vienna, webLyzard technology, the University of Applied Sciences Chur, and the Vienna University of Economics and Business. The library is currently being extended as part of the uComp Project, which investigates Embedded Human Computation for Knowledge Extraction and Evaluation.
- Weichselbraun, A., Scharl, A. and Lang, H.-P. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (K-CAP 2013). Banff, Canada.