Knowledge capture in the age of massive Web data requires robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data. The Extensible Web Retrieval Toolkit (eWRT) is a modular open-source Python API that addresses this requirement. It retrieves data from social media sources such as Twitter, Facebook, Google+ and YouTube. eWRT also includes various helper classes for effective caching and data management.
Available via GitHub, the eWRT toolkit provides components for (i) content acquisition and caching, (ii) converting doc, pdf and html files into text documents, (iii) natural language processing functions such as language detection and string similarity measures including Levenshtein and Soundex distances, (iv) comparing and visualizing ontologies, (v) text cleanup and string normalization, and (vi) streamlining Python programming tasks.
eWRT is written in Python, a popular high-level programming language that emphasizes code readability and supports object-oriented, imperative and functional programming styles.
- Weichselbraun, A., Scharl, A. and Lang, H.-P. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (K-CAP 2013). Banff, Canada.