eWRT – Extensible Web Retrieval Toolkit

Knowledge capture in the age of massive Web data requires robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data. The Extensible Web Retrieval Toolkit (eWRT) is a modular open-source Python API that addresses this requirement. It retrieves data from social media sources such as Twitter, Facebook, Google+ and YouTube. eWRT also includes various helper classes for effective caching and data management.

Available via GitHub, the eWRT toolkit provides components for (i) content acquisition and caching, (ii) converting doc, pdf and html files into text documents, (iii) natural language processing functions such as language detection and string similarity measures including Levenshtein and Soundex distances, (iv) comparing and visualizing ontologies, (v) text cleanup and string normalization, and (vi) streamlining Python programming tasks.

webLyzard Open Source ProjectsAccess the Source Code–GitHub Repository

eWRT has been jointly developed by researchers from MODUL University Vienna, webLyzard technology, the University of Applied Sciences Chur, and the Vienna University of Economics and Business. The library is currently being extended as part of the uComp Project, which investigates Embedded Human Computation for Knowledge Extraction and Evaluation.

EWRT References

Weichselbraun, A., Scharl, A. and Lang, H.-P. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (K-CAP 2013). Banff, Canada.

eWRT – Extensible Web Retrieval Toolkit

EWRT References

About

web·Lyz·ard

Visual Tools

Data Services