Documenting the Now
DocNow Tweet Catalog
Description:

GeoCoV19 is a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages. We posit that GeoCoV19 affords the ability to develop computational models to have a better understanding of how societies are collectively coping with the unprecedented COVID-19 pandemic. Moreover, the dataset can inform the development of AI-based systems to forecast disease outbreaks as well as provide surveillance for authorities to act timely, learn about knowledge gaps and urgent needs of the general public, identify unanticipated issues, and tackle misinformation and fake news, among others.