Skip to main content
The University of Texas at San Antonio Student Union. Copyright UTSA.

Overcoming the challenges of classifying COVID-19 Twitter data using self-supervision and few shot learning

Jetstream enables a self-supervised model to classify COVID-19 tweets in order to provide insights into COVID-19 that have not been investigated.

Researchers from The University of Texas at San Antonio (UTSA) propose a method to classify COVID-19 Twitter data using the Jetstream cloud computing system. They set up deep neural models in Jetstream to classify tweets into four different COVID-19 categories; related, awareness, infection, and vaccine.

Proposed pipeline of COVID-19 self-supervised learning.

The typical seasonal influenza virus and the current development of COVID-19 have many similarities from symptoms to how the virus is spread. With influenza and COVID-19 having many similar symptoms, the researchers fine-tuned a model to provide insights into COVID-19 that have not been investigated. Their experimental results showed the efficacy of their proposed model had an accuracy of 86% identification of COVID-19 related discussion using recently collected tweets.

"Public health surveillance and tracking the virus via social media can be a useful digital tool for contact tracing and preventing the spread of the virus," said Brandon Lwowski, Ph.D. student in the Department of Information Technology, AI concentration, at UTSA. Lwowski was also a previous Jetstream Research Experiences for Undergraduates (REU) program participant. "Large volumes of COVID-19 tweets can quickly be processed in real-time to offer information to researchers."

Brandon Lwowski, Ph. D. student in the Department of Information Technology, AI concentration at UTSA

Using few shot learning (also known as Few-Shot Learning) to fine-tune a self-supervised model to classify large sources of public information from social media about influenza and COVID-19 data allows researchers to help gain insight about the viruses. Just creating a search using "flu" and "coronavirus" with the Twitter API will return millions of tweets. Classifying tweets into smaller subsets including categories like "Self vs Other" and "Awareness vs Infection" provides a deeper understanding of how influenza and COVID-19 are affecting communities.

Paul Rad, National Academy of Inventors Senior Member, Peter T. Flawn Endowed Professor, director of AI and Autonomy Lab, and co-founder of The UTSA Open Cloud Institute

"The major roadblock for using deep learning models on the COVID-19 tweets is the lack of well-annotated and labeled data," said Paul Rad, National Academy of Inventors Senior Member, Peter T. Flawn Endowed Professor, director of AI and Autonomy Lab, and co-founder of The UTSA Open Cloud Institute. "With millions of tweets related to COVID-19 flooding social media, researchers have a difficult time performing supervised learning on the data. We propose a method to attack this problem by transferring knowledge learned in influenza data and integrating it with latent variables obtained from the unlabeled dataset of COVID-19 to perform a deeper understanding through self-supervised learning."

Their paper, "COVID-19 Surveillance through Twitter using Self-Supervised and Few Shot Learning," has been accepted by the 2020 Conference on Empirical Methods in Natural Language Processing.

This work is partly supported by a grant from the Intelligence Community Centers for Academic Excellence (IC CAE) Program and The University of Texas at San Antonio Open Cloud Institute. The authors gratefully acknowledge the use of the services of the National Science Foundation (NSF)-funded Jetstream cloud system (NSF #1445604).