AI Talk: Crossing the data chasm

Sept. 3, 2021 / By V. “Juggy” Jagannathan, PhD

Everyone knows about the appetite that deep learning models have for data, but they need to be properly labeled and curated to train models. The data needs to reflect its intended real world use case. This is a huge problem, and it will only continue to get worse as the need for machine learned models and uses cases explodes. Given the hyper interest in this area, it is only natural that a wide range of approaches have been tried to solve the problem and I am beginning to see an explosion of research papers focused on this issue. A recent survey found that searches for “data augmentation” have spiked significantly in recent years.

Here I summarize a few popular approaches to addressing this labeled data deficit, primarily in the text domain. Similar approaches are valid for other modalities though, like images, audio and videos.

Crowdsourcing. If you have the money and time, you can always use a resource such as Amazon Mechanical Turk to get the labeling done. There are also a range of companies that offer specialized resources to help with the annotations of datasets.

Transfer learning. This is an approach that evolved after the remarkable success of Transformer architecture a few years ago. These models employ unsupervised training using data available on the internet – billions of text documents from Wikipedia and other sources. The language models that are produced from these texts use very basic tricks like predict the next word given the left context, or predict random masked words given a whole sentence. Now these pretrained models provide a way to represent any input text data, referred to as an embedding (a vector of data). Using this as a basis, new models can be fine-tuned on a small dataset of labeled data that can solve a range of problems such as sentiment analysis or classification. GPT-3, one of the largest such pretrained models, has been shown to do extremely well with limited data, called zero or few-shot learning.

Self-training. The basic idea here is you train a model with the labeled data you have, then use it on your unlabeled dataset to create pseudo labels. Self-training is also referred to as semi-supervised learning. If you are trying to predict say, a sentiment of a review, you can use this approach. Once you have a bunch of pseudo labels, then you can use it as normal training data. It has been shown that this approach can improve model performance. Here is a good overview of this approach with some code. This approach assumes you have lots of unlabeled data, which is usually the case.

Active learning. This approach starts out like self-training above. A model is trained using a hand labeled data set as above. The model trained using the small dataset is used to label the unlabeled dataset. Once this is done, the approach diverges. A small selection of the machine labeled data is validated by a human-in-the-loop to generate additional hand labeled data. Determining which data point is queried for human evaluation can be guided by variety of strategies, with the goal being to reduce uncertainty in model performance.

Data augmentation. A fairly broad term that can include all of the previously named approaches to create additional data. Here is a good blog explaining some popular techniques. If you take the same example of classifying a sentiment of a review, you start with a known positive review. You can do a variety of transformations on this review, and all such reviews. You can replace random words with their synonyms, shuffle words around, shuffle phrases utilizing a grammatical segment of the sentence, delete a few words, or add some random words. Of course, if you transform it too much you may lose the positive sentiment characteristic you need to maintain, but this approach gets you additional data you can feed your training model. And it works to improve model performance.

Weak supervision. Also referred to as “distant supervision,” here the idea is to transform your unlabeled dataset with labels created using labeling functions (LFs). What are labeling functions? They are programming routines that use rules or a knowledge base to label the dataset. So, for instance, taking the same example for predicting sentiment of a review, you can scan for words that represent positive sentiment and label them as such. The approach will produce data which is of lower quality than a human reviewed, hand labeled dataset. On the plus side, you can generate a series of LFs which take advantage of some feature in the dataset that can be programmatically found. To increase the quality of labeling, you can look for two or more LFs which confer the same label. Here is a blog about an open source tool called Snorkel that explains how this process works. There are several tools that support this paradigm of augmenting data.

Synthetic data. Creating totally artificial datasets has now become an active area of research. The approach is currently most applicable to structured data and modalities such as images and video. The main drivers for this research are privacy preservation and addressing bias in datasets. Check out this workshop held a few months ago with that focus. Privacy preservation is of course one of the main reasons why datasets are hard to come by in the health care domain. Researchers are exploring how to create synthetic datasets that preserve the statistical properties of the original. These datasets can then be used to explore development of models. For a deep dive, check out this two-hour tutorial on efforts to create such datasets in health care.

Concluding thoughts. It is clear that data augmentation approaches work, however, it is less clear why they work. Several research efforts are focused on unravelling the theoretical underpinnings of these approaches. A lot of the approaches being explored to augment data for text are inspired by computer vision research. Perhaps it is time to come up with new approaches that are inspired by text?

I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.

V. “Juggy” Jagannathan, PhD, is director of research for 3M M*Modal and is an AI Evangelist with four decades of experience in AI and Computer Science research.