Text Classification based on a Subset of Labels-image

Incomplete Supervision:

Text Classification based on a Subset of Labels

Yacun Wang, Luning Yang

Mentor: Jingbo Shang

Report Github Repo

About our project

Many text classification models rely on the assumption that requires users to provide the model with a full set of class labels. This is not realistic, as users may not be aware of all possible classes in advance, or may not be able to obtain an exhaustive list. These models also forced to predict any new document to one of the existing classes, where none of the existing labels is a good fit. Thus, we explore the Incomplete Text Classification (IC-TC) setting: Models mine patterns in a small labeled set which only contains existing labels, apply patterns to predict into existing labels in an unlabeled set, and detect out-of-pattern clusters for potential new label discoveries. We experiment with the potential of weakly supervised ML to detect class labels that humans may not recognize, thus facilitating more accurate classification. From the document and class embeddings and unconfident documents generated, we found that both the baseline and the final model had some capability of detecting unseen classes, and label generation techniques help produce reasonable new class labels.

Disclaimer

Due to space limitations, please view all cited papers in the report.

Introduction

In recent years, with the growing complexity and scale of neural network models, they also require more high-quality human-annotated training data to achieve satisfactory performances. These actions usually require extensive domain expertise and are extremely time-consuming. Researchers have strived to develop models in the weak supervision setting that aim to gradually alleviate the human burden in creating such annotations for the documents. In particular, researchers have approached the problem of text classification by developing models that only require the class labels and a little extra information for each class label such as (1) a few representative words (i.e. seed words); (2) authors, publication date, etc. (i.e. metadata). Researchers have shown that models are capable of obtaining reliable results without full human annotation.

However, the problem setting for these models all depend on one key assumption: users need to provide the model with a full set of desired class labels for the model to consider. This is less realistic as users might not know all possible classes in advance; users are also unable to obtain an exhaustive list of class names without carefully reading and analyzing the documents. If some documents happen to fall outside of the given list, the models will be forced to predict to one of the existing classes based on normalized probability (e.g. the last softmax layer for a neural network).

For example, an online article database might contain thousands of user-uploaded articles labeled with their domains: news, sports, computer science, etc., and the labels are only limited to existing articles. When trying to classify new documents, there might be some classes existing in our documents whose labels are not provided by our database. For instance, we may have a group of articles in the domain of chemistry, while we don’t have the exact label “chemistry” in the database yet.

In this paper, we explore the Incomplete TextClassification (IC-TC) setting: Models mine patterns in a small labeled set which only contains existing labels, apply patterns to predict into existing labels in an unlabeled set, and detect out-of-pattern clusters for potential new label discoveries. We try to explore the possibility of utilizing the power of machines to detect class labels that humans fail to recognize and classify documents to more reasonable labels. In particular, we proposed a baseline model and an advanced model that both leverage semi-supervised and unsupervised learning methods to extract information from the labeled part of the dataset, learn patterns from the unlabeled part, and generate new labels based on documents that have lower similarity between their representation and existing class labels. From the experiments on a well-balanced dataset, both models are performing relatively well in learning high-quality seed words, word embeddings, class and document embeddings, and detecting unseen clusters of classes. With the help of modern large language model ChatGPT, the models are also capable of finding generic labels for the new classes.

Data

DBPedia

560K Wikipedia articles, 14 evenly-distributed classes, average 50 words per document. Originally all labeled.

For incompleteness: remove 5 classes entirely, keep 10% of the remaining 9 classes to form the labeled set.

36K labeled documents, 524K unlabeled documents with unseen labels.

Seen Labels: Book, Film, Album, Plant, Animal, Village, River, Athlete, Company; Unseen Labels: Building, Transportation, Politics, Artist, School.

Evaluation

New Label Binary

The model decides on whether to generate new labels for a document based on the confidence of weak supervised document-class representations. The sub-task of predicting whether a document falls outside of existing classes is a binary classification prediction. We evaluate this sub-task using binary precision and recall, with new labels necessary as the positive class.

Existing Performance

The multi-class classification result of all documents with ground truth as existing labels. Report micro- and macro-F1.

New Label Quality

After new labels are generated, we inspect the quality of new labels using either manual inspection, and plot word clouds comparing the significant words appeared in the original removed classes and the new clustered classes with generated labels.

Method Overview

Module

Seed Words: Top 10 TF-IDF scores per existing class, de-duplicated.

Class Representation: Average embedding of picked seed words.

Document Representation: Average embedding of words in doc.

Confidence: Cosine similarity between class, document representations.

Clustering: Gaussian Mixture Model with 5 classes.

about-me-image

Figure 1: Model Pipeline Illustration

Figure 1 illustrates the model pipeline for both of the baseline and the final models. The models for the incomplete setting start from a set of labeled documents and another set of unlabeled documents, and mainly contain 4 modules: (1) learning word embeddings from the documents; (2) using word embeddings to find document and class representations; (3) confidence split based on document-class similarity; (4) clustering unconfident documents and generate new labels.

Final Model

The final model fills in the remaining slots of the model pipeline by using:

Word Embeddings: We obtain the contextualized static representations of each word using pre-trained BERT (Kenton and Toutanova, 2019) embeddings and averaging the representations of all occurrences of the word (Wang et al., 2021). We used the pre-trained bert-base-uncased model with its default vector dimension 768.

Representations: Since cosine similarity will perform poorly on a high-dimensional vector, we use PCA to reduce all class and document embeddings.

New Label Generation: Instead of directly using statistical methods, we use ChatGPT API – a chatbot fine-tuned using reinforcement learning on OpenAI’s state-of-the-art language model GPT-3 (Brown et al., 2020) that has showed ability for text generation and summarization. We prompt ChatGPT to: (1) Generate topics for top 25 documents in Gaussian probability per cluster; (2) Use the summarized topics to generate a generic class label.

Result

Summary of Findings

We report the results of the final model, by the same order as the pipeline shows:

Seed Words: We present a few example seed words generated from the first supervised TF-IDF module below. The basic TF-IDF scores are able to identify relatively representative seed words.

Similarity Distribution: Figure 2 shows the distribution of the maximum cosine similarity found for all unlabeled documents, and thus provides us with the criteria to get unconfident documents. From the figure, the distribution is roughly normal with a slight right skew, and the value ranges from -0.2 to almost 1.0. This is the ideal distribution, since by the definition of cosine similarity, there will be similarities at 0, indicating the representa- tions are not related; there will also be negative similarities, indicating the representations mean something opposite.

Unconfident Documents: Figure 3 shows the 2D unconfident document representations after applying the t-SNE (van der Maaten and Hinton, 2008) dimensionality reduction technique to visualize the high-dimensional data, color coded by their original label, with "Other" indicating any existing labels. To generate the 2D representation, we followed the suggestions on sklearn t-SNE to first use Principle Component Analysis (PCA) to reduce to 50 dimensions, and apply t-SNE with perplexity 30. From the figure, the unconfident documents follow pretty closely with the original removed labels, and 4 new class distributions are distinctive to be clustered: Transportation, Politics, School, Building. From this distribution, we could expect the Gaussian Mixture method to detect most of the new classes relatively well, with the exception that the "Transportation" class (in orange) has two clear centers, which might not be well detected by the model. There will also be a few noisy documents lying around (0, 0) in t-SNE that confuses clustering and label generation.

about-me-image1

Figure 2: Maximum Similarity Distribution for Unlabeled Documents Figure 3: t-SNE Dimensionality Reduction for Unlabeled Documents

Experiment Results: Table 2 shows the results of experimenting with different threshold cutoffs (0.05, 0.1, 0.15) and 2 PCA dimensions (128, 256). From the existing labels prediction, we could see that the supervised + weakly supervised models perform relatively well on classes already known in the dataset, and has improved from the baseline Word2Vec representations. Also, compared to ConWea replication where seed words are all human-chosen, the ability for the model to learn the seed words from the existing labeled documents are helping the understanding of existing classes. Note that sometimes in the ConWea setting we might not have enough labeled documents to generate the seed words, so human effort is still useful. The new label binary classification show satisfactory results, as in all the experiment settings we observe a precision close or over 0.9, showing that the similarity cutoff is picking mostly correct out-of-distribution documents. On the other hand, the recall is less optimal, but it increases drastically if we take more documents. This indicates that and existing label performance show satisfactory results With the low binary classification results, it’s no surprise that the labels generated are not close to the truly removed labels, but we could detect a pattern of increasing dominance of popular classes in the newly generated labels. When the threshold is increased, the more documents from existing classes and thus popular classes become part of the unconfident set, and the new labels detected by TF-IDF will be closer to represent popular classes.

about-me-image1

The word cloud of each removed class and clustered class is plotted in Figure 4. From the label comparison and the word cloud, most important words show in aligned clusters, confirming that the clusters found are relatively close to the original classes. It also lays a good foundation for label generation. In addition, ChatGPT also produces reasonable generic labels for the clusters.

about-me-image1

Figure 4: Word cloud for removed and generated labels, with manual row alignment. Note they don’t necessarily match.

Future Work

From the discussions above, although the current final model is capable of finding quality representations, perform reasonable similarity-based confidence splits, and generate labels based on clusters, there are plenty of drawbacks that this model failed to address:

Confidence Split: In some literature (Shu et al., 2017), the split threshold is automatically learned from each existing class, which requires less manual instructions and could lead to potential better splits targeted at individual classes;

Clustering: In the current model, we have to specify the number of clusters beforehand, which requires prior assumptions. We can replace Gaussian Mixture with density-based clustering or LDA to automatically detect the potential number of new classes. Hierarchical clustering can also be applied so that examples of multiple centers can be included as well.

Data: The DBPedia dataset is well-balanced, which is less realistic for unseen classes. In the example of the online database, classes failed to be provided beforehand are more likely to be unpopular classes. We need to improve method heuristics to work for unbalanced and fine-label datasets.

Extension: Since the model has the ability to detect new labels based on out-of-distribution documents, we can naturally extend the model to detect potential mislabels. For example, we can utilize confidence score to identify and relabel poor human annotations and allow multi-labels.

Extension: As discussed in the Method section, we choose simple similarity-based techniques to have the opportunity to further decrease human effort, which is both error-prone and time-consuming. We can fully utilize extremely weak supervision techniques to only use class names as supervision and learn class-oriented document representations using attention (Wang et al., 2021).

Contact

Yacun Wang

Email: yaw006@ucsd.edu
Github: colts661

Luning Yang

Email: l4yang@ucsd.edu
Github: Luning-Yang

© Copyright 2022 Tim Baker