Jury Learning: Integrating Dissenting Voices into Machine Learning Models

An Analysis of "Jury Learning: Integrating Dissenting Voices into Machine Learning Models"

Introduction/Motivation

Jury Learning introduces a novel approach in supervised machine learning (ML) to address the challenge of label disagreements in societal contexts, such as online toxicity, misinformation detection, and medical diagnosis. Traditional ML methods typically use majority voting to resolve these disagreements, often overshadowing minority perspectives. The innovative solution proposed is 'jury learning,' which explicitly considers diverse societal opinions. This method involves forming a 'jury' of diverse annotators to determine the classifier's predictions, thereby integrating varied viewpoints and responses to societal disagreements. The paper presents a deep learning architecture that models individual annotators, allowing for dynamic jury compositions and enhanced representation of minority views. This approach not only accommodates diverse opinions but also significantly alters classification outcomes, indicating its potential for creating more inclusive and representative ML models.

An overview of Jury Learning

Biography

Mitchell L. Gordon

Mitchell L. Gordon
Incoming Assistant Prof. at MIT
Post Doc from University of Washington
PhD from Stanford
Citations: 817


Michelle S. Lam

Michelle S. Lam
PhD candidate at Stanford
BS and MS from Stanford
Citations: 237


Joon Sung Park

Joon Sung Park
PhD candidate at Stanford
MS from UIUC
Citations: 2845


Kayur Patel

Kayur Patel
Research Scientist at Apple
PhD from University of Washington
MS from Stanford
Citations: 1811


Jeffrey T. Hancock

Jeffrey T. Hancock
Professor at Stanford
Ex Professor at Cornell
PhD from Dalhousie University
Citations: 28025


Tatsunori Hashimoto

Tatsunori Hashimoto
Assistant Professor at Stanford
Post Doc from Stanford University
PhD from MIT
Citations: 9577


Michael S. Bernstein

Michael S. Bernstein
Associate Professor at Stanford
MS and PhD from MIT
Co-author in Image Net
Citations: 67060

Literature Review

Engaging Stakeholders in Algorithm Design: The paper discusses the critique of unexamined majoritarianism in annotation process in ML models, which tends to exclude minority annotations. In response to this, jury learning is introduced as a means to provide weight to the voices of the minority annotations. This approach aligns with the need in human-computer interaction and AI fairness [1] for algorithms that balance multiple stakeholders' needs and interests. Past works like WeBuildAI [2] make stakeholders design their own models representing their beliefs, and then a larger algorithm uses each of these models as a single vote when making a decision for the group. The jury learning approach, however, allows practitioners to model each relevant individual or group from existing datasets, enabling them to reason over and specify which individuals or groups their models should reflect. It also provides an orthogonal view to fairness where current AI fairness algorithms are outcome-centric, jury learning focuses fairness in the model input level.

Disagreement in Datasets: Disagreements are very common in annotation process of data for ML models especially in subjective tasks like toxicity detected, medical diagnosis, and news misinformation. Some past works have tried handling disagreements bu having common consensus between annotators. But for some tasks such as those common in social computing contexts, much of the disagreement is likely irreducible [3], stemming from the socially contested nature of questions. Thus, The authors highlight the importance of understanding not just the existence of disagreement but also the nature and reasons behind it. They propose annotator-level modeling to better capture the distribution of opinions and provide insights into who disagrees and why. It also provides an opportunity to include voices and opinions that matter to us the most given our use-case.

Interactive Machine Learning (ML) TA prominent area of ML research is integrating human-centered methods into machine learning systems. Interactive machine learning seeks methods to integrate human insight into the creation of ML models. Past works have used human-in-loop to provide more accurate labels[4]. Another line of work has sought to characterize best practices for designers and organizations developing such classifiers [5]. The paper add to this by providing this new interactive framework with visualizations increasing the explainability of classifier's result.

Methodology

Model Architecture

A Deep & Cross Network (DCN) is used in the architecture. A typical DCN involves three sets of embeddings: content, annotator, and group. The content embedding enables prediction on previously unseen items by mapping those items into a shared space. The group embeddings make use of the data from all annotators who belong to each group, helping overcome sparsity in the dataset. The annotator embedding ensures that the model learns when each annotator differs from the groups they belong to. The DCN learns to combine these embeddings to predict each individual annotator's reaction to an example: the embeddings are concatenated into an input layer, then fed into a cross network containing multiple cross layers that model explicit feature interactions, and then combined with a deep network that models implicit feature interactions. The DCN architecture is modified to jointly train a pre-trained BERT-based model, using its pooler output as the content embeddings.

: Performance against individual annotator's test labels

Performance against individual annotator's test labels for three models: today's standard state-of-the-art aggregate approach (which is annotator-agnostic, and makes one prediction per example), a group-specific version of our proposed architecture, and the full version of our proposed architecture. The standard aggregated model's performance varies substantially between groups. For instance, it achieves an MAE of 0.83 for Asian annotators and 1.12 for Black annotators, a performance decrease of 35.0%. By comparison, we find that our model does show differences between groups, but with far smaller magnitudes. It achieves an MAE of 0.62 for Asian annotators and 0.65 for Black annotators, a performance decrease of 4.9%.

User Evaluation

For user evaluation, authors selected 19 content moderators. They explained them the Jury Learning algorithm, showed them a sample of toxic comments, and asked them to comprise juries. They found that in average 13.6% of decisions flipped between jury learning classifier customized for a community and an off the shelf classifier.

System Overview

In the Jury Selection portion of the system, the user can create juror sheets to populate their jury composition and can provide one or more input examples to evaluate. Then, the system outputs the Jury Learning Results section where they can view a summary of the jury verdict based on a median-of-means estimator of jury outcomes. Here, they can view the full distribution of jury outcomes, select individual juries to view trends, and inspect individual jurors on a jury. When a user selects a jury, the Jury Trends section is updated. There, they can group by different fields like the juror sheet, decision label, or other demographic attributes to understand patterns in the labels from this jury and contextualize them with respect to the larger population. When a user selects a particular juror, the Juror Details view opens, and they can inspect the predicted label for the juror, the background of this juror, and the juror's annotations. Users can also inspect counterfactual juries that would result in the opposite verdict

Social Impact


Positive Impacts:


  • Fairness: The paper provides an orthogonal view of fairness. Fairness in AI has been output-centered meanwhile this paper provides an opportunity to include fairness at the model input level. Jury learning can be thought of as a form of procedural justice. While it doesn't guarantee the fairness of outcomes, instead we make claims around the correctness of the process.

  • Diversity: Jury Learning provides a way to include the values and voices of diverse groups of people. We have the option of designing diverse juries and examining how their values and voices affect the outcome. This helps us understand how diversity affects the model's output.

  • Inclusion: Jury Learning ensures that voices from underrepresented groups are heard and reflected in our models. It is especially important when we are designing models for platforms that are extensively used by underrepresented groups, for example, a toxicity detection model for a website used by the LGBTQ+ community.

  • Interpretability: Jury Learning provides visualizations and model results that are more interpretable. For example, suppose you are a journalist and your comment was removed from a platform, with Jury Learning, you would have a better explanation, like "a group including 2 white men, 2 women of POC and 2 people from LGBTQ identity" found this toxic.

  • Better Decision Making: Jury Learning provides visualizations that help ML and HCI Practitioners visualize and analyze the values and voices of different groups of people and even individual annotators. The practitioners can then make decisions about which voices to include on the jury based on their applications.

Negative Impacts:


  • Bias: The Jury Learning hands a lot of power to the ML/HCI Practitioner. The practitioner can select a morally corrupt jury and as the model emulates the voice of the jury, the model would be morally wrong. For example, a set of racist jurors in the jury can make the model racist.

  • Privacy: Jury Learning requires demographics and identity information which are usually very personal to annotators. Sometimes, some information or a set of identities might lead to the identification of the annotator hurting their privacy and promise of anonymization.

  • Abdication of Responsibility: ML models are not 100% accurate in the real-world scenario. Thus, when a model fails who should be held responsible? ML Practioner, Jurors, or the organization deploying the models. We would need policies in place to handle these scenarios.

  • Ecological Fallacy: The ecological fallacy happens when we associate group values with an individual just based on their belongingness to the group. In Jury Learning, we take the means of all the sampled juries as the final output which promotes ecological fallacy. We associate the voice of the entire group as the voice of individual jurors.

Industrial Application

Companies can shift the AI models they use to this new paradigm based on Jury Learning. It offers better interpretability, visualization and decision making. Suppose you are a journalist and your comment was removed by a toxicity detection model, you would want to know what about it was toxic. With Jury Learning, the companies can provide a better reasoning like "A jury of 6 people including 2 black women, 2 white women, and 2 asian men found this toxic".

Companies also don't have to undergo the tedious and time-consuming process of creating ML models for different target population. They can use same model but a jury composition that represents the target population the best.

Academic Research

Jury learning currently models values and voices of annotators based on their demographics. An intriguing avenue for expanding this research would be implicitly deriving these values and voices from the data, without necessitating explicit demographic information. This would allow models to learn more nuanced values of a person which are not just based on their identity or demographics. Consider a dietitian striving to suggest the optimal food items for a person. Employing a "jury learning" model with implicit values learning that learns from past data of the target individual and comparable profiles would help in making food items suggestion based on preferences and values of the individual. This nuanced approach would help us build human-centric ML models that are grounded in values and preferences.

Another interesting avenue is to try this approach on different types on deep learning models and tasks. Some papers that have already extended this work in this direction are:

Peer-Review


Reviewer 1 (Akshat Choube)


Score: 7 (Accept)

Strengths

  • Introduces a new architecture that combines Deep Cross Networks with LLMs
  • Jointly modelling annotators and the task is unique
  • Introduces a new learning framework that is more grounded in human values
  • Makes the model more explainable and offers better decision-making for ML practitioners

Weakness

  • The paper missed the in-detail discussion of the applicability of "Jury Learning" in other domains.
  • Jury selection is an important aspect of this paper, and authors should have discussed a more algorithmic approach to it.
  • Some parts of the paper were repetitive.

Reviewer 2 (Srijha Kalyan)


Score: 7 (Accept)

Strengths

  • Jury learning innovatively addresses the issue of label disagreements in machine learning, particularly in societal contexts like online toxicity and misinformation detection.
  • The model architecture of jury learning allows for dynamic jury compositions, offering a responsive approach to incorporating diverse voices in machine learning models. This adaptability is crucial in societal contexts where opinions and societal norms are constantly evolving.
  • The jury learning approach can lead to more performant and fair classifiers by taking into account the views of a wider range of annotators. This is reflected in the evaluation which shows changes in classification outcomes due to the increased diversity of jury composition
  • The approach promotes reflective practices around dataset generation and model creation, encouraging practitioners to think critically about whose voices their models represent and why

Weakness

  • The paper says Jury Learning provides an orthogonal view of fairness but does not use any metrics to evaluate fairness.
  • The authors should have discussed in more detail the limitations of this framework
  • It would have been interesting if the paper presented existing approaches that try to model annotators and comparison of "jury learning" with them.

References

[1] Mehrabi, Ninareh, et al. "A survey on bias and fairness in machine learning." ACM computing surveys (CSUR) 54.6 (2021): 1-35.

[2] Lee, Min Kyung, et al. "WeBuildAI: Participatory framework for algorithmic governance." Proceedings of the ACM on Human-Computer Interaction 3.CSCW (2019): 1-35.

[3] Gordon, Mitchell L., et al. "The disagreement deconvolution: Bringing machine learning performance metrics in line with reality." Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021.

[4] Chang, Joseph Chee, Saleema Amershi, and Ece Kamar. "Revolt: Collaborative crowdsourcing for labeling machine learning datasets." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 2017.

[5] Amershi, Saleema, et al. "Guidelines for human-AI interaction." Proceedings of the 2019 chi conference on human factors in computing systems. 2019.

Team Members

Srijha Kalyan

Akshat Choube

Poster

Poster