Jury Learning introduces a novel approach in supervised machine learning (ML) to address the challenge of label disagreements in societal contexts, such as online toxicity, misinformation detection, and medical diagnosis. Traditional ML methods typically use majority voting to resolve these disagreements, often overshadowing minority perspectives. The innovative solution proposed is 'jury learning,' which explicitly considers diverse societal opinions. This method involves forming a 'jury' of diverse annotators to determine the classifier's predictions, thereby integrating varied viewpoints and responses to societal disagreements. The paper presents a deep learning architecture that models individual annotators, allowing for dynamic jury compositions and enhanced representation of minority views. This approach not only accommodates diverse opinions but also significantly alters classification outcomes, indicating its potential for creating more inclusive and representative ML models.
Mitchell L. Gordon
Incoming Assistant Prof. at MIT
Post Doc from University of Washington
PhD from Stanford
Citations: 817
Michelle S. Lam
PhD candidate at Stanford
BS and MS from Stanford
Citations: 237
Joon Sung Park
PhD candidate at Stanford
MS from UIUC
Citations: 2845
Kayur Patel
Research Scientist at Apple
PhD from University of Washington
MS from Stanford
Citations: 1811
Jeffrey T. Hancock
Professor at Stanford
Ex Professor at Cornell
PhD from Dalhousie University
Citations: 28025
Tatsunori Hashimoto
Assistant Professor at Stanford
Post Doc from Stanford University
PhD from MIT
Citations: 9577
Michael S. Bernstein
Associate Professor at Stanford
MS and PhD from MIT
Co-author in Image Net
Citations: 67060
Engaging Stakeholders in Algorithm Design: The paper discusses the critique of unexamined majoritarianism in annotation process in ML models, which tends to exclude minority annotations. In response to this, jury learning is introduced as a means to provide weight to the voices of the minority annotations. This approach aligns with the need in human-computer interaction and AI fairness [1] for algorithms that balance multiple stakeholders' needs and interests. Past works like WeBuildAI [2] make stakeholders design their own models representing their beliefs, and then a larger algorithm uses each of these models as a single vote when making a decision for the group. The jury learning approach, however, allows practitioners to model each relevant individual or group from existing datasets, enabling them to reason over and specify which individuals or groups their models should reflect. It also provides an orthogonal view to fairness where current AI fairness algorithms are outcome-centric, jury learning focuses fairness in the model input level.
Disagreement in Datasets: Disagreements are very common in annotation process of data for ML models especially in subjective tasks like toxicity detected, medical diagnosis, and news misinformation. Some past works have tried handling disagreements bu having common consensus between annotators. But for some tasks such as those common in social computing contexts, much of the disagreement is likely irreducible [3], stemming from the socially contested nature of questions. Thus, The authors highlight the importance of understanding not just the existence of disagreement but also the nature and reasons behind it. They propose annotator-level modeling to better capture the distribution of opinions and provide insights into who disagrees and why. It also provides an opportunity to include voices and opinions that matter to us the most given our use-case.
Interactive Machine Learning (ML) TA prominent area of ML research is integrating human-centered methods into machine learning systems. Interactive machine learning seeks methods to integrate human insight into the creation of ML models. Past works have used human-in-loop to provide more accurate labels[4]. Another line of work has sought to characterize best practices for designers and organizations developing such classifiers [5]. The paper add to this by providing this new interactive framework with visualizations increasing the explainability of classifier's result.
A Deep & Cross Network (DCN) is used in the architecture. A typical DCN involves three sets of embeddings: content, annotator, and group. The content embedding enables prediction on previously unseen items by mapping those items into a shared space. The group embeddings make use of the data from all annotators who belong to each group, helping overcome sparsity in the dataset. The annotator embedding ensures that the model learns when each annotator differs from the groups they belong to. The DCN learns to combine these embeddings to predict each individual annotator's reaction to an example: the embeddings are concatenated into an input layer, then fed into a cross network containing multiple cross layers that model explicit feature interactions, and then combined with a deep network that models implicit feature interactions. The DCN architecture is modified to jointly train a pre-trained BERT-based model, using its pooler output as the content embeddings.
Performance against individual annotator's test labels for three models: today's standard state-of-the-art aggregate approach (which is annotator-agnostic, and makes one prediction per example), a group-specific version of our proposed architecture, and the full version of our proposed architecture. The standard aggregated model's performance varies substantially between groups. For instance, it achieves an MAE of 0.83 for Asian annotators and 1.12 for Black annotators, a performance decrease of 35.0%. By comparison, we find that our model does show differences between groups, but with far smaller magnitudes. It achieves an MAE of 0.62 for Asian annotators and 0.65 for Black annotators, a performance decrease of 4.9%.
For user evaluation, authors selected 19 content moderators. They explained them the Jury Learning algorithm, showed them a sample of toxic comments, and asked them to comprise juries. They found that in average 13.6% of decisions flipped between jury learning classifier customized for a community and an off the shelf classifier.
In the Jury Selection portion of the system, the user can create juror sheets to populate their jury composition and can provide one or more input examples to evaluate. Then, the system outputs the Jury Learning Results section where they can view a summary of the jury verdict based on a median-of-means estimator of jury outcomes. Here, they can view the full distribution of jury outcomes, select individual juries to view trends, and inspect individual jurors on a jury. When a user selects a jury, the Jury Trends section is updated. There, they can group by different fields like the juror sheet, decision label, or other demographic attributes to understand patterns in the labels from this jury and contextualize them with respect to the larger population. When a user selects a particular juror, the Juror Details view opens, and they can inspect the predicted label for the juror, the background of this juror, and the juror's annotations. Users can also inspect counterfactual juries that would result in the opposite verdict
Companies can shift the AI models they use to this new paradigm based on Jury Learning. It offers better interpretability, visualization and decision making. Suppose you are a journalist and your comment was removed by a toxicity detection model, you would want to know what about it was toxic. With Jury Learning, the companies can provide a better reasoning like "A jury of 6 people including 2 black women, 2 white women, and 2 asian men found this toxic".
Companies also don't have to undergo the tedious and time-consuming process of creating ML models for different target population. They can use same model but a jury composition that represents the target population the best.
Jury learning currently models values and voices of annotators based on their demographics. An intriguing avenue for expanding this research would be implicitly deriving these values and voices from the data, without necessitating explicit demographic information. This would allow models to learn more nuanced values of a person which are not just based on their identity or demographics. Consider a dietitian striving to suggest the optimal food items for a person. Employing a "jury learning" model with implicit values learning that learns from past data of the target individual and comparable profiles would help in making food items suggestion based on preferences and values of the individual. This nuanced approach would help us build human-centric ML models that are grounded in values and preferences.
Another interesting avenue is to try this approach on different types on deep learning models and tasks. Some papers that have already extended this work in this direction are:
Score: 7 (Accept)
Strengths
Weakness
Score: 7 (Accept)
Strengths
Weakness
[1] Mehrabi, Ninareh, et al. "A survey on bias and fairness in machine learning." ACM computing surveys (CSUR) 54.6 (2021): 1-35.
[2] Lee, Min Kyung, et al. "WeBuildAI: Participatory framework for algorithmic governance." Proceedings of the ACM on Human-Computer Interaction 3.CSCW (2019): 1-35.
[3] Gordon, Mitchell L., et al. "The disagreement deconvolution: Bringing machine learning performance metrics in line with reality." Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021.
[4] Chang, Joseph Chee, Saleema Amershi, and Ece Kamar. "Revolt: Collaborative crowdsourcing for labeling machine learning datasets." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 2017.
[5] Amershi, Saleema, et al. "Guidelines for human-AI interaction." Proceedings of the 2019 chi conference on human factors in computing systems. 2019.
Srijha Kalyan
Akshat Choube