Is MAUVE the best metric to measure the gap between neural text and human text?
The evaluation of text generation is a crucial step in the development of text generation models. Measuring and evaluating the quality of text generation still remains a challenge in the field of natural language processing [2]. Pillutla, et al (2021) [1] is recent a comparison measure for open-ended text generation, which won the outstanding paper award in NIPS 2021. Unlike other metrics like BLEU and Perplexity, MAUVE is a distributional metric that compares the model's distribution against the distribution of the human-written text. The main idea of the metric is to measure the mutual distance from P to Q and Q to P. Due to the computational tractability issue, an approximated version based on a quantized discrete version of the measure is proposed. The experiments by the authors show that MAUVE agrees well with human judgment compared to competitor methods. This is the reason why it makes it an interesting metric to study.
In this project, The main question what we want to answer is if MAUVE a new standard metric in evaluating open-ended text generation. First, we will understand open-ended text generation and explore the workings of MAUVE and reimplement MAUVE and evaluate it on a dataset of choice. Then, we will recreating the experiments in the paper. Finally, we will evaluate the performance of MAUVE on a dataset of choice and compare it with other metrics. Finally, we would like explore the future and limitations of MAUVE.
[1] Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., & Harchaoui, Z. (2021). Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34, 4816-4828.
[2] Nguyen, An. "Language Model Evaluation in Open-ended Text Generation." arXiv preprint arXiv:2108.03578 (2021).
Author's Implementation of MAUVE
Shreyas Prasad (prasad.shre@northeastern.edu)
Gaurav