arxiv:2205.11134

Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status

Published on May 23, 2022

Authors:

Abstract

Bootstrap confidence intervals are advocated for evaluating and comparing NLP system performances, offering advantages over traditional SOTA and statistical significance testing approaches.

AI-generated summary

This paper argues for the widest possible use of bootstrap confidence intervals for comparing NLP system performances instead of the state-of-the-art status (SOTA) and statistical significance testing. Their main benefits are to draw attention to the difference in performance between two systems and to help assessing the degree of superiority of one system over another. Two cases studies, one comparing several systems and the other based on a K-fold cross-validation procedure, illustrate these benefits. A python module for obtaining these confidence intervals as well as a second function implementing the Fisher-Pitman test for paired samples are freely available on PyPi.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2205.11134 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2205.11134 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2205.11134 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.