AI researchers have created a language-model testing tool that discovers major bugs in commercially available cloud AI offerings from Amazon, Google, and Microsoft. Yesterday, a paper detailing the CheckList tool received the Best Paper award from organizers of the Association for Computational Linguistics (ACL) conference. From a report:
NLP models today are often evaluated based on how they perform on a series of individual tasks, such as answering questions using benchmark data sets with leaderboards like GLUE. CheckList instead takes a task-agnostic approach, allowing people to create tests that fill in cells in a spreadsheet-like matrix with capabilities (in rows) and test types (in columns), along with visualizations and other resources. Analysis with CheckList found that about one in four sentiment analysis predictions by Amazon’s Comprehend change when a random shortened URL or Twitter handle is placed in text, and Google Cloud’s Natural Language and Amazon’s Comprehend makes mistakes when the names of people or locations are changed in text. “The [sentiment analysis] failure rate is near 100% for all commercial models when the negation comes at the end of the sentence (e.g. ‘I thought the plane would be awful, but it wasn’t’), or with neutral content between the negation and the sentiment-laden word,” the paper reads.