Table 3: Category-wise evaluation scores. Likert scores are averaged over both annotators, correlation measured between individual Likert scores and percent agreement.
๐๐ง Their experiments show that #LLMs can produce reasonable poem descriptions, but struggle with more abstract interpretion, highlighting where #NLG currently meets its #limits in #LiteraryInterpretation.
#LiteraryComputing #Evaluation #GenerativeModels
1
0
1
0