Conference Report: LREC-COLING 2024`\footnote{Published as blog post at \url{}}`{=latex}

17 minute read

End of May 2024, I participated in the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) in Turin. Both COLING and LREC enrich the landscape of competitive conferences to publish in natural language processing and computational linguistics. While ACL, EMNLP, NAACL and EACL have a tendendency to aim at focusing on accepting high impact papers, also by keeping the acceptance rate low (~25%), both COLING and LREC are traditionally more inclusive. COLING and LREC recently had acceptance rates around 28% and 65%, respectively. While COLING also has been a bit higher in the past, these numbers are generally pretty typical for these venues.

Plenary Session

The conferences LREC and COLING happened together this year, and the general chairs explained this to be a one time event to reschedule LREC to happen every even year and COLING every odd year, while both so far took place in even years. Joining these two conferences was interesting for authors who submitted, because it was not really clear what to expect. Also the organizers seem to have been surprised by the numbers of submitted papers.

Overall, there have been 3,471 submissions, with 1,554 acceptances. Out of those 275 were presented in talks, 837 as poster, and 442 remotely. Therefore, the acceptance rate was 44%. It’s difficult to say, but it might be that tracks that were more LREC-like had a higher acceptance rate and more COLING-style tracks had a lower one. I suspect that this is the case because the track with most acceptances is in the track “Corpora and Annotation”. LREC’s idea has always been to optimize for high recall here, given that resources may have an impact in low-resource languages without showing a high impact overall across the community. However, I’d like to note that LREC only started in 2020 to review its papers! Until 2018, extended abstracts were submitted and reviewed, and accepted abstracts were invited to submit a full paper, which did not get reviewed again. I am quite happy that this has been changed. The overall quality of papers, posters, and presentations has been comparable to other conferences, but before 2018, I’ve seen a couple of presentations at LREC where a review of the full paper might have had the chance to improve the quality of the work.

In the opening session, the program chairs also shared information on the countries from where most papers came (ranked list: China, USA, German, France, Japan, UK, Rep. of Korea, Spain, Italy, India). While China is, since a couple of years, having more and more papers in the NLP venues, I was a bit surprised to see quite many papers from Korea, which I think I did not before. Maybe the reason is that COLING 2022 was in Korea and made the conference more popular in this part of the world. It’s been quite interesting to also see many papers who worked on Korean. There were also some differences between countries in the acceptance rates, but I am not sure if these are just artefacts, so I don’t want to republish the overall highest numbers of acceptance rates (because the numbers of submissions were low in these places). The overall number of submissions is also roughly mirrored in the numbers of participants: China (472), USA (313), Germany (286), Italy (237), France (221), UK (143), Japan (141), Korea (91). I was also very happy to see that there were 89 scholars from Ukraine.

Overall, the conference felt very much like COLING and LREC together - one could clearly see the origin of this joint conference, and I liked this a lot.

Poster Session

As usual in our field, most papers were presented as posters, and LREC-COLING made no difference regarding the difference between the quality of orally presented papers and posters: there is none. Therefore, posters are often much more interactive than presentations, and its great to have discussions. I still like to go to presentations, particularly for topics where I am not an expert. For me, oral presentations are better to learn something new I don’t know a lot about. I don’t feel comfortable with asking a poster presenter for very basic introductions while they want to talk about their most recent research.


For the first time in my life, I’ve been asked to be a tutorial co-chair. I did act as a senior area chair a couple of times, but that’s a more guided process. I was very happy to do that together with Naoaki Okazaki who had experience already as a tutorial chair. Without him, I would not have been able to do this job, I learned a lot from him.

Due to his experience, nearly everything went very well with the tutorials, as far as I can say. We did select a good set of tutorials who attracted people from various areas. We received 20 submissions from which we selected 13 to be taught at the conference. Out of those three were introductory (one to an adjacent topic), and the majority presented cutting-edge topics. Unsurprisingly, a popular topic is large-language models, which are covered by multiple tutorials with varying perspectives on multimodality, evaluation, knowledge editing and control, hallucination, and bias. Other tutorials cover argument mining, semantic web, dialogue systems, semantic parsing, inclusion in NLP systems, and applications in chemistry. You can find the tutorial summaries at @lrec-2024-2024-joint. I did attend two tutorials, one on knowledge editing and one on recognizing and mitigating hallucinations.

Only one thing did not go well: For one tutorial, the presenters did not come on site but presented entirely virtual; something that we did not intend. We believe that we communicated that, for each tutorial, at least one presenter needed to be on site. It’s currently not clear to me what the reason for his presumable misunderstanding is, but for future tutorials, I would suggest to ensure already at submission time to have people tick a box that they will come to the conference, if the proposal is accepted. Also, it might be a good idea to check with the local organizers if the presenters actually registered to the conference early enough.

Overall, if you participated in the tutorials as a teacher or participant, let me know if you have any feedback. LREC-COLING will compile a summary document to be handed over to the next organizing team and I would make sure to pass along any constructive feedback.

Own Contributions from Bamberg and Stuttgart

Stuttgart was very well presented in Turin, as usual, but as this was the first time for me to be at a conference with my Bamberg affiliation, I will focus on mentioning the contributions that came from Bamberg.

We had two papers in which my group was involved:

  • @velutharambath-etal-2024-factual-statements presents our Defabel corpus in which we asked people to argue for a given statement (“Convince me that camels store water in the hump.”) Depending on their own belief, we labeled the argument as deceptive or not. By doing so, we have a corpus in which deceptive arguments and “honest” arguments were created for the same statements. Our intend was to disentangle fact-checking and deception detection.
  • @wemmer-etal-2024-emoprogress-cumulated describes the corpus creation of customer agent and dream corpora annotated with cumulative emotion labels. Most emotion corpora are either for the whole text, for isolated sentences, or for sentences in context. We compiled a corpus with annotations in which the raters only had access to the prior context, which is the realistic setting how we also read text or talk to other people – we cannot look into the future!

I was happy to also see another contribution from Bamberg, namely from the group of Andreas Henrich:

  • @fruth-etal-2024-approach discuss in their paper published at the “Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context” a reinforcement learning based approach to German text simplification. Noteworthy is that they also tackle hallucination to some degree, namely by checking of any named entities are included that have not been in the original non-simplified text.

My Favorite Contributions

I found quite a set of talks and papers very interesting. This only reflects my personal opinion, and that I do not mention a particular paper probably only means that I did not have the time to see its presentation. There are many interesting papers in the proceedings, I did not go through all of them yet.

Invited Talks

Before I mention my favorite papers, I’d like to say something about the invited talks. There were three of them with quite different foci. I’ll mention the two here that I found most interesting.

Roger Levy talked about mistakes that humans and language models do; in the same way or in different ways. He also gave possible explanations for both humans and models. His talk was full of interesting text completion examples, for instance “The children went outside to…” or “The squirrel stored some nuts in the…” – where in the latter case apparently many people answer “tree”.

Michele Loporcaro talked about differences in the dialect in italy. I found this inspiring, not only because there was barely anything in the talk that I knew before (not a linguist…), but also because it gave me an interesting example for linguistic research to which I am not often enough exposed yet.


I found the following papers particularly interesting. I selected these papers based on my own interests. Given that you read this here on my blog there is some chance that you share some research interests with me and hopefully find my selection useful. Still, I want to point out that it’s absolutely not a negative opinion statement if I did not include a paper here, despite it being related to my interests. I probably just missed it.

Biomedical and Health NLP:

  • @raithel-etal-2024-dataset-pharmacovigilance create a pharmacovigilance corpus across multiple languages. It is annotated with drug names, changes in medication, and side effects, as well as causal relations. Interestingly, the baseline experiments also include cross-lingual experiments (training on multiple languages and testing in a zero-shot setting on another one). The performance scores are similar to monolingual experiments, sometimes even higher. The paper might have some overlap to our BEAR corpus, but we had a different focus, namely the goal to develop an entity and argumentative claim resource (@wuhrl-klinger-2022-recovering,@wuehrl-etal-2024-makes).
  • @giannouris-etal-2024-plain describe an approach in the Determit workshop to automatically summarize clinical trial reports in plain language. I’ve been interested in biomedical text summarization for laypeople for a while, and our FIBISS project has also been motivated with such challenges in mind. They contribute an interesting and valuable resource on the topic.

Ensembles/Aggregations of annotators:

  • @basile-etal-2024-pyrater-python is a bit of a demo paper. They present a Bayesian approach to annotator aggregation with an integration of the STAN language to specify directed probabilistic models. First of all, I was quite happy to see some probabilistic graphical model work at the conference, and secondly, this really looks like a useful approach. We’ll definitely have a look!
  • Flor Miriam Plaza del Arco presented work in the NLPerspectives workshop in which she and her colleagues showed that the annotation-aggregation method MACE can be used to build ensembles of language models which are better than simpler aggregation methods, like majority vote (@plaza-del-arco-etal-2024-wisdom). That’s interesting because LLMs are not really diverse as humans are, but the method still works for aggregation Instruction-tuned LMs show specialization in different tasks. We talked potential future work in which the components of the ensemble could be conditioned on personas to explicitly make the ensemble more diverse as humans are in annotation tasks. I am curious to see how this goes!


Corpus collection and Analysis:

  • @jin-etal-2024-bragging-online report on their study on bragging in social media. They find that rich males brag more about their leisure time while low income people focus more on the self. Very interesting analysis. I am wondering if these results could impact general social media happiness analysis.
  • @dick-etal-2024-gil-galad describe their collected corpus of various ways to formulate in a gender-inclusive manner in German. They include comparably known cases like “Arbeiter:innen”, but also nominalized particles (Lehrende) and abstract nouns (Lehrkraefte). They collected these latter cases pretty much manually if I understood their work correctly. I think that their corpus would be an interesting resource to build an automatic system that can find unknown and rare cases of such inclusive language. I sometimes feel a bit challenged to always formulate gender-inclusive, and I’d like to learn from other people how they do that in rare, less established cases than “Studierende”.
  • @fiorentini-etal-2024-towards-whap create a corpus of italian WhatsApp messages. An interesing approach: The authors collected their own whatsapp messages, including voice messages and asked the interlocutors for consent. The resources seems not to be available yet, but I am super curious. I remember that there has been a paper on trust in social media platforms some while ago and this resource might be an interesting opportunity to study such effects computationally (@pmid33267485).
  • @troiano-vossen-2024-clause-atlas created the CLAUSE-ATLAS corpus. They aim at a full (clause-level) annotation of events and emotions in some books. This can of course not be done with reasonable costs manually. They therefore only annotate beginnings of chapter manually and the rest automatically and analyze the agreement between human annotators and a large language model. They find that the agreement is comparable.
  • @maladry-etal-2024-human-system build, to best of my knowledge, the first irony-labeled corpus in which annotators were asked for their confidence that the text is actually ironic. They formulated the labels as a rating scale. Interestingly, automatic systems are better with predicting irony on the instances in which humans were confident. That result is in line with our findings for emotion analysis a couple of years ago (@troiano-etal-2021-emotion), where we also showed that humans can quite well predict the inter-annotator agreement for the instances they annotated.

Arguments and News:

  • @feger-dietze-2024-taco-twitter build an argument corpus in which the discussion is kept as a tree. They label Arguments as Statements and Reasons and Non-arguments as Notification or None. I think it might be nice to also see persuasiveness labels for the arguments in comparison to each other in each tree.
  • @song-wang-2024-like-make build an automatic system to persuade people, here in a specific context, namely to make donations. Their system is a chatbot that can automatically recognize which persuasion strategy might be most promising. They consider “credibility appeal”, “foot-in-the-door”, and “emotional appeal”. Again, that’s super relevant for our EMCONA project and we’ll consider to use this for our work.
  • @pu-etal-2024-scinews-scholarly built a system to automatically generate news reports out of scientific texts. Their idea is similar to our analysis in our recent work in @wuehrl-etal-2024-makes. Its impressive that they were able to automatize such complex task! I’d be curious to understand if their automatic system makes the same changes to the text and scientific claims that we found (making the articles more sensational, or simplifying correlation reports to causations).
  • @nejadgholi-etal-2024-challenging-negative create counter-stereotype statements. I put this in this category of argument mining, because I’d consider counter-stereotype statements as attempts to convince the dialogue partner to change a stance. The work is nicely grounded in categories of stereotypes, like counter-facts and broadening universals. I am also here wondering if a convincingness study would make sense.
  • @kalashnikova-etal-2024-linguistic-nudges describe a wizard-of-oz study in which they nudge people carefully to change their opinion or emotion. They compare smart assistance, robots, and humans and … human nudges are most successful.

LLM-Specific things:

  • @rao-etal-2024-tricking-llms develop a hierarchy of jailbreak attempts to LLMs. I did not too much look into possibilities to trick LLM to do things they are supposed not to do (like to leak training data), and the authors provide a set of possible approaches. It is interesting to see these weaknesses of existing models.
  • @addlesee-etal-2024-clarifying-completions describe a study that showcases possible differences how LLM answer requests to how humans would do that. Particularly, they put incomplete questions into a LLM and check if the behaviour is human-like. My favorite example from the poster was: “What is the zip code of…” and the LLM answers “of Nevada?”.
  • @pucci-ranaldi-2024-language-matter report on an experiment on the importance of the order of instructions with varying difficulty. Instead of just using answer length or such proxies to assess the difficulty, they rely on the concept of Bloom’s taxonomy (remembering, understanding, applying, creating/evaluating/analyzing) and show that fine-tuning a LLM in order of these categories in increasing difficulty level leads to better results. This paper is a beautiful example that imports knowledge from psychology and humanities into machine learning.


  • @li-etal-2024-emstremo-adapting describe a chat system that can help people to regulate emotions. This is the first work I am aware of that builds on top of emotion regulation theories. Their system learns that guilt can be tackled with curiosity, and fear with admiration. Very impressive example of how an analysis of a data-driven system confirms knowledge that we have from other fields, confirming the learning approach.
  • @bhaumik-etal-2024-social-convos describe a corpus and modeling effort of detecting agendas on social media. What do people intend with a particular post? I find this related to our research interest in the EMCONA project, in which we want to understand how people use emotions to persuade people. However, their work is more general and focuses on agendas that are less explicitly formulated in the annotated task.
  • @christop-2024-nemo-dataset described her effort on building a speech corpus in Polish, labeled with emotions. Her intend is to use it later for text-to-speech systems. The data has been created by asking actors to show specific given emotional states.
  • @plaza-del-arco-etal-2024-emotion-analysis nicely complements my recent paper on the current state of research on event-centric emotion analysis (@klinger-2023-event). The coverage by Flor and her colleagues is much broader than mine, and they particularly point out the subjective nature of emotions needs to be considered more. Further, there is quite a large set of emotion models in psychology that has not been considered yet.
  • @prochnow-etal-2024-idem-idioms desribe an (automatically generated) data set of idioms with emotion labels.
  • @cortal-2024-sequence-sequence reports on an emotion labeled corpus of dreams. Interestingly, emotions in this dreambank corpus are mostly expressed quite explicitly. The corpus also contains some semantic role annotations, making it one of the few corpora with structured emotion annotations. We also worked on this for a while, with the REMAN and GoodNewsEveryone corpora (@kim-klinger-2018-feels,@bostan-etal-2020-goodnewseveryone), amongst others. It might be interesting to see how literature and news annotations compare to those in dreams, and if emotion role labeling systems could be transferred between these very different domains.

Other things:


The Best papers of LREC-COLING are:

Venue and Place

The conference took place in Turin, a nice city which is not too touristic. It has an acceptable cycling infrastructure which I used to go from downtown to the conference center every day. The cars seem not to be used to bicycles yet and did not check at all if there is a bike when they turned into an intersection, but the infrastructure was preventing bad incidence. Definitely not a perfect infrastructure, but much better than in Stuttgart, so I enjoyed cycling in Turin a lot.

The conference center was the old Fiat factory Lingotto which now has, next to the conference center, also a mall and a car museum. I am not a car fan, but the test track on the roof was pretty impressive.

Lingotto Roof Test Track


The conference center itself was pretty nice (and huge!). The poster sessions were in a separate hall with a lot of space. While the venue has not been as charming as in Iceland (LREC 2014), Marrakesh (LREC 2008), or Santa Fe (COLING 2018), I enjoyed the venue being close to the city.

Altogether, I was very happy with the whole conference, and I am looking forward to the next LREC 2025 and the COLING 2026.

Organization Team


[Download this post as PDF]