Advertisement · 728 × 90

Posts by Thomas Pellard

Post image Post image

Journée d’études doctorales interuniversitaire sur les langues en Asie orientale, organisée par des doctorants du CRLAO.

📆Lundi 11 mai 2026, 8:45-17:40
📍Maison de la recherche de l’Inalco, Salle L0.01, 2 rue de Lille, 75007 Paris
🔈En format hybride

➤ Programme :
crlao.cnrs.fr/event/journe...

18 hours ago 2 1 0 0
Andrea Migliano presenting information about next year’s EHBEA conference in Zurich

Andrea Migliano presenting information about next year’s EHBEA conference in Zurich

Andrea introducing the organising committee, including herself, Adrian Jaeggi, Jorg Gross and Charles Efferson

Andrea introducing the organising committee, including herself, Adrian Jaeggi, Jorg Gross and Charles Efferson

Next year in Zurich 😊 🏔️🇨🇭 #ehbea2027 will be held 29 March to 2 April. Website and Bluesky account are already set up (love the logo ❤️) @ehbea2027.bsky.social

www.ehbea2027.com

3 days ago 26 8 0 1
Preview
Au CNRS, une coupe budgétaire inédite fragilise un peu plus la recherche : « On va dans le mur » Le PDG du Centre national de recherche scientifique, Antoine Petit, a annoncé, le 24 mars, une baisse supplémentaire de 20 millions d’euros sur le budget 2026 de l’institution. Les chercheurs et les directeurs d’unité s’alarment.

Au CNRS, une coupe budgétaire inédite fragilise un peu plus la recherche : « On va dans le mur »

3 days ago 58 50 11 8
Cover of the book "Word-Prosidic Systems of Japonic Languages"

Cover of the book "Word-Prosidic Systems of Japonic Languages"

💡How diverse can a single language family’s prosody be?

Find out in the latest volume of "Endangered and Lesser-Studied Languages and Dialects".

🔗https://brill.com/display/title/74111

1 week ago 5 4 1 0
Post image

🇹🇷 Test à la #BULAC jusqu'au 8 mai : l'Encyclopedia of Turkic Languages and Linguistics Online de Brill !

📚 Du vieux turc aux langues turciques contemporaines, découvrez une ressource de référence du monde turcophone.

✍️ Testez, explorez et donnez-nous votre avis !

👉 www.bulac.fr/node/3563

1 week ago 1 3 0 0
Post image Post image

📣 Assistez à la journée d’étude "Le mandchou, langue des empereurs Qing (1644-1911) : nouvelles recherches historiques, linguistiques et en humanités numériques"

Jeudi 16 avril 2026, à partir de 9h.
À l'#EFEO et en ligne.

En savoir plus : https://swll.to/j2PSBmt

1 week ago 6 6 0 0
Gilt bronze statue of Acala, a Tibetan protector god

Gilt bronze statue of Acala, a Tibetan protector god

New book chapter ... "The Anatomy of a Tradition". Compares notions of tradition, innovation, lineage, fashion in two different fields: Tibetan Buddhism and rural weaving in southeast Asia, discusses why they are similar, and why this is so ... www.researchgate.net/publication/...

1 week ago 2 2 1 0
Post image Post image

And that was #EvolangXVI! Thanks to all the organisers and helpers, it was an amazing experience as always! See you all in 2028 in Nijmegen, NL for #EvolangXVII, organised by @limorraviv.bsky.social @asliozyurek.bsky.social and @profsimonfisher.bsky.social!

1 week ago 11 2 0 2
numerique.gouv.fr Le numérique au service de l'efficacité de l'action publique

Ca bouge vraiment en faveur du logiciel libre en France. Souveraineté numérique : l'État accélère la réduction de ses dépendances extra-européennes

www.numerique.gouv.fr/sinformer/es...

1 week ago 6 3 0 0
Preview
WaST: a formalisation of the Wave model with associated statistical inference and applications We propose a mathematical formalisation of the ``wave model'' originally developed in historical linguistics but with further applications in human sciences. This model assumes new traits appear in a population and spread to nearby populations depending on their closeness. It is mostly used to describe joint evolution of closely related populations, for example of several dialects. These situations of permanent contact are not accurately represented by its competitors based on tree structures. We built a fully Bayesian generative model where innovation spread along a fixed graph and disappear according to a death process. We then develop a Metropolis-Hastings within Gibbs sampler to sample from the posterior distribution on the graph. We test our method on simulated datasets as well as on several real dataset.

I was told the Wave model in linguistics had not been formalised. Here is an attempt at it. Note that many things remain to be done, check the discussion for that.
The method/model I propose, WaST, can be used on any cognacy dataset. I'll put the code on git soon.
arxiv.org/abs/2604.08220

1 week ago 7 4 1 0
Advertisement
Post image

Interesting: Statistical structure and the evolution of languages

royalsocietypublishing.org/rspb/article...

1 week ago 7 1 0 0
A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

fdat.uni-tuebingen.de/records/ccfp...

1 week ago 1 1 0 0
Title page for working paper: "The Varieties of Cultural Selection"

Title page for working paper: "The Varieties of Cultural Selection"

I've been thinking a lot about the foundations of cultural evolutionary theory. While there's been a lot of work on transmission mechanisms, there has been far less work on cultural *selection*. Here's a new working paper presenting a taxonomy of cultural selection processes.
osf.io/preprints/so...

2 weeks ago 70 24 6 4

📦 My first #RStats package on CRAN:

{readelan}

A package dedicated to reading all files associated with ELAN: eaf, etf, ecv. Reads annotations, metadata, controlled vocabularies. Relevant for many in #linguistics perhaps?

More info here:
borstell.github.io/misc/readelan/

2 weeks ago 57 21 5 0
Evolang 16 t-shirt

Evolang 16 t-shirt

This is my first time attending the Evolang conference, but the very 1st day was just wonderful

2 weeks ago 3 0 0 0
Preview
Un nouveau cognat rgyalrong-kiranti : l’étymon pour la “trace” Comme nous l’avions montré dans Jacques et Pellard (2021), les cognats partagés entre les groupes kiranti et rgyalronguiques sont très peu nombreux, et tous partagés par d’autres branches de la famill...

Un nouveau cognat partagé exclusivement par le rgyalronguique et le kiranti:
panchr.hypotheses.org/3993

2 weeks ago 4 5 0 0
Preview
Family Tree Model The family tree remains the main metaphor for describing the evolutionary history and relationships of languages. Languages, like biological species, evolve following processes such as descent with m...

Our chapter on the tree model (with @thomaspellard.bsky.social and @robinryder.bsky.social ) is at last published online: onlinelibrary.wiley.com/doi/10.1002/...

2 weeks ago 18 11 1 0

This has now appeared in the latest issue of the journal (so also offline):

3 weeks ago 12 4 0 0
Advertisement
Cultural Evolution with Rob Boyd
Cultural Evolution with Rob Boyd YouTube video by Evolutionary Psychology (The Podcast)

in which Rob Boyd lays into a couple of students of Toobey and Cosmides (in the politest possible way). He mentions my work at around 57mins in ... but forgets my name. Ah well, you take what you can get ... www.youtube.com/watch?v=GoqB...

3 weeks ago 2 1 0 0
Preview
Foundations of Formal Etymological Analysis This study gives a brief overview on formal aspects of etymological analysis, by providing a modified workflow for the classical comparative method in historical language comparison. This workflow is contrasted with the current state-of-the-art in computational historical linguistics, pointing out where computational methods and interactive tools for annotation are lacking, and where they are available already. ## 1 Introduction It has been more than 10 years now that my dissertation was published, in which I presented a computational method that was supposed to detect cognates in comparative wordlists, based on an automated analysis of potential sound correspondences (List 2014). Now, ten years later, I know much more about the problem of cognate detection in specific and etymological analysis in general, but I still do not feel that the problem can be considered as solved. However, despite the slow progress in my personal and my colleagues’ attempts to automatize the comparative method, I see the general _workflow_ that one should follow in order to carry out a _formal_ etymological analysis, much clearer now. For this reason, I thought it is time to share my views on this workflow, in order to also provide a detailed account on those parts, where we still lack full-fledged automated approaches, while at the same time pointing to those aspects that might become relevant in the future. Some of these problems have already been mentioned as part of an essay published two years ago (List 2024a), in which I discussed several _open problems_ in the field of comparative linguistics and historical language comparison. The current attempt to summarize the major workflows involved in _formal etymological analysis_ can be seen as an attempt to follow up on this work, by contrasting the solutions we have with the solutions we need. Since the workflow for etymological analysis involves several steps that consist themselves of at times complex sub-tasks, I will restrict myself here to presenting the general workflow for etymological analysis. In later studies, I may zoom in to address particular problems in more detail. ## 2 Background The central workflow of the comparative method has been discussed in many studies. A very influential workflow can be found in the study of Ross (1996) (pages 6f). Inspired by this workflow, I have proposed a five-step approach (see Figure 1), starting from the (1) proof of relationship, followed by the (2) identification of cognates, and the (3) identification of sound correspondences, and cumulating in the (4) reconstruction of proto-forms and (5) an internal classification of the language family in question (List 2014, 58). Figure 1: Workflow of the traditional comparative method based on Ross and Durie (1996) taken from List (2014). It was based on this workflow that I later tried to automatize central parts of the comparative methods. A first step in this direction was my work on phonetic alignment and cognate detection (List 2014), which can be assigned to step (2). My work on correspondence patterns (List 2019) can be seen as central for step (3). We have used this work to build initial methods for the supervised reconstruction of proto-forms and for the prediction of reflex forms (Bodt and List 2022). While these methods are all fully automated, I have also tried to establish formal tools that allow to conduct these stages of the workflow manually, but in a way that would account for a formal – computer-readable – treatment of the individual steps when applying them to a standardized dataset. Here, most ideas went into the EDICTOR tool (https://edictor.org, List 2017), that has since then been constantly enhanced (List and Dam 2024, List et al. 2025). ## 3 Workflow When working on fully automated workflows for different aspects of the comparative method, I soon realized that none of the methods would actually yield the result that I had originally hoped for. Instead of solving a problem once and for all time, each solution would point to additional problems. This has not changed even today, even if I think that we have made great progress in the retrospective. Right now, however, I would revise the workflow that I had proposed back in 2014 and expect much smaller steps to be carried out – often also in an iterative fashion. 1. initial data preparation -> convert data to standard phonetic transcriptions 2. internal cognate analysis -> segment words into morphemes, carry out internal reconstruction and allomorphic and allophonic analysis, identify language-internal cognates, gloss individual morphemes for their particular meaning 3. external cognate analysis -> identify cognate morphemes, align them, identify potential exceptions 4. external correspondence analysis -> identify correspondence patterns, refine them, mark exceptions 5. linguistic reconstruction -> assign proto-forms to correspondence patterns 6. sound law induction -> determine the sound laws that lead from a proto-form to the individual language varieties 7. phylogenetic reconstruction -> determine major subgroupings of the languages Most of these steps have been discussed already by me in past studies, but I figured I have so far never found the time to clarify in a much more explicit manner, how I think that these different approaches could or should be best combined to form a full-fledged workflow for historical language comparison. We can arrange the seven steps in the manner shown in Figure 2. Figure 2: Revised workflow for historical language comparison. This workflow keeps the major division into two phases of the workflow that I proposed earlier, with one phase being devoted to the identification of cognates and correspondences and one being devoted to the identification of proto-forms and subgroupings. However, both steps are now extended by another step. The identification of cognate sets and correspondence patterns is now extended by what one could best call _internal reconstruction_. The phonological and phylogenetic reconstruction stage is no accompanied by a detailed search for sound laws. All three subtasks must be solved – as before – in an iterative fashion during which researchers or algorithms jump back and forth through different representations of the original data. ## 4 Workflow Examples In the following, I will try to provide more detailed examples on this revised workflow, pointing both to software solutions and to interface and modeling solutions that can be used to address the individual tasks constituting the two major phases of historical language comparison. Before we discuss the phases of cognate and correspondence detection and phonological and phylogenetic reconstruction, however, I will briefly try point to the first stage of data preparation, which is crucial for any kind of linguistic analysis, but all to often, unfortunately, treated with much less care, both by researchers and by computational solutions. ### 4.1 Data Preparation Data preparation constitutes a crucial but also frequently overlooked aspect of formal etymological analysis. Thus, in order to allow for an automated handling of word forms as _sound sequences_ we must insist that sounds are represented in standardized phonetic transcriptions in which each sound is clearly defined. Transcription systems, however, are rarely clearly defined and rely on ad-hoc extensions by scholars who feel that the transcription lacks the sounds they want to transcribe or who simply ignore the rules. Since rules are never explicitly imposed upon transcriptions and also rarely checked, this has led to a considerable diversity of transcription practices and also implementations, as can be seen from the variation underlying cross-linguistic databases and data collections (Anderson et al. 2018, 2023). With the _Cross-Linguistic Data Formats_ initiative (CLDF, https://cldf.clld.org), we have tried to lay the foundation of standardized datasets in cross-linguistic applications (Forkel et al. 2018). When developing the recommendations that were later adapted in individual CLDF versions, we relied on practical experience resulting from concrete attempts to handle cross-linguistic data both manually – trying to compile new datasets – and automatically – trying to analyze existing datasets. These recommendations, however, were not necessarily developed with CLDF as endpoint in mind, but rather reflect practical requirements for cross-linguistic data that we observed in practice. Based on these observations, we would now require that a common cross-linguistic dataset, compiled for etymological analysis, has the form of a comparative _wordlist_ provided in _long table_ format. This means, that we work with at least one table that stores the data, with each row reflecting an individual _word form_ (Form) in a individual _language variety_ (Language) that represents the translation of an individual _concept_ (Concept). The word form itself can be represented in three different ways. In the form of the original _value_ (Value) that one may observe in a dictionary (containing additional non-standard information, like brackets, but also mixed entries in which several forms are provided, separated by slashes or decorated with additional ad-hoc annotations), in the form of the actual _form_ (Form) that one wants to use as the basis of an analysis, and in _segmented form_ (Segments), fully transcribed, with individual sounds being segmented by spaces and morpheme boundaries (including word boundaries) inside the form being marked by a plus symbol (+). While the form may contain idiosyncratic transcriptions, the segmented form in CLDF _must_ account for standard transcription systems, including the B(road) IPA defined by the catalog of Cross-Linguistic Transcription Systems (https://clts.clld.org, Anderson et al. 2018). This final step of analysis, the actual _segmentation_ thus not only segments a given form of idiosyncratic transcriptions into individual speech sounds, it also standardizes speech sounds. This step of converting forms as they are usually provided in the literature into standardized segmented representations is usually done in a semi-automated fashion with the help of _Orthography Profiles_ (Moran and Cysouw 2018, see also Forkel and List 2024), but could in theory also be done manually. Depending on the language family in question, the segmentation of word forms into speech sounds is not enough, and additional segmentations of sound sequences into sequences representing individual meaning-bearing units – morphemes – are required. While this step might fall inside the stage of data preparation, practical experience shows that it is often much easier to combine this step with the _internal cognate analysis_ that will be discussed in more detail in the following section. ### 4.2 Cognate and Correspondence Detection Starting with Swadesh’s work on lexicostatistics (Swadesh 1952, 1955), the compilation of cognate-annotated comparative wordlists has become a common practice in historical linguistics. In a comparative wordlist, words from different languages are arranged according to the concepts they express and later compared for cognacy. An alternative to this annotation practice consists in the collection of etymologically related words regardless of the meanings they express. The most common format for such collections is the _etymological dictionary_ , which adds hypothetical proto-forms to all previously identified cognate sets. Both representations have advantages and disadvantages. Etymological dictionaries suffer from the fact that they only pick _positive evidence_ for genetic language relations. Meanings of individual word forms – _reflexes_ of proto-forms in the classical terminology – are usually insufficiently reported and barely standardized, often, scholars even strip off suffixes and prefixes from the reflex forms, making it very difficult to verify the origins of the entries that made it into the etymological dictionary. Comparative wordlists have the disadvantage of missing out on details. Semantic shift is a common phenomenon, but rarely reported in comparative wordlists. The annotation of cognate sets that goes beyond the concept slots that constitute their organisational basis can be tedious in practice. Furthermore, due to the original selection of a limited _number_ of basic concepts, serving as the main comparanda, many potentially interesting cognate sets are deliberately ignored. While I understand the arguments of both sides, I see much better chances in enhancing the annotation of comparative wordlists in order to make up for their shortcomings than in trying to address the shortcomings of etymological dictionaries. As a result, the basic format that I consider as relevant for etymological analysis is always the comparative wordlist that consists of a list of concepts that are in turn translated into a list of languages. In order to identify cognate sets and sound correspondences in such wordlists, three distinct analysis steps are important, which are deeply intertwined with each other. The first step consists in the detailed annotation of individual word forms in individual language varieties, with the goal of segmenting the word forms into individual morphemes, thereby identifying allophones and allomorphs and carrying out what is known as an _internal reconstruction_ of the individual language varieties. The second step consists in the identification of cognate morphemes, which have to be aligned with each other after initial identification. This step is pretty well explored, but quite a few problems remain, especially when carrying out the second step without paying attention to the first step of language-internal analysis. The third step consists in the identification of correspondence patterns. Here, one must cluster all identical _alignment sites_ that have been identified in step 1 and 2 of the cognate and correspondence detection workflow into partitions that show no conflicts with each other. Conflicts are those cases where one language shows two reflexes for the same _correspondence pattern_. Step 1 is typically ignored, but it is of crucial importance, since internal reconstruction decreases language-internal variation (Fox 1995, pp. 211-213), and can thus be seen as an important pre-requisite for any _external_ comparison of word forms. Thus, when comparing word forms in Germanic languages, it would not make much sense to use two sounds `[`ç`]` and `[`x`]` to represent the grapheme `<`ch`>` in German, since both pronunciations can be seen als allophones of the same phoneme. Retaining their phonetic pronunciations without prior language-internal analysis would only increase the variation when trying to identify sound correspondences across Germanic languages. The same holds for allomorphic variation that can be traced to language-internal processes. Language-internal variation must be reduced before comparing one language with another, in order to reduce the complexity of the external comparison. Another important aspect of language-internal comparison, that is often disregarded, consists in the identification of language-internal cognates or _word families_. Since our basic workflow does not start from previously identified cognate sets, but rather from a comparative wordlist, in which word forms are selected due to the meanings they express, we may encounter many cases in which a morpheme _recurs_ across many word forms. For this reason, a proper internal analysis should not only identify allophones, allomorphs, and morpheme boundaries, but also tag those cases where morphemes recur across word forms, making sure that morphemes are only compared once _across_ languages. The second step of the workflow, the actual identification of cognate morphemes through external language comparison is pretty well explored, with many attempts of automation and formalization. For me, the essential part of this stage consists in three individual tasks, the actual _partitioning_ of morphemes into a first set of _partial cognate sets_ (see List et al. 2016 for details on this notion), the identification of cognates recurring _across meaning slots_ in the wordlist (in Wu et al. 2020, we have called this _cross-semantic cognate detection_ at times, but I am hesitant of using the term nowadays), and the consecutive _multiple phonetic alignment_ of these morphemes (see List 2014 on multiple phonetic alignment). In order to avoid stretching the size of this study too much, I won’t comment too much on both steps, but emphasize that I consider them as actually solved for those cases where high-quality comparative wordlists with morpheme-segmented word forms resulting from a thorough internal comparison are provided. Tools like EDICTOR (List et al. 2025; List and van Dam 2024) facilitate this step greatly. Once morphemes have been assigned to individual cognate sets and consecutively aligned morphologically, we arrive at the third step of the workflow, which requires us to partition the _alignment sites_ , that is, the _columns_ of all _alignments_ reflecting all cognate sets, into _correspondence patterns_. While individual alignments reflect true correspondences, provided one trusts that the alignments have been carried out properly, thee alignments do not reflect _patterns_ , given that they are but _one_ instance of a correspondence. In order to count as a _pattern_ , a larger number of alignment sites is needed, that must confirm the recurrence of the correspondences observed. Although they play a crucial role in the traditional literature, correspondence patterns have been ignored for a long time in computational approaches. In 2019, I presented a first workflow that helps to derive correspondence patterns from aligned cognates automatically (List 2019). While this workflow was an eye-opener for myself, helping us to address several additional tasks, such as _supervised phonological reconstruction_ (List et al. 2022) and _reflex prediction_ (Bodt and List 2022), it has not received much attention beyond the work of my own group. The reason may have been that the workflow involves larger amounts of data that are difficult to be handled manually, thus making the actual verification of findings in the interaction with classical linguistics rather difficult. With EDICTOR 3, new functionalities have been made available that help scholars to carry out the annotation of correspondence patterns in a computer-readable but manual manner (List and van Dam 2024). Unfortunately, we lack clear-cut examples in which actual data have been analyzed from scratch. In practice, it has also turned out to very difficult to account for the different layers of annotation that a full-fledged etymological analysis requires. While for me it is pretty clear that a very detailed annotation is needed that accounts for the three major aspects and multiple smaller problems involved in all individual steps, we currently lack the full-fledged examples where this annotation has been applied to a larger dataset. I hope to be able to provide such examples in the future. Without such examples, I fear, it will be very difficult to enhance both the interactive annotation tools and the algorithms that could help to automatize particular tasks. ### 4.3 Phonological and Phylogenetic Reconstruction The third and in the present workflow final phase consists again of three steps that are deeply intertwined with each other. While it may be clear that the first step of linguistic (or phonological) reconstruction and the second step of sound law induction should go hand in hand, I see a similar need to integrate the final step of phylogenetic reconstruction with the previous steps, given that phylogenies play a crucial role in informing the reconstruction of proto-sounds and the induction of sound laws. The first step, linguistic (or more precisely phonological) reconstruction can be seen as an additional _clustering_ or _partitioning_ operation applied this time to the _correspondence patterns_. Correspondence patterns show – in the definition of the term that I have used in my own work – for each language only _one_ reflex sound. This means that conflicts in inferred correspondence patterns must be flagged as _exceptions_. Exceptions however, could in theory be resolved as a later stage, especially when they show a high degree of systematicity, as we know from Grimm’s studies (Grimm 1822) that were later resolved – among others – by Verner (1877). While exceptions in correspondence patterns can be considered a specific case, we can also easily think of correspondence patterns that differ strong enough so that we would not merge them (and accepting multiple reflex sounds for the same language variety). This does not, however, mean, that these patterns must reflect _different proto-sounds_. We know very well that conditioning context can easily trigger different reflex sounds in sound change. As a result, two or more distinct correspondence patterns can often be expected to correspond to the same sound in the proto-language. Carrying out a phonological reconstruction analysis thus has two consequences. First, we assign each correspondence pattern in our data to a certain proto-sound of which we assume that it was present in the proto-language. Second, by assigning identical proto-sounds to different correspondence patterns, we set the stage for the identification of conditioning context that explains the differences in the reflexes of individual language varieties. Formally, we can thus say that we cluster or partition correspondence patterns into proto-sounds. Once this step has been carried out, we are left with distinct proto-forms for each aligned cognate sets in our data. We must now identify the conditioning contexts that explain the split of a proto-sound into two or more reflex sounds in the same language variety. I call this step _sound law induction_ (see also List 2024b), since sound laws provide a concrete explanation for the change of a sound based on conditioning context. The basic task of the second step in phonological and phylogenetic reconstruction thus consists in the identifification of those contexts (or more broadly speaking those _sound laws_) that explain why the same proto-form splits into different reflexes. Since splits and mergers (mergers are less relevant for us in this context, since they have no consequences on the reconstruction practice) can occur at any stage during language evolution, it would be wrong to assign their occurrence only to the individual development of a language variaty after having split off its subgroup. Instead, we must assume that splits and mergers can happen throughout the whole evolution of a language family, on each internal node that a phylogenetic reconstruction might propose. As a result, it is crucial to substantiate any findings regarding the proto-phonology and the sound laws that constitute a language family by providing a detailed phylogenetic reconstruction that explains how the proto-language split into branches and individual language varieties over time. While I have initial ideas of how phonological reconstruction can be carried out formally, based on previously identified corresponcence patterns (see List et al. 2022), as well as an initial idea of modeling sound laws computationally (List 2024b), I have neither concrete ideas for the _induction_ of sound laws from comparative data, nor for the consistent integration of phylogenetic reconstruction with the other two steps of the workflow. I am, however, convinced, that a successfully implemented model of etymological analysis needs to account for all three steps in this stage of the workflows. Future research will show how well these ideas aspects can be integrated with each other. ## 5 Outlook This little study has not provided any answers to currently open questions in historical language comparison. What I wanted to provide instead was an overview on the detailed workflow of the comparative method that I consider as fruitful to pursue in the future. I am sure that this overview is by no means the last word on the matter. On the contrary, I expect that the workflow will be enriched by more details with time. It is also quite possible that different language families will require different treatments of certain aspects. In any case, I have the hope that future studies will not only help to improve the current state of the art with respect to the conceptualization of the workflow, but also with respect to the implementation of interactive annotation tools and automated workflows that help to make human annotation more efficient. ## References Anderson, Cormac, Tiago Tresoldi, Thiago Costa Chacon, Anne-Maria Fehn, Mary Walworth, Robert Forkel, and Johann-Mattis List. 2018. “A Cross-Linguistic Database of Phonetic Transcription Systems.” _Yearbook of the Poznań Linguistic Meeting_ 4 (1): 21–53. https://doi.org/10.2478/yplm-2018-0002. Anderson, Cormac, Tiago Tresoldi, Simon J. Greenhill, Robert Forkel, Russell D. Gray, and Johann-Mattis List. 2023. “Variation in Phoneme Inventories: Quantifying the Problem and Improving Comparability.” _Journal of Language Evolution_ 8 (2): 149–68. https://doi.org/10.1093/jole/lzad011. Bodt, Timotheus Adrianus, and Johann-Mattis List. 2022. “Reflex Prediction. A Case Study of Western Kho-Bwa.” _Diachronica_ 39 (1): 1–38. https://doi.org/10.1075/dia.20009.bod. Forkel, Robert, and Johann-Mattis List. 2024. “A New Python Library for the Manipulation and Annotation of Linguistic Sequences.” _Computer-Assisted Language Comparison in Practice_ 7 (1): 17–23. https://doi.org/10.15475/calcip.2024.1.3. Forkel, Robert, Johann-Mattis List, Simon J. Greenhill, Christoph Rzymski, Sebastian Bank, Michael Cysouw, Harald Hammarström, Martin Haspelmath, Gereon A. Kaiping, and Russell D. Gray. 2018. “Cross-Linguistic Data Formats, Advancing Data Sharing and Re-Use in Comparative Linguistics.” _Scientific Data_ 5 (180205): 1–10. https://doi.org/10.1038/sdata.2018.205. Grimm, Jacob. 1822. _Deutsche Grammatik_. 2nd ed. Vol. 1. Göttingen: Dieterichsche Buchhandlung. List, Johann-Mattis. 2014. _Sequence Comparison in Historical Linguistics_. Düsseldorf: Düsseldorf University Press. https://doi.org/10.1515/9783110720082. ———. 2017. “A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets.” In _Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations_ , 9–12. Valencia: Association for Computational Linguistics. https://edictor.org. ———. 2019. “Automatic Inference of Sound Correspondence Patterns Across Multiple Languages.” _Computational Linguistics_ 45 (1): 137–61. https://doi.org/10.1162/coli_a_00344. ———. 2024a. “Modeling Sound Change with Ordered Layers of Simultaneous Sound Laws.” [Preprint, not peer-reviewed]._Humanities Commons_. https://doi.org/10.17613/4n5z-9y52. ———. 2024b. “Open Problems in Computational Historical Linguistics [Version 2; Peer Review: 5 Approved].” _Open Research Europe_ 3 (201): 1–27. https://doi.org/10.12688/openreseurope.16804.2. List, Johann-Mattis, and Kellen Parker van Dam. 2024. “Computer-Assisted Language Comparison with EDICTOR 3 [Invited Paper].” In _Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change_ , edited by Nina Tahmasebi, Syrielle Montariol, Andrey Kutuzov, David Alfter, Francesco Periti, Pierluigi Cassotti, and Netta Huebscher, 1–11. Bangkok, Thailand: Association for Computational Linguistics. https://aclanthology.org/2024.lchange-1.1. List, Johann-Mattis, Nathan W. Hill, and Robert Forkel. 2022. “A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns.” In _Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change_ , 89–96. Dublin: Association for Computational Linguistics. https://aclanthology.org/2022.lchange-1.9. List, Johann-Mattis, Philippe Lopez, and Eric Bapteste. 2016. “Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists.” In _Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers)_ , 599–605. Berlin: Association of Computational Linguistics. https://anthology.aclweb.org/P16-2097. List, Johann-Mattis, Kellen Parker van Dam, and Frederic Blum. 2025. _EDICTOR 3. An Interactive Tool for Computer-Assisted Language Comparison [Software Tool, Version 3.1]_. Passau: MCL Chair at the University of Passau. https://edictor.org. Moran, Steven, and Michael Cysouw. 2018. _The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles_. Berlin: Language Science Press. https://langsci-press.org/catalog/book/176. Ross, Malcolm D. and Durie, Mark. 1996. “Introduction.” In _The Comparative Method Reviewed: Regularity and Irregularity in Language Change_ , edited by Mark Durie, 3-38. New York: Oxford University Press. Swadesh, Morris. 1952. “Lexico-Statistic Dating of Prehistoric Ethnic Contacts: With Special Reference to North American Indians and Eskimos.” _Proceedings of the American Philosophical Society_ 96 (4): 452–63. ———. 1955. “Towards Greater Accuracy in Lexicostatistic Dating.” _International Journal of American Linguistics_ 21 (2): 121–37. https://www.jstor.org/stable/1263939. Verner, Karl A. 1877. “Eine Ausnahme Der Ersten Lautverschiebung.” _Zeitschrift Für Vergleichende Sprachforschung Auf Dem Gebiete Der Indogermanischen Sprachen_ 23 (2): 97–130. Wu, Mei-Shin, Nathanael E. Schweikhard, Timotheus A. Bodt, Nathan W. Hill, and Johann-Mattis List. 2020. “Computer-Assisted Language Comparison. State of the Art.” _Journal of Open Humanities Data_ 6 (2): 1–14. https://doi.org/10.5334/johd.12. **Cite this article as:** List, Johann-Mattis (2026): “Foundations of Formal Etymological Analysis” in _Computer-Assisted Language Comparison in Practice_ , 9.1: 19-30 [first published on 25/03/2026], URL: https://calc.hypotheses.org/9208, DOI: 10.15475/calcip.2026.1.3. **Download the article as PDF:** calcip-09-1-3.pdf **Copyright information** : This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. **Funding Information** : This project has received funding from the European Research Council (ERC) under the European Union’s Horizon Europe research and innovation programme (Grant agreement No. 101044282). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. * * * The text only may be used under licence Creative Commons Attribution 4.0 International. All other elements (illustrations, imported files) are “All rights reserved”, unless otherwise stated. * * * OpenEdition suggests that you cite this post as follows: Johann-Mattis List (March 25, 2026). Foundations of Formal Etymological Analysis. _Computer-Assisted Language Comparison in Practice_. Retrieved March 26, 2026 from https://doi.org/10.58079/15xvf * * * * * * * *

The March contribution to our CALCiP journal / blog is the possible start of a more detailed series on "Foundations of Formal Etymological Analysis"

https://calc.hypotheses.org/9208

https://doi.org/10.15475/calcip.2026.1.3

3 weeks ago 1 1 0 0
Phylogenetic Comparative Methods Phylogenetic Comparative Methods

Hi all. I am very excited that after 6 years I finally got my phylogenetic comparative methods book and online exercises online. Feel free to use and share. The book is here: nhcooper123.github.io/pcm-primer/. Note that it is not finished, we had to abandon it before the sunk costs fallacy broke us

3 weeks ago 286 180 9 3
Preview
日本列島における文化系統学の当面する大問題は消滅危機「方言」の記述の薄さである|白鳥詩織 この文章は 2026 年度の研究費の申請をしていた時にわたしの頭から生成されました.わたしは日琉語族の記述言語学の現状を深く憂えています.方言研究を行うための予算の獲得に苦しんでいるみなさんも,制度的な有効性と読者への射程とに気を付けながら,この問題意識をパクっていってください. とりあえず Twitter の自分の投稿から転載するが,今後,文化への数理的アプローチの文脈をしらない人,えらいひと...

この文章は 2026 年度の研究費の申請をしていた時にわたしの頭から生成されました.わたしは日琉語族の記述言語学の現状を深く憂えています.方言研究を行うための予算の獲得に苦しんでいる方がいたら,制度的な有効性と読者への射程とに気を付けながら,必要に応じてこの問題意識をパクってください.すくなくとも学際性は無条件に上がります.

note.com/imaginai/n/n...

4 weeks ago 3 1 1 0
Classements d’admissibilité au concours CNRS 2026

Les résultats des concours CNRS section 36 Sciences du langage ont été publiés
c3n-cn.fr/2025/12/12/c...

1 month ago 1 2 0 0
Data Organization in Spreadsheets
Karl W. Broman
& Kara H. Woo
Pages 2-10 | Received 01 Jun 2017, Accepted author version posted online: 29 Sep 2017, Published online: 24 Apr 2018

    1. Introduction
    2. Be Consistent
    3. Choose Good Names for Things
    4. Write Dates as YYYY-MM-DD
    5. No Empty Cells
    6. Put Just One Thing in a Cell
    7. Make it a Rectangle
    8. Create a Data Dictionary
    9. No Calculations in the Raw Data Files
    10. Do Not Use Font Color or Highlighting as Data
    11. Make Backups
    12. Use Data Validation to Avoid Errors
    13. Save the Data in Plain Text Files

ABSTRACT

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

Data Organization in Spreadsheets Karl W. Broman & Kara H. Woo Pages 2-10 | Received 01 Jun 2017, Accepted author version posted online: 29 Sep 2017, Published online: 24 Apr 2018 1. Introduction 2. Be Consistent 3. Choose Good Names for Things 4. Write Dates as YYYY-MM-DD 5. No Empty Cells 6. Put Just One Thing in a Cell 7. Make it a Rectangle 8. Create a Data Dictionary 9. No Calculations in the Raw Data Files 10. Do Not Use Font Color or Highlighting as Data 11. Make Backups 12. Use Data Validation to Avoid Errors 13. Save the Data in Plain Text Files ABSTRACT Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

Every day is a good day for sharing one of the most useful papers about research data ever written. PLEASE get your people to understand and follow this advice.

www.tandfonline.com/doi/full/10....

1 month ago 1050 402 31 47
Preview
方言研究・実践の現場——理論・実証・アウトリーチの交差点 — | Notion 日程

3月19・20日オンライン開催の講演・発表会。「方言研究・実践の現場——理論・実証・アウトリーチの交差点 —」登壇者:五十嵐陽介,黒木邦彦,谷口ジョイ,松浦年男。詳細・申込はリンク先にあります
yearman.notion.site/30562e64bc24...

1 month ago 4 2 0 0
Advertisement

My PhD thesis is now available in open access!

1 month ago 14 6 0 0
Post image Post image

#OTD 177 years ago, Karl Brugmann (1849–1919) was born 🎂 An expert on Sanskrit and comparative Indo-European linguistics, he was one of the most prominent Neogrammarians, who argued that sound laws operate without exceptions. Brugmann’s Law is named after him.

#LinguisticBirthdays #Histlx

1 month ago 13 7 1 0
Société de Linguistique de Paris - Journées

Journée d'étude de la Société de Linguistique de Paris :
«Origine et évolution des langues», organisée par Guillaume Jacques (CNRS/EPHE) & Thomas Pellard (CNRS) du CRLAO, Annie Rialland (CNRS) & Michela Russo (Lyon 3)
🗓️ Samedi 28 mars2026, EPHE & en ligne
📑Programme : www.slp-paris.com/journees.html

1 month ago 2 2 0 0
Preview
令和7年度 第2回「危機言語の保存と日琉諸語のプロソディー」合同研究発表会 | 国立国語研究所 国立国語研究所の催し物情報です。研究者から一般の方,児童・生徒まで様々な方を対象としたイベントを実施しています。

Giving a little talk on the finer points of Tsugaru phonology at NINJAL (国立国語研究所) today:

www.ninjal.ac.jp/events_jp/20...

1 month ago 2 2 0 0
Post image Post image

#OTD 135 years ago, Yevgeny D. Polivanov (1891–1938) was born 🎉 An orientalist, translator, and expert in Chinese, Japanese, Uzbek, and Dungan, he helped develop writing systems for previously unwritten languages of the Soviet Union. He was executed by the NKVD in 1938.

#LinguisticBirthdays #Histlx

1 month ago 10 6 0 0