Talking About Muslims in Middle French: The Potential of Word-to-Vector Models for Studying Semantic Relationships in Medieval Languages
by Kimberly Lifton
Medieval vernaculars are notoriously tricky for digital humanists to work with because they lack standardized spelling. Especially when using out-of-the-box libraries and software, most Natural Language Processing (NLP) techniques simply do not work well for medieval languages. However, word-to-vector models have the capacity to handle noise like spelling variants when trained on a significant number of words. As part of my PhD project, which examines the representations of Muslims in textual sources during the rise of the Ottomans in the fifteenth century, I have created custom word-to-vector models using Middle French texts. These models capture the constellations of Muslim representations in Middle French texts at the word level. My methodology considers the exploratory potential of word-to-vector models for shaping research questions in a process that Gabor Mihaly Toth has aptly described as “semantic wanderings.”1
### Introduction
Word-to-vector or word embeddings is a technique in Natural Language Processing (NLP) where words are mapped to real numbers or vectors, which enables words to be represented in a continuous vector space that is capable of capturing words’ meanings and relationships in a corpus. My process draws on the word-to-vector techniques for historical research developed by Suphan Kirmizialtin and David Joseph Wrisley.2 I have created two word-to-vector models using Middle French sources, one for more traditional narrative manuscript sources and another for inventories. My choice to create separate models for different types of sources allows me to compare how authors and cataloguers used the same terminology for similar or different ends depending on their objectives. The word “sarrasinois” for instance might appear in a drastically different context in an inventory versus a travel narrative.
For the model trained on narrative sources, I selected a total of 50 manuscripts from the fifteenth century containing different types of narrative texts that mention Muslims. For a full list, see the table below. My selection criteria were largely influenced by what is publicly available to download from archives. I attempted to limit the number of repeated texts to avoid over-representation. However, medieval texts are famously intertextual, which is something I want to capture in my results. Some of the manuscripts include other languages like Latin, Middle English, or Flemish. Although, since the presence of other languages was relatively minimal, they do not appear to have influenced the model. For the inventories model, I used a total of twenty fifteenth-century inventories from France and the Low Countries all in Middle French but containing Latin phrases. Ultimately, the two sets of training data are relatively small compared to fastText’s pre-trained models for modern languages but demonstrate some promising results.
### Workflow
I began the process of creating a custom fastText model by using a layout analysis model and a premade Kraken Handwritten Text Recognition (HTR) model, TRIDIS v2, to extract machine-readable text from digitized manuscripts on the eScriptorium platform. TRIDIS v2 is an HTR model for multilingual Latin script from the eleventh to the sixteenth century trained on semi-diplomatic transcriptions (abbreviations were expanded).3 The training and validation dataset consisted of 1,855 pages with 120,000 lines of text and an additional 420,000 lines from a GAN model imitating medieval handwriting. The model has an accuracy rate of 96.8%, but this of course differs in practice from manuscript to manuscript. For instance, on fol. 6r of Paris, Bibliothèque de l’Arsenal, Ms. 5074 réserve, which I have selected at random from my automated transcriptions, the HTR transcription of the first few lines reads:
fait il Si uous diray en quelle maniere et que uous
et moy auons a faire uray est cõme uous scaues q̃
lempereur a guerre au duc regnault et a nous de
si longue main que zay paour que ia paix ne sen
A manually corrected transcription reads:
fait il Si uous diray en quelle maniere et que uous
et moy auons afaire uray est cõme uous scaues q̃
lempereur a guerre au duc Regnault et a nous de
si longue main que jay paour que ja paix ne sen
In the case of this particular Burgundian Gothic Cursive or bastarda hand, the HTR model demonstrates confusion with the letter j and is not case sensitive (r vs R). Please note that I do not consider differences between u and v to be errors, as is common practice in transcription conventions. While issues with identifying the letter j could impact my results, case sensitivity will not. The quality of a few HTR transcriptions was also influenced by image-quality, like the transcription for BnF Dupuy 255, which I could only access as a digitized microfilm scan. Here is an example of the HTR transcription from fol. 1r:
de rhodes est assise aulont de lisle en
contre leuant et tremontanc et nest poĩt
des plus grandes ne aussi des mendres.
Et est toute Ronde Reserut que le port de lamer in
A manually corrected transcription reads:
de Rhodes est assise aulont de lisle en
contre leuant et tremontane et nest poĩt
des plus grandes ne aussi des mendres
Et est toute Ronde Reserue que le port de lamer en
While, in this case, the accuracy rate is not a significant inhibitor to my results, the worst transcription of the fifty manuscripts was for BnF Français 1489. The scanned microfilm images seem to have further degraded in quality upon importing them into the eScriptorium interface. Here is an example of the HTR transcription from a randomly selected section of fol. 1r:
Que li rois d'Engletee de say char engendra
du roy amani questoce gouverna
Eun pour lamour de Deu le lor mahonnli a
Et depuis sur paienr si bien il e porta
Compare this to a corrected transcription:
Que li Rois dengleterre de sar char engendra
Et du Roy amauri quescoce gouverna
Qui pour lamour de dieu le loy mahom laissa
Et depuis sur paiens si bien il se porta
Even in the instance of BnF Français 1489, the HTR transcription is still usable. The flexibility of the word-to-vector model also helps with the noise created by minor transcription errors of one or two characters.
After using this HTR model in the eScriptorium platform to create machine-readable text files, I cleaned the text files of punctuation (aside from hyphens), capital letters, and line breaks for training data. The resulting model for narrative sources was trained on a total of 2,700,123 words and 12,142,943 characters (excluding punctuation and spaces). I used a Skipgram model for unsupervised training with 100 dimensions and 10 epochs. The number of dimensions is dependent on the amount of data available for the specific use case. 300 dimensions is a common paramterization for modern language pre-trained fastText models. However, there is far less data for Middle French making 100 dimensions more suitable in this instance (to avoid overfitting on too small a dataset).
For the model trained on inventories, I used OCR transcriptions from scanned editions of the texts, which I obtained through Gallica, Archive.org, and Hathi Trust. For this model, I also trained an unsupervised Skipgram model on a total number of 2,465,953 words (23,738 unique words) with 100 dimensions and 10 epochs. Although the inventory training data was substantially smaller than my narrative source training data, the quality of machine-readable text was higher.
To evaluate the accuracy of the model, I looked at the quality of the word embeddings that the model generated through intrinsic evaluation methods, including evaluating the model’s outputs for nearest neighbor results. The nearest neighbor results demonstrated the model’s ability to identify spelling variants accurately as well as synonyms and antonyms. For the narrative sources model, here is a list of the top 30 words with the highest cosine similarity to “tartar”:
Words associated with 'tartar':
0.9335034489631653: tartars
0.9177899360656738: tartarie
0.902688205242157: tarta
0.8939720392227173: tartarins
0.741866409778595: tars
0.6958993673324585: tartres
0.6886648535728455: tarse
0.6694226264953613: taitars
0.6531116366386414: tartre
0.6460373997688293: sartazins
0.6442888975143433: tarie
0.6340306401252747: baldach
0.61935955286026: kiaan
0.6164014339447021: sairazins
0.6123291254043579: kan
0.6096377372741699: soldan
0.603230357170105: syrie
0.6014472842216492: sauazins
0.598444938659668: caan
0.5979862809181213: empereour
0.597381591796875: kaan
0.5945258736610413: halappe
0.5937740802764893: mango
0.592740535736084: liaan
0.5844046473503113: seigneurie
0.5833797454833984: gypte
0.581447958946228: mangy
0.5791760683059692: grecs
0.5790480971336365: turquenians
0.5782543420791626: tart
The first nine words are spelling variants of Tartar or Tartary. In many of the instances where “tars” and “tarie” appear as individual words, they are separated from the first half of the word “tartars” and “tartarie” respectively. An example of this is the line _“iii b uant le souidan de turquie ot assemble sonost de toutes pars et uint et se combati aux tar tars en unlieu qui est nonme cosadathgrant fut 3 la bacaille et asses cnyot de mors dime part et cautre mais.”_ The ability of the model to identify subwords makes it capable of coping with these types of transcription issues.
Subwords for 'tartar': (['tartar', '<ta', '<tar', '<tart', '<tarta', 'tar', 'tart', 'tarta', 'tartar', 'art', 'arta', 'artar', 'artar>', 'rta', 'rtar', 'rtar>', 'tar', 'tar>', 'ar>'], array([ 11178, 624971, 350665, 353353, 792128, 808467, 426671, 552850, 507458, 1840673, 1673096, 1556432, 158876, 160941, 1241347, 720045, 808467, 1528573, 1832099]))
However, not all spellings of “tars” refer to the Tartars. Some represent variants of the word “tard.” It is difficult to determine to what degree the model picks up on homographs, words that are spelled the same but have different meanings, because “tars” is not a common spelling of “tard” in the corpus.
Additionally, variations of the place name Tarsus may also be muddled with variations for the realm of Tartary because the two have similar or even the same spellings and appear in similar contexts. While not necessarily synonyms, “sartazins,” “sauazins,” and “sairazins,” all variants of Saracen, are close in contextual meaning to Tartar across the training data. Other words like “kiaan,” “kan,” “caan,” and “kaan” (khan) commonly appear in phrases like “grant can” or “mango can” to refer to the ruler of the Mongol khanate. Based on context, “liaan” is a mistranscription of “kaan” as well. While the amount of training data and quality of machine-readable transcriptions are limitations, overall, results suggest that the models are viable for exploratory purposes.
### Results
After training a custom fastText model on Middle French texts, I ran a series of semantic queries for words relevant to my study and mapped their semantic relationships to other words. The proximity in a word vector can reveal synonyms, antonyms, word analogies, or common contexts. I visualized these relationships using Principal Component Analysis (PCA) graphs for reducing the high-dimensional word vector data generated by the model to two-dimensional space. Plotting word vectors in a two-dimensional space allows me to capture the constellations of rich semantic structures associated with demonyms for Muslims and territories ruled by Islamic sultanates.
PCA graph of “orient” and its spelling variants using the cosine similarity of vectors from the inventories model
While some of my findings are expected—such as the tendency for Saracens to be discussed in relation to Christians whereas Christians are discussed in a broader range of contexts—others are more surprising. For instance, results from the inventories model demonstrate that terms related to _orient_ and _outremer_ tended to refer to natural resources extracted from the environment like gemstones, while _Sarrasinois_ or variations of _sarrazen_ tended to refer broadly to man-made goods from Islamic kingdoms. However, man-made goods more commonly co-occurred with the adjectival forms of demonyms, like Turkish and Moorish, or with locationally specific designators, like from Damascus or from Alexandria. These lexicons suggest that applying the terms orientalist, oriental, Eastern, or Orientalism to the Middle Ages is too totalizing to accurately describe how Latin Christendom perceived Muslim peoples and the geographies they resided in. Similar trends are mirrored in the narrative sources as well.
Taking inspiration from Max M. Louwerse and Nick Benesh’s “Representing Spatial Structure Through Maps and Language: Lord of the Rings Encodes the Spatial Structure of Middle English,” I have also used the results of the narrative sources model to create a “cognitive map”—a mental representation of spatial information about places.4 In the article Louwerse and Benesh demonstrate that spatial mental representations can be mapped from linguistic sources through statistical linguistic frequencies. Their results suggest how the physical distance between locations in Middle Earth can be “estimated by the lexical company they keep, leading to the conclusion that language encodes spatial structure.”5 This inspired me to look at the cosine similarity of location names in narrative sources with the question “How directly will my results correspond to a geographic map of the world, a Ptolemaic map, or a T-O map?” in mind.
T-O Map, New Haven, Beinecke Rare Book and Manuscript Library, Beinecke MS 358, fol. 74v
To display my results, I created two PCA graphs, one which shows vectors for different spelling variants of place names and another that calculates the centroid of the spelling variant vectors. The results show that England, Portugal, Spain, Bohemia, France, and specific city-states in Italy are clustered together in a “Europe” cluster. Hungary hovers further off to the side from this cluster. The PCA graph that contains spelling variant vectors shows that this is because some variants of Hungary hover closer to Syria, Mesopotamia, and Trebizond, possibly because of Hungary’s role in the late medieval crusades. On the outskirts of the European cluster is another smaller cluster of “in-between” places—Italy, Cyprus, Africa, and Israel (the traditional center point of the T-O map). Notably, Jerusalem and Judea are not in the center between the “Europe” cluster and the other clusters containing a mix of locations from what is usually labeled “Asia” and “Africa” on a T-O map. This may be the result of pilgrimage narratives that would have clustered location terms for places along the common pilgrimage route—Galilee, Bethlehem, Nazareth, Jericho, etc. The clusters on the bottom half of the PCA centroid graph containing Persia, Tartary, Armenia, Georgia, Turkistan, Türkiye, and Syria are more perplexing. These clusters likely represent a spatial thinking less influenced by the standard pilgrimage route from Venice to Jerusalem and more general.
PCA of Centroids for Each Geographical Term PCA of Word-to-Vec Cognitive Map
Potentially the most interesting finding from this lower half of the centroid PCA graph is the cluster containing “Orient,” “Occident,” and “Indie” (a term that refers to the _Tribus Indies_ , the geographical Orient beyond Muslim-occupied lands). One might expect Occident somewhere amid the “Europe” cluster but instead it sits right beside Orient, indicating that Occident is used as a complimentary term to Orient and rarely independently. As a whole, the vectors do not spatially superimpose onto either the Ptolemaic map, which became popular in Latin Christendom during the fifteenth century, or the T-O map. They represent a more complex web of spatial thinking comprised of several different discourses, some on pilgrimage others on crusade. While ideas about the spatial organization of “Europe” were relatively stable across these discourses, as my dissertation shows, spatial ideas of “Africa” and “Asia” were more fluid. They resist modern East-West binaries in their multiplicity.
The digital humanities methodology that I have presented in this blog is exploratory, and, while it exposes the complexities of spatial thinking in the fifteenth century, it does not broach the question “why?” For this, I turn to close readings of my sources in a process of oscillated reading that moves between the “distant” reading of digital humanities and traditional “close” reading.6 In zooming in and out on my sources, my dissertation pieces together why the fifteenth-century Middle French image of the world constantly shifted as the borders of the known world expanded.
* * *
### Appendix
List of manuscripts used for narrative sources model training data:
Manuscript | Lines | Words | Characters
---|---|---|---
BnF, Arsenal 5072 | 7556 | 67337 | 298324
BnF, Arsenal 5073 | 9952 | 83375 | 377390
BnF, Arsenal 5074 | 7443 | 62889 | 282219
BnF, Arsenal 5075 | 7251 | 59794 | 270599
Beinecke, MS 1346 | 349 | 2336 | 11130
BnF, Dupuy 255 | 6314 | 54117 | 246337
BnF, Français 810 | 1172 | 6021 | 27633
BnF, Français 92 | 14741 | 139622 | 637493
BnF, Français 358 | 11221 | 52056 | 239067
BnF, Français 764 | 20124 | 170726 | 703439
BnF, Français 852 | 13170 | 111720 | 504041
BnF, Français 947 | 12166 | 58919 | 272488
BnF, Français 1278 | 7111 | 69901 | 315623
BnF, Français 1380 | 7664 | 43483 | 191096
BnF, Français 1489 | 3898 | 38435 | 165996
BnF, Français 1506 | 14309 | 104188 | 475554
BnF, Français 1721 | 6042 | 39319 | 177393
BnF, Français 2200 | 3560 | 21643 | 102770
BnF, Français 2810 | 12471 | 136670 | 597185
BnF, Français 2868 | 8059 | 45240 | 203902
BnF, Français 5063 | 1622 | 20757 | 90911
BnF, Français 5593 | 950 | 4584 | 22204
BnF, Français 5594 | 21691 | 111592 | 503650
BnF, Français 5646 | 1057 | 7263 | 35610
BnF, Français 6440 | 25092 | 110541 | 509820
BnF, Français 6487 | 1837 | 25264 | 118482
BnF, Français 9087 | 6034 | 50648 | 223412
BnF, Français 9201 | 5963 | 58319 | 272152
BnF, Français 9737 | 8384 | 59454 | 288057
BnF, Français 11594 | 6851 | 40018 | 177228
BnF, Français 11610 | 5991 | 44272 | 201785
BnF, Français 12201 | 3126 | 24732 | 117056
BnF, Français 12572 | 3243 | 25721 | 115456
BnF, Français 12574 | 1710 | 12110 | 53664
BnF, Français 13235 | 7641 | 28444 | 133543
BnF, Français 15217 | 3179 | 28973 | 127020
BnF, Français 19170 | 12161 | 65206 | 296045
BnF, Français 20055 | 3893 | 23342 | 113525
BnF, Français 24210 | 10070 | 96762 | 405137
BnF, Français 24371 | 7049 | 56585 | 259434
BnF, Français 25295 | 950 | 4584 | 22204
BnF, Français 25434 | 4484 | 24918 | 114736
Gent, Universiteitsbibliotheek, MS 2749/11 | 573 | 2983 | 13145
Gent, Universiteitsbibliotheek, MS 418 | 769 | 6547 | 28025
Lille, Ms. Godefroy 50 | 14270 | 117514 | 524452
Metz, Bibliothèque-médiathèque, MS 1562 | 9544 | 99361 | 439803
BnF, NAF 10057 | 11550 | 97338 | 446316
BnF, NAF 20960 | 4089 | 45236 | 209187
Princeton, Garrett 168 | 740 | 3353 | 16694
Newberry Library, Case, MS 54.5 | 3849 | 35911 | 164511
* * *
Featured Image: Jean le Long d’Ypres, Livre de l’Etat du grand Khan, BnF Français 2810, fol. 136v.
Cite this article as: Kimberly Lifton. Talking About Muslims in Middle French: The Potential of Word-to-Vector Models for Studying Semantic Relationships in Medieval Languages. DH Lab (Blog). https://dhlab.hypotheses.org/?p=4713.
* * *
1. Gabor Mihaly Toth, “Women in Early Modern Handwritten News: Random Walks and Semantic Wandering in the Medici Archive,” _Journal of Digital History_ 3.2 (2024). https://journalofdigitalhistory.org/en/article/jnkqqTTKW8km [↩]
2. Suphan Kirmizialtin and David Joseph Wrisley, “Exploring Gulf Manumission Documents with Word Vectors,” _Journal of Digital Islamicate Research_ 2 (2024), 1-29; Also see, Avery Blankenship, Sara Connell, and Quinn Dombrowski, “Understanding and Creating Word Embeddings,” _Programing Historian_(2004). https://doi.org/10.46430/phen0116; Anton Ehrmanntraut, Thora Hagen, Leonard Konle, and Fotis Jannidis, “Type-and Token-based Word Embeddings in the Digital Humanities,” _Computational Humanities Research_ (2021), 16-38; word-to-vector has also been used to trace change over time. For instance, see Jaap Verheul, Hannu Salmi, Martin Riedl, Asko Nivala, Lorella Viola, Jama Keck, and Emily Bell, “Using Word Vector Models to Trace Computational Change Over Time and Space in Historical Newspapers, 1840-1914,” _Digital Humanities Quarterly_ 16.2 (2024); Melvin Wevers and Marijn Koolwen, “Digital Begriffsgeschichte: Tracing Semantic Change Using Word Embeddings,” _Historical Methods: Journal of Quantitative and Interdisciplinary History_ 53.4 (2020), 226-243. [↩]
3. Sergio Torres Aguilar, “TRIDIS v2 : HTR model for multilingual medieval and early modern documentary manuscripts (11th-16th),” Zenodo, September 30, 2024. https://doi.org/10.5281/zenodo.13862096. [↩]
4. Max M. Louwerse and Nick Benesh, “Representing Spatial Structure Through Maps and Language: Lord of the Rings Encodes the Spatial Structure of Middle Earth,” _Cognitive Science: A Multidisciplinary Journal_ 36.8 (2012): 1556-1569, https://doi.org/10.1111/cogs.12000. [↩]
5. Ibid., 1564. [↩]
6. For an argument on the benefits of this method see Inge van de Ven, “Too Much to Read? Negotiating (Il)legibility between Close and Distant Reading,” in _Legibility in the Age of Signs and Machines_(Leiden: Brill, 2018), 180-196. [↩]
* * *
OpenEdition schlägt Ihnen vor, diesen Beitrag wie folgt zu zitieren:
DH Lab (30. Mai 2025). Talking About Muslims in Middle French: The Potential of Word-to-Vector Models for Studying Semantic Relationships in Medieval Languages. _DH Lab_. Abgerufen am 29. Juni 2025 von https://doi.org/10.58079/1416d
* * *
* * * * *
Bookmarked: Talking About Muslims in Middle French: The Potential of Word-to-Vector Models for Studying Semantic Relationships in Medieval Languages – DH Lab https://dhlab.hypotheses.org/4713 #Digital_Humanities