Cluster 7 – Text interoperability and analysis

Profile picture for user jean-baptiste.camps
Jean-Baptiste CAMPS
Docteur en études médiévales. Maître de conférences en philologie computationnelle
Bruno BON
Ingénieur de recherche. Responsable de la section Lexicographie et sémantique
Directrice-adjointe des Sources Chrétiennes. Coordinatrice du projet BIBLINDEX

Through text analysis and text mining, we will carry out a work on the interoperability of corpora and on an extremely fine granularity, allowing us to go beyond what is done elsewhere.

1/ The Distributed Texts Services (DTS) text sharing protocol

The Distributed Text Services (DTS) project aims to define a protocol for sharing texts and their passages via specific APIs. DTS is to texts what IIIF is to images. It is supported by the CJM and HiSoMA, in collaboration with the MRSH of Caen.

The specifications of this protocol are currently open to external contributions in order to ensure its compatibility with a widest range of projects. It is based on text sharing in TEI, a REST architecture and a catalogue expressed in JSON-LD.

The implementation of DTS in TEIPublisher, which is incomplete to date, will be finalised with the help of HiSoMA in conjunction with the community. The DTS API will also have to be implemented within the editing tools of the MRSH of Caen. The success of the API requires a massive effort in user training, but also in the development of client and server software suites (CapiTainS, TEIPublisher, etc.).

2/ Lemmatisation and translation assistance for ancient texts

HiSoMA has created the Biblindex lemmatisation prototype (on a set of 70,000 form/lemma pairs from biblical and patristic Greek texts). The Jonas database of French texts from the Middle Ages (IRHT) would like to lemmatise incipits and explicits in order to get around the orthographic contingencies of Old French.

The Collatinus and Eulexis softwares, tools to assist in the translation of Latin and Greek, will be enriched with new content and new features. Collatinus will continue its developments on medieval Latin and prosopographical dictionaries. Eulexis will be enriched and will integrate the Greek Koinè. Biblissima+ aims to transpose the structure of these tools to other languages.

The CJM's lemmatizer (Pie) and its post-correction application (Pyrrha) are now operational on classical Latin and Old French. These tools will be linked to the automatic handwriting recognition (HTR) within e-Scripta, in connection with the work of cluster 3.

The lemmatized lexical corpus of 50 million words of medieval Latin (period 800-1200), created by the IRHT with the support of the ANR Velum, could be doubled by enlarging its scope (Iberian or Italian texts of the 8th century; Germanic or Slavic texts of the 13th century) and take into account both Merovingian Latin and Scholastic Latin.

3/ Textometry, stylometry and alignment

The CJM wants to create a Computational Resource Centre for languages with graphic variation which will focus on:

  • the issue of linguistic annotation (see above)
  • the processing that allows to answer questions of dating and localisation; alignment of different versions and collation; detection of named entities.

The challenge is the automatic processing of historical languages with high graphic variation and the provision of tools (web interfaces, APIs, algorithms) and models (mainly for Gallo-Romance languages and Latin). Eventually, dialectometric (a heat map system) and stylometric services are considered.

For stylometry, we would like to have functionalities for automatically matching texts according to their style or content, and even for detecting paraphrases, particularly from one language to another.

The Biblindex project by HISoMa aims to develop tools for searching intertextuality based on lemmatisation and textometry, making them as generic as possible to apply to the search for any citation phenomenon in ancient texts.

At the crossroads of editing, lemmatisation and computational exploitation of textual data, the CJM and the CIHAM propose the creation of a tool capable of reproducing in an automated way the complete process of text establishment, from the macroscopic level (alignment by paragraph or other textual structure) to the microscopic level (word-by-word alignment and then establishment of a typed apparat offering an analysis of similarity of variants).

Automatic processing will thus make it possible to classify graphical, grammatical or semantic variants, or even a finer classification based on semantic representations according to the context.

From an ecdotical point of view, critical reconstructions of archetypes and the visualisation of relationships between the extant witnesses of a text will be generated. The contribution of Biblissima+ would allow to improve the code and develop a web application.