Cluster 3 – Artificial intelligence, pattern recognition and handwriting recognition

The work of the cluster will focus on the recognition of forms, characters and handwritings for the study of ancient handwritten and printed books, sigillography, numismatics and heraldry. In particular, it will aim to bring together the complementary developments of Kraken and Arkindex.

1/ IRHT research on Latin and French texts

For several years, the IRHT has been developing computer-based handwriting recognition and analysis programmes (classification and dating), publishing several datasets to feed artificial intelligence. Its main partner in this field, the Teklia company, has developed the Arkindex platform, based on artificial intelligence and computer vision to support historical and visual analysis. Document analysis starts with low-level elements (layout segmentation, spotting of miniatures and decoration elements, text transcription) and feeds into an analysis process that needs to be developed. On the text side, textual analysis fuels the identification of minimal semantic units, such as named entities or textual quotations and replacements. On the visual side, the identification of iconographic elements (miniatures, historiated initials, coats of arms) can be completed by iconographic indexing or the measurement of stylistic similarities. It is already an architecture designed for mass processing, production monitoring (quality indicators, supervision) and user-oriented services, with linguistic functionalities that can be used for the mass processing of BVMM resources.

2/ Kraken, an HTR module for all writings (e-Scripta, EPHE-PSL)

Kraken, which is entirely open source (Apache 2 licence), is the first element of a suite of tools designed by EPHE-PSL researchers within Scripta-PSL in order to study manuscripts corpora written in any language and any graphic system. The tool is flexible: users define their own transcription principles, up to full transliteration, and can, on an open archive available on Zenodo, publish their models and/or download them and allow other users to take a model and re-train it for a different but similar writing.

3/ Pattern analysis: recognition of watermarks, decorative elements, heraldry and numismatics (ENC-PSL, IRHT, EPHE-PSL, CRAHAM)

The tool built by the ENC-PSL, IRHT, INRIA and École des Ponts in the "Filigranes pour tous" project, providing access to the Briquet repertoire and allowing to identify watermarks captured by a smartphone, must be improved and the database enriched with metadata. In coordination with the Arkindex developments, the docExtractor application (INRIA-IRHT) will also be developed, allowing the isolation of all decorative elements in a given manuscript, and the comparison of these elements from one manuscript to another.

The BVH via the BaTyr database seeks to enable automatic recognition of forms and fonts for the identification of printers. The IRHT intends to develop, with INRIA, similar processes adapted for handwritten scripts and their variability.

In the field of heraldry, we will develop the module already built in e-Signa (SAPRAT, EPHE-PSL), which enables heraldic information to be formalised and returned in the form of a standardised drawing and description. The heraldic search interface will make it possible to query the portal's resources by image or text using a simple graphic reconstruction tool.

A very similar methodology is being developed at AOROC for coins, in collaboration with Mines ParisTech. AOROC will continue to acquire data on Celtic and Italic (pre-Roman) coins - captured with the help of a high-precision 3D scanner - and make them available online on open access tools (Chronocarto, Sketchfab, Nakala/Huma-Num). The CRAHAM will participate in the effort to reference and index imprints representing coins or monetary types.