The potential of Transkribus for textual research, the opinion of Daniele Fusi

This is the third and last part of the interview with philologist Daniele Fusi on the Transkribus platform and the frontiers of electronic text analysis. In the first one, Dr. Fusi told us about his experience as a digital philologist and some programs used in his research field; in the second one, he explained the difference between digital text and digitalized text.

alt="Daniele Fusi and the potential of Transkribus for textual research"

Dr. Fusi, what are the most useful tools that the Transkribus platform offers for in-depth textual research?

Transkribus is a great and exemplary framework that combines the use of advanced technologies with a wide range of ways of publishing their functionality, combined with a philosophy of “crowd-sourcing” where content (and especially templates) is enriched and refined by users themselves, within their specific projects or simply on a voluntary basis. However, I have no significant direct experience in its use, as my field of operation is usually downstream of the transcription process.

In the case of your Gramsci project, however, the first step, which is no small task in terms of work, has already been done with automatic transcription, though necessarily to be corrected. If anything, the keyword spotting function should not be confused with an independent and targeted text indexing and search system, which will need to be implemented autonomously for your project. One of the most immediate aims of this function in Transkribus is to improve overall recognition, insofar as it becomes possible to use a text search directly from within the transcription system; in other words, alongside recognition, an initial essential indexing of “words” is carried out, which allows for text-based searches. Since at that stage it is not always possible to be sure of the recognition of the text, it is necessary to provide for a broad-mesh search, so the choice of regular expressions is optimal, and the definition of a confidence threshold to be increased or decreased depending on the number of results obtained. In this way we can, for example, search all the occurrences of what has been recognized as a certain “word”, which we may see badly recognized in certain cases, examine them, and if possible apply a systematic correction to all the instances traced back to that word, thus drastically reducing the time for manual correction.

Given the nature of the problem dealt with, it is certainly a more powerful kind of search than a simple “literal” search (or at most with some wildcards) full text; but it is something designed to operate within the transcription process, rather than to replace indexing and search systems to be placed downstream of the transcription process, in an autonomous digital edition.

Your edition must also offer not only the typical vulgate text search modes (literal research, operators, wildcards, regular expressions, fuzzy matching, etc.), but also connect a series of important metadata related to all the semantic aspects that you extrapolate from the text.

This is essentially a function of the open nature of such editions, whose usefulness is multiplied precisely by the number and different types of its users: for example, a linguist might be interested in research aimed at studying the morphology or semantics of certain words, while a philosopher or historian might rather want to start from concepts without even a specific linguistic expression, and be led by them to all relevant steps, as already happens for example in the faceted search of the site. In some cases it might also be useful to combine the two searches, looking for all the words connected to a certain concept, or vice versa.

One of the most interesting features of digital editions lies in their open nature, such as to constitute a tool that can be used for searches not even imagined a priori by their creator; in this sense, they represent a starting point, rather than an arrival point, precisely because of the heraclithean flow in which they are placed due to their digital nature.

Leave a Reply

Your email address will not be published.