Hello and Welcome to the course "Natural Language Processing for Digital Humanities." Module 1: "Paths into the Digital World". In this module, we would like to present three main topics. First of all, what do we mean by "Digital Humanities"? Next, we would like to introduce the Text+Berg Korpus (Swiss Alpine Text Corpus). Finally, we would like to explain the steps, that are necessary to go from a text image to digital text i.e. the digitisation of modern and historical texts. Let's try to define the term "Digital Humanities": Let's have a look at "Digital Humanities" from a very broad perspective ... How can cultural heritage be stored in digital form? Then, how can we access historical texts in order to find, read and analyse them? And finally, how can new methods be applied to historical texts? Here is an example of a geographic search. This is a text, that has been geo-referenced, that means: all geographical entities have been marked and annotated with geographic coordinates. Two more examples of new methods on old texts: Here you see a collocation network extracted from the Text+Berg Korpus (Swiss Alpine Text Corpus), where you can see which words are semantically related Here, these related words are visualised in the form of a complex network. We also created a bilingual search engine, which allows us to automatically identify the occurrence and frequency of different translation options in a text. Here an example for the German word "Sturm" (=storm): In our corpus, it has been translated 202 times with "tempête" (storm) and 14 times with "vent" (wind) in French. We get these usage examples and the frequencies and we see, that it can be also translated with "tourmente" or "orage" in French for our text collection. There are different challenges regarding the corpus acquisition. It depends on how we collect the data: If we collect electronic documents or if we digitise printed works. In the first case, where we collect documents from the web or from companies or institutions, we need to ensure that these documents that can be in different formats (PDF, Word, HTML) for instance, are converted into a uniform digital text. For the digitisation of printed works, that is nowadays done by libraries or by Google Books in a large scale, it is a question of scanning and converting the digital text and possibly reduce the quantity of OCR (optical character recognition) errors. That's what we want to show you today: But first of all, a small overview on the Text+Berg Korpus (Swiss Alpine Text Corpus). This corpus contains texts from the Schweizer Alpen Club (Swiss Alpine Club) from 150 years ever since the Swiss Alpine Club was established. The first edition of a yearbook of the Swiss Alpine Club was published in 1864. In 1925, the yearbook was renamed to "Die Alpen, Les Alpes, Le Alpi" (in German, French and Italian). The corpus contains a collection of more than 100.000 pages. Why did we create such a corpus? It contains topic-specific texts over a time span of 150 years, it was published regularly, only in two years, no book was published at all. And: The texts are available in several languages: German, French, Italian, Romansh (Rhaeto-Romanic), Swiss German and English. These are the languages the corpus texts are written in and partially also been translated to. From 1957 on, the texts have been translated from German to French (and viceversa) and from 2012 on also to Italian. Here a couple of examples of texts from this corpus: A mountaineering text report from 1904. "Sechs Tage in den Alpen von Cogne" ("Six Days in the Alps of Cogne") a typical example from this epoch. Then, a French example regarding the topic "Change and Transformation of the glacial landscape in Switzerland" (1955) An an Italian example, a mountaineering report from Peru (1983). Why do we use this corpus in our course? It serves as an illustration of the steps involved in the corpus acquisition, and shows how linguistic annotations can be made in a multilingual corpus, how the semantic indexing and the alignment of translated texts works and how corpuslinguistic analyses can be done. How did we proceed with the digitisation? As a fist step, we collected all books. Members of the Alpen-Club donated the books in several copies. We were then able to cut the books open and used a standard scanner with automated paper feed for the digitisation and archiving. The OCR system marks both image and text blocks. The image blocks are here marked in red. For the recognition of the text blocks, the system marks those parts where it is not sure about the content. It is visible, that those are partially correct recognitions nevertheless, some OCR errors have crept in. The ornamental "D" can not be recognised at the beginning of the text. In the second line, the word "Rekord" (record) has been recognized as "Eekord" because the letters "R" and "E" have been mixed up. In the very last line, the "Alpine Journal" was incorrectly recognised as "Alpino Journal". Those are some examples for typical OCR errors. After the OCR step, we can export the digital text in order to do some further processing. The text is initially cleaned from additional information included in the HTML format. Next, the text is verticalised and tokenised, i.e. the punctuation is split. The steps involved in the process of tokenisation will be explained later more in detail. We will also come back to the steps required for the automatic part-of-speech tagging and the lemmatisation that have been performed on this example sentence. The entire information has been saved as XML. After the attribution of linguistic information the automatic recognition of personal names, and temporal expressions, that will be marked and annotated in our documents. The Text+Berg corpus contains texts with up to 23 million words in German, 22 million words in French according to the tokenization, i.e. counting also punctuation marks. 1 million words for Italian and a smaller amount for Romansh, Swiss German and English. How do we get from text image to digital text? How does an OCR system work? The OCR system collects the image points and creates potential characters, the character sequences are then grouped to possible words. Finally, the system compares the word suggestions to the dictionary entries in the OCR system. If the word is found, the systems return this word as recognised. If the word cannot be found, the most likely character sequence is returned. That's why in such cases the OCR quality is not really reliable. The quality depends very much on the fonts, i. e. the letter types. The first texts from the very beginning in 1864 on, were in the font Antigua. Antigua is a modern font and allows a good OCR quality while you get a medium OCR quality for Fraktur fonts. Here an example from a reading book (1917). The problem even increases with the automatic recognition of handwritings causing a lower OCR quality. We will not go into that in this course. Instead, we will focus on the OCR quality when dealing with modern fonts. What kind of challenges does the OCR system face? On the one hand, writing variants are problematic, Writing variants in the course of time,
word such as : "Nachteil" (disadvantage) "passieren" (happen) und "successive" (successive) in 19th century spelling are not included in the OCR dictionary and therefore they are not as well recognised as the modern variants of the 20th century. Writing variants can also be caused by dialects. In our corpus, we find several quotations in Swiss German with different spelling variants that can not be found in the OCR dictionary. What can we do to improve OCR quality? Typically, you add words to an OCR system, to the corresponding dictionary, examples have already been mentioned, such as old spelling, dialectal variants, but also names, toponyms and names of mountains, that can be added to the dictionary. We made the experience that the extension of the dictionary, improves the text recognition of these words while the text recognition in other places gets worse, No great advantage can be achieved in this way. That's why we need to focus on the subsequent automatic correction of OCR errors and we tried out several approaches: First of all, we started to work with two different OCR systems and compared the output of these systems. If both systems return different suggestions, we choose the word that occurs more frequently in the corpus. Here in this example system 1 suggests "Wunseh" and system 2 "Wunsch" (wish) We know that the German word "Wunsch" (wish) is much more frequent in our corpus than the first variant and are thus able to choose correctly. The experience with this system is the following: It is quite time-consuming to apply two different OCR systems and to precisely compare their results at an exact position. Nevertheless, a small but observable improvement of the text quality can be achieved. A different approach involves the implementation of automatic spelling checking without a final human judgement. We find incorrect words in the output of the OCR system by entering another dictionary system and we then calculate the most similar word in the corpus to this erroneous word. An example: The German word "Erstüberquerung" (first crossing) is recognised correctly but the German word "Männergesaugverein" (male choral society) is not recognised correctly. We are thus looking for a very similar-looking word, i.e. a word that only differs in very few letters and exists in our corpus, namely "Männergesangsverein" (male choral society). We then replace the word at this point. This methods works fairly well in our experience but only for words with a certain minimum length of 15 letters approximately. Another method that we tried out, is web-based crowd-correction. In this approach, as many people as possible are asked to correct the remaining errors in the recognised texts over an online application. We built a system named "Kokos" and added 21.000 pages from our corpus of the 19th century. In this way, we were able to achieve 250.000 corrections coming from several people within the course of six months. This experience shows the usefulness of crowd-correction that eliminated OCR errors almost completely. To find the right people for such a task and the handling of such corrections is quite arduous. You have to provide information on a regular basis to this target group. In the following slides, I will mention a couple of decisions regarding the design of our system. Our system displays the recognised text at the left side and at the right side the recognised text image These texts can be accessed in different ways. In our view, it is important to integrate the OCR correction into the reading experience and to allow the access over different articles, years and different search terms. By this means, the proofreaders are reading the texts with pleasure and additionally, make any modifications in the current texts. This needs to be organised in the most pleasant way possible. A few examples: We provide an edit window where the most important symbols can be clicked quickly and inserted into the text if they are needed for the correction of a certain word. Additionally, the selected word is highlighted in the text image. In this way, the user can quickly switch between the recognised text and the text image. Also the search function needs to be as comfortable as possible: We decide to also deliver the image of the search hit Thus, one can quickly decide if this is an occurrence in the text that needs to be corrected or if this case is supposed to stay as it is. You see in this example that we searched for "Kinder" (children) and, of course, we only want to correct those instances, where "Rinder" (cattle) instead of "Kinder" (children) is found in the text. The motivation of the proofreaders must be maintained through constant feedback. We therefor inserted a ranking list where each proofreader can see the total number of corrections made by him- or herself in comparison to other proofreaders and where one's position can be found in the ranking. However, we would also like to direct the correctors to those sides that have not yet been entirely corrected. We thus provided such an overview for all books where you can see which pages are marked as already corrected and which pages still need corrections. By this means, we hope to indicate the remaining pages and to inform the proofreaders about the progress of the entire project in order to correct successively and quickly all remaining pages. Let's summarise: The starting point for the construction of a corpus are texts in different media and formats, in our case, distributed over a long time span and several languages. The aim is the homogeneous digital representation of the texts with as few OCR errors as possible, as shown before, and XML annotations for structural information such as paragraphs, sentences, words and linguistic informations such as part-of-speech classes, proper names or syntactic functions. But all this will be demonstrated and explained in the following modules. We would like to thank you for your attention and look forward to seeing you in the next modules. Thank you very much!