Welcome to the second part of Module 1, with the title "Challenges of Corpus Acquisition". In this part, I will show you how to build a corpus on your own. There are two approaches: You can collect electronic texts, i.e. texts that are already digitised and can be found online. On the other hand, you can digitise text on paper. And I will explain to you in this lesson what has to be taken into account. Let's start with the already digitised texts. Those are available online in different formats. Websites, PDF texts and documents in other formats. So, they need to be converted. Either way, our desired target format is XML. The texts need to be cleaned as well in order to obtain structured and annotated data at the end of the processing pipeline . In order to collect data online, we use a specific tool: a webcrawler. The webcrawler works as follows: It starts with a given web address (URL) and loads the corresponding website. This website is rated, filtered and stored. Then, all other addresses that can be found on the same page, are collected. And the same process starts all over again with these new URL addresses, until a certain stop criterion is reached. The data processing step includes the text conversion from the website format HTML to the markup format XML, for instance. In this step, the texts are cleaned and navigation elements from HTML pages are removed. Additionally, redundant blank spaces, newlines and similar special characters are removed from the page. If we are not dealing with HTML documents but PDF documents for instance, which is quite frequent, then there is a wide range of software, that can be used for the conversion of PDF documents to XML. The layout information is typically stored in the PDF format which can lead to problems. Many of the conversion softwares, that we have tested had some troubles with letters that were sticking together. The blank spaces separating the single words were missing. Here an example of missing blanks: "imAngesicht" (in the face of), "wennsie" (if she) What also has to be considered are ligatures. Typically, the ligatures of "fi" and "fl" that don't represent individual letters or characters but one single sign, used in typographic contexts with the purpose of making text more consistent and elegant. During the conversion this element causes problems, because it is resolved into one single sign instead of two separate characters. When converting to XML this needs to be treated separately. Now, I would like to give you an intuitive understanding of XML and explain the difference between layout XML and structural XML. Here comes an example: A short newspaper text in layout XML looks like this for instance: We know where the article starts and ends indicated by the article tags at the beginning and the end of the document. But for the other elements present in this text, we only have information about the lines, the graphical representation of characters and words. Typically, this information includes the font size. In this case, font size 14 for the header at the top level, font size 12, then font size 8 for the author names. And for the text, font size 10. The exact sizes don't play an important role. It is important to remember that we don't have any logical information about the text but only information regarding the appearance of the text. And what we actually want to have is structured, structural XML. And then this text looks like this: We know that the header is the text in the line tagged with <TITLE> and that the author is mentioned in a single tag. Subsequently, the LEAD text is displayed, followed by the actual content of the article. That means the logical units of the text are here marked as such. But let's move on now... How can we proceed if the texts for our corpus are on paper i.e. in printed form. Here we have to distinguish between bound books and loose sheets. We prefer loose sheets because they can be scanned using the automatic paper feed. If you have bound books you may have to turn them over by hand when scanning. So, the next step is "scanning". photographing the individual pages. Afterwards the document is converted into raw text using OCR (Optical Character Recognition) as we have seen at the beginning of the module. If you have a good OCR program, you may directly get XML as an output, namely layout XML. This layout XML need then to be re-converted into structural XML similar to the conversion of PDF documents. Our goal is again the structuring and annotation of the present texts. So, when bound books need to be scanned we must ask ourselves whether these books can be cut open. If it is possible to cut them open we can scan the pages using the automatic paper feed in a conventional scanner. There might occur problems with folded pages, folding maps or sticky pages. But in this way, the books can be scanned quickly and efficiently. If we want to scan books that we don't cut open because they're too valuable, or because they need to be preserved for some other reason then this is a case for special scanners, if that is to be done on a large scale. A scan service or a library might have such special scanners using manual page turning or a page turning robot. Here are two pictures to visualise that; At the right, you see a scanner with manual page turning a high-performance scanner that is used at the central library in Zurich, where a book does't need to be opened 180° but only 120° or 90°: a much gentler technique for valuable books. To ensure that the sides remain flat, however, a glass funnel is attached to the sides, which briefly presses them down. This funnel is controlled with the foot. The pages are photographed with a camera and the operator of this scanner can turn to the next page and scan the next pages like this. Even more elegant is the page turning robot that turns the pages automatically. There are different approaches, here in the picture we see a page turning robot, which uses air pressure and an air blower to turn the pages. Others apply suction to the pages, use mechanical arms to turn the pages and allow thus a higher scan performance and a faster scanning process. We have learned, however, that these page-turning robots also need to be looked after. It is not the case that a book can be scanned entirely automatically. A couple of tips for those of you, who want to scan: The resolution for the text recognition should be at least 300 dpi (dots per inch) and the settings should be chosen according to the desired colour effects and data sizes between black-white scanning, grayscale and colour. With black-white scanning a relatively small amount of data needs to be stored. Better OCR results can be achieved by using the intermediate settings using greyscale scanning And, if the pages need to be re-used, the settings need to be adjusted to scan in colour which leads to relatively large amounts of data. In addition to that, historical texts often can not be converted into raw texts using automatic text recognition. Those texts need to be copied by hand. It is called manual transcription. Those historical texts can be handwritings or old prints. Again the target format is the same: We would like to have a structured and annotated XML format as a result. Here you see a picture, where you can easily imagine that this text can not be automatically converted into machine-readable text. Here you see a manuscript of the 16th century from the central library in Zurich. It is obvious that when processing such texts they need to be copied, transcribed by someone with expert knowledge. Before you start building a corpus yourself, you should familiarise yourself with the major digitisation initiatives, that have been launched over the past years. A very good source to collect texts for your own corpus is the project Gutenberg, supported by volunteers that digitised 46.000 books and made them available on the Internet. Those books are all without copyright. You can copy them and re-assemble them together for your own purposes. Another European Initiative is "Europeana", a network of digital cultural assets that does not work as an actual repository but rather as a network with links and information regarding other sources. However, the Europeana meanwhile contains 24 million full-text pages. Again, you can find texts, images, sound recordings and videos, texts from books, newspapers, diaries and archive documents and use the documents for your own purposes. Google's digitisation initiatives are even known better: Google Books is the most famous one. More than 30 million books have been digitised by Google. It is estimated that there are only 130 million different books worldwide, which means that a large part has already been scanned and made accessible with the help of OCR by Google Books. So, you can look at these books with online search. But they are not always fully available. The texts cannot be copied, so these texts are not really adapted for the construction of your own corpus. It is similar with Hathi Trust, an initiative of North American Universities, where also more than 10 million books have already been digitised. In the current slide, you see a whole range of other text collections from scientific articles to newspaper archives to linguistic annotated corpora. I recommend to have a look at these text collections if they might be useful for your corpus project. Let's summarise Module 1, part B: Corpus acquisition starts with different types of source documents. On the one hand, there can be electronic texts that we collect in the web on the other hand, it can be texts on paper that still need to be digitised. The goal is always the same: We would like to have structured texts in XML as a result of the process. These XML documents then can be further processed and analysed. Thank you for your attention and I recommend the next module, where we deal with the automatic segmentation of texts and introduce you to the topic XML. Thank you very much!