What is corpus in R.

6 Interaction and dialogue with large text data: corpus analysis with “polmineR” Andreas Blätte 1 Political science on the way to large text data Digitization makes large amounts of digital text accessible - in different data formats and with varying data quality. The humanities and social sciences are challenged in two ways. They are initially faced with the task of creating sustainable and systematic digital text data collections. On the one hand, this task includes catching up on existing material stocks, which may be available digitally, but not in a data format suitable for machine processing. On the other hand, the social sciences would have to develop capacities to continuously turn relevant excerpts of current digital events into material for their analyzes - insofar as they want to remain in a position to make an analytical contribution to the self-observation of a society shaped by digitization. Science should distrust the formula “the network never forgets”: data on servers is often deleted, only the unpleasant is persistent. The primary objective of private data corporations - from Twitter, Facebook and others - is not to keep public archives available. The optimism that these will guarantee access to data for scientific purposes is on clayey feet. The social sciences themselves have to worry about the sustainable availability of digital data and take care of it. Archives and volatile digital material are to be processed into corpora. Corpora are large collections of digital texts that have been created according to systematic criteria. They enable analysis methods that combine quantifying and qualitative analysis steps. By acquiring methods from other disciplines that have a long history of working with corpora, including corpus and computational linguistics as well as computer science, the methodological radius of social science content analysis is expanded. Exploratory methods of recognizing speech patterns are gaining in importance. Analysis steps that test hypotheses remain relevant. But in order to realize new analytical possibilities, one really has to get a grip on the digital material technically. If data sets, even with well-equipped computers, blow up the main memory with conventional handling or simply cannot be processed quickly and efficiently, interaction and dialogue with the data remain an unfulfilled promise. Large amounts of material remain worthless if they cannot be used productively in research. Technologies are needed that make it possible not only for a small group of people to work with new digital material stocks. There are plenty of open-source tools available for the daring who have good to very good programming skills. In this respect, access to advanced methods is open to everyone - and yet in fact limited. Demanding skilled programming from everyone is a barrier to scientific progress and easily leads to technical ability taking precedence over content-related issues. It has to be more user-friendly. For non-hackers, there is now an established tradition of developing software for text and corpus analysis (cf. Kreuz / Römer 2013). Two generations of applications can be distinguished (Hardie 2012). A first generation is made for installation on end devices, i.e. on a single workstation computer (laptop, etc.). Fully developed software products offer the user convenience. But they are closed applications that can only be expanded by the respective software provider or developer. But above all: These software offers (e.g. Lexico3, WordSmith) are not designed to actually process larger corpora efficiently. A second generation of software relieves the users' computers by installing a server. Efficient systems for data management can be installed on a server, which are provided with graphical user interfaces and are available to users via their browser. These include CQPweb, the Leipzig Corpus Miner (LCM, cf. Wiedemann / Niekler 2016) or a number of evaluation tools that are available through the CLARIN (Common Language Resources and Technology Infrastructure) research network. Such a centralized infrastructure offers excellent possibilities for a large number of users: You can use the comfort of graphical user interfaces and do not have to worry about software installations or hardware requirements. But this scenario also has an obvious price. The graphical user interface limits what is methodologically possible, server administrators have to set limits to the corpus analysis with PolmineR 121 infrastructure utilization by individual users in the interests of other users. Against the background of these considerations, the “polmineR” R package is presented here, a portable software architecture and analysis environment for large annotated text data that can be installed both on the server and locally.1 The functionality of the package keeps programming requirements for common analysis steps low, but is open to statistically complex operations. The package offers efficiency and flexibility. It keeps the way "back to the text" open at all times and in this way supports text-based research that does not lose sight of the claim to validity. If the formula code is theory (Schaal / Kath 2014) applies, the code of the package reflects the social science intuition that when working with text data, quantitative access without the support of qualitative validation potentially violates standards of validity. This is exactly where the polmineR package differs from popular R packages such as tm, quanteda or text2vec. However, this is not about the political science and philosophical impregnation of the code, rather the explanations of fundamental technical design decisions are in the foreground. In this respect, it is less about an introduction to the polmineR package, this is provided by the RPF's “vignette”. Rather, it is about formulating guidelines and passing on experience for the development of software through which the social sciences are preparing for the digital age. Software developments can lead to the canonicalization of certain methods - this is precisely why their basic decisions should be transparent and open to criticism. 2 polmineR: Objectives and basic decisions Anyone who starts scientific programming has the experience that many problems that previously appeared to be disproportionately complex can now be solved efficiently using programming technology. Scientific access opens up new problems. Who the pro-1 The R package is available through the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=polmineR. The latest development version is freely accessible via GitHub (https://github.com/PolMine/polmineR). 122 Andreas Blätt begins to program, in particular does not have to reinvent the wheel: powerful programming languages ​​are freely available; Perl, Python, Java and now also R are relevant for working with text data. For each of these programming languages ​​there are libraries or extension packages that extend the functionality for specialized tasks. The development of libraries drives progress, because not everyone has to do everything from scratch. Such a process of developing specialized libraries also ensures progress in the area of ​​corpus analysis and text mining. An established toolset is available for Python with the NLTK package (Bird et al. 2009; Perkins 2010), in R the tm package serves as a Swiss Army knife. So why an additional package for corpus analysis? The objectives of the polmineR package can be outlined using the terms interactivity, performance, economy, flexibility, openness, portability, user-friendliness and documentation. With the help of these key words, basic decisions are now to be briefly explained and made clear why the development of a new package was started. To show this in outline in advance: The polmineR package was implemented in R and is based on a combination of R with a system specializing in the data management of large annotated corpora, the Corpus Workbench (CWB). Interactivity Corpus analysis processes combine data-driven, quantifying analysis steps with a qualitative inspection and interpretation (of excerpts) of the original text. It is necessary to look into the text if analysis results are to have any claim to validity. Ideally, corpus analysis is based on a successful "dialogue with the data". When choosing a programming language, this means that it should allow commands to be called interactively from a command line within a work session. This is implemented as standard with R and Python (here: interactive shell). The decisive factor for the decision in favor of R was the widespread use of this "statistical language" among social scientists, an excellent spectrum of packages for working with text data as well as the recognized excellent visualization options. The possibilities of working productively with text data in R are beyond question (Gries 2013, 2016). Finally, RStudio is a freely available development environment that embeds a work session with R in a user-friendly environment. Corpus Analysis with PolmineR 123 Performance Corpora - in a scientific context - range in size from several million to several billion words. Working with corpora should remain efficient even when the amount of text increases.2 The performance benchmark for the software design proposed here is the corpora prepared as part of the PolMine project (www.polmine.de), which have a volume of 300 million words and have more (p. 2013a, b). Corpus analysis procedures will establish themselves when scientific users can use high-performance software for data of this size (or beyond) that allows interaction and dialogue with the data. If spending a lot of time on calculations and calling up text would lead to permanent breaks in the conversation, a foreseeable practical research consequence would be to pour a hypothesis into a program code, to initiate the calculation, to look at the results a day later, and then to make some sense interpreted into these - because there is no time to pursue the questions raised by the patterns recognized by quantification. The performance of the systems used is a prerequisite for good research with text data. Economy In R there are very well developed packages for natural language processing (NLP) and for the analysis of texts available.3 For basic functions, one can think of the tm package or quanteda package, the lda and the topicmodels package for the topic modeling procedures that have become popular. Regardless of the text analysis, all conceivable statistical methods are available in R (Baayen 2008). Procedures implemented in other programming languages ​​are by no means out of reach: R offers good options for accessing Perl, Python, Java and C. One on the principle of thrift 2 What a large corpus is is a relative question. The largest German-speaking corpus, with a volume of 6.1 billion words, is the German reference corpus (DeReKo) (see http://www1.idsmannheim.de/kl/projekte/korpora/). 3 See the “CRAN Task View” on Natural Language Processing (http://cran.r-project.org/web/views/ NaturalLanguageProcessing.html). 124 Andreas Blätte-oriented package therefore tries to do as much as necessary, but as little as possible itself. The polmineR package offers a high-performance basic functionality, but otherwise sees itself as a hinge for gaining information from corpora, which is then passed on to other specialized program libraries. An example: With the polmineR package, two commands (partitionBundle, then as.TermDocumentMatrix) can be used to create a term document matrix to which the TermDocumentMatrix class defined in the tm package is assigned. This object can then serve as the basis for calculating a topic model with the topicmodel package. Flexibility Users who primarily want to use standard procedures should be able to access them with as little effort as possible. The polmineR package offers, among other things, the uncomplicated generation of subcorpora according to variable criteria, term frequency counts, the display of concordances (often also keyword-in-context analysis in content analysis) and co-occurrence analyzes. However, the package should also be open to expansion. This is achieved through a consistently object-oriented implementation of the package4, i.e. through the definition of appropriately documented classes and methods. For example, the method for the full-text display of full-text editions for special document types (e.g. newspaper articles) can be adapted accordingly; the package defines an expandable core. In the polmineR basic package there is a method for displaying the full text of any partition with the read method. Specialized methods can also be introduced. It extends a read method for partitions of a class specifically defined for plenary protocols (plprPartition), so that interjections are visually identified in the display by a corresponding sentence (indentation and italics). Openness Scientific users do not just want any result, in case of doubt they want to be able to understand in detail how this comes about. Quality assurance aspects and possible 4 technical aspects are also implemented with the S4 system of object-oriented programming in R. Corpus analysis with PolmineR 125 Functional extensions benefit from the openness of the source code. The polmineR package is therefore available in a way that is now common for R packages, both via CRAN and via the social coding portal GitHub. It is therefore also open to further developments by third parties. The principle of source openness also applies to all components that are used by the package; these are available according to general public licenses. Scientific exchange and, last but not least, the possibilities for use in teaching benefit from open-source software. Portability Corpus analysis is often computationally intensive. This speaks in favor of performing memory-intensive and computationally intensive operations on a more powerful, fixed-location device (workstation, server). An installation on mobile devices should not be ruled out: The entire architecture should be portable, scalable and run on different systems. The polmineR package can be installed on all common operating systems (Microsoft, Linux, Mac OS). The vignette on the package contains a detailed description of the installation on the various operating systems. Ease of use Corpus analysis is technically not trivial in view of large data. Methodically reflective research with corpora means that one understands how the corpora under study came about, the properties and peculiarities. At the same time, corpus analysis should show as little “voodoo” as possible, entry barriers should be low and a fluid workflow should be feasible. A command line-based tool sets limits to user comfort. However, it can be used if the repertoire of commands is clear and sufficiently intuitive. This is also achieved through an object-oriented implementation. The so-called generic functions of R (e.g. length, summary, names) can be applied to the objects of the classes defined in polmineR. Above all, possibilities are used which arise when using the RStudio development environment: Tables can be output in the viewer area, where convenient sorting and filtering functions are available. Concordances and full texts are displayed in an integrated browser window.5 Documentation Poorly documented code prevents reuse and leads to the fact that new code is repeatedly written for specific research problems, although code that was developed earlier could possibly be reused. In line with the ideal of cumulative knowledge progress, the code should be documented in such a way that it can be shared and reused. Results that can no longer be traced because the code is closed to the original author contradict the ideal of reproducible research. In the polmineR package, all methods and classes are documented in the usual way for R packages. They also contain references to the scientific reference works that were used when implementing a statistical procedure. * * * The listed, common requirements shape the design decisions of the R package polmineR.Fundamental conceptual considerations regarding the functionality of the package (the design of a system of classes and methods) are independent of the specific programming language chosen. They are based on assessments and empirical values ​​of what the scientific workflow requires. The implementation in R is one possibility among others: Porting to another programming language (e.g. Python) would be feasible. For R, however, there has been an interface for access to the Corpus Workbench (CWB), a system for corpus analysis that has been established for many years, with the rcqp package (Desgraupes / Loiseau 2016) since 2013. The use of the CWB as an efficient system for data storage is an essential guarantee for the speed of the evaluation environment. 5 Since version 0.7.0, a graphical user interface can be called in polmineR with the polmineR () function call. This restricts the range of functions to the basic functionality and has so far been experimental. The implementation takes place via a so-called "shiny" application. Corpus analysis with PolmineR 127 3 Efficiency of data storage: The CWB as backend A key to efficient storage and analysis of text data is so-called indexing (cf. Heyer 2008): Indexing means that the words in the source text are converted into numbers. The numerical values ​​act as keys to be able to store and identify words uniquely but with less memory space and faster. Indexing is common practice in large text mining systems, especially in larger search engines. Lucene, for example, with the Solr extension is widespread. A development especially for the requirements of corpus and computational linguistics is the CWB, (http://cwb.sourceforge.net) which is used by polmineR as a backend. Technically speaking, the conception of the polmineR package follows the idea of ​​a three-layer architecture (Bruges / Dutoit 2009): The CWB is used for data management (data layer), for text statistical analyzes and for method development the statistics package R (application layer) is used on top can in turn set up graphical user interfaces (presentation layer). Figure 1 shows how these components relate to one another. Figure 1: Architecture of the polmineR package The CWB requires the import of a corpus; the indexing is carried out during the import (Evert et al. 2016b). The effort of a 128 Andreas Blätte CWB import is only worthwhile if the text data is actually extensive and if it is provided with metadata and linguistically annotated. This includes at least an enhancement of the original text with a part-of-speech annotation (definition of part of speech, e.g. noun / adjective / verb) and, in the course of lemmatization, a return of word forms to the unflected basic form (from "Herren" becomes "Herr", from “gone” becomes “go” etc.). For smaller, linguistically not annotated corpora and purely experimental evaluations, preference may be given to working with the tm package, for example. Insofar as texts are provided with metadata, a "well-formed" XML (for Extensible Markup Language) is required as the import format; in addition to the linguistic annotation, a tokenization of the data, i.e. a breakdown of the original body text into individual words, is required. In order to support the CWB import, an R package for data preparation was developed in the polmine project with the corpus toolkit (ctk). To give an impression of the data format, Table 1 shows the beginning of a plenary record in a CWB import format. Table 1: XML data format for the CWB import