The emergence of computational methods of text processing has created new paradigms of research in literary studies in recent years (Jockers & Underwood, 2016), for instance distant reading to find patterns and regularities (Moretti, 2005). Network analysis and extraction of information about relations between characters from literary texts is an example for distant reading methods. Such information can not only be helpful for better understanding of character interactions but can also facilitate the comparison of thereof in different texts.
Existing tools of text analysis and network visualization such as Voyant 1 or Gephi 2 are either missing modules for character network analysis or require preliminary steps on data preprocessing from the user and therefore are not easy-to-use for some humanities scholars who lack programming skills. Interactive tools in addition often lack features to ensure reproducibility of results.
We present our ongoing effort on closing this gap by developing a literary analysis reporting tool rCAT 3 , whose primary purpose is to provide an easy-to-use, stable, and reusable solution for automatic extraction of relational information from text and to characterize these relationships automatically to provide the user with deeper qualitative insight. We opt for implementation as a web-based reporting tool instead of an interactive tool for two reasons: (1) automatically generated reports in PDF format can serve as a stable foundation for discussion and can be reused in publications and visualizations easily, and (2) the results are clearly connected to the chosen input parameters such that reproducibility of results is ensured.
As a use-case study, we apply rCAT to Johann Wolfgang von Goethe's epistolary novel Die Leiden des jungen Werthers. On the basis of this epistolary novel, we show that not only the network can be generated, but also the characteristic triangular relationship of the protagonists is easily identified. The goal is to automatically determine this triad in the original text and in the adaptations that have been published since the publication of Werther in 1774.
Previous research on social networks in literary fiction generally fall into one of the two categories: (1) works that explore methods for extracting and formalizing character networks ( cf., Elson et al. (2010), Agarwal et al. (2012, 2013), Park et al. (2012)), and (2) works that primarily focus on qualitative implications of network analysis ( cf., Rydberg-Cox (2011), Moretti (2011), Nalisnick & Baird (2013), Jayannavar et al. (2015)). It is common to address both tasks at the same time, as in Beveridge & Shan (2016), who introduce a number of formal measures for analyzing the centrality of the characters in Game of Thrones books, which results in both expected and surprising findings.
Building on graph theory extensively elaborated in the past fifty years (e.g., Bondy and Murty, 1976 or West, 2001), our work is similar to Beveridge & Shan (2016), in particular, in terms of the weighted degree measure, and to Park et al. (2012), in terms of distance measure for detecting closely related characters in a text.
In the following, we explain the different components in rCAT, which are available for text analysis. After that, we discuss the results based on a use-case study.
To detect character mentions in the text we use a fundamental named-entity recognition approach based on dictionaries. This approach is suitable for scholars who analyze texts they already know. Consequently, we opt for a transparent and simple character recognition procedure: The user provides a list of character names to be included in the analysis specifying a canonical name form and all variations thereof she would like to take into account ( e.g., “Lotte” is the canonical name and “Lotten”, “Lottens”, “Lottgen”, “Lottchen”, “Charlotten S.”. are its variants).
We define the closeness of relationship between two characters using a distance measure dist X (p,q), where p and q are the strings corresponding to these characters and X is the number of tokens between them (Park et al., 2012). In addition, we introduce the context measure cont Y (p,q), where p and q are the strings corresponding to these characters and Y is the number of tokens before the character p and after the character q. While the former measure allows for detecting those characters that are closely related to each other, the latter one enables a contextual analysis of their relationship.
We visualize the network of characters with an undirected graph G=(V,E), where V are the vertices, each vertex corresponding to one character, and each edge E=(V i, ,V j ) corresponding to relations between pairs of characters. We output the following measures for each character node: degree, edge weight, weighted degree and density. The degree is the number of edges occurring with a given vertex. The edge weight, w i,j ≥ 0, is defined as the number of interactions between the vertices V i and V j. The weighted degree is the sum of weights of the edges occurring with a vertex i. Density is the ratio of occurring edges between two vertices and all possible vertex pairs.
Word clouds are an approach to visualize the vocabulary of a text. The size of one word corresponds to its frequency. We use two different kinds of word clouds: For each character in the character list, we show word clouds based on the context of a window size n. For each pair of characters occurring in the network, we present a word cloud based on the words between them as well as on the words found in the context. Both types of word clouds can be filtered to the specific word fields (words from specific domains) which is helpful in gaining a focused insight into the characters relations.
We plot the timeline of multiple predefined world fields (specified by word lists) in the text. This feature is helpful in representing how certain fields ( e.g., concepts, emotions) develop throughout the narrative (Kim et al., 2017).
The tool was developed using Python v.3.6 and the Flask 4 web development framework. The tool outputs a single PDF report. The resulting document contains information from the analysis modules described in the previous section. Network graphs included in the report are generated with graphviz. Additionally, the tool can generate a CSV file that can be used as input to Gephi.
For a use-case analysis, we apply rCAT to Die Leiden des jungen Werther by Johann Wolfgang Goethe with the following parameters: X=8, Y=5, stop words removed.
In Goethe's epistolary novel, the protagonist Werther describes his unhappy love for Lotte, who is engaged to Albert. The characteristic triangular relationship in the novel arises from this constellation (protagonist ⎼ beloved woman ⎼ antagonist). With rCAT we expect to identify and characterize this relationship. Figures 1 and 2 show a sample network analysis output (tables are shown only partly).
The protagonist Werther shows a degree of 21, which is the number of characters with whom he interacts. The closest relationship measured by edge weight (Figure 2) is observed between Werther and Lotte (81 interactions). The antagonist Albert has a low degree of 3. However, his weighted degree is 36 (third highest after Werther and Lotte), which confirms his important role in the triangular relationship.
Highlighted in red is the typical triangular relationship in Goethe’s novel, which corresponds to the three highest weighted degrees. In further steps, we will use rCAT to analyze the adaptations of Goethe's novel with a focus on this triad.
To better characterize the edges, the tool outputs top- n word clouds sorted by edge weight ( n is specified by the user) for character pairs and by degree for single characters. Figure 4 and 5 show examples of the word clouds for character pairs filtered to the words from the emotion domain.
The word clouds enable first conclusions about the relationships of the characters. Werther and Lotte's word cloud characterizes their ambivalent relationship. The key words "Leidenschaft" and "Freude" reflect Werther's love, whereas the mentions of "sterben" and "Verblendung" are characteristic of the unrequited love, which leads Werther into his "disease unto death". As Werther and Albert’s word cloud reveals, their relationship is dominated by the "Unruhe" that Werther feels through his adversary.
Additionally, the tool plots the development of the narrative (not bound to specific characters) based on the word fields, an example of which is shown on Figure 6. In this case we used words from the emotion domain (with emotion dictionaries by Klinger et al. (2016)).
The word field development can highlight the prevalence of individual emotion domains across the text. The accumulation of the negative emotion words (Wut,Trauer, Furcht) towards the end suggests, for example, that Goethe’s novel has no “happy ending”. The striking rash on “Freude”, however, captures the last happy hours Werther spends with Lotte in the second part of the narration before he kills himself.