The Use of ‘Text as Data’ in Social Science Research

The conventional way to get information from text is (of course) to read it. However, researchers in social sciences are also using quantitative approaches that allow certain kinds of analysis to be applied systematically to much larger bodies of text. Félix Krawatzek and Andrew Eggers have made a series of podcasts to discuss the possibilities and limitations of several of these techniques. In each episode, Felix and Andy review and critique a set of selected readings that highlight key methods and some of their applications.
Episode 1: Use of text analysis in Social Science
The first podcast introduces some of the basic assumptions behind the use of text in the social sciences. It introduces a distinction between using text as a proxy for otherwise hard-to-measure concepts and studying text for the sake of understanding discourse. The podcast discusses the difference between discourse and content analysis.

Félix describes an example of discourse analysis – his research on the rules applied in Russian society to the use of the term ‘youth’ in public discussion of change in the pre and post ‘perestroika’ periods. Andy provides an account of the differences between supervised and unsupervised machine learning in analysing large collections of text and topic modelling. He and Félix suggest that these newer classification methods are complementary to other research methods and can support more conventional approaches to text analysis.
Episode 2: The structure of text
The second podcast covers discourse analysis and corpus analytics, which are methods for gaining a better understanding of the structure of text. Félix and Andy discuss key concepts such as collocations, concordances or keywords and techniques like correspondence factor analysis.

The application of these techniques allow them, for example, to talk about the different uses of the term “Europe” in presidential speeches in the French 5th Republic and also to revisit undervalued early quantitative text analysis undertaken in France. Lastly they analyse the use of corpus analysis and the combination of critical discourse analysis and corpus linguistics techniques in research on the ways the British media has represented migrants and refugees in news stories.
Episode 3: Scaling and tone
In the third episode, Félix and Andy explore scaling methods that seek to measure sentiment, ideology, and other latent concepts in text. They start off by exploring the sentiment dictionary “LSD” which can be used to assess and compare the tone of different texts. Andy and Félix discuss an application of this method to study how the US media report on economic news.

The second part of this episode is devoted to scaling methods which aim at placing actors on a one-dimensional scale such as ideology (left-right) or attitudes towards environmental regulation (pro-anti). Together they explore potential problems in the use of these methods and ways to circumvent them.
Episode 4: Topic modelling
The fourth episode is devoted to topic models. This methodology seeks to uncover ‘what is being said’ and in some cases ‘who is saying what’. Félix and Andy elaborate on the way topic modelling has been applied and review whether the high expectations of this approach have been confirmed. They also discuss more recent developments in the field of unsupervised clustering in general.

The podcast series provides a useful background for students who will be attending the “Computerised Text Analysis” class at the Oxford Spring School. Félix and Andy hope that it may also be enlightening for others who are interested in this fast-moving area of research.

The podcasts are also available here.

