Introduction to Big Data and Automated Text Analysis for Social Scientists

Friday June 7, 2019
10:00 AM - 05:00 PM
ANSO Building Room 2107

Registration for the workshop is now FULL.

You can register to be put on the waiting list if spaces open up by using the registration form.

This two-day workshop offers a practical introduction to fundamentals and recent developments in the collection and analysis of big data with an emphasis on automated text analysis. The workshop is designed with social scientists in mind, but participants from other fields are also welcome. I assume that participants have little to no prior experience with methods of automated text analysis. 

This workshop makes extensive use of the programming language python. Although having some knowledge of Python is an asset, it is not necessary. I will provide all participants with fully executable code for all topics covered in the workshop. Participants will be encouraged to modify the code to suit their specific interests, but this requires minimal programming knowledge and is not required.

The python introduction class will first introduce how to set up and manage a python environment, then go through the basic data structures, control flows, functions of Python, and specifics about Pandas and Numpy that are two important packages commonly used for data science and machine learning.

Day 1: An Introduction to Automated Text Analysis – June 7

The first day will begin with a general introduction to the promises and pitfalls of big data and automated text analysis in the social sciences, followed by an overview of the applications of supervised and unsupervised machine learning. It will conclude with a comparison of two approaches to collecting text data from the web: Application Programming Interfaces (APIs) and web scraping. 

The second part of the first day will focus on the essential first steps of an automated text analysis. Topics covered will include (1) natural language processing tasks such as tokenizing text, normalizing text, part-of-speech tagging, and named entity recognition; and (2) methods for constructing document-term matrices, which are required for the use of machine learning methods. 


Day 2: Analyzing Unstructured Text Data – June 8

The second day picks up where the first day left off. We will begin with applications of unsupervised learning to discover latent themes and topics in text. We will focus on three different approaches: (1) the vector space model, text similarity, and cluster analysis; (2) topics modelling; and (3) semantic network analysis. In the afternoon, we will focus on the use of supervised learning to scale up traditional content analysis. 

Time permitting, we may also cover (1) methods for sentiment analysis and classifying text by political ideology, and (2) approaches to integrating unsupervised and supervised machine learning. 

The Instructor 

John McLevey is an Associate Professor in the Department of Knowledge Integration and the Department of Sociology & Legal Studies at the University of Waterloo. He primarily works in the areas of computational social science and social network analysis, with substantive interests in science and evidence-based policymaking, environmental politics and governance, social movements, and cognitive social science. 

As a computational social scientist, Dr. McLevey’s most general research goal is to advance our knowledge of how social networks and institutions affect cognition and behaviour — including the formation and diffusion of knowledge, beliefs, biases, and behaviours — and the social and political consequences of those complex transmission processes. His work is funded by research grants from SSHRC and an Early Researcher Award from the Ontario Ministry of Research and Innovation. Among other things, Dr. McLevey is currently writing a methods book on computational social science for Sage. You can learn more about his work at and

Lab Assistant: Yufan Zhuang

Yufan Zhuang is a Graduate Research Assistant in the Department of Computer Science, a master of science candidate in the Data Science Institute of Columbia University, also a graduate research intern at IBM Research in the summer of 2019. He primarily works in natural language processing and probabilistic programming, sometimes wanders into other areas include computational sociology, psychology and computer security.

Yufan’s current primary research goal is to develop robust, generalizable deep learning framework for sequential classification/generation. He also has done work in probablistic topic modelling with applications in sociology and exploring bias in machine learning systems.