Introduction to Big Data and Automated Text Analysis for Social Scientists

DATE
Friday June 7, 2019 - Saturday June 8, 2019
location_on
ANSO Building Room 2107


This two-day workshop offers a practical introduction to fundamentals and recent developments in the collection and analysis of big data with an emphasis on automated text analysis. The workshop is designed with social scientists in mind, but participants from other fields are also welcome. I assume that participants have little to no prior experience with methods of automated text analysis. 

This workshop makes extensive use of the programming language python. Although having some knowledge of Python is an asset, it is not necessary. I will provide all participants with fully executable code for all topics covered in the workshop. Participants will be encouraged to modify the code to suit their specific interests, but this requires minimal programming knowledge and is not required.

 

Day 1: An Introduction to Automated Text Analysis – June 7

The first day will begin with a general introduction to the promises and pitfalls of big data and automated text analysis in the social sciences, followed by an overview of the applications of supervised and unsupervised machine learning. It will conclude with a comparison of two approaches to collecting text data from the web: Application Programming Interfaces (APIs) and web scraping. 

The second part of the first day will focus on the essential first steps of an automated text analysis. Topics covered will include (1) natural language processing tasks such as tokenizing text, normalizing text, part-of-speech tagging, and named entity recognition; and (2) methods for constructing document-term matrices, which are required for the use of machine learning methods. 

 

Day 2: Analyzing Unstructured Text Daa – June 8

The second day picks up where the first day left off. We will begin with applications of unsupervised learning to discover latent themes and topics in text. We will focus on three different approaches: (1) the vector space model, text similarity, and cluster analysis; (2) topics modelling; and (3) semantic network analysis. In the afternoon, we will focus on the use of supervised learning to scale up traditional content analysis. 

Time permitting, we may also cover (1) methods for sentiment analysis and classifying text by political ideology, and (2) approaches to integrating unsupervised and supervised machine learning. 

The Instructor 

John McLevey is an Associate Professor in the Department of Knowledge Integration and the Department of Sociology & Legal Studies at the University of Waterloo. He primarily works in the areas of computational social science and social network analysis, with substantive interests in science and evidence-based policymaking, environmental politics and governance, social movements, and cognitive social science. 

As a computational social scientist, Dr. McLevey’s most general research goal is to advance our knowledge of how social networks and institutions affect cognition and behaviour — including the formation and diffusion of knowledge, beliefs, biases, and behaviours — and the social and political consequences of those complex transmission processes. His work is funded by research grants from SSHRC and an Early Researcher Award from the Ontario Ministry of Research and Innovation. Among other things, Dr. McLevey is currently writing a methods book on computational social science for Sage. You can learn more about his work at johnmclevey.com and networkslab.org

Lab Assistant: Yufan Zhuang

Yufan Zhuang is a Graduate Research Assistant in the Department of Computer Science, a master of science candidate in the Data Science Institute of Columbia University, also a graduate research intern at IBM Research in the summer of 2019. He primarily works in natural language processing and probabilistic programming, sometimes wanders into other areas include computational sociology, psychology and computer security.

Yufan’s current primary research goal is to develop robust, generalizable deep learning framework for sequential classification/generation. He also has done work in probablistic topic modelling with applications in sociology and exploring bias in machine learning systems.