# Computing and Statistics Workshops

Workshops marked with the icon have downloadable materials available; click on the title for details.

## Upcoming Workshops

This workshop introduces character string manipulation in R. String data is often unstructured, and regular expressions provide a concise mechanism to describe text patterns that may be contained within string data. It may take a little while to get accustomed to using regular expressions, but they are extremely useful. Stringr is an R package for string manipulation. It includes all of the common string operations one might need, including pattern matching. Although stringr is not part of the tidyverse core, it is built with similar goals in mind – consistency, simplicity and producing output that can easily be used as input.

Lunch will be provided following the workshop. **To register**, send an email to oprws@princeton.edu from your princeton.edu email address with the subject "stringr".

**To register**, send an email to oprws@princeton.edu from your princeton.edu email address with the subject "stm"

## May 2018

This workshop introduces two modern R packages, both written by Hadley Wickham and part of R's "tidyverse," that provide intuitive tools for handling common data management tasks. The first package, tidyr, provides functions that reshape data so it conforms to a specific “tidy” structure where each variable is saved in its own column, each observations is saved in its own row, and each type of observational unit is stored in a separate table. The second package, dplyr, provides a set of functions (referred to as “verbs”) that allow you to easily subset observations, re-order observations, select specific variables, add new variables, group observations, and summarize groups of observations.

## January 2018

Working with real data almost always means dealing with missing data, but having some measures missing from an observation doesn’t mean that observation should be excluded from the analysis. This workshop will cover the use of the R package Amelia to impute missing data and obtain more robust estimates.

This workshop provides an introduction to the R graphics package ggplot2. Time is spent describing the main concepts of the grammar that define the graphical building blocks, and exploring many examples that show how to use ggplot2's layered approach to create basic and more complex graphs. The workshop covers ggplot2 version 2.2.1.

## May 2017

This workshop will show how the Stata commands margins and marginsplot can be used for model interpretation and visualization, and will present ways to compute adjusted predictions and marginal effects, as well as ways to compare predictions for levels of a factor variable.

The workshop will also demonstrate how the user-written command coefplot, by Ben Jann, can be applied to any estimation results to graphically display regression coefficients or other statistics of interest. One nice feature of coefplot is that it can be used to very easily display results from several models on one graph. Another nice feature is that most of the options available for other twoway plot types can also be used with the coefplot command.

This workshop provides an introduction to OpenScholar, a website building and content management tool for hosting professional profiles. The workshop will go over the basics of creating a website, adding content, and creating an appealing site layout. Attendees are encouraged to bring content such as a CV and images, so that they are ready to upload content to their website. A valid Princeton University netID is needed to log in to OpenScholar.

This workshop provides a discussion of issues to consider when designing statistical graphs. Topics include:

- tables vs graphs
- audience and setting
- representing data accurately
- highlighting comparisons of interest
- simplicity and clarity
- color

Several sets of graphs are examined that attempt to "tell the same story" and discussion will center on why one display may be preferable to another. While the workshop goes over the pros and cons of using bar graphs, dot plots, line graphs, box plots, violin plots and several other graph types, the discussion is implementation tool independent, and is intended to be useful for those building graphs using R Base Graphics, ggplot2, and Stata Graphics, as well as many other tools.

## January 2017

If you want to increase the quality and impact of your work, you should consider doing open and reproducible research. In this workshop, I will begin by providing a working definition for what it means for your research to be open and reproducible. Then, I will describe the ways that you can overcome the obstacles that may be preventing you from being open and reproducible. The workshop will be illustrated by some of my own struggles with these issues during my career. Because there are many complicated technical, legal, professional, and ethical issues involved, there will be lots of time for questions and discussion.

This workshop introduces concepts, software tools and best practices for making research reproducible. Topics include version control (Git/Github), managing file dependencies (make), and tools for creating dynamic documents – most tools presented are in R through Rstudio (Sweave, Rmarkdown, Knitr, R Notebook); two Stata commands (Weaver, Stata markdown) will be presented towards the end.

This workshop shows how you can access Princeton's high performance computing resources. Discussion includes an overview of the Linux systems that are available at Princeton and how to: obtain accounts, connect and transfer files, run R and Stata programs on these systems, and submit jobs using a job scheduler called SLURM. In addition, time will be spent showing Linux commands for managing files, and explaining how to write Linux shell scripts to automate repeating tasks

## September 2016

This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.

## May 2016

The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both. The software package implements the estimation algorithms for the model and also includes tools for every stage of a standard workflow from reading in and processing raw text through making publication quality figures. The workshop will provide a hands-on introduction to using the stm package which currently includes functionality to:

- ingest and manipulate text data
- estimate Structural Topic Models
- calculate covariate effects on latent topics with uncertainty
- estimate a graph of topic correlations
- compute model diagnostics and summary measures
- create the plots used in various papers about stm

This workshop introduces two modern R packages, both written by Hadley Wickham, that provide intuitive tools for handling common data management tasks. The first package, tidyr, provides functions that reshape data so it conforms to a specific “tidy” structure where each variable is saved in its own column, each observations is saved in its own row, and each type of observational unit is stored in a separate table. The second package, dplyr, provides a set of functions (referred to as “verbs”) that allow you to easily subset observations, re-order observations, select specific variables, add new variables, group observations, and summarize groups of observations.

## January 2016

If you want to increase the quality and impact of your work, you should consider doing open and reproducible research. In this workshop, I will begin by providing a working definition for what it means for your research to be open and reproducible. Then, I will describe the ways that you can overcome the obstacles that may be preventing you from being open and reproducible. The workshop will be illustrated by some of my own struggles with these issues during my career. Because there are many complicated technical, legal, professional, and ethical issues involved, there will be lots of time for questions and discussion.

The concept of "tidy data" offers a powerful framework for structuring data to ease manipulation, modeling and visualization. However, most R functions, both those built-in and those found in third-party packages, produce output that is not tidy, and that is therefore difficult to reshape, recombine, and otherwise manipulate. This workshop introduces the broom package, which turns the output of model objects into tidy data frames that are well-suited to further analysis, manipulation, and visualization with input-tidy tools such as ggplot2 and dplyr.

This workshop provides an introduction to the R graphics package ggplot2. Because ggplot2 is based on Wilkinson's Grammar of Graphics (2005), time is spent both (1) describing the main concepts of the grammar that define the graphical building blocks and (2), exploring many examples that show how to use ggplot2's layered approach to create basic and more complex graphs.

This workshop provides a discussion of issues to consider when designing statistical graphs. Topics include:

- tables vs graphs
- audience and setting
- representing data accurately
- highlighting comparisons of interest
- simplicity and clarity
- color

## September 2015

## May 2015

5/08/2015 from 9:30 AM to 4:00 PM ~ 217 Wallace Hall

5/11/2015 from 9:30 AM to 4:00 PM ~ 217 Wallace Hall

Python is a very popular, general-purpose, multi-paradigm, open-source, scripting language. It is designed to emphasize code readability and has a clean syntax with high level data types. It is well-suited for interactive work and quick prototyping, yet it is powerful enough for writing large applications. In this full-day workshop, attendees are introduced to basic Python syntax and to its ecosystem. See the workshop syllabus for objectives.

## January 2015

1/09/2015 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall

## September 2014

This workshop provides a brief introduction to Stata. Attendance is limited to first-year OPR graduate students and post-docs.

## May 2014

5/09/2014 from 9:30 AM to 12:00 PM ~ 217 Wallace Hall

5/09/2014 from 1:30 PM to 4:00 PM ~ 217 Wallace Hall

## January 2014

## September 2013

## May 2013

5/22/2013 from 9:00 AM to 5:00 PM ~ 300 Wallace Hall

Topics include: non-parametric identification by adjustment; d-separation; the difference between overcontrol bias, confounding bias, and selection bias; what variables to control for and what variables not to control for in observational research; effect heterogeneity; structural assumptions in instrumental variables identification; and recent work on causal mediation analysis.

Please note that this course focuses on spotting and understanding causal opportunities and causal problems. It is not a course on statistical methods (no software component). Students will discuss numerous exercises in class and solve a short homework assignment for the second day.