Stats and data quality resources

  • Statistical Test Selector: UCLA’s guide to what statistical test should be used, with example code in Stata, R, SPSS, and SAS
  • Bad data and how to fix them: Encyclopedia of all the things that can and do go wrong with data, and suggestions on how to fix.
  • DataBasic: Suite of web tools for beginners to work with data

Stata resources

  • If you’re new to Stata, check out UCLA’s Stata Page. It has a wealth of resources to get you started.

R resources

Markdown and Github resources

Useful Stata packages

Stata has a number of user-written commands that are contributed by RePEc and housed at the Boston College Statistical Software Components (SSC) archive. As long as you are connected to the internet, you can download and install a package by simply typing ssc install estout in the Stata command window. Once the package has installed, type help estout to view the help file associated with the package. To view trending packges from SCC type ssc whatshot, n(25) in Stata. This will return the top 25 packages at SSC.

Data Wrangling/Munging

  • egenmore: Stata’s egen command can execute tons of useful data munging operations. If egen is not enough for you, check out Nicholas Cox’s egenmore. The package includes various egen extensions.
  • reclink: probabilistically match records
  • jarowinkler: calculate the Jaro-Winkler distance between strings
  • carryforward: carry forward/backward previous observations
  • valuesof: display and return in r(values) the varlues of a variable joined together in a single string.
  • use13: load datasets created with Stata 13 in Stata 10-12.
  • usespss: load SPSS files
  • usesas: load SAS files
  • insheetjson: import tabular data from JSON sources on the web
  • shp2dta: converts shape boundary files (shapefiles) to Stata datasets
  • winsor2: winsorize a varlist
  • trimmean: trimmed means as descriptive or inferential statistics
  • nearmrg: provide nearest-match merging of datasets
  • tscollap:
  • mdesc: tabulate prevalence of missing values
  • mkcorr: generate correlation table formatted for easy inclusion in articles
  • sxpose: transpose string variable dataset
  • fs: show names of files in a compact form
  • confirmdir: confirm if a directory exists
  • extremes: list extreme values of a variable
  • nsplit: split numeric variables into components
  • kountry: standardize country names across datasets
  • :
  • :
  • :

Useful R packages

R has a long list of libraries that extend the functionality of base R and make it easier to use. Here’s a running list of packages that we find particularly helpful, broken down by category. Core libraries are indicated with an asterisk, and are part.icularly recommended for all users.

To install any of the packages, use install.packages("<package name>"), as in: install.packages("ggplot2"). All packages can be found on R’s CRAN

A lazy way to install and load all these packages

  • Laura is in the process of creating an R library called llamar to make it easier to load the most useful libraries at once (and also to create some custom plotting themes and functions). It’s under development now, so apologies for any lack of documentation and/or anything that breaks in the future.
  • To load the packages listed here, copy this code into R: install.packages("devtools") library(devtools) devtools::install_github("flaneuse/llamar") library(llamar) loadPkgs()
  • If you have any comments, feel free to email us

Data Wrangling

  • *dplyr: filter, create new variables, summarise, … Basically, anything you can think to do to a dataset
  • *tidyr: reshape and merge datasets
  • data.table: similar to dplyr but good for large datasets; some extra functionality
  • stringr: string manipulation
  • lubridate: better way to work with dates
  • zoo: running averages, amongst other things

Visualization and Interactive plots

  • **ggplot2: Hadley Wickham’s incredibly powerful plotting library built off of the Grammar of Graphics. So useful and well-designed it gets two asterisks.
  • ggplot2 extension packages: Running list of extensions to ggplot2
  • ggrepel: extends ggplot2 to avoid overlapping text
  • ggvis: data visualization package that enables interactive graphics
  • d3heatmap: creates D3-based heatmaps in R
  • htmlwidgets: suite of packages that port javascript visualization packages into R
  • metricsgraphics: creates interactive plots based on the MetricsGraphics.js / D3 chart library
  • rCharts: creates interactive plots based on several javascript charting libraries
  • DiagrammeR: creates graph diagrams using a Markdown-like syntax
  • packcircles: creates non-overlapping packed circles
  • waffle: creates isotype graphs (a single object repeated N times)

Geospatial analysis and mapping

  • ggmap: geocoding and geospatial library
  • leaflet: R wrapper to embed dynamic maps using leaflet.js
  • choroplethr, choroplethrAdmin1: easy way to create choropleths (heatmaps for a map) at the Admin 0- (country) and Admin 1-level (states/provinces)
  • RgoogleMaps: overlays plots on a Google map

Interactivity

  • shiny: easy way to create custom, interactive web applications in R
  • shinydashboard: uses Shiny to create customized dashboards
  • shinythemes: customize appearance of Shiny apps

Reporting, publication, and custom appearance

Importing files

  • haven: imports in files from Stata, SAS, and SPSS
  • foreign: an alternative to haven to import from Stata, SAS, and SPSS. Doesn’t support Stata 14 (yet?)
  • readr: an advanced form of the base read.csv function with some added functionality.
  • readxl: imports in multiple sheets from Excel
  • googlesheets: connects to Google Drive spreadsheets.
  • rvest: scrapes websites
  • pdftools: scrapes .pdf files
  • jsonlite: converts between JSON objects and R ones

Developer libraries

  • *devtools: makes writing and releasing R packages a breeze. For casual users, allows you to install packages directly from Github using install_github
  • roxygen2: allows for easy commenting of functions and packages
  • testthat: reproducible testing functions for package development
  • microbenchmark: timing function to profile how long functions take to execute
  • profvis: allows visual profiling of function timing to optimize performance

Fitting libraries

  • *broom: cleans up results from any fitted model into something neat and organized
  • MASS
  • sandwich
  • lmtest
  • plm
  • ggalt
  • coefplot
  • cluster
  • GWmodel

Misc.

  • swirl: A package to learn R within R.