Winsorization in R

A common task in preparing data for analysis is to Winsorize variables in the dataset first. This means we want to replace extreme values in the data with less extreme values. This is usually operationalized as replacing values of a variable, say return on assets (ROA), that are above the 99th percentile with the 99th percentile value itself, and ROA values below the 1st percentile with the 1st percentile value itself. It is a form of outlier reduction. See the Wikipedia article.

In R, for a single variable we can do this in a few lines integrating the quantile function:

wins <- function(x){
  ## A helper function for wins.df
  ## x: is a vector
  percentiles <- quantile(x, probs=seq(0,1,0.01), na.rm=TRUE)
  pLOWER <- percentiles["1%"]
  pUPPER <- percentiles["99%"]
  x.w <- ifelse(x <= pLOWER, pLOWER, x)
  x.w <- ifelse(x >= pUPPER, pUPPER, x.w)
  return(x.w)
}

In a pooled cross-section of data, we usually want to Winsorize at the yearly frequency. Passing the above wins function as a helper to a larger one this can be extended pretty straightforwardly:

wins.df <- function(X, var, yearid = "year"){
  years <- unique(X[,yearid])
  Y <- X
  for(year in years){
    x <- X[which(X[, yearid] == year), var]
    x.w <- wins(x)
    var.w <- var
    Y[which(Y[,yearid] == year), var.w] <- x.w
  }
  return(Y)
}

A cleaned-up and more general version of this code is here.

Leave a comment