当前位置 : 首页 » 互动问答 » 正文

How to remove outliers from a dataset

分类 : 互动问答 | 发布时间 : 2011-01-25 05:23:51 | 评论 : 8 | 浏览 : 239133 | 喜欢 : 78

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.

I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like. enter image description here

回答(8)

  • 2楼
  • OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:

    remove_outliers <- function(x, na.rm = TRUE, ...) {
      qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
      H <- 1.5 * IQR(x, na.rm = na.rm)
      y <- x
      y[x < (qnt[1] - H)] <- NA
      y[x > (qnt[2] + H)] <- NA
      y
    }
    

    To see it in action:

    set.seed(1)
    x <- rnorm(100)
    x <- c(-10, x, 10)
    y <- remove_outliers(x)
    ## png()
    par(mfrow = c(1, 2))
    boxplot(x)
    boxplot(y)
    ## dev.off()
    

    And once again, you should never do this on your own, outliers are just meant to be! =)

    EDIT: I added na.rm = TRUE as default.

    EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

    enter image description here

  • 3楼
  • Use outline = FALSE as an option when you do the boxplot (read the help!).

    > m <- c(rnorm(10),5,10)
    > bp <- boxplot(m, outline = FALSE)
    

    enter image description here

  • 4楼
  • The boxplot function returns the values used to do the plotting (which is actually then done by bxp():

    bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray") 
    #need to "waste" this plot
    bstats$out <- NULL
    bstats$group <- NULL
    bxp(bstats)  # this will plot without any outlier points
    

    I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them is a systematic and unjustified mangling of the observational record.

  • 5楼
  • x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
    data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
    

    I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.

  • 6楼
  • I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
    if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above: "If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by sample mean or median" and also here is the usage part from the same source:
    "Usage

    rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
    

    Arguments
    x a dataset, most frequently a vector. If argument is a dataframe, then outlier is removed from each column by sapply. The same behavior is applied by apply when the matrix is given.
    fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the outlier(s) is/are simply removed.
    median If set to TRUE, median is used instead of mean in outlier replacement. opposite if set to TRUE, gives opposite value (if largest value has maximum difference from the mean, it gives smallest and vice versa) "

  • 7楼
  • Adding to @sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:

    newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) ) 
    

    This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.

  • 8楼
  • Wouldn't:

    z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) & 
            df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
    

    accomplish this task quite easily?

相关阅读:

Append an object to a list in R in amortized constant time, O(1)?

R: Break for loop

How to sort a dataframe by multiple column(s)?

Drop data frame columns by name

Plot two graphs in same plot in R

删除data.frame中包含全部或部分NA(缺失值)的行

Changing column names of a data frame

How to join (merge) data frames (inner, outer, left, right)?

如何在R数据帧中用零替换NA值?

Counting the number of elements with the values of x in a vector