API/Reference

BiweightStatsModule
BiweightStats

A module for robust statistics based on the biweight transform.[1]

Biweight Statistics

The basis of the biweight transform is robust analysis, that is, statistics which are resilient to outliers while still efficiently representing a variety of underlying distributions. The biweight transform is based off the median and the median absolute deviation (MAD). The median is a robust estimator of location, and the MAD is a robust estimator of scale

\[\mathrm{MAD}(X) = \mathrm{median}\left|X_i - \bar{X}\right|\]

where $\bar{X}$ is the median.

The biweight transform improves upon these estimates by filtering out data beyond a critical cutoff. The analogy is doing a sigma-filter, but using these robust statistics instead of the standard deviation and mean.

\[u_i = \frac{X_i - \bar{X}}{c \cdot \mathrm{MAD}}\]

\[\forall i \quad\mathrm{where}\quad u_i^2 \le 1\]

The cutoff factor, $c$, can be directly related to a Gaussian standard-deviation by multiplying by 1.4826[2]. So a typical value of $c=9$ means outliers further than $13.3\sigma$ are clipped (for residuals which are truly Gaussian-distributed). In addition, in BiweightStats, we also skip NaNs and Infs (but not missing or nothing).

References

Methods

source
BiweightStats.locationFunction
location(X; c=9, M=nothing)
location(X::AbstractArray; dims=:, kwargs...)

Calculate the biweight location, a robust measure of location.

\[\hat{y} = \frac{\sum_{u_i^2 \le 1}{y_i(1 - u_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)^2}}\]

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;


julia> location(X)
49.98008021468018

Iterative refinement

You can iteratively refine the location estimate by manually passing the median, like so-

X = 10 .* randn(rng, 1000) .+ 50
let ystar, ystar_old
    ystar = ystar_old = location(X)
    tol = 1e-6
    maxiter = 10
    for _ = 1:maxiter
        ystar = location(X; M=ystar_old)
        isapprox(ystar_old, ystar; atol=tol) && break
        ystar_old = ystar
    end
    ystar
end

# output

49.991666155308145

References

  1. NIST: biweight location
source
BiweightStats.scaleFunction
scale(X; c=9, M=nothing)
scale(X::AbstractArray; dims=:, kwargs...)

Compute the biweight scale of the variable. This is the same as the square-root of the midvariance.

\[\hat{\sigma} = \frac{\sqrt{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}}{\sum^n_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}}\]

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;


julia> scale(X)
10.045813567765071

References

  1. NIST: biweight scale

See Also

midcor, midvar, midcov

source
BiweightStats.midvarFunction
midvar(X; c=9, M=nothing)
midvar(X::AbstractArray; dims=:, kwargs...)

Compute the biweight midvariance of the variable.

\[\hat{\sigma}^2 = \frac{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}{\left[\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\right]^2}\]

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;


julia> midvar(X)
100.9183702382928

References

  1. NIST: biweight midvariance

See Also

scale, midcor, midcov

source
BiweightStats.midcovFunction
midcov(X, [Y]; c=9)

Computes biweight midcovariance between the two vectors. If only one vector is provided the biweight midvariance will be calculated.

\[\hat{\sigma}_{xy} = \frac{\sum_{u_i^2 \le 1,v_i^2 \le 1}{(x_i - \bar{x})(1 - u_i^2)^2(y_i - \bar{y})(1 - v_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\sum_{v_i^2 \le 1}{(1 - v_i^2)(1 - 5v_i^2)}}\]

Warning

NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.

Examples

julia> X = 10 .* randn(rng, 1000, 2) .+ 50;


julia> midcov(X[:, 1], X[:, 2])
-1.058463590812247

julia> midcov(X[:, 1]) ≈ midvar(X[:, 1])
true

julia> X[3, 2] = NaN;


julia> midcov(X[:, 1], X[:, 2])
NaN

References

  1. NIST: biweight midcovariance

See Also

scale, midvar, midcor

source
midcov(X::AbstractMatrix; dims=1, c=9)

Computes the variance-covariance matrix using the biweight midcovariance. By default, each column is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) covariance matrix. If dims=2, though, each row will become a variable, leading to an (M, M) covariance matrix.

Warning

NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.

Examples

julia> X = 10 .* randn(rng, 1000, 3) .+ 50;


julia> C = midcov(X)
3×3 Matrix{Float64}:
 100.918    -1.05846    -2.88515
  -1.05846  94.702      -0.490742
  -2.88515  -0.490742  100.699

julia> size(midcov(X; dims=2))
(1000, 1000)

References

  1. NIST: biweight midcovariance

See Also

scale, midvar, midcor

source
BiweightStats.midcorFunction
midcor(X, Y; c=9)

Compute the correlation between two variables using the midvariance and midcovariances.

\[\frac{s_{xy}}{\sqrt{s_{xx} \cdot s_{yy}}}\]

where $s_{xx},s_{yy}$ are the midvariances of each vector, and $s_{xy}$ is the midcovariance of the two vectors.

Examples

julia> X = 10 .* randn(rng, 1000, 2) .+ 50;


julia> midcor(X[:, 1], X[:, 2])
-0.010827077678217934

References

  1. Wikipedia
  2. NIST: Biweight midcorrelation

See Also

midvar, midcov, scale

source
midcor(X::AbstractMatrix; dims=1, c=9)

Computes the correlation matrix using the biweight midcorrealtion. By default, each column of the matrix is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) correlation matrix. If dims=2, though, each row will become a variable, leading to an (M, M) correlation matrix. The diagonal will always be one.

Examples

julia> X = 10 .* randn(rng, 1000, 3) .+ 50;


julia> C = midcor(X)
3×3 Matrix{Float64}:
  1.0        -0.0108271  -0.0286201
 -0.0108271   1.0        -0.0050253
 -0.0286201  -0.0050253   1.0

julia> size(midcor(X; dims=2))
(1000, 1000)

References

  1. Wikipedia
  2. NIST: Biweight midcorrelation

See Also

midvar, midcov, scale

source
BiweightStats.BiweightTransformType
BiweightTransform(X; c=9, M=nothing)

Creates an iterator based on the biweight transform.[1] This iterator will first filter all input data so that only finite values remain. Then, the iteration will progress using a custom state, which includes a flag to indicate whether the value is within the cutoff, which is c times the median-absolute-deviation (MAD). The MAD is based on the deviation from M, which will default to the median of X if M is nothing.

Advanced usage

This transform iterator is used for the internal calculations in BiweightStats.jl, which is why it has a somewhat complicated iterator implementation.

Examples

julia> X = randn(rng, 100);


julia> X[10] = 1e4 # add clear outlier
10000.0

julia> X[13] = NaN # add NaN
NaN

julia> X[25] = Inf # add Inf
Inf

julia> bt = BiweightTransform(X);

Lets confirm all the entries are finite. The iteration interface is divided into

(d, u2, flag), state = iterate(bt, [state])

where d is the data value minus M, u2 is (d / (c * MAD))^2, and flag is whether the value is within the transformed dataset.

julia> all(d -> isfinite(d[1]), bt)
true

and let's see how iteration differs between a normal sample and an outlier sample, which we manually inserted at index 10-

julia> (d, u2, flag), _ = iterate(bt, 9)
((-0.17093842061187192, 0.0009098761083851183, true), 10)

julia> (d, u2, flag), _ = iterate(bt, 10)
((0.0, 0.0, false), 11)

References

source