API/Reference
BiweightStats
BiweightStats.BiweightTransform
BiweightStats.location
BiweightStats.midcor
BiweightStats.midcov
BiweightStats.midvar
BiweightStats.scale
BiweightStats
— ModuleBiweightStats
A module for robust statistics based on the biweight transform.[1]
Biweight Statistics
The basis of the biweight transform is robust analysis, that is, statistics which are resilient to outliers while still efficiently representing a variety of underlying distributions. The biweight transform is based off the median and the median absolute deviation (MAD). The median is a robust estimator of location, and the MAD is a robust estimator of scale
\[\mathrm{MAD}(X) = \mathrm{median}\left|X_i - \bar{X}\right|\]
where $\bar{X}$ is the median.
The biweight transform improves upon these estimates by filtering out data beyond a critical cutoff. The analogy is doing a sigma-filter, but using these robust statistics instead of the standard deviation and mean.
\[u_i = \frac{X_i - \bar{X}}{c \cdot \mathrm{MAD}}\]
\[\forall i \quad\mathrm{where}\quad u_i^2 \le 1\]
The cutoff factor, $c$, can be directly related to a Gaussian standard-deviation by multiplying by 1.4826[2]. So a typical value of $c=9$ means outliers further than $13.3\sigma$ are clipped (for residuals which are truly Gaussian-distributed). In addition, in BiweightStats
, we also skip NaN
s and Inf
s (but not missing
or nothing
).
References
Methods
BiweightStats.location
— Functionlocation(X; c=9, M=nothing)
location(X::AbstractArray; dims=:, kwargs...)
Calculate the biweight location, a robust measure of location.
\[\hat{y} = \frac{\sum_{u_i^2 \le 1}{y_i(1 - u_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)^2}}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> location(X)
49.98008021468018
Iterative refinement
You can iteratively refine the location estimate by manually passing the median, like so-
X = 10 .* randn(rng, 1000) .+ 50
let ystar, ystar_old
ystar = ystar_old = location(X)
tol = 1e-6
maxiter = 10
for _ = 1:maxiter
ystar = location(X; M=ystar_old)
isapprox(ystar_old, ystar; atol=tol) && break
ystar_old = ystar
end
ystar
end
# output
49.991666155308145
References
BiweightStats.scale
— Functionscale(X; c=9, M=nothing)
scale(X::AbstractArray; dims=:, kwargs...)
Compute the biweight scale of the variable. This is the same as the square-root of the midvariance.
\[\hat{\sigma} = \frac{\sqrt{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}}{\sum^n_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> scale(X)
10.045813567765071
References
See Also
BiweightStats.midvar
— Functionmidvar(X; c=9, M=nothing)
midvar(X::AbstractArray; dims=:, kwargs...)
Compute the biweight midvariance of the variable.
\[\hat{\sigma}^2 = \frac{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}{\left[\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\right]^2}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> midvar(X)
100.9183702382928
References
See Also
BiweightStats.midcov
— Functionmidcov(X, [Y]; c=9)
Computes biweight midcovariance between the two vectors. If only one vector is provided the biweight midvariance will be calculated.
\[\hat{\sigma}_{xy} = \frac{\sum_{u_i^2 \le 1,v_i^2 \le 1}{(x_i - \bar{x})(1 - u_i^2)^2(y_i - \bar{y})(1 - v_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\sum_{v_i^2 \le 1}{(1 - v_i^2)(1 - 5v_i^2)}}\]
NaN
and Inf
cannot be removed in the covariance calculation, so if they are present the returned value will be NaN
. To prevent this, consider imputing values for the non-finite data.
Examples
julia> X = 10 .* randn(rng, 1000, 2) .+ 50;
julia> midcov(X[:, 1], X[:, 2])
-1.058463590812247
julia> midcov(X[:, 1]) ≈ midvar(X[:, 1])
true
julia> X[3, 2] = NaN;
julia> midcov(X[:, 1], X[:, 2])
NaN
References
See Also
midcov(X::AbstractMatrix; dims=1, c=9)
Computes the variance-covariance matrix using the biweight midcovariance. By default, each column is a separate variable, so an (M, N)
matrix with dims=1
will create an (N, N)
covariance matrix. If dims=2
, though, each row will become a variable, leading to an (M, M)
covariance matrix.
NaN
and Inf
cannot be removed in the covariance calculation, so if they are present the returned value will be NaN
. To prevent this, consider imputing values for the non-finite data.
Examples
julia> X = 10 .* randn(rng, 1000, 3) .+ 50;
julia> C = midcov(X)
3×3 Matrix{Float64}:
100.918 -1.05846 -2.88515
-1.05846 94.702 -0.490742
-2.88515 -0.490742 100.699
julia> size(midcov(X; dims=2))
(1000, 1000)
References
See Also
BiweightStats.midcor
— Functionmidcor(X, Y; c=9)
Compute the correlation between two variables using the midvariance and midcovariances.
\[\frac{s_{xy}}{\sqrt{s_{xx} \cdot s_{yy}}}\]
where $s_{xx},s_{yy}$ are the midvariances of each vector, and $s_{xy}$ is the midcovariance of the two vectors.
Examples
julia> X = 10 .* randn(rng, 1000, 2) .+ 50;
julia> midcor(X[:, 1], X[:, 2])
-0.010827077678217934
References
See Also
midcor(X::AbstractMatrix; dims=1, c=9)
Computes the correlation matrix using the biweight midcorrealtion. By default, each column of the matrix is a separate variable, so an (M, N)
matrix with dims=1
will create an (N, N)
correlation matrix. If dims=2
, though, each row will become a variable, leading to an (M, M)
correlation matrix. The diagonal will always be one.
Examples
julia> X = 10 .* randn(rng, 1000, 3) .+ 50;
julia> C = midcor(X)
3×3 Matrix{Float64}:
1.0 -0.0108271 -0.0286201
-0.0108271 1.0 -0.0050253
-0.0286201 -0.0050253 1.0
julia> size(midcor(X; dims=2))
(1000, 1000)
References
See Also
BiweightStats.BiweightTransform
— TypeBiweightTransform(X; c=9, M=nothing)
Creates an iterator based on the biweight transform.[1] This iterator will first filter all input data so that only finite values remain. Then, the iteration will progress using a custom state, which includes a flag to indicate whether the value is within the cutoff, which is c
times the median-absolute-deviation (MAD). The MAD is based on the deviation from M
, which will default to the median of X
if M
is nothing
.
This transform iterator is used for the internal calculations in BiweightStats.jl
, which is why it has a somewhat complicated iterator implementation.
Examples
julia> X = randn(rng, 100);
julia> X[10] = 1e4 # add clear outlier
10000.0
julia> X[13] = NaN # add NaN
NaN
julia> X[25] = Inf # add Inf
Inf
julia> bt = BiweightTransform(X);
Lets confirm all the entries are finite. The iteration interface is divided into
(d, u2, flag), state = iterate(bt, [state])
where d
is the data value minus M
, u2
is (d / (c * MAD))^2
, and flag
is whether the value is within the transformed dataset.
julia> all(d -> isfinite(d[1]), bt)
true
and let's see how iteration differs between a normal sample and an outlier sample, which we manually inserted at index 10
-
julia> (d, u2, flag), _ = iterate(bt, 9)
((-0.17093842061187192, 0.0009098761083851183, true), 10)
julia> (d, u2, flag), _ = iterate(bt, 10)
((0.0, 0.0, false), 11)
References