API/Reference
BiweightStatsBiweightStats.BiweightTransformBiweightStats.locationBiweightStats.midcorBiweightStats.midcovBiweightStats.midvarBiweightStats.scale
BiweightStats — ModuleBiweightStatsA module for robust statistics based on the biweight transform.[1]
Biweight Statistics
The basis of the biweight transform is robust analysis, that is, statistics which are resilient to outliers while still efficiently representing a variety of underlying distributions. The biweight transform is based off the median and the median absolute deviation (MAD). The median is a robust estimator of location, and the MAD is a robust estimator of scale
\[\mathrm{MAD}(X) = \mathrm{median}\left|X_i - \bar{X}\right|\]
where $\bar{X}$ is the median.
The biweight transform improves upon these estimates by filtering out data beyond a critical cutoff. The analogy is doing a sigma-filter, but using these robust statistics instead of the standard deviation and mean.
\[u_i = \frac{X_i - \bar{X}}{c \cdot \mathrm{MAD}}\]
\[\forall i \quad\mathrm{where}\quad u_i^2 \le 1\]
The cutoff factor, $c$, can be directly related to a Gaussian standard-deviation by multiplying by 1.4826[2]. So a typical value of $c=9$ means outliers further than $13.3\sigma$ are clipped (for residuals which are truly Gaussian-distributed). In addition, in BiweightStats, we also skip NaNs and Infs (but not missing or nothing).
References
Methods
BiweightStats.location — Functionlocation(X; c=9, M=nothing)
location(X::AbstractArray; dims=:, kwargs...)Calculate the biweight location, a robust measure of location.
\[\hat{y} = \frac{\sum_{u_i^2 \le 1}{y_i(1 - u_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)^2}}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> location(X)
49.98008021468018Iterative refinement
You can iteratively refine the location estimate by manually passing the median, like so-
X = 10 .* randn(rng, 1000) .+ 50
let ystar, ystar_old
ystar = ystar_old = location(X)
tol = 1e-6
maxiter = 10
for _ = 1:maxiter
ystar = location(X; M=ystar_old)
isapprox(ystar_old, ystar; atol=tol) && break
ystar_old = ystar
end
ystar
end
# output
49.991666155308145References
BiweightStats.scale — Functionscale(X; c=9, M=nothing)
scale(X::AbstractArray; dims=:, kwargs...)Compute the biweight scale of the variable. This is the same as the square-root of the midvariance.
\[\hat{\sigma} = \frac{\sqrt{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}}{\sum^n_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> scale(X)
10.045813567765071References
See Also
BiweightStats.midvar — Functionmidvar(X; c=9, M=nothing)
midvar(X::AbstractArray; dims=:, kwargs...)Compute the biweight midvariance of the variable.
\[\hat{\sigma}^2 = \frac{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}{\left[\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\right]^2}\]
Examples
julia> X = 10 .* randn(rng, 1000) .+ 50;
julia> midvar(X)
100.9183702382928References
See Also
BiweightStats.midcov — Functionmidcov(X, [Y]; c=9)Computes biweight midcovariance between the two vectors. If only one vector is provided the biweight midvariance will be calculated.
\[\hat{\sigma}_{xy} = \frac{\sum_{u_i^2 \le 1,v_i^2 \le 1}{(x_i - \bar{x})(1 - u_i^2)^2(y_i - \bar{y})(1 - v_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\sum_{v_i^2 \le 1}{(1 - v_i^2)(1 - 5v_i^2)}}\]
NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.
Examples
julia> X = 10 .* randn(rng, 1000, 2) .+ 50;
julia> midcov(X[:, 1], X[:, 2])
-1.058463590812247
julia> midcov(X[:, 1]) ≈ midvar(X[:, 1])
true
julia> X[3, 2] = NaN;
julia> midcov(X[:, 1], X[:, 2])
NaNReferences
See Also
midcov(X::AbstractMatrix; dims=1, c=9)Computes the variance-covariance matrix using the biweight midcovariance. By default, each column is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) covariance matrix. If dims=2, though, each row will become a variable, leading to an (M, M) covariance matrix.
NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.
Examples
julia> X = 10 .* randn(rng, 1000, 3) .+ 50;
julia> C = midcov(X)
3×3 Matrix{Float64}:
100.918 -1.05846 -2.88515
-1.05846 94.702 -0.490742
-2.88515 -0.490742 100.699
julia> size(midcov(X; dims=2))
(1000, 1000)References
See Also
BiweightStats.midcor — Functionmidcor(X, Y; c=9)Compute the correlation between two variables using the midvariance and midcovariances.
\[\frac{s_{xy}}{\sqrt{s_{xx} \cdot s_{yy}}}\]
where $s_{xx},s_{yy}$ are the midvariances of each vector, and $s_{xy}$ is the midcovariance of the two vectors.
Examples
julia> X = 10 .* randn(rng, 1000, 2) .+ 50;
julia> midcor(X[:, 1], X[:, 2])
-0.010827077678217934References
See Also
midcor(X::AbstractMatrix; dims=1, c=9)Computes the correlation matrix using the biweight midcorrealtion. By default, each column of the matrix is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) correlation matrix. If dims=2, though, each row will become a variable, leading to an (M, M) correlation matrix. The diagonal will always be one.
Examples
julia> X = 10 .* randn(rng, 1000, 3) .+ 50;
julia> C = midcor(X)
3×3 Matrix{Float64}:
1.0 -0.0108271 -0.0286201
-0.0108271 1.0 -0.0050253
-0.0286201 -0.0050253 1.0
julia> size(midcor(X; dims=2))
(1000, 1000)References
See Also
BiweightStats.BiweightTransform — TypeBiweightTransform(X; c=9, M=nothing)Creates an iterator based on the biweight transform.[1] This iterator will first filter all input data so that only finite values remain. Then, the iteration will progress using a custom state, which includes a flag to indicate whether the value is within the cutoff, which is c times the median-absolute-deviation (MAD). The MAD is based on the deviation from M, which will default to the median of X if M is nothing.
This transform iterator is used for the internal calculations in BiweightStats.jl, which is why it has a somewhat complicated iterator implementation.
Examples
julia> X = randn(rng, 100);
julia> X[10] = 1e4 # add clear outlier
10000.0
julia> X[13] = NaN # add NaN
NaN
julia> X[25] = Inf # add Inf
Inf
julia> bt = BiweightTransform(X);
Lets confirm all the entries are finite. The iteration interface is divided into
(d, u2, flag), state = iterate(bt, [state])where d is the data value minus M, u2 is (d / (c * MAD))^2, and flag is whether the value is within the transformed dataset.
julia> all(d -> isfinite(d[1]), bt)
trueand let's see how iteration differs between a normal sample and an outlier sample, which we manually inserted at index 10-
julia> (d, u2, flag), _ = iterate(bt, 9)
((-0.17093842061187192, 0.0009098761083851183, true), 10)
julia> (d, u2, flag), _ = iterate(bt, 10)
((0.0, 0.0, false), 11)References