# Question:A serious problem with Statistics:-Sample()

## Question:A serious problem with Statistics:-Sample()

Maple

Hi,
The choice method=default (the implicit choice) gives dramatically false results as demonstrated here for the sampling of a normal r.v.
It's likely to be the case for any continuous distribution.
In a few words the tails of the distribution are not correctly sampled (far from the mean by 4 standard deviations for a normal r.v., not an extremely rare event in practical applications).

I think the attached file is sufficiently documented for you to understand the problem.

 > restart:

The problem presented here also exists with Maple 2018 and Maple 2019

 > interface(version) (1)
 > with(Statistics):
 > X := RandomVariable(Normal(0, 1)):

Generate a sample of X of size 10^6

Here is the R code for those who would like to verify the results and the performances

N <- 10^6

R <- 20

Q <- c(0:5)

M <- matrix(nrow=R, ncol=length(Q))

for (r in c(1:20)) {

S <- rnorm(N, mean=0, sd=1)

for (q in Q) {

M[r, q+1] <-length(S[S>q])

}

}

Expected.Number.of.Outliers <- floor(N - pnorm(Q, mean=0, sd=1) * N)

Observed.Number.of.Outliers <-  round(colMeans(M))

Expected.Number.of.Outliers

Observed.Number.of.Outliers

Absolute.Differences <- M - kronecker(t(Expected.Number.of.Outliers), rep(1, R))

boxplot(differences)

Remark the loop below takes about 30 seconds to run (to be compared to the 20 minutes
it would take am I used the build-in procedure Select, and to  the 3 seconds R demands).

 > N := 10^6: R := 20: Q  := [\$0..5]: nQ := numelems(Q): M  := Matrix(R, nQ, 0): for r from 1 to R do   S := Sample(X, N):   for q in Q do     M[r, q+1] := add(Heaviside~(S -~ q)):   end do: end do:
 > Expected_Number_Of_Outliers := Vector[row](nQ, i -> Probability(X > Q[i], numeric)*N); (2)
 > Observed_Number_Of_Outliers := Mean(M); (3)

While the values of  the Observed_Number_of_Outliers seem to be reasonably correct for q =0, 1, 2 and 3
there are highly suspicious for q = 4 and likely for q = 5 (but N is too small to be categorical)

So let's examine more closely rhe results for q = 4

 > `q=4`:= convert(M[..,5], list);  min(`q=4`)  (4)

This means we obtained 20 estimations out of 20 of the probability that X exceeds q=4.
There are only about on chance out of 1 million to get such a result (roughly (1/2)^20 ).

Here is the result R returned for the same quantile q=4

M[,5]

 24 37 23 38 30 43 23 30 25 39 25 29 29 36 39 41 35 37 36 34

(9 values out of 20 are less than the expected value of 31.67)

A very simple test based on the Binomial distribution will prove, without any doubt, that the
distribution of the 20 numbers of outliers for q = 4 implies the sampling algorithm is of
is of very poor quality.

Is it possible to obtain correct results with Maple ?

(simulations done below for the case q=4 only)

 > N := 10^6: R := 20: Menv := Vector[column](R, 0): for r from 1 to R do   S := Sample(X, N, method=envelope):   Menv[r] := add(Heaviside~(S -~ 4)): end do:
 > `q=4`:= convert(Menv, list); Mean(Menv); Variance(Menv);   (5)

Conclusion, the "default" sampling strategy suffers strong flaw.

Remark: it's well known that one of the best method to sample normal random variables
is Box-Mueller's: why it has not been programmed by default ?

 > 