Probabilities without replacement
Some fast comments on the problem of selection without replacement. When the text is computing probabilities of picking a pair when choosing two letters from either a flat or standard English distribution of characters they compute the probability sampling with replacement. When they approximate the probability with pairs of letters from ciphertext they do sampling without replacement. It is instructive to look at the probabilities one gets doing the flat and standard English samples without replacement.
First we want the probabilities for each of the letters, both with a typical distribution and with a flat distribution.
>
typprobs := linalg[vector]
([.08167, .01492, .02782, .04253, .12702, .02228, .02015, .06094,
.06966, .00153, .00772, .04025, .02406, .06749, .07507, .01929,
.00095, .05987, .06327, .09056, .02758, .00978, .02360, .00150,
.01974, .00074]);
flatprobs := linalg[vector](26,i->evalf(1/26));
Next we want a set of functions that can be used to get the probability of a random pair of letters being a pair.
>
pairs := x -> x*max(x-1,0):
#computes the number of pairs that can be obtained from x elements
expect := (vec,n) -> linalg[scalarmul](vec,n):
#computes the number of expected occurences for each letter
# given a probability vector and a total number of letters.
pairsvec := vec -> map(pairs,vec):
sumvec := vec -> sum(vec[i],i=1..26):
#Computes the number of pairs for a given distribution of letters.
probn := (numpairs, numchars) -> numpairs/(pairs(numchars)):
#Computes probabilities for pairs and total number of characters.
nreplaceprob := (vec,n) -> probn(sumvec(pairsvec(expect(vec,n))),n):
#Combines all the earlier functions into a single command.
We are ready to look at the probabilities for two letters chosen without replacement being a pair, both with a typical probability distribution and with a flat distribution.
We will start with a sample of size 200.
>
nreplaceprob(typprobs,200);
nreplaceprob(flatprobs,200);
Notice that both probabilities are lower than the probabilities given in the book.
Compare what happens if we use a sample of size 100,000.
>
nreplaceprob(typprobs,100000);
nreplaceprob(flatprobs,100000);
This is closer to what the values that the book uses.
Let us look at a variety of sample sizes to compare the probabilities of a fair in both a typical ands a flat distribution.
>
print(`n`,`typ pairs`,`prob typical`,`flat pairs`,`flat prob`,`total pairs`);
for k from 1 to 20 do
print(k*50,round(nreplaceprob(typprobs,k*50)*pairs(k*50)),
nreplaceprob(typprobs,k*50),
round(nreplaceprob(flatprobs,k*50)*k*50*(k*50-1)),
nreplaceprob(flatprobs,k*50),
k*50*(k*50-1)):
od:
Notice how the probabilities move toward the values listed in the book, but there is still a noticable difference at n = 1000. Consider what values we get when the values of n are powers of 10.
>
print(`n`, `typ prob`, `flat prob`);
for k from 1 to 15 do
print(`10^`||k,nreplaceprob(typprobs,10^k), nreplaceprob(flatprobs,10^k)):
od:
The values are within 1 % of the with replacement values when the sample is at least 10,000 characters.
If we want to get fancy we notice that a passage cannot have half of a z in it. We want to modify our distribution to only allow integrer values. For the flat distribution we put the extras in the first slots available. For the typical distributions we will use rounding.
>
flatdis := proc(n)
local flatletdis, count, extras:
flatletdis := linalg[vector](26):
extras := n mod 26;
for count from 1 to extras do
flatletdis[count] := ceil(n/26):
od:
for count from (extras + 1) to 26 do
flatletdis[count] := floor(n/26):
od:
flatletdis;
end:
Consider the distributions obtained for 491 letters.
>
flat2 := flatdis(491):
print(flat2);
>
wholeletdis := (vec, n) -> map(round,expect(vec,n)):
>
letcount := wholeletdis(typprobs,491);
sum(letcount[i],i=1..26);
>
pairprobfromdist := (dist,n) ->
evalf(probn(sumvec(pairsvec(dist)),n)):
We will find that insisting on whole numbers only for the number of letters decreases the expected probability of a pair with a standard distribution and increases the probability of a pair in a flat distribution.
>
pairprobfromdist(wholeletdis(typprobs,491),491);
nreplaceprob(typprobs,491);
>
pairprobfromdist(flatdis(491),491);
nreplaceprob(flatprobs,491);
>
print(`n`,`typ prob - whole`,`prob typical`,`flat prob - whole`,`flat prob`,`total pairs`);
for k from 1 to 20 do
print(k*50,pairprobfromdist(wholeletdis(typprobs,k*50),k*50),
nreplaceprob(typprobs,k*50),
pairprobfromdist(flatdis(k*50),k*50),
nreplaceprob(flatprobs,k*50),
k*50*(k*50-1)):
od:
>