Statistics module

Added a statistics module for calculating various facets of
share survival statistics.
This commit is contained in:
Shawn Willden 2009-01-13 20:12:35 -07:00
parent b3513f3ac6
commit 21832280da
4 changed files with 1021 additions and 1 deletions

688
docs/lossmodel.lyx Normal file
View File

@ -0,0 +1,688 @@
#LyX 1.6.1 created this file. For more info see http://www.lyx.org/
\lyxformat 345
\begin_document
\begin_header
\textclass amsart
\use_default_options true
\begin_modules
theorems-ams
theorems-ams-extended
\end_modules
\language english
\inputencoding auto
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100
\graphics default
\paperfontsize default
\spacing single
\use_hyperref false
\papersize default
\use_geometry false
\use_amsmath 1
\use_esint 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\author ""
\author ""
\end_header
\begin_body
\begin_layout Title
Tahoe Distributed Filesharing System Loss Model
\end_layout
\begin_layout Author
Shawn Willden
\end_layout
\begin_layout Email
shawn@willden.org
\end_layout
\begin_layout Abstract
The abstract goes here
\end_layout
\begin_layout Section
Problem Statement
\end_layout
\begin_layout Standard
The allmydata Tahoe distributed file system uses Reed-Solomon erasure coding
to split files into
\begin_inset Formula $N$
\end_inset
shares, each of which is then delivered to a randomly-selected peer in
a distributed network.
The file can later be reassembled from any
\begin_inset Formula $k\leq N$
\end_inset
of the shares, if they are available.
\end_layout
\begin_layout Standard
Over time shares are lost for a variety of reasons.
Storage servers may crash, be destroyed or simply be removed from the network.
To mitigate such losses, Tahoe network clients employ a repair agent which
scans the peers once per time period
\begin_inset Formula $A$
\end_inset
and determines how many of the shares remain.
If less than
\begin_inset Formula $R$
\end_inset
(
\begin_inset Formula $k\leq R\leq N$
\end_inset
) shares remain, then the repairer reconstructs the file shares and redistribute
s the missing ones, bringing the availability back up to full.
\end_layout
\begin_layout Standard
The question we're trying to answer is "What's the probability that we'll
be able to reassemble the file at some later time
\begin_inset Formula $T$
\end_inset
?".
We'd also like to be able to determine what values we should choose for
\begin_inset Formula $k$
\end_inset
,
\begin_inset Formula $N$
\end_inset
,
\begin_inset Formula $A$
\end_inset
, and
\begin_inset Formula $R$
\end_inset
in order to ensure
\begin_inset Formula $Pr[loss]\leq t$
\end_inset
for some threshold probability
\begin_inset Formula $t$
\end_inset
.
This is an optimization problem because although we could obtain very low
\begin_inset Formula $Pr[loss]$
\end_inset
by choosing small
\begin_inset Formula $k,$
\end_inset
large
\begin_inset Formula $N$
\end_inset
, small
\begin_inset Formula $A$
\end_inset
, and setting
\begin_inset Formula $R=N$
\end_inset
, these choices have costs.
The peer storage and bandwidth consumed by the share distribution process
are approximately
\begin_inset Formula $\nicefrac{N}{k}$
\end_inset
times the size of the original file, so we would like to reduce this ratio
as far as possible consistent with
\begin_inset Formula $Pr[loss]\leq t$
\end_inset
.
Likewise, frequent and aggressive repair process can be used to ensure
that the number of shares available at any time is very close to
\begin_inset Formula $N,$
\end_inset
but at a cost in bandwidth.
\end_layout
\begin_layout Section
Reliability
\end_layout
\begin_layout Standard
The probability that the file becomes unrecoverable is dependent upon the
probability that the peers to whom we send shares are able to return those
copies on demand.
Shares that are returned in corrupted form can be detected and discarded,
so there is no need to distinguish between corruption and loss.
\end_layout
\begin_layout Standard
There are a large number of factors that affect share availability.
Availability can be temporarily interrupted by peer unavailability, due
to network outages, power failures or administrative shutdown, among other
reasons.
Availability can be permanently lost due to failure or corruption of storage
media, catastrophic damage to the peer system, administrative error, withdrawal
from the network, malicious corruption, etc.
\end_layout
\begin_layout Standard
The existence of intermittent failure modes motivates the introduction of
a distinction between
\noun on
availability
\noun default
and
\noun on
reliability
\noun default
.
Reliability is the probability that a share is retrievable assuming intermitten
t failures can be waited out, so reliability considers only permanent failures.
Availability considers all failures, and is focused on the probability
of retrieval within some defined time frame.
\end_layout
\begin_layout Standard
Another consideration is that some failures affect multiple shares.
If multiple shares of a file are stored on a single hard drive, for example,
failure of that drive may lose them all.
Catastrophic damage to a data center may destroy all shares on all peers
in that data center.
\end_layout
\begin_layout Standard
While the types of failures that may occur are pretty consistent across
even very different peers, their probabilities differ dramatically.
A professionally-administered blade server with redundant storage, power
and Internet located in a carefully-monitored data center with automatic
fire suppression systems is much less likely to become either temporarily
or permanently unavailable than the typical virus and malware-ridden home
computer on a single cable modem connection.
A variety of situations in between exist as well, such as the case of the
author's home file server, which is administered by an IT professional
and uses RAID level 6 redundant storage, but runs on old, cobbled-together
equipment, and has a consumer-grade Internet connection.
\end_layout
\begin_layout Standard
To begin with, let's use a simple definition of reliability:
\end_layout
\begin_layout Definition
\noun on
Reliability
\noun default
is the probability
\begin_inset Formula $p_{i}$
\end_inset
that a share
\begin_inset Formula $s_{i}$
\end_inset
will surve to (be retrievable at) time
\begin_inset Formula $T=A$
\end_inset
, ignoring intermittent failures.
That is, the probability that the share will be retrievable at the end
of the current repair cycle, and therefore usable by the repairer to regenerate
any lost shares.
\end_layout
\begin_layout Definition
Reliability is clearly dependent on
\begin_inset Formula $A$
\end_inset
.
Short repair cycles offer less time for shares to
\begin_inset Quotes eld
\end_inset
decay
\begin_inset Quotes erd
\end_inset
into unavailability.
\end_layout
\begin_layout Subsection
Fixed Reliability
\end_layout
\begin_layout Standard
In the simplest case, the peers holding the file shares all have the same
reliability
\begin_inset Formula $p$
\end_inset
, and are all independent from one another.
Let
\begin_inset Formula $K$
\end_inset
be a random variable that represents the number of shares that survive
\begin_inset Formula $A$
\end_inset
.
Each share's survival can be viewed as an indepedent Bernoulli trial with
a succes probability of
\begin_inset Formula $p$
\end_inset
, which means that
\begin_inset Formula $K$
\end_inset
follows the binomial distribution with paramaters
\begin_inset Formula $N$
\end_inset
and
\begin_inset Formula $p$
\end_inset
(
\begin_inset Formula $K\sim B(N,p)$
\end_inset
).
The probability mass function (PMF) of the binomial distribution is:
\begin_inset Formula \begin{equation}
Pr(K=i)=f(i;N,p)=\binom{n}{i}p^{i}(1-p)^{n-i}\label{eq:binomial-pdf}\end{equation}
\end_inset
\end_layout
\begin_layout Standard
A file survives if at least
\begin_inset Formula $k$
\end_inset
of the
\begin_inset Formula $N$
\end_inset
shares survive.
Equation
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:binomial-pdf"
\end_inset
gives the probability that exactly
\begin_inset Formula $i$
\end_inset
shares survive, so the probability that fewer than
\begin_inset Formula $k$
\end_inset
survive is the sum of the probabilities that
\begin_inset Formula $0,1,2,\ldots,k-1$
\end_inset
shares survive.
That is:
\end_layout
\begin_layout Standard
\begin_inset Formula \begin{equation}
Pr[failure]=\sum_{i=0}^{k-1}\binom{n}{i}p^{i}(1-p)^{n-i}\label{eq:simple-failure}\end{equation}
\end_inset
\end_layout
\begin_layout Subsection
Independent Reliability
\end_layout
\begin_layout Standard
Equation
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:simple-failure"
\end_inset
assumes that each share has the same probability of survival, but as explained
above, this is not typically true.
A more accurate model allows each share
\begin_inset Formula $s_{i}$
\end_inset
an independent probability of survival
\begin_inset Formula $p_{i}$
\end_inset
.
Each share's survival can still be treated as an independent Bernoulli
trial, but with success probability
\begin_inset Formula $p_{i}$
\end_inset
.
Under this assumption,
\begin_inset Formula $K$
\end_inset
follows a generalized distribution with parameters
\begin_inset Formula $N$
\end_inset
and
\begin_inset Formula $p_{i},1\leq i\leq N$
\end_inset
.
\end_layout
\begin_layout Standard
The PMF for this generalized
\begin_inset Formula $K$
\end_inset
does not have a simple closed-form representation.
However, the PMFs for random variables representing individual share survival
do.
Let
\begin_inset Formula $S_{i}$
\end_inset
be a random variable such that:
\end_layout
\begin_layout Standard
\begin_inset Formula \[
S_{i}=\begin{cases}
1 & \textnormal{if }s_{i}\textnormal{ survives}\\
0 & \textnormal{if }s_{i}\textnormal{ fails}\end{cases}\]
\end_inset
\end_layout
\begin_layout Standard
The PMF for
\begin_inset Formula $Si$
\end_inset
is very simple,
\begin_inset Formula $Pr(S_{i}=1)=p_{i}$
\end_inset
and
\begin_inset Formula $Pr(S_{i}=0)=p_{i}$
\end_inset
.
\end_layout
\begin_layout Standard
Observe that
\begin_inset Formula $\sum_{i=1}^{N}S_{i}=K$
\end_inset
.
Effectively,
\begin_inset Formula $K$
\end_inset
has just been separated into the series of Bernoulli trials that make it
up.
\end_layout
\begin_layout Standard
The discrete convolution theorem states that given random variables
\begin_inset Formula $X$
\end_inset
and
\begin_inset Formula $Y$
\end_inset
and their sum
\begin_inset Formula $Z=X+Y$
\end_inset
, if
\begin_inset Formula $Pr[X=x]=f(x)$
\end_inset
and
\begin_inset Formula $Pr[Y=y]=f(y)$
\end_inset
then
\begin_inset Formula $Pr[Z=z]=(f\star g)(z)$
\end_inset
where
\begin_inset Formula $\star$
\end_inset
denotes the convolution operation.
Stated in English, the probability mass function of the sum of two random
variables is the convolution of the probability mass functions of the two
random variables.
\end_layout
\begin_layout Standard
Discrete convolution is defined as
\end_layout
\begin_layout Standard
\begin_inset Formula \[
(f\star g)(n)=\sum_{m=-\infty}^{\infty}f(m)\cdot g(n-m)\]
\end_inset
\end_layout
\begin_layout Standard
The infinite summation is no problem because the probability mass functions
we need to convolve are zero outside of a small range.
\end_layout
\begin_layout Standard
According to the discrete convolution theorem, then, if
\begin_inset Formula $Pr[K=i]=f(i)$
\end_inset
and
\begin_inset Formula $Pr[S_{i}=j]=g_{i}(j)$
\end_inset
, then
\begin_inset Formula $ $
\end_inset
\begin_inset Formula $f=g_{1}\star g_{2}\star g_{3}\star\ldots\star g_{N}$
\end_inset
.
Since convolution is associative, this can also be written as
\begin_inset Formula $ $
\end_inset
\begin_inset Formula \begin{equation}
f=(g_{1}\star g_{2})\star g_{3})\star\ldots)\star g_{N})\label{eq:convolution}\end{equation}
\end_inset
which enables
\begin_inset Formula $f$
\end_inset
to be implemented as a sequence of convolution operations on the simple
PMFs of the random variables
\begin_inset Formula $S_{i}$
\end_inset
.
In fact, as values of
\begin_inset Formula $N$
\end_inset
get large, equation
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:convolution"
\end_inset
turns out to be a more effective means of computing the PMF of
\begin_inset Formula $K$
\end_inset
even in the case of the standard bernoulli distribution, primarily because
the binomial calculation in equation
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:binomial-pdf"
\end_inset
produces very large values that overflow unless arbitrary precision numeric
representations are used, or unless the binomial calculation is very cleverly
interleaved with the powers of
\begin_inset Formula $p$
\end_inset
and
\begin_inset Formula $1-p$
\end_inset
to keep the values manageable.
\end_layout
\begin_layout Standard
Note also that it is not necessary to have very simple PMFs like those of
the
\begin_inset Formula $S_{i}$
\end_inset
.
Any share or set of shares that has a known PMF can be combined with any
other set with a known PMF by convolution, as long as the two share sets
are independent.
Since PMFs are easily represented as simple lists of probabilities, where
the
\begin_inset Formula $i$
\end_inset
th element in the list corresponds to
\begin_inset Formula $Pr[K=i]$
\end_inset
, these functions are easily managed in software, and computing the convolution
is both simple and efficient.
\end_layout
\begin_layout Subsection
Multiple Failure Modes
\end_layout
\begin_layout Standard
In modeling share survival probabilities, it's useful to be able to analyze
separately each of the various failure modes.
If reliable statistics for disk failure can be obtained, then a probability
mass function for that form of failure can be generated.
Similarly, statistics on other hardware failures, administrative errors,
network losses, etc., can all be estimated independently.
If those estimates can then be combined into a single PMF for that server,
then we can use it to predict failures for that server.
\end_layout
\begin_layout Standard
In the case of independent failure modes for a single server, this is very
simple to do.
If
\begin_inset Formula $p_{i,j}$
\end_inset
is the probability of survival of the
\begin_inset Formula $j$
\end_inset
th failure mode of server
\begin_inset Formula $i$
\end_inset
, and there are
\begin_inset Formula $m$
\end_inset
failure modes then
\begin_inset Formula \[
p_{i}=\prod_{j=1}^{m}p_{i,j}\]
\end_inset
is the probability of server
\begin_inset Formula $i$
\end_inset
's survival and
\begin_inset Formula \[
Pr[S_{i}=k]=f(k)=\begin{cases}
1-p_{i} & k=0\\
p_{i} & k=1\end{cases}\]
\end_inset
is the full survival PMF.
\end_layout
\begin_layout Standard
\end_layout
\end_body
\end_document

View File

@ -1,7 +1,7 @@
def foo(): pass # keep the line number constant
import os, time
import os, time, random
from twisted.trial import unittest
from twisted.internet import defer, reactor
from twisted.python import failure
@ -9,6 +9,7 @@ from twisted.python import failure
from allmydata.util import base32, idlib, humanreadable, mathutil, hashutil
from allmydata.util import assertutil, fileutil, deferredutil, abbreviate
from allmydata.util import limiter, time_format, pollmixin, cachedir
from allmydata.util import statistics
class Base32(unittest.TestCase):
def test_b2a_matches_Pythons(self):
@ -163,6 +164,108 @@ class Math(unittest.TestCase):
self.failUnlessEqual(f([0,0,0,4]), 1)
self.failUnlessAlmostEqual(f([0.0, 1.0, 1.0]), .666666666666)
def test_round_sigfigs(self):
f = mathutil.round_sigfigs
self.failUnlessEqual(f(22.0/3, 4), 7.3330000000000002)
class Statistics(unittest.TestCase):
def should_assert(self, msg, func, *args, **kwargs):
try:
func(*args, **kwargs)
self.fail(msg)
except AssertionError, e:
pass
def failUnlessListEqual(self, a, b, msg = None):
self.failUnlessEqual(len(a), len(b))
for i in range(len(a)):
self.failUnlessEqual(a[i], b[i], msg)
def failUnlessListAlmostEqual(self, a, b, places = 7, msg = None):
self.failUnlessEqual(len(a), len(b))
for i in range(len(a)):
self.failUnlessAlmostEqual(a[i], b[i], places, msg)
def test_binomial_coeff(self):
f = statistics.binomial_coeff
self.failUnlessEqual(f(20, 0), 1)
self.failUnlessEqual(f(20, 1), 20)
self.failUnlessEqual(f(20, 2), 190)
self.failUnlessEqual(f(20, 8), f(20, 12))
self.should_assert("Should assert if n < k", f, 2, 3)
def test_binomial_distribution_pmf(self):
f = statistics.binomial_distribution_pmf
pmf_comp = f(2, .1)
pmf_stat = [0.81, 0.18, 0.01]
self.failUnlessListAlmostEqual(pmf_comp, pmf_stat)
# Summing across a PMF should give the total probability 1
self.failUnlessAlmostEqual(sum(pmf_comp), 1)
self.should_assert("Should assert if not 0<=p<=1", f, 1, -1)
self.should_assert("Should assert if n < 1", f, 0, .1)
def test_survival_pmf(self):
f = statistics.survival_pmf
# Cross-check binomial-distribution method against convolution
# method.
p_list = [.9999] * 100 + [.99] * 50 + [.8] * 20
pmf1 = statistics.survival_pmf_via_conv(p_list)
pmf2 = statistics.survival_pmf_via_bd(p_list)
self.failUnlessListAlmostEqual(pmf1, pmf2)
self.failUnlessTrue(statistics.valid_pmf(pmf1))
self.should_assert("Should assert if p_i > 1", f, [1.1]);
self.should_assert("Should assert if p_i < 0", f, [-.1]);
def test_convolve(self):
f = statistics.convolve
v1 = [ 1, 2, 3 ]
v2 = [ 4, 5, 6 ]
v3 = [ 7, 8 ]
v1v2result = [ 4, 13, 28, 27, 18 ]
# Convolution is commutative
r1 = f(v1, v2)
r2 = f(v2, v1)
self.failUnlessListEqual(r1, r2, "Convolution should be commutative")
self.failUnlessListEqual(r1, v1v2result, "Didn't match known result")
# Convolution is associative
r1 = f(f(v1, v2), v3)
r2 = f(v1, f(v2, v3))
self.failUnlessListEqual(r1, r2, "Convolution should be associative")
# Convolution is distributive
r1 = f(v3, [ a + b for a, b in zip(v1, v2) ])
tmp1 = f(v3, v1)
tmp2 = f(v3, v2)
r2 = [ a + b for a, b in zip(tmp1, tmp2) ]
self.failUnlessListEqual(r1, r2, "Convolution should be distributive")
# Convolution is scalar multiplication associative
tmp1 = f(v1, v2)
r1 = [ a * 4 for a in tmp1 ]
tmp2 = [ a * 4 for a in v1 ]
r2 = f(tmp2, v2)
self.failUnlessListEqual(r1, r2, "Convolution should be scalar multiplication associative")
def test_find_k(self):
f = statistics.find_k
g = statistics.pr_file_loss
plist = [.9] * 10 + [.8] * 10
t = .0001
k = f(plist, t)
self.failUnlessEqual(k, 10)
self.failUnless(g(plist, k) < t)
def test_pr_file_loss(self):
f = statistics.pr_file_loss
plist = [.5] * 10
self.failUnlessEqual(f(plist, 3), .0546875)
def test_pr_backup_file_loss(self):
f = statistics.pr_backup_file_loss
plist = [.5] * 10
self.failUnlessEqual(f(plist, .5, 3), .02734375)
class Asserts(unittest.TestCase):
def should_assert(self, func, *args, **kwargs):

View File

@ -73,3 +73,7 @@ def log_floor(n, b):
p *= b
k += 1
return k - 1
def round_sigfigs(f, n):
fmt = "%." + str(n-1) + "e"
return float(fmt % f)

View File

@ -0,0 +1,225 @@
# Copyright (c) 2009 Shawn Willden
# mailto:shawn@willden.org
from __future__ import division
from mathutil import round_sigfigs
import math
import array
def pr_file_loss(p_list, k):
"""
Probability of single-file loss for shares with reliabilities in
p_list.
Computes the probability that a single file will become
unrecoverable, based on the individual share survival
probabilities and and k (number of shares needed for recovery).
Example: pr_file_loss([.9] * 5 + [.99] * 5, 3) returns the
probability that a file with k=3, N=10 and stored on five servers
with reliability .9 and five servers with reliability .99 is lost.
See survival_pmf docstring for important statistical assumptions.
"""
assert 0 < k <= len(p_list)
assert valid_probability_list(p_list)
# Sum elements 0 through k-1 of the share set PMF to get the
# probability that less than k shares survived.
return sum(survival_pmf(p_list)[0:k])
def survival_pmf(p_list):
"""
Return the collective PMF of share survival count for a set of
shares with the individual survival probabilities in p_list.
Example: survival_pmf([.99] * 10 + [.8] * 6) returns the
probability mass function for the number of shares that will
survive from an initial set of 16, 10 with p=0.99 and 6 with
p=0.8. The ith element of the resulting list is the probability
that exactly i shares will survive.
This calculation makes the following assumptions:
1. p_list[i] is the probability that any individual share will
will survive during the time period in question (whatever that may
be).
2. The share failures are "independent", in the statistical
sense. Note that if a group of shares are stored on the same
machine or even in the same data center, they are NOT independent
and this calculation is therefore wrong.
"""
assert valid_probability_list(p_list)
pmf = survival_pmf_via_conv(p_list)
assert valid_pmf(pmf)
return pmf
def survival_pmf_via_bd(p_list):
"""
Compute share survival PMF using the binomial distribution PMF as
much as possible.
This is more efficient than the convolution method below, but
doesn't work for large numbers of shares because the
binomial_coeff calculation blows up. Since the efficiency gains
only matter in the case of large numbers of shares, it's pretty
much useless except for testing the convolution methond.
Note that this function does little to no error checking and is
intended for internal use and testing only.
"""
pmf_list = [ binomial_distribution_pmf(p_list.count(p), p)
for p in set(p_list) ]
return reduce(convolve, pmf_list)
def survival_pmf_via_conv(p_list):
"""
Compute share survival PMF using iterated convolution of trivial
PMFs.
Note that this function does little to no error checking and is
intended for internal use and testing only.
"""
pmf_list = [ [1 - p, p] for p in p_list ];
return reduce(convolve, pmf_list)
def print_pmf(pmf, n):
"""
Print a PMF in a readable form, with values rounded to n
significant digits.
"""
for k, p in enumerate(pmf):
print "i=" + str(k) + ":", round_sigfigs(p, n)
def pr_backup_file_loss(p_list, backup_p, k):
"""
Probability of single-file loss in a backup context
Same as pr_file_loss, except it factors in the probability of
survival of the original source, specified as backup_p. Because
that's a precondition to caring about the availability of the
backup, it's an independent event.
"""
assert valid_probability_list(p_list)
assert 0 < backup_p <= 1
assert 0 < k <= len(p_list)
return pr_file_loss(p_list, k) * (1 - backup_p)
def find_k(p_list, target_loss_prob):
"""
Find the highest k value that achieves the targeted loss
probability, given the share reliabilities given in p_list.
"""
assert valid_probability_list(p_list)
assert 0 < target_loss_prob < 1
pmf = survival_pmf(p_list)
return find_k_from_pmf(pmf, target_loss_prob)
def find_k_from_pmf(pmf, target_loss_prob):
"""
Find the highest k value that achieves the targeted loss
probability, given the share survival PMF given in pmf.
"""
assert valid_pmf(pmf)
assert 0 < target_loss_prob < 1
loss_prob = 0.0
for k, p_k in enumerate(pmf):
loss_prob += p_k
if loss_prob > target_loss_prob:
return k
k = len(pmf) - 1
return k
def valid_pmf(pmf):
"""
Validate that pmf looks like a proper discrete probability mass
function in list form.
Returns true if the elements of pmf sum to 1.
"""
return round(sum(pmf),5) == 1.0
def valid_probability_list(p_list):
"""
Validate that p_list is a list of probibilities
"""
for p in p_list:
if p < 0 or p > 1:
return False
return True
def convolve(list_a, list_b):
"""
Returns the discrete convolution of two lists.
Given two random variables X and Y, the convolution of their
probability mass functions Pr(X) and Pr(Y) is equal to the
Pr(X+Y).
"""
n = len(list_a)
m = len(list_b)
result = []
for i in range(n + m - 1):
sum = 0.0
lower = max(0, i - n + 1)
upper = min(m - 1, i)
for j in range(lower, upper+1):
sum += list_a[i-j] * list_b[j]
result.append(sum)
return result
def binomial_distribution_pmf(n, p):
"""
Returns Pr(K), where K ~ B(n,p), as a list of values.
Returns the full probability mass function of a B(n, p) as a list
of values, where the kth element is Pr(K=k), or, in the Tahoe
context, the probability that exactly k copies of a file share
survive, when placed on n independent servers with survival
probability p.
"""
assert p >= 0 and p <= 1, 'p=%s must be in the range [0,1]'%p
assert n > 0
result = []
for k in range(n+1):
result.append(math.pow(p , k ) *
math.pow(1 - p, n - k) *
binomial_coeff(n, k))
assert valid_pmf(result)
return result;
def binomial_coeff(n, k):
"""
Returns the number of ways that k items can be chosen from a set
of n.
"""
assert n >= k
if k > n:
return 0
if k > n/2:
k = n - k
accum = 1.0
for i in range(1, k+1):
accum = accum * (n - k + i) // i;
return int(accum + 0.5)