(This post is part of the Understanding GHOSTDAG series)

Truly understanding blockChain and blockDAGs, to the extent you could make your own contributions, requires quite a lot of math. This appendix will surely not teach you all of it, but it is a good starting point. Here we have more of a collection of the math that came up while writing the rest of this book (appropriately, it will keep updating is as writing progresses).

This is not a math book, so the presentation will not be completely formal and I will allow myself to handwave as I please. I will do my best to refer to more formal resources that I personally recommend. However, unlike these resourcecs that naturally draw motivations from many walks of life, I tried to draw inspiration and examples from the world of consensus and proof-of-work as much as possible. So while this appendix is a bit erratic in its difficulty curve, I think it is worth a read, even if to be only partially absorbed.

Exponentials and Logarithms

Exponentials and logarithms describe quantities that increase either very fast or very slow, respectively.

We say that the quantity $f(x)$ is exponential with respect to quantity $x$ if whenever we add to $x$ , $f(x)$ multiplies. For example, $f(n)$ could describe the number natof possible results of flipping $n$ coins. Clearly $f(1)=2$ , as if we flip one coin there are two possible outcomes. We also note that adding one flip will double the number of possible outcomes. Each possible outcome of flipping $n$ coins “splits” into two possible outcomes of flipping $n+1$ coins: one for if the additional coin is head and other for if it is tails. In other words, we get that $f(n+1) = 2\cdot f(n)$ and it is not hard to work out from here that $f(n) = 2^n$ .

Typically, we will not talk about exponential growth but about exponential decay. This is simply doing a “one over” to exponential growth. If $f(n)$ grows exponentially fast then $1/f(n)$ decays exponentially fast. We usually see this “one over” in the context of probability, where the number of options grow exponentially, making the chance a particular option decay exponentially.

For example, say we flip $n$ coins, what is the probability that they are all head? We already know that there are $2^n$ possible outcomes, so the probability we get this one particular result we want (assuming the coins are fair) is $1/2^n$ , which we will usually denote $2^{-n}$ .

Logarithms are the opposite. They grow extremely slow. Whenever a value is multiplied, the value of its logarithm increases additively. For example, we can reverse the previous question and ask: how many coins do we need to flip so that there could be at least $N$ different outcomes? If $N=2$ we know the answer is that one coin is enough. With some more consideration it is easy to realize that for $N=2^n$ we need $n$ coins. The way mathematicians notate this observation is $\log_2(2^n)=n$ .

More generally, $\log_a(b)$ is the power of $a$ you need to take to obtain $b$ . It is the unique number satisfying $a^{\log_a(b)} = b$ . The number $a$ is called the base of the logarithm.

Since we deal in asymptotics, it happens that we don’t particularly care about the base, so I will just write $\log(x)$ . If that’s too abstract for you, feel free to imagine we are in base two. A word of warning though: the implied base in the notation $\log(x)$ varies from textbook to textbook. Generally speaking, in computer-science books the logarithms are in base two, in math books the logarithms are in base $e$ (and will be usually denoted as $\ln$ ), in physics books they don’t care about the basis of the logarithm, and everywhere else the logarithms are in base 10.

A nice way to think about logarithms is as the “number of digits”. In base 10, we have that $10^7$ has $7$ digits and it just so happens that $\log_{10}(10^7)=7$ . We can think of $\log_{10}$ as a function that increases from $n$ to $n+1$ as we go over all numbers with $n$ digits.

We can work out some more properties of logarithms, but we really do not need them. Still, if you want to practice logarithms you can try proving the following (using only the definition of the logarithm given above):

$\log(2^x) = x$
$\log(xy) = \log x + \log y$
$\log(x^a) = a\cdot \log(x)$

We will typically care about how quantities increase as thing get smaller. This will come up when we talk about the definition of security for blockChains. We ask ourselves questions like “how long must a receiver wait before he knows the probability a transaction is reverted is smaller than $\varepsilon$ ?”

For example, if you already read the discussion about probabilistic vs. deterministic confirmations, you should already expect that this time goes to infinity as $\varepsilon$ goes to $0$ , but we can still ask ourselves how fast this happens. We say this happens logarithmically slow if whenever we halve $\varepsilon$ , the time only increases by a constant amount. One way to express it is to say that the time it takes looks like $\log(1/\varepsilon)$ , note that if we make $\varepsilon$ two times smaller it follows from the logarithm rules that $\log\left(\frac{1}{\varepsilon/2}\right)=\log\left(2\cdot\frac{1}{\varepsilon}\right)=\log\left(\frac{1}{\varepsilon}\right)+1$ which is just what we wanted.

Big O Notation

The above is littered with vague statements like “grows as” or “looks like”. The purpose of the big O notation is to give this a formal definition. Honestly though, understanding the definition is not as important as getting used to the notation.

I will state the formal definition, try not to let it scare you, and then we will break it down. The definition will look a bit laden with details, but that’s exactly the point: we pack all these details into the definition so that we will not have to worry about them.

Formally, given two functions $f$ and $g$ we say that $f=O(g)$ if there are values $n_0,C,D>0$ such that for any $n>n_0$ it holds that $f(n)\le C\cdot g(n) + D\text{.}$ Intuitively, that means that $f$ is smaller than something that “looks like $g$ ” when $n$ is ”sufficiently large”.

The need for this kind of notation is deep rooted in computer science. Say you want to understand how long it takes your algorithm to run. This times depends on many things that might not particularly interests you right now, and are not even well defined. For example, the basic operations might take different times on different CPUs. Saying that the runtime of your algorithm is $O(n)$ (where $n$ is the size of the input) is saying something like “look, I don’t know exactly how long this would take, it depends on all sorts of practical considerations that are other people’s job to work out, what I can tell you is that if you double the length of the input the runtime should pretty much double, at least for inputs that aren’t too small”.

Using this notation, instead of saying “the runtime $T$ increases logarithmically fast is $\varepsilon$ goes to $0$ ” we can just succinctly write $T(\varepsilon) = O(\log(1/\varepsilon))$ .

The “Big O” part is called the growth asymptotics while the values $C,D$ are called the growth constants.

When do constants matter? When we make a concrete design. A computer scientist researching algorithms for multiplying matrices usually mostly cares how the algorithm behaves as the sizes of the matrices increases. However, an ASIC designer trying to build a circuit for multiplying matrices of given size and other attributes mostly cares about the actual constants.

A nice example from the world of blockChains is safety versus confirmation times. In both cases we look at how the probability a transaction is reverted decays with time. Safety is a statement about the asymptotics of this decay, whereas confirmation times deal with the constants.

Taylor Series of the Exponential function*

Don’t worry, I have no intention to develop the theory of Taylor sequences, just state this one useful fact: $e^x = \sum_{i=0}^\infty \frac{x^n}{n!}\text{.}$

What is this $e$ you ask? Just some number, approximately $2.718$ . It is a very fascinating number, which you can read all about in this fascinating book. We use this number because it makes the above formula simple. If we want to approximate an exponent with another base $a$ we just need to note that $a^x = e^{x\cdot\ln a}\text{.}$

We will use this series for a small value of $x$ , in this case we can note that the summand $x^n/n!$ actually vanishes very fast, to the effect that even for $n=2$ it is small enough to neglect. In other words, it holds that $e^x = 1+x+O(x^2)\text{,}$ and for $|x|\ll 1$ we can say that $e^x \approx 1+x\text{.}$

Probability Theory

(As a general resource for probability theory, I highly recommend Sheldon Ross’ First Course in Probability)

A probability function is something that assigns likelihood to events. This likelihood as expressed as a real number between $0$ and $1$ . We usually use capitalized latin letters from the beginning of the alphabet, such as $A,B$ , to denote events. The probability of the event is denoted as $\mathbb{P}[A]$ .

For any event $A$ , the complement of $A$ is denoted $A^c$ and includes all outcomes outside of $A$ . For example, if $A$ is the event that the result of a die roll is even, then $A^c$ is the event that the result is odd (assuming we only model outcomes where the die toss had a concise result, and disregard events such as the die shattering on impact or remaining suspended mid-air).

Two events are disjoint if there is no scenario included in both. For example, the event “the result is odd” is disjoint of the event “the result is four”. Note that an event is disjoint of $A$ if and only if it is contained in $A^c$ .

The probability function has some useful yet hardly surprising properties, I will not list them all. Two important properties are:

The probability of the complement: $\mathbb{P}[A^c]=1-\mathbb{P}[A]\text{.}$
Additivity of disjoint events: $\mathbb{P}[A\text{ or }B] = \mathbb{P}[A] + \mathbb{P}[B]\text{.}$

(Exercise: prove that the first property is a special case of the second property)

Random Variables

Sometimes, we want to consider processes whose outcome doesn’t have a clear numerical value. For example, the result of a die toss has a “natural” numerical value (the value written on the top side), while the result of flipping a coin does not.

Random variables are assignments of numerical values to outcomes. For example, if we want to model a one dollar bet that a fair coin flip will turn out head, we can define a random variable that assigns $1$ to the result head and $-1$ to the result tails.

There are many interesting random variables. For example, we could define a variable that takes a sequence of coin tosses of any length and outputs the number of heads, or the length of the longest contiguous subsequence of identical tosses. Random variables are often denoted by capitalized letters from the end of the English alphabet, so let us denote the last variable we described as $X$ . Random variables are a very expressive tool that allows us to zoom in on the property that interests us, and abstract away the rest. For example, if we indeed only care about the length of the longest subsequence of identical tosses, then we don't really care that HHTTT and THHHTTHTTH are completely different sequences, only that they satisfy $X=3$ .

Given a random variable $X$ we can ask what is the probability that it will give some value $x$ . This quantity is denoted by $\mathbb{P}[X=x]$ . For example, the random variable describing the fair coin above satisfies that $\mathbb{P}[X=1]=\mathbb{P}[X=-1] = 1/2$ .

Expectation (and Variance)

Given a random variable, we can ask what would be its value on average. That is, what should we expect to get from it. This quantity is called the expectation, it is just an average of the possible values, weighted by their probability. In other words, if $X$ can get the values $a_1,\ldots,a_n$ then the expectation of $X$ is $\mathbb{E}[X] = \mathbb{P}[X=a_1]\cdot a_1 + \ldots + \mathbb{P}[X=a_n]\cdot a_n\text{.}$

So for example, if $X$ is our fair coin, we get that $\begin{aligned}\mathbb{E}[X] & =\mathbb{P}[X=1]\cdot1+\mathbb{P}[X=-1]\cdot\left(-1\right)\\& =\frac{1}{2}\cdot1+\frac{1}{2}\cdot(-1)\\& =0\text{,}\end{aligned}$

meaning that when we bet on a fair coin, we don’t have any advantage or disadvantage.

What if the coin wasn’t fair though? Say it has probability $p$ to land on heads? Then the probability to land on tails is $1-p$ , and we get that the expectation is $\begin{aligned}\mathbb{E}[X] & =\mathbb{P}[X=1]\cdot p+\mathbb{P}[X=-1]\cdot\left(1-p\right)\\& =p\cdot1+\left(1-p\right)\cdot\left(-1\right)\\& =2p-1=2\left(p-\frac{1}{2}\right)\text{.}\end{aligned}$

Not surprisingly, we find that the game is in our favour when the coin is ( $p>1/2$ ).

Laws of Large Numbers

Laws of large numbers clarify the practical meaning of the expectation. Intuitively, we expect that if we flip a fair coin many times, about half of the flips will be heads. We can accept that two out of three coins were heads, but at some level, we all know that if we flip a fair coin three million times, the number of heads should be much closer to a million and a half than it is to two millions. We have an innate understanding of how repetition reduces noise, the law of large numbers is the simplest way to formalize this understanding.

A law of large numbers looks like this: if $X_1,X_2,…$ are independent and identical random variables (the prototypical example being that $X_i$ describe the $i$ th flip of the coin), then $\lim_{n\to\infty}\frac{X_{1}+\ldots+X_{n}}{n}=\mathbb{E}\left[X\right]\text{.}$

The expression above is not well-defined, we never explained what this $\lim$ notation means. It clearly talks about how the average of the first $n$ instances behaves as $n$ grows, but that description is vague. The truth is there are many ways to define what $\lim$ means for random variables. These different definitions are called modes of convergence, and each one will produce a different law of large numbers.

We de not need to formally state any law of large numbers, the intuitive principle is more than enough for us. The main purpose of this digression was to provide an example of the math you will have to learn to understand ongoing research.

To illustrate the principle, take a look at this graph:

It was generated by simulating ten thousands coin flips and tracking their average. I repeated the experiment five times, and plotted each repetition with a different color. Notice that while eventually the process converges to some expected point, there are many ways to get there. The “speed” at which a random variable approaches its expectation is often quantified using something called variance. I will not formally define the variance, but intuitively it represents how “scattered” a random variable is around its expectation.

Exercise: say I run this simulation again, what is the probability I will get a result identical to one of the lines above?

The Distribution of a Random Variable

Let $X$ be the result of throwing a normal six-sided die. Now consider a “weird” six-sided die, it is still fair in the sense that all sides are equally likely, but for some reason it has the numbers two through seven written on the sides. Let $Y$ be the result of throwing the weird die and subtracting one from the result.

Formally, $X$ and $Y$ are different, but they distribute the same, in the sense that each value is equally likely to appear as a result of $X$ and of $Y$ . We denote this by $Y\sim X$ . Often, we don’t really care about the details of defining a random variable, but only about how it distributes. The same distribution can describe many different phenomena, so it is valuable to consider distributions on the abstract.

The Math of Block Creation: the Poisson and Exponential Distributions

We now describe two distributions that are by far the important distributions in the world of proof-of-work.

Imagine you are in a bus station, waiting for a bus that comes once every ten minutes. The busses aren’t regularly spaced, but instead every microsecond someone flips a coin to decide whether to send out a bus, and the probability that a bus is sent out is such that, on average, a bus goes out every ten minuts.

The Poisson distribution describes the number of busses that one would see in a span of ten minutes, while the exponential distribution describes the time that passes between two consecutive busses.

Granted busses are a weird example here, but there are other examples where these distributions fit more naturally, such as observing alpha particles emitted from a decaying radioactive matter that’s supposed to emit one particle per ten minutes, considering the distribution of typos in a book, or considering the block production of a blockchain.

As a rule of thumb, a continuous process is a Poisson Process if the probability that something happens at a given moment is independent of the probability it happens in another moment. The decaying particles do not care how long it has been since the last alpha particle, and the block production process does not care how long you have been mining so far.

We select some unit of time, say minutes, and let $\kappa$ denote the number of times we expect the event to happen within this time unit. For example, to describe a block once per ten minutes we set $\kappa=0.1$ . We call $\kappa$ the parameter and denote these distributions as $Poi(\kappa)$ and $Exp(\kappa)$ .

So how do they distribute? The key observation is that these are the only distributions that satisfy the independence property I described above. I will not do this here, but from this property alone one can compute that $\mathbb{P}\left[Poi(\kappa)=n\right]=\frac{\kappa^{n}\cdot e^{-\kappa}}{n!}\text{.}$

It is comfortable to generalize this a bit. Let $\lambda$ be the block delay (in minutes), how does the number of blocks created within $D$ minutes distribute? To do so, we “change units”. Instead of describing the block delay as $\lambda$ minutes, we describe it as $\lambda/D$ periods of length $D$ minutes. In other words, in every period of $D$ minutes we expect $D/\lambda$ blocks, so the number of blocks we will see distributes as $Poi(D/\lambda)$ . Combining this all, we get this handy formula: the probability to see $n$ blocks within $D$ minutes if the block delay is $\lambda$ is given by: $\mathbb{P}\left[Poi\left(\frac{D}{\lambda}\right)=n\right]=\frac{\left(D/\lambda\right)^{n}\cdot e^{-D/\lambda}}{n!}\text{.}$

So for example, to work out the probability the Bitcoin network produce six blocks in the next hour we set $n=6$ , $\lambda=10$ and $D=60$ to get the answer $\mathbb{P}\left[Poi\left(\frac{60}{10}\right)=6\right]=\frac{6^{6}\cdot e^{-6}}{6!}\approx16\%\text{.}$

This is already quite surprising! We know that the network should prepare six blocks an hour, yet the probability this actually happens is eerily low. This is an example of just how noisy the block creation process are. To see more striking examples, we want to talk about how the delay between consecutive blocks distributes. This is described by the exponential distribution.

The exponential distribution is a bit harder to describe, because it is continuous. The probability that the delay between two blocks is exactly some value is 0. The best we can do is to estimate what is the probability that it falls in some range. I will again skip the calculation, but for any $0\le s<t \le \infty$ the probability is $\mathbb{P}\left[s\le Exp(\kappa)\le t\right]=e^{-s\kappa}-e^{-t\kappa}=\text{.}$

In particular, if $\lambda$ is the block delay (how many minutes we expect to wait between two consecutive blocks), then $\kappa$ (how many blocks we expect to see every minute) is $1/\lambda$ . For example, the probability that the delay between two consecutive blocks is between 8 and 12 minutes is $\mathbb{P}\left[8\le Exp(1/10)\le12\right]=e^{-8/10}-e^{-12/10}\approx14.8\%\text{.}$

For people who have not seen this before, this should be at least a bit surprising. Despite an average delay of 10 minutes, we see that the probability of being within two minutes of this duration is actually quite small. The probability that the deviation from ten minutes is at most six and a half minutes is about one half.

Indeed, the block creation process is noisy, and that is one of the reasons we want to wait a reasonable amount of blocks before feeling secure, because the probability that our chain reverts by pure chance alone is not-negligible at all when there are just a few blocks. Taking the average over a large amount of blocks reduces this high variance.

As a final exercise, let us find the $t$ for which the probability that the block delay is at most $t$ is exactly $1/2$ . Do you have any guess what that $t$ might be? The probability that the block delay is at most $t$ is given by $\mathbb{P}\left[0\le Exp(1/10)\le t\right]=e^{-0/10}-e^{-t/10}=1-e^{-t/10}\text{.}$ so we want to solve the equation $1-e^{-t/10}=\frac{1}{2}$ for $t$ . I will leave the details to you but the result is $t=10\cdot\ln2\approx10\cdot0.7$ .

This is remarkable! Even though the average delay between two consecutive blocks is 10 minutes, for any particular pair of consecutive blocks, it is more likely that the time between them is less than eight minutes than it is more than eight minutes! The intuition is that these high likelihood pairs of blocks with short delays between them are offset by rare blocks with very long delays between them. For example, the probability that two consecutive blocks have a delay of more than an hour between them is given by $\mathbb{P}\left[60\le Exp(1/10)\le\infty\right]=e^{-60/10}-e^{-\infty/10}=e^{-6}\approx0.2\%\text{.}$

This might seem small but it really isn’t. This happens once every 500 blocks, and given that 144 blocks are created on an average day, this means that, on average, we should see consecutive blocks more than an hour apart about twice a week.

Finally, we can get a nicer expression by using a multiplicative approach. We can ask ourselves what is the probability that the block delay is at most twice smaller to twice larger than the expected block delay. A bit more generally, for a given $a>1$ we have that $\mathbb{P}\left[\frac{10}{a}\le Exp\left(1/10\right)\le a\cdot10\right]=e^{-1/a}-e^{-a}\text{.}$

In particular, for $a=2$ this comes out about 47%, showing that most block delays are outside this interval! This is the plot of the probability as a function of $a$ :

We are not surprised to see the probability starting at $0$ for $a=1$ and tapering off to $1$ as $a$ goes to infinity. However, what should surprise you is how slowly this happens (note that the xx axis is logarithmically scaled.) Even at $a=10$ , we have that the probability is around 90%. Close to 10% of the block delays are shorter than 1 minute or longer than 100 minutes! Even at $a=100$ the probability is just around 99%, telling us that in an average day we should expect to see at least one block delay shorter than 6 seconds or longer than 16 hours!

So why don’t we ever see 16 hours block delay in the wild? Is our math wrong. No, it isn’t, and the answer is actually right in front of you! But I’ll let the reader figure that one out for themselves.

Enjoyed reading this article?

Comments

No comments yet!

Mathematical Background for Consensus (Understanding GHOSTDAG, appendix)