Academic Word List

A Project for Teaching and Learning Academic English Vocabulary

Friday, 12 06th

Last updateFri, 05 Sep 2014 10am

Mathematical Foundations of Computer Networking


Mathematical Foundations of Computer Networking


Graduate students, researchers, and practitioners in the field of computer networking often require a firm conceptual understanding of one or more of its theoretical foundations. Knowledge of optimization, information theory, game theory, control theory, and queueing theory is assumed by research papers in the field. Yet these subjects are not taught in a typical computer science undergraduate curriculum. This leaves only two alternatives: either to study these topics on one’s own from standard texts or take a remedial course. Neither alternative is attractive. Standard texts pay little attention to computer networking in their choice of problem areas, making it a challenge to map from the text to the problem at hand. And it is inefficient to require students to take an entire course when all that is needed is an introduction to the topic.


This book addresses these problems by providing a single source to learn about the mathematical foundations of computer networking. Assuming only a rudimentary grasp of calculus, it provides an intuitive yet rigorous introduction to a wide range of mathematical topics. The topics are covered in sufficient detail so that the book will usually serve as both the first and ultimate reference. Note that the topics are selected to be complementary to those found in a typical undergraduate computer science curriculum. The book, therefore, does not cover network foundations such as discrete mathematics, combinatorics, or graph theory. Each concept in the book is described in four ways: intuitively; using precise mathematical notation; with a carefully chosen numerical example; and with a numerical exercise to be done by the reader. This progression is designed to gradually deepen understanding. Nevertheless, the depth of coverage provided here is not a substitute for that found in standard textbooks. Rather, I hope to provide enough intuition to allow a student to grasp the essence of a research paper that uses these theoretical foundations.




The chapters in this book fall into two broad categories: foundations and theories. The first five foundational chapters cover probability, statistics, linear algebra, optimization, and signals, systems and transforms. These chapters provide the basis for the four theories covered in the latter half of the book: queueing theory, game theory, control theory, and information theory. Each chapter is written to be as self-contained as possible. Nevertheless, some dependencies do exist, as shown in Figure 1, where light arrows show weak dependencies and bold arrows show strong dependencies.

Chapter organization


Using this book


The material in this book can be completely covered in a sequence of two graduate courses, with the first course focusing on the first five chapters and the second course on the latter four. For a single-semester course, some possible alternatives are to cover:


Probability, statistics, queueing theory, and information theory


Linear algebra, signals, systems and transforms, control theory and game theory


Linear algebra, signals, systems and transforms, control theory, selected portions of probability, and information theory


Linear algebra, optimization, probability, queueing theory, and information theory


This book is designed for self-study. Each chapter has numerous solved examples and exercises to reinforce concepts. My aim is to ensure that every topic in the book should be accessible to the perservering reader.




I have benefitted immensely from the comments of dedicated reviewers on drafts of this book. Two in particular who stand out are Alan Kaplan, whose careful and copious comments improved every aspect of the book, and Prof. Johnny Wong, who not only reviewed multiple drafts of the chapters on probability on statistics, but also used a draft to teach two graduate courses at the University of Waterloo.


I would also like to acknowledge the support I received from experts who reviewed individual chapters: Augustin Chaintreau, Columbia (probability and queueing theory), Tom Coleman, Waterloo (optimization), George Labahn, Waterloo (linear algebra), Kate Larson, Waterloo (game theory), Abraham Matta, Boston University (statistics, signals, systems, transforms, and control theory), Sriram Narasimhan, Waterloo (control theory), and David Tse, UC Berkeley (information theory). I received many corrections from my students at the University of Waterloo who took two courses based on book drafts in Fall 2008 and Fall 2011. These are: Andrew Arnold, Nasser Barjesteh, Omar Beg, Abhirup Chakraborty, Betty Chang, Leila Chenaei, Francisco Claude, Andy Curtis, Hossein Falaki, Leong Fong, Bo Hu, Tian Jiang, Milad Khalki, Robin Kothari, Alexander Laplante, Constantine Murenin, Earl Oliver, Sukanta Pramanik, Ali Rajabi, Aaditeshwar Seth, Jakub Schmidtke, Kanwaljit Singh, Kellen Steffen, Chan Tang, Alan Tsang, Navid Vafei, and Yuke Yang. Last but not the least, I would never have completed this book were it not for the unstinting support and encouragement from every member of my family for the last four years. Thank you.


S. Keshav


Waterloo, October 2011.



CHAPTER 1 Probability

1.1 Introduction


The concept of probability pervades every aspect of our life. Weather forecasts are couched in probabilistic terms, as are economic predictions and even outcomes of our own personal decisions. Designers and operators of computer networks need to often think probabilistically, for instance, when anticipating future traffic workloads or computing cache hit rates. From a mathematical standpoint, a good grasp of probability is a necessary foundation to understanding statistics, game theory, and information theory. For these reasons, the first step in our excursion into the mathematical foundations of computer networking is to study the concepts and theorems of probability. This chapter is a self-contained introduction to the theory of probability. We begin by introducing the elementary concepts of outcomes, events, and sample spaces, which allows us to precisely define the conjunctions and disjunctions of events. We then discuss concepts of conditional probability and Bayes’ rule. This is followed by a description of discrete and continuous random variables, expectations and other moments of a random variable, and the Moment Generating Function. We discuss some standard discrete and continuous distributions and conclude with some useful theorems of probability and a description of Bayesian networks.

Note that in this chapter, as in the rest of the book, the solved examples are an essential part of the text. They provide a concrete grounding for otherwise abstract concepts and are necessary to understand the material that follows.


1.1.1 Outcomes


The mathematical theory of probability uses terms such as ‘outcome’ and ‘event’ with meanings that differ from those in common practice. Therefore, we first introduce a standard set of terms to precisely discuss probabilistic processes. These terms are shown in bold font below. We will use the same convention to introduce other mathematical terms in the rest of the book. Probability measures the degree of uncertainty about the potential outcomes of a process. Given a set of distinct and mutually exclusive outcomes of a process, denoted , called the sample space S, the probability of any outcome,denoted P(oi), is a real number between 0 and 1, where 1 means that the outcome will surely occur, 0 means that it surely will not occur, and intermediate values reflect the degree to which one is confident that the outcome will or will not occur1. We assume that it is certain that some element in S occurs. Hence, the elements of S describe all possible outcomes and the sum of probability of all the elements of S is always 1.

1. Strictly speaking, S must be a measurable field. {o1, o2, …} σ


DRAFT - Version 3 - Introduction




Imagine rolling a six-faced die numbered 1 through 6. The process is that of rolling a die and an outcome is the number shown on the upper horizontal face when the die comes to rest. Note that the outcomes are distinct and mutually exclusive because there can be only one upper horizontal face corresponding to each throw. The sample space is S = {1, 2, 3, 4, 5, 6} which has a size . If the die is fair, each outcome is equally likely and the probability of each outcome is .




Imagine throwing a dart at random on to a dartboard of unit radius. The process is that of throwing a dart and the outcome is the point where the dart penetrates the dartboard. We will assume that this is point is vanishingly small, so that it can be thought of as a point on a two-dimensional real plane. Then, the outcomes are distinct and mutually exclusive.


The sample space S is the infinite set of points that lie within a unit circle in the real plane. If the dart is thrown truly randomly, every outcome is equally likely; because there are an infinity of outcomes, every outcome has a probability of zero. We need special care in dealing with such outcomes. It turns out that, in some cases, it is necessary to interpret the probability of the occurrence of such an event as being vanishingly small rather than exactly zero. We consider this situation in greater detail in Section 1.1.5 on page 4. Note that although the probability of any particular outcome is zero, the probability associated with any subset of the unit circle with area a is given by , which tends to zero as a tends to zero.


1.1.2 Events


The definition of probability naturally extends to any subset of elements of S, which we call an event, denoted E. If the sample space is discrete, then every event E is an element of the power set of S, which is the set of all possible subsets of S. The probability associated with an event, denoted , is a real number and is the sum of the probabilities associated with the outcomes in the event.




Continuing with Example 1, we can define the event “the roll of a die results in a odd-numbered outcome.” This corresponds to the set of outcomes {1,3,5}, which has a probability of . We write P({1,3,5}) = 0.5.


1.1.3 Disjunctions and conjunctions of events


Consider an event E that is considered to have occurred if either or both of two other events or occur, where both events are defined in the same sample space. Then, E is said to be the disjunction or logical OR of the two events denoted and read “ or .”



DRAFT - Version 3 -Axioms of probability

Continuing with Example 1, we define the events = “the roll of a die results in an odd-numbered outcome” and = “the roll of a die results in an outcome numbered less than 3.” Then, and and In contrast, consider event E that is considered to have occurred only if both of two other events or occur, where both are in the same sample space. Then, E is said to be the conjunction or logical AND of the two events denoted and read “ and .”. When the context is clear, we abbreviate this to .




Continuing with Example 4, Two events Ei and Ej in S are mutually exclusive if only one of the two may occur simultaneously. Because the events have no outcomes in common, = 0. Note that outcomes are always mutually exclusive but events need not be so.


1.1.4 Axioms of probability


One of the breakthroughs in modern mathematics was the realization that the theory of probability can be derived from just a handful of intuitively obvious axioms. Several variants of the axioms of probability are known. We present the axioms as stated by Kolmogorov to emphasize the simplicity and elegance that lie at the heart of probability theory:


1. , that is, the probability of an event lies between 0 and 1.


2. P(S) = 1, that is, it is certain that at least some event in S will occur.


3. Given a potentially infinite set of mutually exclusive events E1, E2,...


(EQ 1) That is, the probability that any one of the events in the set of mutually exclusive events occurs is the sum of their individual probabilities. For any finite set of n mutually exclusive events, we can state the axiom equivalently as:


(EQ 2) An alternative form of Axiom 3 is:


(EQ 3) This alternative form applies to non-mutually exclusive events.




Continuing with Example 1, we define the mutually exclusive events {1,2} and {3,4} which both have a probability of 1/3.




Continuing with Example 1, we define the non-mutually exclusive events {1,2} and {2,3} which both have a probability of 1/3. Then,


1.1.5 Subjective and objective probability


The axiomatic approach is indifferent as to how the probability of an event is determined. It turns out that there are two distinct ways in which to determine the probability of an event. In some cases, the probability of an event can be derived from counting arguments. For instance, given the roll of a fair die, we know that there are only six possible outcomes, and that all outcomes are equally likely, so that the probability of rolling, say, a 1, is 1/6. This is called its objective probability. Another way of computing objective probabilities is to define the probability of an event as being the limit of a counting process, as the next example shows.




Consider a measurement device that measures the packet header types of every packet that crosses a link. Suppose that during the course of a day the device samples 1,000,000 packets and of these 450,000 packets are UDP packets, 500,000 packets are TCP packets, and the rest are from other transport protocols. Given the large number of underlying observations, to a first approximation, we can consider that the probability that a randomly selected packet uses the UDP protocol to be 450,000/1,000,000 = 0.45. More precisely, we state: where UDPCount(t) is the number of UDP packets seen during a measurement interval of duration t, and TotalPacket- Count(t) is the total number of packets seen during the same measurement interval. Similarly P(TCP) = 0.5.


Note that in reality the mathematical limit cannot be achieved because no packet trace is infinite. Worse, over the course of a week or a month the underlying workload could change, so that the limit may not even exist. Therefore, in practice, we are forced to choose ‘sufficiently large’ packet counts and hope that the ratio thus computed corresponds to a probability. This approach is also called the frequentist approach to probability. In contrast to an objective assessment of probability, we can also use probabilities to characterize events subjectively.




Consider a horse race where a favoured horse is likely to win, but this is by no means assured. We can associate a subjective probability with the event, say 0.8. Similarly, a doctor may look at a patient’s symptoms and associate them with a 0.25 probability of a particular disease. Intuitively, this measures the degree of confidence that an event will occur, based on expert knowledge of the situation that is not (or cannot be) formally stated.How is subjective probability to be determined? A common approach is to measure the odds that a knowledgeable person would bet on that event. Continuing with the example, if a bettor really thought that the favourite would win with a probability of 0.8, then the bettor should be willing to bet module under the terms: if the horse wins, the bettor gets module.25, and if the horse loses, the bettor gets {module [168]}. With this bet, the bettor expects to not lose money, and if the reward is greater than module.25, the bettor will expect to make money. So, we can elicit the implicit subjective probability by offering a high reward, and then lowering it until the bettor is just about to walk away, which would be at the module.25 mark.


The subjective and frequentist approaches interpret zero-probability events differently. Consider an infinite sequence of successive events. Any event that occurs only a finite number of times in this infinite sequence will have a frequency that can be made arbitrarily small. In number theory, we do not and cannot differentiate between a number that can be made arbitrarily small and zero. So, from this perspective, such an event can be considered to have a probability of occurrence of zero even though it may occur a finite number of times in the sequence. From a subjective perspective, a zero-probability event is defined as an event E such that a rational person would be willing to bet an arbitrarily large but finite amount that E will not occur. More concretely, suppose this person were to receive a reward of module if E did not occur but would have to forfeit a sum of $F if E occurred. Then, the bet would be taken for any finite value of F.


1.2 Joint and conditional probability


Thus far, we have defined the terms used in studying probability and considered single events in isolation. Having set this foundation, we now turn our attention to the interesting issues that arise when studying sequences of events. In doing so, it is very important to keep track of the sample space in which the events are defined: a common mistake is to ignore the fact that two events in a sequence may be defined on different sample spaces.


1.2.1 Joint probability


Consider two processes with sample spaces and that occur one after the other. The two processes can be viewed as a single joint process whose outcomes are the tuples chosen from the product space . We refer to the subsets of the product space as joint events. Just as before, we can associate probabilities with outcomes and events in the product space. To keep things straight, in this section, we denote the sample space associated with a probability as a subscript, so that denotes the probability of event E defined over sample space and is an event defined over the product space .




Consider sample space and sample space . Then, the product space is given by . If these events are equiprobable, then the probability of


each tuple is . Let be an event in and be an event in . Then the event EF is given by the tuples {(1, b), (2, b)} and has probability .


We will return to the topic of joint processes in Section 1.8 on page 31. We now turn our attention to the concept of conditional probability.


1.2.2 Conditional probability


Common experience tells us that if a sky is sunny, there is no chance of rain in the immediate future, but that if it is cloudy, it may or may not rain soon. Knowing that the sky is cloudy, therefore, increases the chance that it may rain soon, compared to the situation when it is sunny. How can we formalize this intuition? To keep things simple, first consider the case when two events E and F share a common sample space and occur one after the other. Suppose that the probability of E is and the probability of F is . Now, suppose that we are informed that event E actually occurred. By definition, the conditional probability of the event F conditioned on the occurrence of event E is denoted (read “the probability of F given E”) and computed as: (EQ 4)


If knowing that E occurred does not affect the probability of F, E and F are said to be independent and EXAMPLE 11: CONDITIONAL PROBABILITY OF EVENTS DRAWN FROM THE SAME SAMPLE SPACE Consider sample space and events and. Let and . Clearly, the space . The joint event . Suppose . Then, . We interpret this to mean that if event E occurred, then the probability that event F occurs is 0.6. This is higher than the probability of F occurring on its own (which is 0.25). Hence, the fact the E occurred improves the chances of F occurring, so the two events are not independent. This is also clear from the fact that.


The notional of conditional probability generalizes to the case where events are defined on more than one sample space. Consider a sequence of two processes with sample spaces and that occur one after the other (this could be the condition of the sky now, for instance, and whether or not it rains after two hours). Let event E be a subset of and let event F be a subset of . Suppose that the probability of E is and the probability of F is . Now, suppose that we are informed that event E actually occurred. We define the probability as the conditional probability of the event F conditional on the occurrence of E as: (EQ 5) If knowing that E occurred does not affect the probability of F, E and F are said to be independent and (EQ 6)




Consider sample space and sample space with product space . Let be an event in and be an event in . Also, let = 0.5 and let = 0.05. If E and F are independent then: Otherwise, It is important not to confuse P(F|E) and P(F). The conditional probability is defined in the product space and the unconditional probability in the space . Explicitly keeping track of the underlying sample space can help avoid apparent contradictions such as the one discussed in Example 14.




Consider a device that samples packets on a link, as in Example 8. Suppose that measurements show that 20% of the UDP packets have a packet size of 52 bytes. Let P(UDP) denote the probability that the packet is of type UDP and let P(52) denote the probability that the packet is of length 52 bytes. Then, P(52|UDP) = 0.2. In Example 8, we computed that P(UDP) = 0.45. Therefore, P(UDP AND 52) = P(52|UDP) * P(UDP) = 0.2 * 0.45 = 0.09. That is, if we were to pick a packet at random from the sample, there is a 9% chance that is a UDP packet of length 52 bytes (but it has a 20% chance of being of length 52 bytes if we know already that it is a UDP packet).



Consider a television show (loosely modelled on a similar show hosted by Monty Hall) where three identical doors hide two goats and a luxury car. You, the contestant, can pick any door and obtain the prize behind it. Assume that you prefer the car to the goat. If you did not have any further information, your chance of picking the winning door is clearly 1/3. Now, suppose that after you pick one of the doors, say Door 1, the host opens one of the other doors, say Door 2, and reveals a goat behind it. Should you switch your choice to Door 3 or stay with Door 1?


We can view the Monty Hall problem as a sequence of three processes. The first process is the placement of a car behind one of the doors. The second is the selection of a door by the contestant and the third process is the revelation of what lies behind one of the other doors. The sample space for the first process is {Door 1, Door 2, Door 3} abbreviated {1, 2, 3}, as are the sample spaces for the second and third processes. So, the product space is {(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 1),..., (3, 3, 3)}.


Without loss of generality, assume that you pick Door 1. The game show host’s hand is now forced: he has to pick either Door 2 or Door 3. Without loss of generality, suppose that the host picks Door 2, so that the set of possible outcomes that constitutes the reduced sample space is {(1, 1, 2), (2, 1, 2), (3, 1, 2)}. However, we know that the game show host will never open a door with a car behind it - only a goat. Therefore, the outcome (2, 1, 2) is not possible. So, the reduced sample space is just the set {(1, 1, 2), (3, 1, 2)}. What are the associated probabilities?


To determine this, note that the initial probability space is {1, 2, 3} with equiprobable outcomes. Therefore, the outcomes


{(1, 1, 2), (2, 1, 2), (3, 1, 2)} are also equiprobable. When the game show host makes his move to open Door 2, he reveals private information that the outcome (2, 1, 2) is impossible, so the probability associated with this outcome is 0. The show host’s forced move cannot affect the probability of the outcome (1, 1, 2) because the host never had the choice of opening


Door 1 once you selected it. Therefore, its probability in the reduced sample space continues to be 1/3. This means that P({(3, 1, 2)} = 2/3, so it doubles your chances for you to switch doors.


One way to understand this somewhat counter intuitive result is to realize that the game show host’s actions reveal private information, that is, the location of the car. Two-thirds of the time, the prize is behind the door you did not choose. The host always opens a door that does not have a prize behind it. Therefore, the residual probability (2/3) must all be assigned to Door 3. Another way to think of it is that if you repeat a large number of experiments with two contestants, one who never switches doors and the other who always switches doors, then the latter would win twice as often.


1.2.3 Bayes’ rule


One of the most widely used rules in the theory of probability is due to an English country minister Thomas Bayes. Its significance is that it allows us to infer ‘backwards’ from effects to causes, rather than from causes to effects. The derivation of his rule is straightforward, though its implications are profound. We begin with the definition of conditional probability (Equation 4): If the underlying sample spaces can be assumed to be implicitly known, we can rewrite this as:


(EQ 7) We interpret this to mean that the probability that both E and F occur is the product of the probabilities of two events, first, that E occurs, and second, that conditional on E, F occurs. Recall that P(F|E) is defined in terms of the event F following event E. Now, consider the converse: F is known to have occurred–what is the probability that E occurred? This is similar to the problem: if there is fire, there is smoke, but if we see smoke, what is the probability that it was due to a fire? The probability we want is P(E|F). Using the definition of conditional probability it is given by:


(EQ 8) Substituting for P(F) from Equation 7, we get


(EQ 9) which is Bayes’ rule. One way of interpreting this is that it allows us to compute the degree to which some effect or posterior


F can be attributed to some cause or prior E.


We can generalize Bayes’ rule when a posterior can be attributed to more than one prior. Consider a posterior F that is due to some set of n priors Ei such that the priors are mutually exclusive and exhaustive (that is, at least one of them occurs and only one of them can occur). This implies that. Then, (EQ 10)


This is also called the Law of Total Probability.




Continuing with Example 13, let us compute P(52), that is, the probability that a packet sampled at random has a length of 52 bytes. To compute this, we need to know the packet sizes for all other traffic types. For instance, if P(52| TCP) = 0.9 and all other packets were known to be of length other than 52 bytes, then P(52) = P(52|UDP) * P(UDP) + P(52|TCP) * P(TCP) + P(52|other) * P(other) = 0.2 * 0.45 + 0.9 * 0.5 + 0 = 0.54. The law of total probability allows one further generalization of Bayes’ rule to obtain Bayes’ Theorem. From the definition of conditional probability, we have: From Equation 7, Substituting Equation 10, we get (EQ 11) This is called the generalized Bayes’ rule or Bayes’ Theorem. It allows us to compute the probability of any one of the priors Ei, conditional on the occurrence of the posterior F. This is often interpreted as follows: we have some set of mutually exclusive and exhaustive hypotheses Ei. We conduct an experiment, whose outcome is F. We can then use Bayes’ formula to compute the revised estimate for each hypothesis.


Random variables


From Bayes’ rule: Thus, if we see a packet that is not 52 bytes long, it is quite likely that it is a UDP packet. Intuitively, this must be true because most TCP packets are 52 bytes long, and there aren’t very many non-UDP and non-TCP packets.


1.3 Random variables


So far, we have restricted our consideration to studying events, which are collections of outcomes of experiments or observations.


However, we are often interested in abstract quantities or outcomes of experiments that are derived from events and observations, but are not themselves events or observations. For example, if we throw a fair die, we may want to compute the probability that the square of the face value is smaller than 10. This is random and can be associated with a probability, and, moreover, depends on some underlying random events. Yet, it is neither an event nor an observation: it is a random variable. Intuitively, a random variable is a quantity that can assume any one of a set of values (called its domain D) and whose value can only be stated probabilistically. In this section, we will study random variables and their distributions. More formally, a real random variable (the one most commonly encountered in applications having to do with computer networking) is a mapping from events in a sample space S to the domain of real numbers. The probability associated with each value assumed by a real random variable2 is the probability of the underlying event in the sample space as illustrated in


Figure 1.


Figure 1. The random variable X takes on values from the domain D. Each value taken on by the random variable is associated with a probability corresponding to an event E, which is a subset of outcomes in the sample space S.


A random variable is discrete if the set of values it can assume is finite and countable. The elements of D should be mutually exclusive (that is, the random variable cannot simultaneously take on more than one value) and exhaustive (the random variablecannot assume a value that is not an element of D).


DRAFT - Version 3 -Distribution


11 Consider a random variable I defined as the size of an IP packet rounded up to closest kilobyte. Then, I assumes values from the domain D = {1,2,3,..., 64}. This set is both mutually exclusive and exhaustive. The underlying sample space S is the set of potential packet sizes and is therefore identical to D. The probability associated with each value of I is the probability of seeing an IP packet of that size in some collection of IP packets, such as a measurement trace. A random variable is continuous if the values it can take on are a subset of the real line.




Consider a random variable T defined as the time between consecutive packet arrivals at a port of a switch (also called the packet interarrival time). Although each packet’s arrival time is quantized by the receiver’s clock, so that the set of interarrival times are finite and countable, given the high clock speeds of modern systems, modelling T as a continuous random variable is a good approximation of reality. The underlying sample space S is the subset of the real line that spans the smallest and largest possible packet interarrival times. As in the previous example, the sample space is identical to the domain of T.


1.3.1 Distribution


In many cases, we are not interested in the actual value taken by a random variable, but in the probabilities associated with each such value that it can assume. To make this more precise, consider a discrete random variable that assumes distinct values D = {x1, x2,..., xn}. We define the value p(xi) to be the probability of the event that results in assuming the value xi. The function p( ), which characterizes the probability that will take on each value in its domain is called the probability mass function of 3. It is also sometimes called the distribution of .




Consider a random variable H defined as 0 if fewer than 100 packets are received at a router’s port in a particular time interval T and 1 otherwise. The sample space of outcomes consists of all possible numbers of packets that could arrive at the router’s port during T, which is simply the set where M is the maximum number of packets that can be received in time T. Assuming M > 99, we define two events and . Given the probability of each outcome in S, we can compute the probability of each event, and . By definition, and . The set is the probability mass function of H.


Notice how the probability mass function is closely tied to events in the underlying sample space.


Unlike a discrete random variable, which has non-zero probability of taking on any particular value in its domain, the probability that a continuous real random variable will take on any specific value in its domain is 0. Nevertheless, in nearly all cases of interest in the field of computer networking, we will be able to assume that we can define the density function f(x) of as follows: the probability that takes on a value between two reals x1 and x2, is given by the integral. Of course, we need to ensure that . Alternatively, we can think of f(x) being implicitly 3. Note the subtlety in this standard notation. Recall that P(E) is the probability of an event E. In contrast, p(X) refers to the distribution of a random variable X, and refers to the probability that random variable X takes on the value xi.




Suppose we know that packet interarrival times are distributed uniformly in the range [0.5s, 2.5s]. The corresponding density function is a constant c over the domain. It is easy to see that c = 0.5 because we require . The probability that the interarrival time is in the interval is therefore.


1.3.2 Cumulative density function


The domain of a discrete real random variable is totally ordered (that is, for any two values x1 and x2 in the domain, either or ). We define the cumulative density function F( ) by:

(EQ 12) Note the difference between F( ), which denotes the cumulative distribution of random variable , and F(x), which is the value of the cumulative distribution for the value = x. Similarly, the cumulative density function of a continuous random variable , denoted F( ) is given by:


(EQ 13) By definition of probability, in both cases ,




Consider a discrete random variable D that can take on values {1, 2, 3, 4, 5} with probabilities {0.2, 0.1, 0.2, 0.2, 0.3} respectively. The latter set is also the probability mass function of D. Because the domain of D is totally ordered, we compute the cumulative density function F(D) as F(1) = 0.2, F(2) = 0.3, F(3) = 0.5, F(4) = 0.7, F(5) = 1.0.


Now, consider a continuous random variable C defined by the density function f(x) = 1 in the range [0,1]. The cumulative density function F(C) = . We see that, although, for example, f(0.1) = 1, this does not mean that the value 0.1 is certain!


Note that, by definition of cumulative density function, it is necessary that it achieve a value of 1 at right extreme value of the domain.


1.3.3 Generating values from an arbitrary distribution


The cumulative density function F(X), where X is either discrete or continuous, can be used to generate values drawn from the underlying discrete or continuous distribution p(Xd) or f(Xc) as illustrated in Figure 2.


Figure 2. Generating values from an arbitrary (a) discrete or (b) continuous distribution. Consider a discrete random variable that takes on values with probabilities . By definition,


. Moreover, always lies in the range [0,1]. Therefore, if we were to generate a random number u with uniform probability in the range [0,1], the probability that u lies in the range is .


Moreover, . Therefore, the procedure to generate values from the discrete distribution p(Xd) is as follows: first, generate a random variable u uniformly in the range [0,1]; second, compute .


We can use a similar approach to generate values from a continuous random variable with associated density function f(Xc). By definition, for very small values of . Moreover, always lies in the range [0,1].


Therefore, if we were to generate a random number u with uniform probability in the range [0,1], the probability that u lies in the range is , which means that is distributed according to the desired density function


f(Xc). Therefore, the procedure to generate values from the continuous distribution f(Xc) is as follows: first, generate a random variable u uniformly in the range [0,1]; second, compute .


1.3.4 Expectation of a random variable


The expectation, mean or expected value E[ ] of a discrete random variable that can take on n values xi with probability p(xi) is given by:


(EQ 14) Similarly, the expectation E[ ] of a continuous random variable with density function f(x) is given by


(EQ 15) Intuitively, the expected value of a random variable is the value we expect it to take, knowing nothing else about it. For instance, if you knew the distribution of the random variable corresponding to the time it takes for you to travel from your home to work, then, on a typical day, you expect your commute time to be the expected value of this random variable.




Continuing with the random variables C and D defined in Example 22, we find E[D] = 1*0.2 + 2*0.1 + 3*0.2 + 4*0.2 + 5*0.3 = 0.2 + 0.2 + 0.6 + 0.8 + 1.5 = 3.3. Note that the expected value of D is actually a value it cannot assume! This is often true of discrete random variables. One way to interpret this is that D will take on values ‘close’ to its expected value, in this case, 3 or 4.

Similarly, C is the uniform distribution and its expected value is the midpoint of the domain, i.e. 0.5. The expectation of a random variable gives us a reasonable idea of how it behaves in the long run. It is important to remember, however, that two random variables with the same expectation can have rather different behaviours. We now state, without proof, some useful properties of expectations.


1. For constants a and b: E[aX + b] = aE[X] + b (EQ 16)


2. E[X+Y] = E[X] + E[Y], or, more generally, for any set of random variables : (EQ 17)


3. For a discrete random variable with probability mass function p(xi) and any function g(.), E[g( )] = (EQ 18)


4. For a continuous random variable with density function f(x), and any function g(.), E[g(C)] = (EQ 19)


Note that, in general, E[g( )] is not the same as g(E[X]), that is, a function cannot be ‘taken out’ of the expectation.




Consider a discrete random variable D that can take on values {1, 2, 3, 4, 5} with probabilities {0.2, 0.1, 0.2, 0.2, 0.3} respectively. Then, =


Let X be a random variable that has equal probability of lying anywhere in the interval [0,1]. Then, .


1.3.5 Variance of a random variable


The variance of a random variable is defined by V(X) = E[(X-E[X])2]. Intuitively, it shows how ‘far away’ the values taken on by a random variable would be from its expected value. We can express the variance of a random variable in terms of two expectations as V(X) = E[X2] - In practical terms, the distribution of a random variable over its domain D (this domain is also called the population) is not usually known. Instead, the best that we can do is to sample the values it takes on by observing its behaviour over some period of time. We can estimate the variance of the random variable from the array of sample values by keeping running counters for and . Then, , where this approximation improves with n, the size of the sample as a consequence of the law of large numbers, discussed in Section 1.7.4 on page 29.


The following properties of the variance of a random variable can be easily shown for both discrete and continuous random variables.


1. For constant a, V[X+a] = V[X] (EQ 20)


2. For constant a, V[aX] = a2V[X] (EQ 21)


3. If X and Y are independent random variables, V[X+Y] = V[X] + V[Y] (EQ 22)


1.4 Moments and moment generating functions


We have focused thus far on elementary concepts of probability. To get to the next level of understanding, it is necessary to dive into the somewhat complex topic of moment generating functions. The moments of a distribution generalize its mean and variance. In this section we will see how we can use a moment generating function (abbreviated MGF) to compactly represent all the moments of a distribution. The moment generating function is interesting not only because it allows us to prove some useful results, such as the Central Limit Theorem, but also because it is similar in form to the Fourier and Laplace transforms that are discussed in Chapter 5.


1.4.1 Moments


The moments of a distribution are a set of parameters that summarize it. Given a random variable X, its first moment about the origin denoted is defined to be E[X]. Its second moment about the origin, denoted is defined as the expectedvalue of the random variable X2, i.e., E[X2]. In general, the rth moment of X about the origin, denoted , is defined as . We can similarly define the rth moment about the mean, denoted , by E[(X-μ)r]. Note that the variance of the distribution, denoted by σ2 or V[X] is the same as . The third moment about the mean is used to construct a measure of skewness (which describes whether the probability mass is more to the left or the right of the mean, compared to a normal distribution) and the fourth moment about the mean is used to construct a measure of peakedness or kurtosis, which measures the ‘width’ of a distribution. The two definitions of a moment are related. For example, we have already seen that the variance of X, denoted V[X], can be computed as V[X] = E[X2] - (E[X])2. Therefore, . Similar relationships can be found between the higher moments by writing out the terms of the binomial expansion of (X-μ)r.


1.4.2 Moment generating functions


Except under some pathological conditions, a distribution can be thought to be uniquely represented by its moments. That is, if two distributions have the same moments, then, except under some rather unusual circumstances, they will be identical. Therefore, it is convenient to have an expression (or ‘fingerprint’) that compactly represents all the moments of a distribution. Such an expression should have terms corresponding to μr ’ for all values of r. We can get a hint regarding a suitable representation from the expansion of ex:


(EQ 23) We see that there is one term for each power of x. This motivates the definition of the moment generating function (MGF) of a random variable X as the expected value of etX, where t is an auxiliary variable:


(EQ 24) To see how this represents the moments of a distribution, we expand M(t) as


(EQ 25) Thus, the MGF represents all the moments of the random variable X in a single compact expression. Note that the MGF of a distribution is undefined if one or more of its moments are infinite. We can extract all the moments of the distribution from the MGF as follows: if we differentiate M(t) once, the only term that is not multiplied by t or a power of t is . So, . Similarly, . Generalizing, it is easy to show that to get the rth moment of a random variable X about the origin, we only need to differentiate its MGF r times with respect to t and then set t to 0. It is important to remember that the ‘true’ form of the MGF is the series expansion in Equation 25. The exponential is merely a convenient representation that has the property that operations on the series (as a whole) result in corresponding operations being carried out in the compact form. For example, it can be shown that the series resulting from the product of and is. This simplifies the computation of operations on the series. However, it is sometimes necessary to revert to the series representation for certain operations. In particular, if the compact notation of M(t) is not differentiable at t = 0, then we must revert to the series to evaluate M(0), as shown next.




Let X be a uniform random variable defined in the interval [0,1]. This is also called a standard uniform distribution. We would like to find all its moments. We find that M(t) = E[etX] = . However, this function is not defined—and therefore not differentiable—at t = 0. Instead, we revert to the series: which is differentiable term by term. Differentiating r times and setting t to 0, we find that = 1/(r+1). So, = μ = 1/


(1+1) = 1/2 is the mean, and = 1/(1+2) = 1/3 = E(X2). Note that we found the expression for M(t) using the compact notation, but reverted to the series for differentiating it. The justification is that the integral of the compact form is identical to the summation of the integrals of the individual terms.


1.4.3 Properties of moment generating functions We now prove some useful properties of MGFs.




The MGF of a standard uniform random variable X is , so, the MGF of (X-μ) is given by . To find the variance of a standard uniform random variable, we need to differentiate twice with respect to t and then set t to 0. Given the t in the denominator, it is convenient to rewrite the expression as , where the ellipses refer to terms with third and higher powers of t, which will reduce to 0 when t is set to 0. In this product, we need only consider the coefficient of t2 (why?), which is . Differentiating the expression twice results in multiplying the coefficient by 2, and when we set t to zero, we obtain E[(X-μ)2] = V[X] = 1/12.


These two properties allow us to compute the MGF of a complex random variable that can be decomposed into the linear combination of simpler variables. In particular, it allows us to compute the MGF of independent, identically distributed (i.i.d) random variables, a situation that arises frequently in practice.


1.5 Standard discrete distributions We now present some discrete distributions that frequently arise when studying networking problems.


1.5.1 Bernoulli distribution


A discrete random variable X is called a Bernoulli random variable if it can take only two values, 0 or 1, and its probability mass function is defined as p(0) = 1-p and p(1) = p. We can think of X as representing the result of some experiment, with X=1 being ‘success,’ with probability p. The expected value of a Bernoulli random variable is p and variance is p(1-p).


1.5.2 Binomial distribution


Consider a series of n Bernoulli experiments where the result of each experiment is independent of the others. We would naturally like to know the number of successes in these n trials. This can be represented by a discrete random variable X with


(EQ 28) If we set q = 1-p, then these are just the terms of the expansion (p+q)n. The expected value of a variable hat is binomially distributed with parameters (n,p) is np.




Consider a local area network with 10 stations. Assume that, at a given moment, each node can be active with probability p= 0.1. What is the probability that: a) one station is active, b) five stations are active, c) all 10 stations are active?




Assuming that the stations are independent, the number of active stations can be modelled by a binomial distribution with parameters (10, 0.1). From the formula for p(i) above, we get


a) p(1) =


b) p(5) =


c) p(10) =


This is shown in Figure 3.



1.5.3 Geometric distribution


Consider a sequence of independent Bernoulli experiments, as before, each of which succeeds with probability p. Unlike earlier, where we wanted to count the number of successes, we want to compute the probability mass function of a random variable X that represents the number of trials before the first success. Such a variable is called a geometric random variable and has a probability mass function:


(EQ 29)


The expected value of a geometrically distributed variable with parameter p is 1/p.




Consider a link that has a loss probability of 10% and that packet losses are independent (although this is rarely true in practice).


Suppose that when a packet gets lost this is detected and the packet is retransmitted until it is correctly received. What is the probability that it would be transmitted exactly one, two, and three times?




Assuming that the packet transmissions are independent events, we note that the probability of success = p = 0.9. Therefore, p(1) = 0.10* 0.9 = 0.9; p(2) = 0.11* 0.9 = 0.09; p(3) = 0.12* 0.9 = 0.009. Note the rapid decrease in the probability of more than two transmissions, even with a fairly high packet loss rate of 10%. Indeed, the expected number of transmissions is only


1/0.9 = .



About AWL

The AWL is the most recent and widely referred word list for teaching and learning academic vocabulary. The AWL was developed by Averil Coxhead from the Victoria University of Wellington,


Contact us

  • Add: Prof. Amit Hiray
    Assistant Professor
    MIT Academy of Engineering,
    Alandi (D) Pune – 412 105
  • Tel: +91-9923083701