Kernel functions implicitly define some mapping function $\phi(\cdot)$ that transforms an input instance $\textbf{x}\in\mathbb{R}^{d}$ to high dimensional space $Q$ by giving the form of dot product in

$Q:K\left(\textbf{x}_{i},\textbf{x}_{j}\right)\equiv\left\langle\phi\left(\textbf{x}_{i}\right),\phi\left(\textbf{x}_{j}\right)\right\rangle$

(a) Prove that the kernel is symmetric,i.e. $K\left(\textbf{x}_{i}, \textbf{x}_{j}\right)=K\left(\textbf{x}_{j}, \textbf{x}_{i}\right).$

**Solution.**

We have that

$$ K(\textbf{x}_i,\textbf{x}_j)=\left\langle\phi(\textbf{x}_i,\textbf{x}_j)\right\rangle =\left\langle\phi(\textbf{x}_j,\textbf{x}_i)\right\rangle=K(\textbf{x}_j,\textbf{x}_i). $$

(b) Assume we use radial basis kernel function $K\left(\textbf{x}_{i},\textbf{x}_{j}\right)=\exp\left(-\frac{1}{2}\left\|\textbf{x}_{i}-\textbf{x}_{j}\right\|^{2}\right)$.

Thus there is some implicit unknown mapping function $\phi(\textbf{x})$. Prove that for any two input instances $\textbf{x}_{i}$ and $\textbf{x}_{j}$, the squared Euclidean distance of their corresponding points in the feature space $Q$ is less than 2, i.e. prove that $\left\|\phi\left(\textbf{x}_{i}\right)-\phi\left(\textbf{x}_{j}\right)\right\|^{2} \leq 2$.

**Solution.**

We have that

$$ \begin{aligned} ||\phi(\textbf{x}_i)-\phi(\textbf{x}_j)||^2&=\left\langle\phi(\textbf{x}_i),\phi(\textbf{x}_i)\right\rangle+ \left\langle\phi(\textbf{x}_j),\phi(\textbf{x}_j)\right\rangle-2\cdot\left\langle\phi(\textbf{x}_i),\phi(\textbf{x}_j)\right\rangle \\&=K(\textbf{x}_i,\textbf{x}_i)+K(\textbf{x}_j,\textbf{x}_j)-2\cdot K(\textbf{x}_i,\textbf{x}_j) \\&=1+1-2\exp\left(-\frac{1}{2}\left\|\textbf{x}_{i}-\textbf{x}_{j}\right\|^{2}\right)\leq2. \end{aligned} $$

(c) With the help of a kernel function, SVM attempts to construct a hyper-plane in the feature space $Q$ that maximizes the margin between two classes. The classification decision of any $\textbf{x}$ is made on the basis of the sign of

$$ \langle\hat{\textbf{w}},\phi(\textbf{x})\rangle+\hat{w}_0=\sum_{i\in SV}y_i\alpha_i K\left(\textbf{x}_i,\textbf{x}\right)+\hat{w}_0=f\left(\textbf{x};\alpha,\hat{w}_0\right), $$

where $\hat{\textbf{w}}$ and $\hat{w}_0$ are parameters for the classification hyper-plane in the feature space $Q,SV$ is the set of support vectors, and $\alpha_i$ is the coefficient for the $i$-th support vector. Again we use the radial basis kernel function. Assume that the training instances are linearly separable in the feature space $Q$, and assume that the SVM finds a margin that perfectly separates the points.

If we choose a test point $\textbf{x}_{\text{far}}$ which is far away from any training instance $\textbf{x}_i$ (distance here is measured in the original space $\mathbb{R}^d$), prove that $f(\textbf{x}_{\text{far}};\alpha, \hat{w}_0)\approx\hat{w}_0$.

**Solution.**

Since that the test point $\textbf{x}_{\text{far}}$ is far away from any training instance $\textbf{x}_i$, we have that

$$ \forall i\in SV,\ ||\textbf{x}_{\text{far}}-\textbf{x}_i||\gg 0; $$

Therefore

$$ \begin{aligned} &\quad\forall i\in SV, K(\textbf{x}_{\text{far}},\textbf{x}_i)\approx 0 \\&\Rightarrow \sum_{i\in SV}y_i\alpha_i K(\textbf{x}_{\text{far}},\textbf{x}_i)\approx 0 \\&\Rightarrow f(\textbf{x}_{\text{far}};\alpha, \hat{w}_0)\approx\hat{w}_0. \end{aligned} $$

The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, in networking, the number of packets to arrive in a given time window is often assumed to follow a Poisson distribution. If $X$ is Poisson distributed, i.e. $X \sim \operatorname{Poisson}(\lambda)$, its probability mass function takes the following form:

$$ P(X\mid\lambda)=\frac{\lambda^x e^{-\lambda}}{X!} $$

It can be shown that if $\mathbb{E}(X)=\lambda$. Assume now we have $n$ i.i.d. data points from Poisson $(\lambda)$ : $\mathcal{D}=\left\{X_1,\ldots,X_n\right\}$ (For the purpose of this problem, you can only use the knowledge about the Poisson and Gamma distributions provided in this problem.)

(a) Show that the sample mean $\hat{\lambda}=\frac{1}{n}\sum_{i=1}^n X_i$ is the maximum likelihood estimate (MLE) of $\lambda$ and it is unbiased $(\mathbb{E}(\hat{\lambda})=\lambda)$.

**Solution.**

**Part A.** First we prove that $\hat{\lambda}$ is the maximum likelihood estimate of $\lambda$:

Since the $n$ data points are i.i.d., we have the likelihood function to be

$$ \mathcal{L}=P(\mathcal{D}\mid\lambda)=\prod_{i=1}^n\frac{e^{-\lambda}\lambda^{X_i}}{X_i!} =\frac{\prod_{i=1}^ne^{-\lambda}\cdot\prod_{i=1}^n\lambda^{X_i}}{\prod_{i=1}^n X_i!} =\frac{e^{-n\lambda}\cdot\lambda^{\sum_{i=1}^nX_i}}{\prod_{i=1}^n X_i!} $$

To find $\mathop{\text{argmax}}\limits_{\lambda}P(\mathcal{D}\mid\lambda)$, it is equivalent to find $\mathop{\text{argmax}}\limits_{\lambda}\ln\big[P(\mathcal{D}\mid\lambda)\big]$, where

$$ \begin{aligned}\ln\mathcal{L}&=\sum_{i=1}^n\ln\left(\frac{e^{-\lambda}\lambda^{X_i}}{X_i!}\right) =\sum_{i=1}^n\left(-\lambda+\ln\lambda+\ln X_i-\ln X_i!\right)\\&=-n\lambda+\ln\lambda\sum_{i=1}^nx_i-\sum_{i=1}^n\ln X_i!\end{aligned} $$

The maximum is at where the first derivative is zero, i.e.

$$ \frac{\partial\ln\mathcal{L}}{\partial\lambda_{\text{max}}}=-N+\frac{1}{\lambda_{\text{max}}}\sum_{i=1}^n X_i=0\Rightarrow \lambda_{\text{max}}=\frac{1}{n}\sum_{i=1}^nX_i=\hat{\lambda} $$

To verify that this is maximum (rather than minimum), we take the second derivative:

$$ \frac{\partial^2\ln\mathcal{L}}{\partial\lambda^2}=\frac{\partial}{\partial\lambda}\left(-N+\frac{1}{\lambda}{\sum_{i=1}^nX_i}\right)=-\frac{1}{\lambda^2}\sum_{i=1}^nX_i<0 $$

which is always negative, so that $\hat{\lambda}$ is the only maximum.

**Part B.** Now we prove that $\hat{\lambda}$ is unbiased:

First we calculate $E[X]$ as follows

$$ \begin{aligned}E[X]&=\sum_{x\in X}xP(X=x)=\sum_{i\geq 0}^n\frac{x_i\lambda^{x_i}e^{-\lambda}}{x_i!} \\&=\lambda e^{-\lambda}\sum_{i\geq 1}^n\frac{\lambda^{x_{i-1}}}{x_{i-1}!}=\lambda e^{\lambda}\sum_{k\geq 0}\frac{\lambda^k}{k!}=\lambda e^{-\lambda}e^{\lambda}=\lambda; \end{aligned} $$

Therefore we have that

$$ E[\hat{\lambda}]=E\left[\frac{1}{n}\sum_{i=1}^nX_i\right]=\frac{1}{n}\sum_{i=1}^nE[X_i]=\frac{1}{n}\sum_{i=1}^n\lambda=\lambda. $$

Which means that $\hat{\lambda}$ is unbiased.

(b) Now let's be Bayesian and put a prior distribution over $\lambda$. Assuming that $\lambda$ follows a Gamma distribution with the parameters $(\alpha,\beta)$, its probability density function:

$$ p(\lambda\mid\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1} e^{-\beta\lambda} $$

where $\Gamma(\alpha)=(\alpha-1)!$ (here we assume $\alpha$ is a positive integer). Compute the posterior distribution over $\lambda$.

**Solution.**

We calculate the posterior distribution over $\lambda$ as follows:

$$ \begin{aligned}P(\lambda\mid y)&\propto\frac{\beta^{\alpha}\lambda^{\alpha-1}e^{-\beta\lambda}}{\Gamma(\alpha)}\cdot \frac{e^{-n\lambda}\lambda^{\sum_{i=1}^nX_i}}{\prod_{i=1}^nX_i!}=\frac{\beta^{\alpha}}{\Gamma(\alpha)\prod_{i=1}^nX_i!} \cdot\left(\lambda^{\alpha-1+\sum_{i=1}^nx_i}\cdot e^{-(n+\beta)\lambda}\right) \\&=P(\lambda\mid\alpha+\sum_{i}X_i,\beta+n)=\Gamma(\hat{\alpha},\hat{\beta})\end{aligned} $$

where $\hat{\alpha}=\alpha+\sum_{i}X_i$ and $\hat{\beta}=\beta+n$.

(c) Derive an analytic expression for the maximum a posterior (MAP) of $\lambda$ under Gamma $(\alpha,\beta)$ prior.

**Solution.**

Since that

$$ \alpha>1\Rightarrow\hat{\alpha}=\alpha+\sum_{i=1}^nX_i>1 $$

We have the MAP estimation of $\lambda$ to be

$$ \frac{\hat{\alpha}-1}{\hat{\beta}}=\frac{\alpha+\sum_{i=1}^nX_i-1}{\beta+n}. $$

Using d-separation on figure 1 to discuss the following questions.

(a) Are A and B conditionally independent, given D and F?

**Solution.**

A and B **are NOT** conditionally independent. According to the follow process,

A and B are connected, so they are not required to be conditionally independent given D and F.

(b) $P(D\mid CEG) =?\ P(D\mid C)$

**Solution.**

**Not equal**. The equation holds iff. D and E are conditionally independent given C, and D and G are also conditionally independent given C.

From the figure above we know that D and E are conditionally independent given C; However,

From the figure above we know that D and G are **not** conditionally independent given C. Since the equation holds only when both two conditions are satisfied, we have that $P(D\mid CEG) \neq\ P(D\mid C)$.

Given the input variables $X\in\mathbb{R}^p$ and output variable $Y\in\mathbb{R}$, the Expected Prediction Error (EPE) is defined by

$$ \text{EPE}(\hat{f})=\mathbb{E}\left[L(Y,f(X))\right], $$

where $\mathbb{E}(\cdot)$ denotes the expectation over the joint distribution $\text{Pr}(X,Y)$, and $L(Y,f(X))$ is a loss function measuring the difference between the estimated $f(X)$ and observed $Y$. We have shown in our course that for the squared error loss $L(Y,f(X))=(Y-f(X))^2$, the regression function $f(x)=\mathbb{E}(Y|X=x)$ is the optimal solution of $\min_f\text{EPE}(f)$ in the pointwise manner.

(a) In Least Squares, a linear model $X^T\beta$ is used to approximate $f(X)$ according to

$$ \mathop{\min}\limits_{\beta}\mathbb{E}\left[(Y-X^T\beta)^2\right]. $$

Please derive the optimal solution of the model parameters $\beta$.

**Solution.**

First we calculate the first derivative as follows

$$ \begin{aligned}\bigg(\mathbb{E}[(Y-X^T\beta)^2]\bigg)'&=\bigg(\mathbb{E}[(Y-X^T\beta)^T(Y-X^T\beta)]\bigg)'\\ &=\mathbb{E}[-X^T(Y-X^T\beta)-X^T(Y-X^T\beta)']\\&=\mathbb{E}[-2X^T(Y-X^T\beta)]\end{aligned} $$

Therefore for optimal $\beta=\hat{\beta}$, we have that

$$ \mathbb{E}[-X^T(Y-X^T\hat{\beta})]=0\Leftrightarrow \mathbb{E}[YX^T]=\mathbb{E}[XX^T\hat{\beta}]\Leftrightarrow\hat{\beta}=\mathbb{E}[YX^T(XX^T)^{-1}]. $$

(b) Please explain how the nearest neighbors and least squares approximate the regression function, and discuss their difference.

**Solution.**

**Nearest neighbors** regression predicts the output for a new input by finding the nearest training inputs and averaging their outputs; **Least squares** regression predicts the output for a new input by finding the linear combination of the training inputs that best fits the training outputs, getting the best linear parameters by using some optimization algorithm.

The difference is that nearest neighbors is non-parametric, that it does not make any assumptions about the regression function, while least squares assumes that the function is a linear one. Therefore nearest neighbors is more flexible that it can capture more non-linear patterns in the data, but takes a longer time to train, while least squares is more efficient but less flexible.

(c) Given absolute error loss $L(Y,f(X))=\left|Y-f(X)\right|$, please prove that $f(x)=\text{median}(Y\mid X=x)$ minimizes $\text{EPE}(f)\text{ w.r.t. }f$.

**Solution.**

First we see the fact that

$$ c=\text{median}(Y)\Rightarrow\frac{\partial}{\partial c}\sum_{i=1}^N\left|Y-c\right|=\sum_{i=1}^N\text{sgn}(Y-c)=0 $$

Thus $\displaystyle\sum_{i=1}^N\left|Y-c\right|$ is minimized by $c=\text{median}(Y)$;

Therefore, $f(x)=\text{median}(Y\mid X=x)$ minimizes $\text{EPE}(f)$ for $L(Y,f(X))=\left|Y-f(X)\right|$, $\text{ w.r.t. }f$.

**The proof is complete.**

(a) Ridge regression can be considered as an unconstrained optimization problem

$$ \mathop{\min}\limits_{\mathbf{w}}||\mathbf{y-Xw}||_2^2+\lambda||\mathbf{w}||_2^2. $$

where $\mathbf{X}\in\mathbb{R}^{n\times d}$ is a data matrix, and $\mathbf{y}\in\mathbb{R}^n$ is the target vector. Consider the following augmented target vector $\hat{\mathbf{y}}$ and data matrix $\hat{\mathbf{X}}$

$$ \hat{\mathbf{y}}=\begin{bmatrix}\mathbf{y}\\\mathbf{0}_d\end{bmatrix}\ \hat{\mathbf{X}}=\begin{bmatrix}\mathbf{X}\\\sqrt{\lambda}\mathbf{I}_d\end{bmatrix} $$

where $\mathbf{0}_d$ is the zero vector in $\mathbb{R}_d$ and $\mathbf{I}_d\in\mathbb{R}^{d\times d}$ is an identity matrix. Please derive the optimal solution of the optimization problem $\min_{\omega}||\hat{\mathbf{y}}-\hat{\mathbf{X}}\mathbf{w}||_2^2$ only use $\mathbf{X},\mathbf{y}$.

**Solution.**

First we apply the Least Squares regression as follows

$$ \hat{\beta}^{\text{ls}}=\mathop{\text{argmin}}\limits_{\mathbf{w}}||\hat{\mathbf{y}}-\hat{\mathbf{X}}\mathbf{w}||_2^2 =(\hat{\mathbf{X}}^T\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^T\hat{\mathbf{y}}=(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}_d)^{-1}\mathbf{X}^T\mathbf{y} $$

Therefore we have that

$$ \begin{aligned}\mathop{\min}\limits_{\mathbf{w}}||\hat{\mathbf{y}}-\hat{\mathbf{X}}\mathbf{w}||_2^2 &=\left|\left|\begin{bmatrix}\mathbf{y}\\\mathbf{0}_d\end{bmatrix}-\begin{bmatrix}\mathbf{X}\\\sqrt{\lambda}\mathbf{I}_d\end{bmatrix}\cdot\hat{\beta}^{\text{ls}}\right|\right|_2^2 \\&=||\mathbf{y}-\mathbf{X}\hat{\beta}^{\text{ls}}||_2^2+\lambda||\hat{\beta}^{\text{ls}}||_2^2.\end{aligned} $$

(b) Let's consider another situation by constructing an augmented matrix in the following way

$$ \hat{\mathbf{X}}=\begin{bmatrix}\mathbf{X}&\alpha\mathbf{I}_n\end{bmatrix} $$

where $\alpha$ is a scalar multiplier. Then consider the following problem

$$ \mathop{\min}\limits_{\beta}||\beta||_2^2\quad\text{s.t.}\quad\hat{\mathbf{X}}\beta=\mathbf{y} $$

If $\beta^*$ is the optimal solution of (4), show that the first $d$ coordinates of $\beta^*$ form the optimal solution of (3) for a specific $\alpha$, and find the $\alpha$. And what the final $n$ coordinates of $\beta^*$ represent?

**Solution.**

Denote $\beta$ as $\begin{bmatrix}\beta_d\\\beta_n\end{bmatrix}$. In order to express $\beta_n$ in terms of $\beta_d$, consider

$$ \hat{\mathbf{X}}\beta=\hat{\mathbf{y}}\Leftrightarrow\begin{bmatrix}\hat{\mathbf{X}}&\alpha\mathbf{I}_n\end{bmatrix}\begin{bmatrix}\beta_d\\\beta_n\end{bmatrix} =\mathbf{y}\Rightarrow\beta_n=\frac{\mathbf{y}-\mathbf{X}\beta_d}{\alpha} $$

Then for the problem in (4), first we calculate that

$$ \begin{aligned}||\beta||_2^2&=||\beta_d||_2^2+\frac{1}{\alpha^2}||\mathbf{y}-\mathbf{X}\beta_d||_2^2 \\&=\beta_d^T\left(\mathbf{I}_d+\frac{\mathbf{X}^T\mathbf{X}}{\alpha^2}\right)\beta_d-\beta_d^T\cdot\frac{2\mathbf{X}^T\mathbf{y}}{\alpha^2}+\frac{\mathbf{y}^T\mathbf{y}}{\alpha^2} \end{aligned} $$

Apply the first derivative, we have that

$$ \frac{\mathbf{d}||\beta||_2^2}{\mathbf{d}\beta_d}=2\left(\mathbf{I}_d+\frac{\mathbf{X}^T\mathbf{X}}{\alpha^2}\right)\beta_d-\frac{2\mathbf{X}^T\mathbf{y}}{\alpha^2} $$

Therefore for the optimal solution $\beta^*$, we have that

$$ 2\left(\mathbf{I}_d+\frac{\mathbf{X}^T\mathbf{X}}{\alpha^2}\right)\beta_d-\frac{2\mathbf{X}^T\mathbf{y}}{\alpha^2}=0\Leftrightarrow \beta_d=(\alpha^2\mathbf{I}_d+\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$

To find the value of $\alpha$, we have

$$ \beta_d=\hat{\beta}^{\text{ridge}}=(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}_d)^{-1}\mathbf{X}^T\mathbf{y}\Leftrightarrow \alpha=\pm\sqrt{\lambda}. $$

The final $n$ coordinates of $\beta^*$ represents the $\beta_n$.

(c) As we all know, the standard formula for Ridge Regression is the optimal solution of (3). Suppose the SVD of $\mathbf{X}$ is $\mathbf{X}=\mathbf{U\Sigma V}^T$, then we can make some changes on coordinates in the feature space, so that $\mathbf{V}$ becomes identity, where $\mathbf{X}'=\mathbf{XV}$ and $\mathbf{w}=\mathbf{V}^T\mathbf{w}$, and denote $\hat{\mathbf{w}}'$ as the solution of the ridge regression in new coordinates. Please write down the $i$-th coordinate of $\hat{\mathbf{w}}'$.

**Solution.**

We calculate the whole $\hat{\mathbf{w}}'$ first as follows

$$ \begin{aligned}\hat{\mathbf{w}}'&=\mathbf{V}^T\mathbf{w}\\&=\mathbf{V}^T(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}_d)^{-1}\mathbf{X}^T\mathbf{y} \\&=\mathbf{V}^T(\mathbf{V\Sigma}^T\cdot\mathbf{\Sigma V}^T+\lambda\mathbf{I}_d)^{-1}\cdot\mathbf{\Sigma}^T\mathbf{U}^T\mathbf{y} \\&=\text{diag}\left[\frac{\sigma_i}{\sigma_i^2+\lambda}\right]\mathbf{U}^T\mathbf{y} \end{aligned} $$

Therefore the $i$-th coordinate of $\hat{\mathbf{w}}'$ is represented as

$$ \hat{\mathbf{w}}'_i=\frac{\sigma_i}{\sigma_i^2+\lambda}\left\langle\mathbf{U}_i,\mathbf{y}\right\rangle $$

where $\mathbf{U}_i$ is the $i$-th column of $\mathbf{U}$.

A random variable $\mathbf{X}$ has unknown mean and variance: $\mu,\sigma^2$. $n$ iid realizations $\mathbf{X}_1=\mathbf{x}_1,\mathbf{X}_2=\mathbf{x}_2,\cdots,\mathbf{X}_n=\mathbf{x}_n$ from the random variable $\mathbf{X}$ are used to estimate the mean of $\mathbf{X}$. We will call our estimate of $\mu$ the random variable $\hat{\mathbf{X}}$, which has mean $\hat{\mu}$. There are two possible ways to estimate $\mu$ with the realizations of $n$ samples:

- Average the $n$ samples: $\displaystyle\frac{\mathbf{x}_1+\mathbf{x}_2+\cdots+\mathbf{x}_n}{n}$
- Average the $n$ samples and $n_0$ samples of 0: $\displaystyle\frac{\mathbf{x}_1+\mathbf{x}_2+\cdots+\mathbf{x}_n}{n+n_0}$

The bias is defined as $\mathbb{E}[\hat{\mathbf{X}}-\mu]$ and the variance of $\text{Var}[\hat{\mathbf{X}}]$

(a) What are the bias and the variance of each of two estimators above?

**Solution.**

For the first estimator, we have that

$$ \begin{aligned}\mathbb{E}[\hat{\mathbf{X}_1}-\mu]&=\mathbb{E}\left[\frac{\sum_{i=1}^n\mathbf{x}_i}{n}\right]-\mu =\sum_{i=1}^n\frac{\mathbb{E}[\mathbf{x}_i]-\mu}{n}=0;\\ \text{Var}[\hat{\mathbf{X}_1}]&=\mathbb{E}\left[\left(\frac{\sum_{i=1}^n\mathbf{x}_i}{n}-\hat{\mu}\right)^2\right] =\mathbb{E}\left[\left(\frac{\sum_{i=1}^n(\mathbf{x}_i)-\mu}{n}\right)^2\right] \\&=\mathbb{E}\left[\frac{\sum_{i=1}^n(\mathbf{x}_i-\mu)^2}{n^2}\right]\\&=\mathbb{E}\left[\frac{\mathbb{E}[(\mathbf{x}_i-\mu)^2]}{n}\right] =\mathbb{E}\left[\frac{\sigma^2}{n}\right]=\frac{\sigma^2}{n}. \end{aligned} $$

For the second estimator, we have that

$$ \begin{aligned}\mathbb{E}[\hat{\mathbf{X}_2}-\mu]&=\mathbb{E}\left[\frac{\sum_{i=1}^n\mathbf{x}_i}{n+n_0}\right]-\mu=\frac{\sum_{i=1}^n(\mathbb{E}[\mathbf{x}_i]-\mu)}{n+n_0}+\frac{n\mu}{n+n_0}-\mu\\&=-\frac{n_0\mu}{n+n_0}; \\\text{Var}[\hat{\mathbf{X}_2}]&=\mathbb{E}\left[\left(\frac{\sum_{i=1}^n\mathbf{x}_i}{n+n_0}-\hat{\mu}\right)^2\right] =\mathbb{E}\left[\left(\frac{\sum_{i=1}^n(\mathbf{x}_i-\mu)}{n+n_0}+\frac{n\mu}{n+n_0}-\hat{\mu}\right)^2\right] \\&=\mathbb{E}\left[\left(\frac{\sum_{i=1}^n(\mathbf{x}_i-\mu)}{n+n_0}\right)^2\right]=\frac{n^2\text{Var}[\hat{\mathbf{X}_1}]}{(n+n_0)^2} \\&=\frac{n\sigma^2}{(n+n_0)^2}. \end{aligned} $$

(b) Now we denote a new independent sample of $\mathbf{X}$ as $\mathbf{X}'$, in order to test how well $\hat{\mathbf{X}}$ estimates a new sample of $\mathbf{X}$. Please derive an expression for $\mathbb{E}[(\hat{\mathbf{X}}-\mu)^2]$ and $\mathbb{E}[(\hat{\mathbf{X}}-\mathbf{X}')^2]$, and then make some comments on the difference between them. (Hints: Using the Bias-Variance Tradeoff)

**Solution.**

We have that

$$ \begin{aligned} \mathbb{E}\left[(\hat{\mathbf{X}}-\mu)^2\right]&=\mathbb{E}\left[(\hat{\mathbf{X}}-\hat{\mu}+\hat{\mu}-\mu)^2\right] =\mathbb{E}[(\hat{\mathbf{X}}-\hat{\mu})^2]+\mathbb{E}[(\hat{\mu}-\mu)^2] \\&=\hat{\sigma}^2+(\hat{\mu}-\mu)^2; \\\mathbb{E}\left[(\hat{\mathbf{X}}-\mathbf{X}')^2\right]&=\mathbb{E}\left[(\hat{\mathbf{X}}-\hat{\mu}+\hat{\mu}-\mu+\mu-\mathbf{X}')^2\right] \\&=\mathbb{E}[(\hat{\mathbf{X}}-\hat{\mu})^2]+\mathbb{E}[(\hat{\mu}-\mu)^2]+\mathbb{E}[(\mu-\mathbf{X}')^2]-2\mathbb{E}[(\hat{\mathbf{X}}-\hat{\mu})(\mu-\mathbf{X}')] \\&=\hat{\sigma}^2+\sigma^2+(\hat{\mu}-\mu)^2. \end{aligned} $$

(c) Compute $\mathbb{E}[(\hat{\mathbf{X}}-\mu)^2]$ for each of the estimators above.

**Solution.**

We have that

$$ \begin{aligned} \mathbb{E}[(\hat{\mathbf{X}_1}-\mu)^2]&=\text{Var}[\hat{\mathbf{X}_1}]=\frac{\sigma^2}{n}; \\\mathbb{E}[(\hat{\mathbf{X}_2}-\mu)^2]&=\text{Var}[\hat{\mathbf{X}_2}]+\text{Bias}[\hat{\mathbf{X}_2}]^2=\frac{\sigma^2}{n}+\left(\frac{n_0\mu}{n+n_0}\right)^2. \end{aligned} $$

]]>Finance is the study of how individual/real companies/financial institutions allocate **scarce** resources over time.

real company : companies that produce actual products

- Costs and benefits are distributed over time
- The actual timing and size of future cash flows are often known only
**probabilistically**.

Three topics : Financial institutions and markets; Return; Risk

When implementing decisions, people make use of the Financial System defined as the set of markets and other institutions used for financial contracting and exchange of rights and obligations (权利和义务).

Every contract specifies considerations under different conditions:

Example : debt financing contract (借贷关系)

Obligations and rights

(under conditions)for creditors (债权人)Obligations and rights

(under conditions)for debtors (债务人)Contract define the certain time periods (when should the creditor borrow the money and when should the debtor pay back) and deal with accidents (e.g. the debtor can't pay the money on time)

**Financial theory** consists of :

- The set of concepts that help to organize one's thinking about how to allocate resources over time
- The set of quantitative models that can explain stylized facts
- Implications of such quantitative models help evaluate alternatives, make decisions, and implement them.

Basic tenet of finance : the existence of economic organizations (e.g. firms and governments) facilitate the **satisfaction of people's consumption preferences** (maximization of social welfare).

- Individuals' consumption; Firms' profits; Economic productivity and growth (and innovation)

Manage your personal resources;

Deal with the world of business;

Pursue interesting and rewarding career opportunities;

Make informed public choices as a citizen.

- Matching 匹配 : Market making
- Credit enhancing 增信 : Trust providing/guarantee

- Consumption and saving decisions
- Investment and Financing decisions
- Risk-management decisions

Assets:

- Cash, current deposit, fixed term deposit
- Physical assets
*Human Capital (人力资源)*

Liabilities or Debts:

- Short or long-term loan
- Cutoff between short term and long term?

$< 1$ year short term; $\geq 1$ year long term

$$ \text{Net Worth}=\text{Assets}-\text{Liabilities} $$

**The capital budgeting process 资本预算**

How to acquire human capital?

- Recruiting
- Develop or buy a business created by a team
- Good working environment

**The financing process 融资**

Once a new set of approved projects has been identified, it must be financed with retained earnings (盈余公积), stock (股票), and bonds (债券).

Retained earning stand for internal financing; Stock and bonds are external.

Capital structure is the amount of the firm's market value allocated to each category of issued securities. It determines ownership and risk level of the firm's future cash flows.

**Fixed-income securities (debts) 固定收益证券/债券证券**

- Promise either a specified stream of income.
- Claims on
*guaranteed*cash flows.

**Common stocks and preferred stocks (equity) 股权证券**

- Represent an ownership share the corporation
- Claims on the
*possibility*that the firm may earn in the future (uncertainty)

**Derivative securities 衍生证券**

- Provide payoffs that are determined by the prices of other assets.
- Claims on cash flows generated by cash flows.

Capital structure's unit of analysis is the firm as a whole

Cash in-/out- flows could be cancelled out or amplified among projects.

Capital structure also determines who **controls** the firm under different contingencies:

- Common stock holders usually determine the membership of the board of directors.
- Bondholder covenants restrict decisions that could adversely affect bond values.

**Working capital 营运资本 or liquidity 流动性**

The difference between a company's current assets minus its current liabilities.

Possible damages if *not managed well*:

Sub-optimal temporary finance; Delayed in investment schedules; Unscheduled sale of the firm assets; Loss of investor and creditor confidence.

**Sole Proprietorship 个人独资制**

A firm owned by an individual or family

Assets and liabilities are the personal assets and liabilities of the proprietor

Unlimited liability (无限责任制)

Low administrative costs

**Partnership 合伙制**

A firm with $\geq 2$ owners sharing the equity

May own property, borrow, sue, be sued, and enter into legal contracts

Changes in ownership involve dissolving the old partnership and forming a new one

*Usually* unlimited liabilities.

Current examples : Law firms, Architectural firms, Accounting firms

**Corporation 有限责任制**

A legal entity, distinct from its ownership

May own property, borrow, sue, be sued, and enter into legal contracts

**Not dissolved** when shares are transferred

Shareholders elect directors, who appoint management

Pays corporate taxes, resulting in double taxation of owner

**Limited liability** (*corporate veil may be lifted*)

**Owners 股东，董事长**

Contribute money and receive dividends (分红)

Make decisions

**Managers 总裁，首席执行官，首席财务官**

Execute decisions

Provide professional suggestions

Pros : Specialization : capital + professional skills ; capital and skills are more flexible

Cons : Agency problem ; costs of information gathering

The right production plan;

The right recruiting plan;

The right development strategy;

Could be simplified as "*profit maximization*"

time : 2022-08-05 06:02

time : 2023-04-24 09:40

time : 2021-12-28 15:49

time : 2021-06-09 17:03

time : 2022-08-05 05:59

time : 2022-08-01 23:50

time : 2022-03-21 18:48

time : 2022-05-19 11:16

time : 2022-08-06 12:11

time : 2021-11-01 00:15

time : 2022-08-02 00:12

time : 2021-06-20 22:23

time : 2021-11-02 23:18

time : 2022-08-06 12:14

time : 2022-02-19 18:14

time : 2022-07-29 02:11

time : 2021-06-16 18:03

time : 2023-02-26 17:10

time : 2022-08-09 09:10

time : 2021-11-01 00:18

time : 2021-06-08 22:35

time : 2022-08-09 09:13

time : 2022-08-09 09:19

time : 2022-04-09 13:50

time : 2022-02-19 11:15

time : 2021-01-22 17:48

time : 2022-03-18 18:36

time : 2022-02-01 17:44

time : 2021-07-26 11:59

time : 2023-02-25 08:26

time : 2022-04-09 13:58

time : 2022-10-16 18:24

time : 2022-01-23 00:40

time : 2021-10-17 23:43

time : 2022-01-23 00:06

time : 2022-03-18 18:29

time : 2022-08-10 23:11

time : 2021-06-19 23:29

time : 2022-02-05 16:20

(版本2.5.0 定数调整 12.6$\rightarrow$13.4)

time : 2022-08-03 00:31

time : 2021-06-22 10:30

time : 2022-11-15 16:16

time : 2022-05-22 21:27

time : 2022-10-05 18:10

time : 2021-02-25 12:41

time : 2022-01-19 18:07

time : 2022-12-23 23:12

time : N/A

time : 2023-03-06 11:04

time : 2022-01-03 16:21

time : 2022-08-05 06:04

time : 2022-08-05 05:48

time : 2022-02-05 21:10

time : 2022-06-05 10:53

time : 2022-04-29 18:05

time : 2023-02-25 16:05

time : 2022-11-26 00:01

time : 2021-10-14 23:21

time : 2022-07-04 22:44

time : 2022-08-12 09:42

time : N/A

time : 2021-07-08 09:27

(版本2.5.0 定数调整 14.2$\rightarrow$13.8)

time : 2021-11-07 13:45

time : 2021-10-31 21:37

time : 2022-07-22 21:26

(版本2.5.0 定数调整 14.4$\rightarrow$13.9)

time : 2022-08-11 15:45

time : 2022-02-20 17:10

time : 2022-09-03 00:19

time : 2022-01-30 02:37

time : 2021-06-15 22:37

time : 2022-02-13 03:13

time : 2022-02-01 18:08

time : 2022-02-13 03:01

time : 2022-04-12 16:11

time : 2022-01-24 23:09

time : 2022-07-26 00:37

time : 2022-04-05 14:02

time : 2023-01-17 15:37

time : 2022-02-06 22:41

time : 2022-03-19 12:17

time : 2022-04-16 13:44

time : N/A

time : 2022-05-24 03:32

time : 2022-07-18 12:56

time : N/A

time : 2023-04-22 07:07

time : 2021-11-26 13:51

time : 2022-08-05 05:32

time : 2022-12-24 15:34

time : 2022-02-26 00:02

time : 2023-01-25 21:46

time : 2022-07-26 00:27

time : N/A

time : 2022-07-26 00:31

time : N/A

time : 2021-12-28 15:25

time : 2022-07-10 17:04

time : 2022-09-26 20:40

time : 2022-11-26 00:45

time : 2022-09-02 11:30

time : 2021-11-17 14:36

time : 2022-04-29 18:32

time : 2022-04-29 19:05

time : 2022-06-10 18:18

time : N/A

time : 2022-09-01 23:23

time : 2022-01-17 23:50

time : 2022-02-26 14:04

time : 2022-02-28 18:31

time : 2022-07-20 18:04

time : 2023-04-24 09:32

time : 2022-01-15 04:57

time : N/A

time : N/A

time : 2022-02-26 00:19

time : 2022-11-20 13:29

time : 2022-04-01 14:37

time : 2023-04-22 06:59

time : 2022-07-24 17:07

time : 2022-07-24 18:25

time : N/A

time : N/A

time : 2022-07-18 12:59

time : 2022-07-02 01:14

(版本2.5.0 定数调整 15.1$\rightarrow$14.8)

time : 2023-04-24 09:25

time : N/A

time : N/A

time : 2022-07-26 04:39

time : 2023-04-24 01:24

time : 2022-09-02 17:35

time : 2023-01-17 00:23

time : N/A

time : N/A

time : 2022-04-30 16:30

(版本2.5.0 定数调整 14.9$\rightarrow$15.1)

time : 2023-04-19 03:43

time : N/A

time : 2022-03-18 18:51

(版本2.5.0 定数调整 14.9$\rightarrow$15.1)

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : 2022-12-24 15:29

time : 2022-01-11 19:34

time : 2023-04-08 10:54

time : 2022-04-30 16:55

time : N/A

time : 2022-05-18 16:44

time : 2023-01-15 00:04

time : N/A

time : N/A

time : N/A

time : 2021-11-19 16:59

time : N/A

time : N/A

time : 2022-10-23 22:49

time : 2022-06-30 12:40

time : N/A

time : 2022-11-26 01:07

time : 2022-08-07 12:45

time : 2023-02-26 00:58

time : N/A

time : N/A

time : N/A

time : N/A

time : 2023-04-02 18:06

time : N/A

time : N/A

time : 2023-03-02 23:19

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

time : N/A

]]>- New data and information may affect our uncertainties
- Conditional probability : how to
**update**our belief? - All probabilities are
**conditional**! (works as explicit/implicit background info or assumption) - Conditioning is a powerful problem-solving strategy, which can reduce a complicated probability problem to a bunch of simpler conditional probability problems
**First-step analysis**(首步分析法) : obtain recursive solution to multi-stage problems

适用于具有多个步骤且往往拥有某种递归结构的问题

- "Conditioning is the soul of statistics"

If $A$ and $B$ are events with $P(B)>0$, then the *conditional probability* of $A$ **given** $B$, denoted by $P(A\mid B)$, is defined as

$$ P(A\mid B)=\frac{P(A\cap B)}{P(B)} $$

where $P(A)$ is the **prior probability** of $A$ (先验概率), $P(A\mid B)$ is the **posterior probability** (后验概率) of $A$.

因晚于$A$发生的事件$B$而改变了我们对于$A$的印象

For any events $A_1,A_2,\cdots,A_n$ with positive probabilities,

$$ P(A_1,A_2,\cdots,A_n)=P(A_1)P(A_2\mid A_1)P(A_3\mid A_1,A_2)\cdots P(A_n\mid A_1,\cdots,A_{n-1}) $$

Here $P(A_1,A_2,\cdots A_n)\triangleq P(A_1\cap A_2\cap\cdots\cap A_n),\mathrm{etc.}$

For any events $A$ and $B$ with positive probabilities,

$$ P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)} $$

**Proof**

$$ \begin{aligned} P(B\mid A)&=\frac{P(B\cap A)}{P(A)}\Rightarrow P(B\cap A)=P(A)P(B\mid A)\\ \Rightarrow P(A\mid B)&=\frac{P(A\cap B)}{P(B)}=\frac{P(A)P(B\mid A)}{P(B)}. \end{aligned} $$

Let $A_1,A_2,\cdots, A_n$ be a *partition* of the sample space $S$ ($\mathrm{i.e.}$ the $A_i$ are disjoint events and their union is $S$), with $P(A_i)>0$ for all $i$. Then

$$ P(B)=\sum_{i=1}^n P(B\mid A_i)P(A_i) $$

**Proof**

Since that $B=\bigcup_{i=1}^n(B\cap A_i)$, where $B\cap A_i$ are disjoint sets, according to the axiom for probability we have

$$ P(B)=P\left(\bigcup_{i=1}^n(B\cap A_i)\right)=\sum_{i=1}^nP(B\cap A_i)=\sum_{i=1}^nP(B\mid A_i)P(A_i). $$

Let $A_1,\cdots, A_n$ be a *partition* of the sample space $S$ ($\mathrm{i.e.}$ the $A_i$ are disjoint events and their union is $S$), with $P(A_i)>0$ for all $i$. Then for any event $B$ such that $P(B)>0$, we have

$$ P(A_i\mid B)=\frac{P(A_i)P(B\mid A_i)}{P(A_1)P(B\mid A_1)+\cdots+P(A_n)P(B\mid A_n)}. $$

**Proof**

$$ P(A_i\mid B)\xlongequal{\text{Bayes' Rule}}\frac{P(A_i)P(B\mid A_i)}{P(B)}\xlongequal{\text{LOTP}}\frac{P(A_i)P(B\mid A_i)}{\sum_{i=1}^nP(A_i)P(B\mid A_i)}. $$

**Example : Bayes Spam Filter**

A spam filter is designed by looking at commonly occurring phrases in spam. Suppose that 80% of the email is spam. In 10% of the spam emails, the phrase "free money" is used, whereas this phrase is only used in 1% of non-spam emails. A new email has just arrived, which does mention "free money". What is the probability that it is spam?

Event $S$ = "An email is a spam"; $F$ = "An email has the phrase 'free money'"

$P(S)=0.8, P(S^c)=0.2$;

$$ \begin{aligned} P(S\mid F)&=\frac{P(F\mid S)P(S)}{P(F)}\\&=\frac{P(F\mid S)P(S)}{P(F\mid S)P(S)+P(F\mid S^c)P(S^c)}\\&=\frac{0.1\cdot0.8}{0.1\cdot0.8+0.01\cdot0.2}\\&=\frac{80}{82}\\&\approx0.9756 \end{aligned} $$

**Example : Random Coin**

You have one fair coin, and one biased coin which lands Heads with probability $\displaystyle\frac{3}{4}$. You pick one of the coins at random and flip it three times. It lands Heads all three times. Given this information, what is the probability that the coin you picked is the fair one?

Event $A$ = "The chosen coin lands Heads three times"; $F$ = "We picked the fair coin".

$$ \begin{aligned} P(F\mid A)&=\frac{P(A\mid F)P(F)}{P(A\mid F)P(F)+P(A\mid F^c)P(F^c)} \\&=\frac{\displaystyle(\frac{1}{2})^3\cdot\frac{1}{2}}{(\displaystyle\frac{1}{2})^3\cdot\frac{1}{2}+(\frac{3}{4})^3\cdot\frac{1}{2}}\\&=\frac{8}{35}. \end{aligned} $$

**Example : Communication Channel**

Suppose that Alice sends only one bit (a $0$ or $1$) to Bob, with equal probabilities. If she sends a $0$, there is a $5\%$ chance of an error occurring, resulting in Bob receiving a $1$; If she sends a $1$, there is a $5\%$ chance of an error occurring, resulting in Bob receiving a $0$. Given that Bob receives a $1$, what is the probability that Alice actually sent a $1$?

Event $A_1$ = "Alice sent a $1$"; $B_1$ = "Bob received a $1$"

$$ \begin{aligned} P(A_1\mid B_1)&=\frac{P(B_1\mid A_1)P(A_1)}{P(B_1\mid A_1)P(A_1)+P(B_1\mid A_1^c)P(A_1^c)}\\ &=\frac{0.95\cdot0.5}{0.95\cdot0.5+0.05\cdot0.5}\\&=0.95. \end{aligned} $$

Condition on an event $E$, we update our beliefs to be consistent with this knowledge.

$P(\cdot\mid E)$ is also a probability function with sample space $S$:

- $0\leq P(\cdot\mid E)\leq 1$ with $P(S\mid E)=1$ and $P(\varnothing\mid E)=0$.
- If events $A_1,A_2,\cdots$ are disjoint, then

$$ P\left(\bigcup_{j=1}^{\infty}A_j\mid E\right)=\sum_{j=1}^{\infty}P(A_j\mid E). $$

- $P(A^c\mid E)=1-P(A\mid E)$.
- Inclusion - Exclusion : $P(A\cup B\mid E)=P(A\mid E)+P(B\mid E)-P(A\cap B\mid E)$.

Provided that $P(A\cap E)>0$ and $P(B\cap E)>0$, we have

$$ P(A\mid B,E)=\frac{P(B\mid A,E)P(A\mid E)}{P(B\mid E)} $$

**Proof**

- First we have Bayes' Law $\displaystyle P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}$;
- Let $\hat{P}(\cdot)\triangleq P(\cdot\mid E)$;
- Bayes' Law also applies to $\hat{P}(\cdot)$ that

$$ \hat{P}(A\mid B)=\frac{\hat{P}(B\mid A)\hat{P}(A)}{\hat{P}(B)}=\frac{P(B\mid A,E)P(A\mid E)}{P(B\mid E)}. $$

Let $A_1,\cdots, A_n$ be a *partition* of the sample space $S$ ($\mathrm{i.e.}$ the $A_i$ are disjoint events and their union is $S$), with $P(A_i\cap E)>0$ for all $i$. Then

$$ P(B\mid E)=\sum_{i=1}^nP(B\mid A_i,E)P(A_i\mid E). $$

**Proof**

- First we have LOTP $P(B)=\displaystyle\sum_{i=1}^nP(B\mid A_i)P(A_i)$;
- Let $\hat{P}(\cdot)=P(\cdot\mid E)$;
- LOTP also applies to $\hat{P}(\cdot)$ that

$$ \hat{P}(B)=\sum_{i=1}^n\hat{P}(B\mid A_i)\hat{P}(A_i)=\sum_{i=1}^nP(B\mid A_i,E)P(A_i\mid E). $$

**Example : Random Coin**

You have one fair coin, and one biased coin which lands Heads with probability $\displaystyle\frac{3}{4}$. You pick one of the coins at random and flip it three times. It lands heads all three times. If we toss the coin a fourth time, what is the probability that it will land Heads once more?

Let Event $A$ : "the chosen coin lands heads all three times", $F$ : "we picked the fair coin", $H$ : "the chosen coin lands Heads on the fourth time", then

$$ \begin{aligned} P(H\mid A)&=P(H\mid F,A)P(F\mid A)+P(H\mid F^c,A)P(F^c\mid A)\\ &=\frac{1}{2}\times\frac{8}{35}+\frac{3}{4}\times\left(1-\frac{8}{35}\right)\\&=\frac{97}{140}\left(>\frac{1}{2}\right). \end{aligned} $$

($P(F\mid A)$ *has been calculated eariler*)

- Think of $B,C$ as the single event $B\cap C$
- Use Bayes' rule with
*extra conditioning*on $C$. - Use Bayes' rule with
*extra conditioning*on $B$.

**Definition**

Events $A$ and $B$ are *independent* if

$$ P(A\cap B)=P(A)P(B) $$

If $P(A)>0$ and $P(B)>0$, then this is equivalent to

$$ P(A\mid B)=P(A);\quad P(B\mid A)=P(B). $$

即$A,B$两事件间不提供任何信息

**Theorem**

If $A$ and $B$ are independent, then $A$ and $B^c$ are independent, $A^c$ and $B$ are independent, and $A^c$ and $B^c$ are independent.

**Proof**

$P(A\mid B)=P(A),P(B\mid A)=P(B)\Rightarrow P(B^c\mid A)=1-P(B\mid A)=1-P(B)=P(B^c)$

**Definition**

Events $A,B$ and $C$ are independent if all of the following equations hold:

$$ \begin{aligned} P(A\cap B)&=P(A)P(B)\\ P(A\cap C)&=P(A)P(C)\\ P(B\cap C)&=P(B)P(C)\\ P(A\cap B\cap C)&=P(A)P(B)P(C). \end{aligned} $$

当四条性质都满足时，我们称事件$A,B,C$是

相互独立的；当仅满足前三条时，我们称它们是两两独立的(pairwise independent)。两者虽然中文名称近似，但并不等价。

Consider two fair, independent coin tosses.

- Event $A$ : the event that the first is Heads.
- Event $B$ : the event that the second is Heads.
- Event $C$ : the event that both tosses have the same result.

First we have that

$$ \begin{aligned} P(A)&=P(B)=\frac{1}{2};\\P(C)&=P[(A\cap B)\cup(A^c\cap B^c)] \\&=P(A\cap B)+P(A^c \cap B^c) \\&=P(A)\cdot P(B)+P(A^c)\cdot P(B^c)\\&=\frac{1}{4}+\frac{1}{4}=\frac{1}{2}. \end{aligned} $$

We verify that the first three equations hold, meaning that $A,B,C$ are **pairwise independent**;

However, $P(A\cap B\cap C)=P(A\cap B)=\displaystyle\frac{1}{4}\neq\frac{1}{8}=P(A)P(B)P(C)$.

Because $C$ happens naturally when both $A$ and $B$ happens : $C$ is independent with $A$,$B$ respectively, but is not independent with $A$ and $B$ together.

**Definition**

Events $A$ and $B$ are said to be *conditionally independent* given $E$ if:

$$ P(A\cap B\mid E)=P(A\mid E)P(B\mid E). $$

**Example : Conditional Independence $\nRightarrow$ Independence**

We choose either a fair coin or a biased coin ($\mathrm{w.p.}$ $3/4$ of landing Heads).

But we do not know which one we have chosen and we flip it twice.

- Event $F$ : "chosen the fair coin"
- Event $A_1$ : "the first coin tosses landing Heads"
- Event $A_2$ : "the second coin tosses landing Heads"

Events $A_1$ and $A_2$ are conditionally independent given $F$ since that

$$ P(A_1\cap A_2\mid F)=P(A_1\mid F)P(A_2\mid F)\left(\frac{1}{4}=\frac{1}{2}\times\frac{1}{2}\right). $$

However,

$$ \begin{aligned} P(A_i)_{i=1,2}&\xlongequal{\text{LOTP}}P(A_i\mid F)P(F)+P(A_i\mid F^c)P(F^c)\\ &=\frac{1}{2}\times\frac{1}{2}+\frac{3}{4}\times\frac{1}{2}\\&=\frac{5}{8} \end{aligned} $$

And

$$ \begin{aligned} P(A_1\cap A_2)&\xlongequal{\text{LOTP}}P(A_1\cap A_2\mid F)P(F)+P(A_1\cap A_2\mid F^c)P(F^c)\\ &=P(A_1\mid F)P(A_2\mid F)P(F)+P(A_1\mid F^c)P(A_2\mid F^c)P(F^c)\\ &=\frac{1}{2}\times\frac{1}{2}\times\frac{1}{2}+\frac{3}{4}\times\frac{3}{4}\times\frac{1}{2}\\&=\frac{12}{32} \end{aligned} $$

So that $P(A_1)P(A_2)=\displaystyle\frac{25}{64}\neq P(A_1\cap A_2)$, which means that $A_1$ and $A_2$ are not independent.

There are $k+1$ coins in a box. When flipped, the $i^{th}$ coin will turn up heads with probability $i/k$, $i=0,1,\cdots,k$. A coin is randomly selected from the box and is then repeatedly flipped. If the first $n$ flips all result in Heads, what is the conditional probability that the $(n+1)^{st}$ flip will do likewise?

Let Event $A$ : "a successful surgery", $B$ : "Nick is the surgeon", $C$ : "the surgery is a heart surgery", then we have

$$ \begin{aligned} P(A\mid B,C)=\frac{1}{5}&<P(A\mid B^c,C)=\frac{7}{9};\\ P(A\mid B,C^c)=\frac{9}{10}&<P(A\mid B^c,C^c)=1;\\ \text{But }P(A\mid B)=0.83&>P(A\mid B^c)=0.8. \end{aligned} $$

If $P(A\mid B,C)<P(A\mid B^c,C)$ and $P(A\mid B,C^c)<P(A\mid B^c,C^c)$,

it is possible that $P(A\mid B)>P(A\mid B^c)$.

- Label three doors 1,2,3; We assume the contestant picks door 1
- By conditioning on the location of the car, Event $C_i$ = "the car is behind the door $i$", $i=1,2,3$; $P(C_i)=\displaystyle\frac{1}{3}$.
- By LOTP, we have

$$ \begin{aligned} P(\text{get a car})&=P(\text{win})\\&=P(\text{win}\mid C_1)P(C_1)+P(\text{win}\mid C_2)P(C_2)+P(\text{win}\mid C_3)P(C_3) \end{aligned} $$

From which we have

- Stay(Not switching) : $P^{\text{stay}}(\text{win})=\displaystyle 1\times\frac{1}{3}+0\times\frac{1}{3}+0\times\frac{1}{3}=\frac{1}{3}$;
- Switching : $P^{\text{switch}}(\text{win})=\displaystyle0\times\frac{1}{3}+1\times\frac{1}{3}+1\times\frac{1}{3}=\frac{2}{3}$.

Which means that we have a better chance winning if we choose to switch.

]]>(**Story Proof**) Define $\begin{Bmatrix}n\\k\end{Bmatrix}$ as the number of ways to partition $\{1,2,\cdots,n\}$ into $k$ non-empty subsets, or the number of ways to have $n$ students spilt up into $k$ groups such that each group has at least one student.

Prove the following identities:

**(a)**

$$ \begin{Bmatrix}n+1\\k\end{Bmatrix}=\begin{Bmatrix}n\\k-1\end{Bmatrix}+k\cdot\begin{Bmatrix}n\\k\end{Bmatrix} $$

Suppose that there are $n+1$ people, and one of them is named *Alice*. Now we want to divide them into $k$ groups.

One method is to do it directly, and we have $\begin{Bmatrix}n+1\\k\end{Bmatrix}$ possible ways to do so.

Or we can also focus on the opinion of *Alice* first. If she is afraid of social interaction and wants to be in a group by herself, then we only need to divide the rest $n$ people into $k-1$ groups, which has $\begin{Bmatrix}n\\k-1\end{Bmatrix}$ ways; Otherwise, the rest $n$ people are divided normally as $k$ groups, and *Alice* has $k$ ways to choose one of the groups to join. In total we have $\begin{Bmatrix}n\\k-1\end{Bmatrix}+k\cdot\begin{Bmatrix}n\\k\end{Bmatrix}$ ways.

The two methods achieve the same result, so the number of ways should be the same, which means that

$$ \begin{Bmatrix}n+1\\k\end{Bmatrix}=\begin{Bmatrix}n\\k-1\end{Bmatrix}+k\cdot\begin{Bmatrix}n\\k\end{Bmatrix}. $$

**The proof is complete.**

**(b)**

$$ \sum_{j=k}^n\begin{pmatrix}n\\j\end{pmatrix}\cdot\begin{Bmatrix}j\\k\end{Bmatrix}=\begin{Bmatrix}n+1\\k+1\end{Bmatrix} $$

Suppose that there are $n+1$ people, and one of them is named *Alice*; Now we want to divide them into $k+1$ groups.

One method is to do it directly, and we have $\begin{Bmatrix}n+1\\k+1\end{Bmatrix}$ possible ways to do so.

However, since *Alice* is the class monitor, actually she can choose her teammates, $\mathrm{i.e.}$ decide how many people are going to be in her group. Suppose the number of people **not** in *Alice's* group to be $j$. Since that there has to be at least one person in each group, we have $k\leq j\leq n$.

For each $j$, we pick out the $j$ people not in *Alice's* group first, at a count of $\begin{pmatrix}n\\j\end{pmatrix}$ ways;

Then divide them into $k$ groups (the rest have been picked by *Alice* thus in the ${(n+1)}^{\text{th}}$ group already). Thus in total we have $\sum_{j=k}^n\begin{pmatrix}n\\j\end{pmatrix}\cdot\begin{Bmatrix}j\\k\end{Bmatrix}$ ways;

The two methods achieve the same result, so the number of ways should be the same, which means that

$$ \sum_{j=k}^n\begin{pmatrix}n\\j\end{pmatrix}\cdot\begin{Bmatrix}j\\k\end{Bmatrix}=\begin{Bmatrix}n+1\\k+1\end{Bmatrix} $$

**The proof is complete.**

A *norepeatword* is a sequence of at least one (and possibly all) of the usual 26 letters a, b, c, . . . , z, with repetitions not allowed. For example, “course” is a norepeatword, but “statistics” is not. Order matters, e.g., “course” is not the same as “source”. A norepeatword is chosen randomly, with all norepeatwords equally likely. Show that the probability that it uses all 26 letters is very close to 1/$e$.

The number of $k$-letter *norepeatword* is a $k$-permutation since order matters and there is no replacement. We have to choose $k$ letters out of 26 letters first, whose number of ways is $\begin{pmatrix}n\\k\end{pmatrix}$.

Let sample space $S$ denote all possible *norepeatword* and event $A$ be those with 26 letters. Thus

$$ \begin{aligned} P(A)&=\frac{26!}{\sum_{i=1}^{26}\begin{pmatrix}26\\i\end{pmatrix}\cdot i!}= \frac{26!}{\sum_{i=1}^{26}\displaystyle\frac{26!}{i!(26-i)!}\cdot i!} \\&=\frac{26!}{\displaystyle\frac{26!}{0!}+\frac{26!}{1!}+\cdots+\frac{26!}{25!}} \\&=\frac{1}{\displaystyle\frac{1}{0!}+\frac{1}{1!}+\cdots+\frac{1}{25!}} \\&\approx\bigg(1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\cdots+\frac{x^{25}}{25!}+\cdots+o(x^n)\bigg|_{x=1}\bigg)^{-1} \\&=\frac{1}{e^x}\bigg|_{x=1} \\&=\frac{1}{e} \end{aligned} $$

**The proof is complete.**

Given $n\geq 2$ numbers $a_1,a_2,\cdots,a_n$ with no repetitions, a bootstrap sample is a sequence $x_1,x_2,\cdots,x_n$ formed from the $a_j$'s by sampling with replacement with equal probabilities. Bootstrap samples arise in a widely used statistical method known as the bootstrap. For example, if $n=2$ and $(a_1,a_2)=(3,1)$, then the possible bootstrap samples are $(3,3),(3,1),(1,3)$ and $(1,1)$.

**(a)** How many possible bootstrap samples are there for $(a_1,a_2,\cdots,a_n)$?

For any $x_i$ with $i=1,2,\cdots,n$, it has $n$ possible choices, $\mathrm{i.e.}$ $a_1,a_2,\cdots,a_n$.

Therefore there are $n^n$ possible bootstrap samples.

**(b)** How many bootstrap samples are there for $(a_1,a_2,\cdots,a_n)$, if order does not matter (in the sense that it only matters how many times each $a_j$ was chosen, not the order in which they were chosen)?

Suppose that among $x_i$, $a_i$ is chosen $b_i$ times for $i=1,2,\cdots,n$. Since that the order does not matter, the number of such possible bootstrap samples is exactly the same as the number of **nonnegative** solutions to the equation

$$ b_1+b_2+\cdots+b_n=n $$

From the **Theorem** for Bose-Einstein Counting we have that number to be $\begin{pmatrix}2n-1\\n-1\end{pmatrix}$.

**(c)** One random bootstrap sample is chosen (by sampling from $a_1,a_2,\cdots,a_n$ with replacement, as described above). Show that not all unordered bootstrap samples (in the sense of (b)) are equally likely. Find an unordered bootstrap sample $\textbf{b}_1$ that is as likely as possible, and an unordered bootstrap sample $\textbf{b}_2$ that is as unlikely as possible. Let $p_1$ be the probability of getting $\textbf{b}_1$ and $p_2$ be the probability of getting $\textbf{b}_2$ (so $p_i$ is the probability of getting the specific unordered bootstrap sample $\textbf{b}_i$). What is $p_1/p_2$? What is the ratio of the probability of getting an unordered bootstrap sample whose probability is $p_1$ to the probability of getting an unordered sample whose probability is $p_2$?

Not all unordered bootstrap samples are equally likely, as more **repeatition** of a certain $a_i$ lead to less probability;

The **most** likely sample $\textbf{b}_1=(a_1,a_2,a_3,\cdots,a_n)$ has the least repetition as every element is different, and the **least** likely sample $\textbf{b}_2=(a_1,a_1,a_1,\cdots,a_1)$ has the most repetition as every element is the same.

We have

$$ \begin{cases} p_1=\displaystyle\frac{n!}{n^n}\\p_2=\displaystyle\frac{1}{n^n}\end{cases}\Rightarrow\frac{p_1}{p_2}=n! $$

For those samples whose probability is $p_1$, it is an $n$-permutation as every $a_i$ is chosen exactly once; And for those samples whose probability is $p_2$, there are $n$ choices

$(a_1,a_1,\cdots,a_1),(a_2,a_2,\cdots,a_2),\cdots,(a_n,a_n,\cdots,a_n).$ So that we have

$$ \frac{P(\text{getting an unordered bootstrap sample whose probability is }p_1)}{P(\text{getting an unordered bootstrap sample whose probability is }p_2)} =\frac{n!}{n}=(n-1)!\ . $$

**(Geometric Probability)** You get a stick and break it randomly into three pieces. What is the probability that you can make a triangle using such three pieces?

Since that the length of the stick is trivial in such problem, assume the length to be 1; Thus assume the length of three pieces to be $x,y,1-x-y$. In order for them to form a triangle, none of their length should surpass $\displaystyle\frac{1}{2}$ otherwise it would be longer than the sum of the rest two; Therefore we have

$$ \text{Sample Space }S=\left\{\begin{array}{l}0\leq x,y\leq 1\\0\leq x+y\leq 1\end{array}\right.\quad \text{Event(Triangle Exists) }A=\left\{\begin{array}{l}0\leq x,y\leq\displaystyle\frac{1}{2}\\\displaystyle\frac{1}{2} \leq x+y\leq 1\end{array}\right. $$

From the graph above we have

$$ P(A)=\frac{S_A}{S_S}=\frac{\displaystyle\frac{1}{2}\times\frac{1}{2}\times\frac{1}{2}}{\displaystyle 1\times1\times\frac{1}{2}}=\frac{1}{4}. $$

In the birthday problem, we assumed that all 365 days of the year are equally likely (and excluded February 29). In reality, some days are slightly more likely as birthdays than others. For example, scientists have long struggled to understand why more babies are born 9 months after a holiday. Let $\textbf{p}=(p_1,p_2,\cdots,p_365)$ be the vector of birthday probabilities, with $p_j$ the probability of being born on the $j$ th day of the year (February 29 is still excluded, with no offense intended to Leap Dayers). The $k$th elementary symmetric polynomial in the variables $x_1,\cdots,x_n$ is defined by

$$ e_k(x_1,\cdots,x_n)=\sum_{1\leq j_1< j_2<\cdots<j_k\leq n}x_{j_1}\cdots x_{j_k}. $$

This just says to add up all of the $\begin{pmatrix}n\\k\end{pmatrix}$ terms we can get by choosing and multiplying $k$ of the variables. For example, $e_1(x_1,x_2,x_3)=x_1+x_2+x_3,e_2(x_1,x_2,x_3)=x_1x_2+x_1x_3+x_2x_3$,

and $e_3(x_1,x_2,x_3)=x_1x_2x_3$. Now let $k\geq2$ be the number of people.

**(a)** Find a simple expression for the probability that there is at least one birthday match, in terms of **p** and an elementary symmetric polynomial.

From the definition we know that $e_k(\textbf{p})$ is the number of ways of choosing $k$ different (birth)days out of 365 days. And there are $k!$ ways to assign those birthdays to $k$ people. Therefore, the probability that there is **no** birthday match is $k!\cdot e_k(\textbf{p})$;

So that the probability that there is at least one birthday match is

$$ 1-k!\cdot e_k(\textbf{p}). $$

**(b)** Explain intuitively why it makes sense that $P$(at least one birth match) is minimized when $p_j=\displaystyle \frac{1}{365}$ for all $j$, by considering simple and extreme cases.

In the simple case where $k=2$, we have

$$ \begin{aligned} P(\text{at least one birth match})&=1-2\cdot e_2(\textbf{p})=(p_1+p_2+\cdots+p_{365})^2-2\cdot e_2(\textbf{p})\\&=p_1^2+p_2^2+\cdots p_{365}^2\\&\geq 365 \cdot\bigg(\frac{p_1+p_2+\cdots+p_{365}}{365}\bigg)^2=\frac{1}{365} \end{aligned} $$

The minumum is taken $\mathrm{iff.}$ $p_1=p_2=\cdots=p_{365}$, $\mathrm{i.e.}$ $p_j=\displaystyle\frac{1}{365}$ for all $j$.

**The proof is complete.**

**(c)** The famous arithmetic mean-geometric mean inequality says that for $x,y\geq 0$

$$ \frac{x+y}{2}\geq\sqrt{xy}. $$

This inequality follows from adding $4xy$ to both sides of $x^2-2xy+y^2=(x-y)^2\geq 0$. Define **r**=$(r_1,r_2,\cdots,r_{365})$ by $r_1=r_2=(p_1+p_2)/2,r_j=p_j$ for $3\leq j\leq 365$. Using the arithmetic mean-geometric mean bound and the fact, which you should verify, that

$$ e_k(x_1,\cdots,x_n)=x_1x_2e_{k-2}(x_3,\cdots,x_n)+(x_1+x_2)e_{k-1}(x_3,\cdots,x_n)+e_k(x_3,\cdots,x_n) $$

show that

$$ P(\text{at least one birthday match}\mid\textbf{p})\geq P(\text{at least one birthday match}\mid\textbf{r}) $$

with strict inequality if $\textbf{p}\neq \textbf{r}$, where the given **r** notation means that the birthday probabilities are given by **r**. Using this, show that the value of **p** that minimizes the probability of at least one birthday match is given by $p_j=\displaystyle\frac{1}{365}$ for all $j$.

**Part A.** First we verify the fact that

$$ e_k(x_1,\cdots,x_n)=x_1x_2e_{k-2}(x_3,\cdots,x_n)+(x_1+x_2)e_{k-1}(x_3,\cdots,x_n)+e_k(x_3,\cdots,x_n) $$

We can divide $e_k(x_1,\cdots,x_n)$ into four partitions: (1) Both $x_1,x_2$ attend the calculation; (2) Only $x_1$ attend the calculation; (3) Only $x_2$ attend the calculation; (4) Neither of them attend the calculation.

Thus we have

$$ \begin{aligned} \text{Total }&=\text{ Part (1) }+\text{ Part (2) }+\text{ Part (3) }+\text{ Part (4) }\\ e_k(x_1,\cdots,x_n)&=x_1x_2e_{k-2}(x_3,\cdots,x_n)+x_1\cdot e_{k-1}(x_3,\cdots,x_n)\\&+x_2\cdot e_{k-1}(x_3,\cdots,x_n)+e_k(x_3,\cdots,x_n) \\&=x_1x_2e_{k-2}(x_3,\cdots,x_n)+(x_1+x_2)e_{k-1}(x_3,\cdots,x_n)+e_k(x_3,\cdots,x_n) \end{aligned} $$

**Part B.** Next we show that

$$ P(\text{at least one birthday match}\mid\textbf{p})\geq P(\text{at least one birthday match}\mid\textbf{r}) $$

$$ \begin{aligned} &\quad P(\text{at least one birthday match}\mid\textbf{p})\\&=1-k!\cdot e_k(\textbf{p}) \\&=1-k!\cdot\big(p_1p_2e_{k-2}(p_3,\cdots,p_n)+(p_1+p_2)e_{k-1}(p_3,\cdots,p_n)+e_k(p_3,\cdots,p_n)\big) \\&\geq 1-k!\cdot\bigg(\displaystyle\frac{(p_1+p_2)^2}{4}\cdot e_{k-2}(p_3,\cdots,p_n)\\&+(p_1+p_2)e_{k-1}(p_3,\cdots,p_n)+e_k(p_3,\cdots,p_n)\bigg) \\&=1-k!\cdot\big(r_1r_2e_{k-2}(r_3,\cdots,r_n)+(r_1+r_2)e_{k-1}(r_3,\cdots,r_n)+e_k(r_3,\cdots,r_n)\big) \\&=1-k!\cdot e_k(\textbf{r}) \\&=P(\text{at least one birthday match}\mid\textbf{r}) \end{aligned} $$

**Part C.**

Finally we prove that the value of **p** that minimizes the probability of at least one birthday match is given by $p_j=\displaystyle\frac{1}{365}$ for all $j$.

Suppose that there exists some vector $\textbf{p}'$ that minimizes the probability of at least one birthday match other than **p**, where $\textbf{p}=(\displaystyle \frac{1}{365},\frac{1}{365},\cdots,\frac{1}{365})$ and $\exists i\neq j\in[1,365]~\mathrm{s.t.}~p_i\neq p_j\in\textbf{p}'$.

Then since that $p_i p_j>\displaystyle\frac{(p_i+p_j)^2}{4}$ for $p_i\neq p_j$,

define $\textbf{r}'$=$(r_1,r_2,\cdots,r_365)$ by $r_i=r_j=(p_i+p_j)/2,r_k=p_k$ for the rest of $k$ in $[1,365]$.

Then using the same method in **Part A** and **B**we have that

$$ P(\text{at least one birthday match}\mid\textbf{p}')\geq P(\text{at least one birthday match}\mid\textbf{r}') $$

So that $\textbf{r}'$ should be the minimization vector instead of $\textbf{p}'$, which leads to the contradiction.

Thus the probability is minimized $\mathrm{iff.}$ $p_j=\displaystyle\frac{1}{365}$ for all $j$.

**The proof is complete.**

(**Coupon Collection**) If each box of a brand of crispy instant noodle contains a coupon, and there are 108 different types of coupons. Given $n\geq 200$, what is the probability that buying $n$ boxes can collect all 108 types of coupons? You also need to plot a figure to show how such probability changes with the increasing value of $n$. When such probability is no less than 95%, what is the minimum number of $n$?

Collecting all 108 types of coupons in $n$ boxes is equivalent to dividing $n$ coupons into 108 nonempty parts where the order of parts **does not matter**;

So that the probability of achieving this is

$$ P=\frac{\begin{Bmatrix}n\\108\end{Bmatrix}\cdot 108!}{108^n}=\sum_{i=0}^{108}(-1)^i\cdot\begin{pmatrix}108\\k\end{pmatrix}\cdot \bigg(\frac{108-i}{108}\bigg)^n $$

where $n$ takes a minimum number of 823 when $P\geq 0.95$.

]]>A **Set** is a collection of objects. Given two sets $A,B$, key concepts include

- empty set : $\varnothing$
- $A$ is a subset of $B$ : $A\subseteq B$
- union of $A$ and $B$ : $A\cup B$
- intersection of $A$ and $B$ : $A\cap B$
- complement of $A$ : $A^c$
**De Morgan's Laws**:

$$ (A\cup B)^c=A^c\cap B^c;\quad (A\cap B)^c=A^c\cup B^c. $$

The **Sample Space** of an experiment : the set of **all possible outcomes** of the experiment.

An **Event** $A$ is a subset of the sample space $S$.

$A$ *occurred* if the actual outcome is $A$.

**Example : Coin Flips**

A coin is flipped $10$ times; Writing Heads as $1$ and Tails as $0$, then

- An outcome is a sequence $(s_1,s_2,\cdots,s_{10})$ with $s_j\in\{0,1\}$.
- The sample space : the set of all such sequences, $2^{10}$ elements.
- $A_j$ : The event that the $j$th flip is Head.
- $B$ : The event that at least one flip was Head. $(B=\bigcup_{j=1}^{10}A_j)$
- $C$ : The event that all the flips were Heads. $(C=\bigcap_{j=1}^{10}A_j)$
- $D$ : The event that there were at least two consecutive Heads. $(D=\bigcup_{j=1}^9(A_j\cap A_{j+1}))$

**Translation between English and Sets**

Assumption $1$ : Finite sample space;

Assumption $2$ : All outcomes occur equally likely (等概率发生)

**Definition** 古典概型

Let $A$ be an event for an experiment with a finite sample space $S$. The naive probability of $A$ is

$$ P_{\text{naive}}(A)=\frac{\left|A\right|}{\left|S\right|}=\frac{\text{number of outcomes favorable to }A}{\text{total number of outcomes in }S}. $$

**Sampling** : sampling from a set means choosing an element (draw a sample) or multiple elements (draw samples) from that set.

**With/Without Replacement** : put each element (object) **back** or not after each draw. Also known as "repetition is allowed or not"

**Ordered/Unordered** : ordering matters or not.

**Theorem** for order matters with replacement

Consider $n$ objects in a set and making $k$ choices from them, one at a time *with replacement* ($\mathrm{i.e.}$ choosing a certain object does not preclude it from being chosen again). Then there are $n^k$ possible outcomes.

**Theorem** for order doesn't matter with replacement

Consider $n$ objects in a set and making $k$ choices from them, one at a time *without replacement* ($\mathrm{i.e.}$ choosing a certain object preclude it from being chosen again) Then there are

$$ n(n-1)\cdots(n-k+1) $$

possible outcomes for $k\leq n$ (and $0$ possibility for $k>n$). For the special case where $k=n$, there are $n!$ possible outcomes, each outcome is called a **permutation** (排列) of such $n$ objects.

**Example : Birthday Problem**

There are $k$ people in a room. Assume each person's birthday is equally like to be any of the $365$ days of the year (excluding February $29^{\text{th}}$), and that people's birthdays are independent (we assume that there are no twins in the room). What is the possibility that two or more people in the room has the same birthday?

Event $A$ : There exists $2$ people whose birthdays are the same

$$ \begin{align} P(A^c)&=\frac{\sim\text{without reptition}}{\text{Number of ways to assign birthdays to }k\text{ people}} \\&=\frac{365(365-1)(365-2)\cdots(365-k+1)}{365^k};\\ P(A)&=1-P(A^c) \end{align} $$

**Generalized Birthday Problem**

Each of $k$ people has a random number ($\mathrm{e.g.}$ "birthday") drawn from $n$ values ($\text{e.g.}$. "days")

If the probability that at least two people have the same number is $50\%$, then

$$ k\approx 1.18\sqrt{n}. $$

The result is reached based on the fact that

$$ \text{when}\left|x\right|<<1,~e^{-x}\approx 1-x $$

as follows :

$$ \begin{align} P(A^c)&=(1-\frac{1}{n})\cdots(1-\frac{k-1}{n}) \\&\approx e^{-\frac{1}{n}}\cdot e^{-\frac{2}{n}}\cdots e^{-\frac{k-1}{n}} \\&=e^{-\frac{k(k-1)}{2n}}\approx e^{-\frac{k^2}{2n}}; \\P(A^c)=0.5\Rightarrow P(A^c)&=0.5=e^{-\frac{k^2}{2n}}\Rightarrow k=\sqrt{n\cdot 2\ln 2}\approx 1.18\sqrt{n}. \end{align} $$

**Application : Hash Table**

A commonly used data structure for fast information retrieval

Example : Store people's names; For each person $x$, a hash function $h$ is computed; $h(x)$ is the location that will be used to store $x$'s name.

**Hash Collision**

Collision : $x\neq y$, but $h(x)=h(y)$

Given $k$ people (with different names) and $n$ locations, what is the probability of occurrence of hash collision?

$$ P(\text{Hash Collision})=1-\frac{n(n-1)\cdots(n-k+1)}{n^k}\approx 1-e^{-\frac{k^2}{2n}}. $$

**Cryptographic Hash Function**

A good cryptographic hash function $f$ has two properties:

- Given the hash $f(M)$ of a message string $M$, it's computationally
**infeasible**to recover $M$. - It's computationally
**infeasible**to find a "collision", meaning a pair of distinct messages $M_1\neq M_2$ such that $f(M_1)=f(M_2)$

**Application** : A map which "scrambles" long strings (message) into $m$-bit hash (digest).

- Example 1 : MD(Message-Digest Algorithm)5(Bittorrent) with $m=128$.
- Example 2 : SHA(Secure Hash Algorithm)-1(SSL,PGP) with $m=160$.

**The Birthday Attack**

Suppose we try to "break" a hash function by finding a collision (e.g. forged digital signature)

One method : take a huge number of messages $M$, hash them all and hope to find two with the same hash value.

Now how many messages would you have to try before there was at least a $50\%$ chance of finding two with the same hash?

$n=2^m$ and $k\approx\sqrt{n}=2^{m/2}$. For SHA-1, $k\approx 2^{80}$.

Which is also called **Combination** (组合).

$k$-combination : choose a $k$-element subset of a set with $n$ elements (since that the order of elements does not matter in a set)

**Binomial Coefficient**

**Definition**

For any nonnegative integers $k$ and $n$, the binomial coefficient $\begin{pmatrix}n\\k\end{pmatrix}$, read as "$n$ choose $k$", is the number of subsets of size $k$ for a set of size $n$.

**Theorem**

For $k\leq n$, we have

$$ \begin{pmatrix}n\\k\end{pmatrix}=\frac{n(n-1)\cdots(n-k+1)}{k!}=\frac{n!}{k!(n-k)!} $$

**Binomial Theorem** 二项式定理

$$ (x+y)^n=\sum_{k=0}^n\begin{pmatrix}n\\k\end{pmatrix}x^ky^{n-k}. $$

**Multinomial Theorem** 项数拓展 $2\rightarrow r$

$$ \begin{gather}(x_1+x_2+\cdots x_r)^n=\sum_{n_1,n_2,\cdots,n_r>0}\frac{n!}{n_1!n_2!\cdots n_r!}x_1^{n_1}x_2^{n_2}\cdots x_r^{n^r},\\ \text{ where }n_1+n_2+\cdots+n_r=n. \end{gather} $$

**Story Proof : The Team Captain**

For any positive integers $n$ and $k$ with $k\leq n$, we have

$$ n\begin{pmatrix}n-1\\k-1\end{pmatrix}=k\begin{pmatrix}n\\k\end{pmatrix} $$

There is a group of $n$ people; From which we choose a team of $k$ people, and one of the team members will be the captain.

If we choose the captain first then the rest team members, the number of choices is $n\begin{pmatrix}n-1\\k-1\end{pmatrix}$; If we choose the team members first then the captain among them, the number of choices is $k\begin{pmatrix}n\\k\end{pmatrix}$. So that the equation is proved.

**Story Proof : Vandermonde's Identity**

A famous relationship between binomial coefficients, called *Vandermonde's Identity*, says that

$$ \begin{pmatrix}m+n\\k\end{pmatrix}=\sum_{j=0}^k\begin{pmatrix}m\\j\end{pmatrix}\begin{pmatrix}n\\k-j\end{pmatrix} $$

There's $m$ men and $n$ women, and we want to choose $k$ people from them.

If we choose $k$ people at the same time, the number of choices is $\begin{pmatrix}m+n\\k\end{pmatrix}$; If we choose a specific number of $j$ men first then the rest $k-j$ women, the number of choices is $\begin{pmatrix}m\\j\end{pmatrix}\begin{pmatrix}n\\k-j\end{pmatrix}$; The specific number $j$ ranges from $0$ to $k$ respectively, thus we have the summation and the equation is proved.

How many ways are there to choose $k$ times from a set of $n$ objects with replacement, if order doesn't matter (we only care about how many times each object was chosen, not the order in which they were chosen)?

The question is also called "Bose-Einstein Counting"

**Equivalent Problem**

The number of solutions for the indeterminate equation $x_1+x_2+\cdots+x_n=r$ where all terms are nonnegative integers.

**Theorem**

There are $\begin{pmatrix}r-1\\n-1\end{pmatrix}$ distinct **positive** integer-valued vectors $(x_1,x_2,\cdots,x_n)$ satisfying the equation

$$ x_1+x_2+\cdots+x_n=r,x_i>0,i=1,2,\cdots,n. $$

**Theorem** Bose-Einstein Counting($>0\rightarrow \geq 0$)

There are $\begin{pmatrix}n+k-1\\n-1\end{pmatrix}$ distinct **nonnegative** integer-valued vectors $(x_1,x_2,\cdots,x_n)$ satisfying the equation

$$ x_1+x_2+\cdots+x_n=k,x_i\geq0,i=1,2,\cdots,n. $$

Add $1$ to each $x_i$ respectively, the total summation is added by $n$, and the problem is turned into the number of positive integer solutions with $\text{RHS}=n+k$.

Choose $k$ objects out of $n$ objects, the number of possible ways:

$$ \begin{matrix} &\text{Order Matters}&\text{Order Not Matter}\\\text{with replacement}&n^k&\begin{pmatrix}n-k+1\\k\end{pmatrix}\\\text{without replacement}&n!/(n-k)!&\begin{pmatrix}n\\k\end{pmatrix} \end{matrix} $$

Geometric probability is a tool to deal with the problem of infinite outcomes by measuring the number of outcomes geometrically, in terms of geometric measure such as *length, area,* or *volume*.

Equally likely means the probability of falling into some geometric region is proportional to the measure of such region including *length, area,* or *volume*.

Given a sample space $S$, the probability of event $A$ occurring is

$$ P(A)=\frac{M(A)}{M(S)}, $$

where $M(\cdot)$ is the measure of geometric region.

**Example : 1-dimensional Geometric Probability**

$x$ is a real number that $x\in[1,2)$. What is the probability that $x=1.5$?

Divide $[1,2)$ into $2N+1$ equal length intervals:

$[1,\displaystyle1+\frac{1}{2N+1}),[\displaystyle1+\frac{1}{2N+1},\displaystyle1+\frac{2}{2N+1}),\cdots,[\displaystyle1+\frac{2N}{2N+1},2)$.

The probability for choosing any of the intervals is $P=\displaystyle \frac{1}{2N+1}$.

Since we have that $1.5\in A_n=[\displaystyle1+\frac{N}{2N+1},\displaystyle1+\frac{N+1}{2N+1})$, we have

$$ 0\leq P(x=1.5)\leq P(A_n)=\frac{1}{2N+1}\xrightarrow{N\rightarrow\infty}0\Rightarrow P(x=1.5)=0. $$

An event whose probability is $0$ may happen, as shown in the example above; For those who cannot happen under any circumstances they are called **impossible events**.

The *frequentist* view: probability represents a long-run frequency over a large number of repetitions of an experiment.

If we say a coin has probability $\displaystyle\frac{1}{2}$ of Heads, that means the coin would land Head $50\%$ of the time if we tossed it over and over.

However, the frequency may not exist in general.

Now as the intuition behind the Monte Carlo Computing Method.

Probability represents a degree of belief about the event in question.

So we can assign probabilities to hypotheses like "candidate $A$ will win the election" or "the defendant is guilty" even if it is not possible to repeat the same election or the same crime over and over again (thus no statistical data).

It is related to Logic, Philosophy and Psychology.

**Definition**

Given a sample space $S$, the class of subsets that constitute the set of events satisfies the following axioms:

- The sample space itself $S$ is an event.
- For every event $A$, the complement $A^c$ is an event.
- For every sequence of events $A_1,A_2,\cdots$, the union $\bigcup_{j=1}^{\infty}A_j$ is an event.

**Definition**

A *probability space* consists of a *sample space* $S$ and a *probability function* $P$ which takes an event $A\subseteq S$ as input and returns $P(A)$, a real number between $0$ and $1$, as output. That function $P$ must satisfy the following axioms:

- $P(\varnothing)=0,\ P(S)=1$.
- If $A_1,A_2,\cdots$ are
**disjoint**events (they are independent from each other), then

$$ P\big(\bigcup_{j=1}^{\infty}A_j\big)=\sum_{j=1}^{\infty}P(A_j). $$

Saying that these events are disjoint means that they are **mutually exclusive** (互不相容的): $A_i\cap A_j$ for $i\neq j$.

Probability has the following properties, for any events $A$ and $B$:

- $1.$ $P(A^c)=1-P(A)$.

Since that

$$ P(A\cup A^c)=P(S)=1\Leftrightarrow P(A)+P(A^c)=1\Leftrightarrow P(A^c)=1-P(A). $$

- $2.$ If $A\subseteq B$, then $P(A)\leq P(B)$.

Since that $A\subseteq B$, we have

$$ B=A\cup(B\cap A^c)\Rightarrow P(B)=P(A)+P(B\cap A^c)\geq P(A). $$

- $3.$ $P(A\cup B)=P(A)+P(B)-P(A\cap B)$
*容斥原理*

$$ P(A\cup B)=P(A)+P(A^c\cap B)=P(A)+\big(P(B)-P(A\cap B)\big). $$

**Theorem**

For any $n$ events $A_1,A_2,\cdots,A_n$, we have

$$ P(A_1\cap A_2\cap\cdots\cap A_n)\geq P(A_1)+P(A_2)+\cdots+P(A_n)-(n-1). $$

$$ \begin{align} \text{Proof : }&\Leftrightarrow 1-P(A_1\cap A_2\cap\cdots\cap A_n)\leq1-P(A_1)-\cdots-P(A_n)+n-1 \\&\Leftrightarrow P(\overline{A_1\cap A_2\cap\cdots\cap A_n})\leq[1-P(A_1)]+[1-P(A_2)]+\cdots+[1-P(A_n)]\\ &\Leftrightarrow P(\overline{A_1}\cup\cdots\cup\overline{A_n})\leq P(\overline{A_1})+\cdots+P(\overline{A_n})\\ &\Leftrightarrow\text{Proved by Property 1.} \end{align} $$

**Theorem**

For any events $A_1,A_2,\cdots$ we have

$$ P(\bigcup_{i=1}^{\infty}A_i)\leq\sum_{i=1}^{\infty}P(A_i). $$

First we have that

$$ \begin{align} &A_1\cup A_2=A_1\cup(A_2\cap\overline{A_1});\\ &A_1\cup A_2\cup A_3=A_1\cup A_2\cup(A_3\cap\overline{A_1\cap A_2})=A_1\cup(A_2\cap\overline{A_1})\cup(A_3\cap\overline{A_1\cup A_2});\\ &\cdots \\&A_1\cup\cdots\cup A_n=A_1\cup(A_2\cap\overline{A_1})\cup(A_3\cap\overline{A_1\cup A_2})\cup\cdots\cup(A_n\cap\overline{A_1\cup\cdots\cup A_{n-1}}). \end{align} $$

Thus

$$ \begin{align} P(\bigcup_{i=1}^{\infty}A_i)&=P\big(A_1\cup(\bigcup_{i=2}^{\infty}A_i\cap\overline{\bigcup_{j=1}^{i-1}A_j})\big) \\&=P(A_1)+\sum_{i=2}^{\infty}P\big(A_i\cap\overline{\bigcup_{j=1}^{i-1}A_j}\big)\\ &\leq P(A_1)+\sum_{i=2}^{\infty}P(A_i) \\&=\sum_{i=1}^{\infty}P(A_i).\qquad\qquad\square \end{align} $$

For any events $A_1,A_2,\cdots, A_n$: ($n$元容斥原理)

$$ \begin{align} P(\bigcup_{i=1}^{n}A_i)&=\sum_iP(A_i)-\sum_{i<j}P(A_i\cup A_j)+\sum_{i<j<k}P(A_i\cap A_j\cap A_k)\\&+\cdots+ (-1)^{n+1}\cdot P(A_1\cap A_2\cap\cdots\cap A_n). \end{align} $$

]]>Signals describe a wide range of physical phenomena; **Mathematically**, signals are represented as functions of one or more variables. For example:

- Sound: represents acoustic pressure as a function of time $f(t)$
- Picture: represents brightness as a function of two spatial variables $f(x,y)$
- Video: consists of a sequence of images, called frames, and is a function of 3 variables: 2 spatial coordinate and time.

For signals, independent variables can be one or more; Generally we focus on signals which involve a **single** independent variable **time**. (although it may not represent time in specific applications)

**Continuous-Time signals**: the independent variable is continuous, and signals are defined for a **continuum** of values.

Represented as $x(t)$ where $t$ denotes the independent variable.

**Discrete-Time signals**: defines only at discrete times, and the independent variable only takes a discrete set of values. (e.g. closing price of a certain stock in one week)

Represented as $x[n]$, where $n$ denotes the independent variable.

For $x[n]$, it might be discrete or nature, or could be **sampling** of continuous time signal; It is defined only for **integer values** of $n$.

Introduction: Let $v(t)$ and $i(t)$ be voltage and current across a resistor $R$, the instantaneous power is

$$ p(t)=v(t)i(t)=\frac{1}{R}v^2(t) $$

So that the total energy over the time interval $t_1\leq t\leq t_2$ is

$$ E_{R}=\int_{t_1}^{t_2}p(t)\mathrm{d}t=\int_{t_1}^{t_2}\frac{1}{R}v^2(t)\mathrm{d}t $$

And the average power over the interval $t_1\leq t\leq t_2$ is

$$ P_R=\frac{1}{t_2-t_1}\int_{t_1}^{t_2}p(t)\mathrm{d}t=\frac{1}{t_2-t_1}\int_{t_1}^{t_2}\frac{1}{R}v^2(t)\mathrm{d}t $$

Similarly, for any signal $x(t)$ or $x[n]$, the **total energy** is defined as

$$ \begin{gather} E=\int_{t_1}^{t_2}\left|x(t)\right|^2\mathrm{d}t,\quad t\in\left[t_1,t_2\right]\\ E=\sum_{n=n_1}^{n_2}\left|x[n]\right|^2,\quad t\in\left[t_1,t_2\right]. \end{gather} $$

The **average power** is defined as

$\displaystyle P=\frac{E}{t_2-t_1}$ for continuous-time signal and $\displaystyle P=\frac{E}{n_2-n_1+1}$ for discrete-time signal.

Over **infinite** time interval $-\infty\leq t\leq \infty$ or $-\infty \leq n\leq \infty$, we have

$$ \left\{ \begin{array}{} \displaystyle E_{\infty}\triangleq\lim_{T\rightarrow \infty}\int_{-T}^{T}\left|x(t)\right|^2\mathrm{d}t=\int_{-\infty}^{\infty}\left|x(t)\right|^2\mathrm{d}t\quad\text{Continuous} \\ \displaystyle E_{\infty}\triangleq\lim_{N\rightarrow\infty}\sum_{n=-N}^{N}\left|x[n]\right|^2=\sum_{n=-\infty}^{\infty}\left|x[n]\right|^2\quad\text{Discrete} \end{array} \right. $$

and

$$ \left\{\begin{array}{} \displaystyle P_{\infty}\triangleq\lim_{T\rightarrow\infty}\frac{1}{2T}\int_{-T}^{T}\left|x(t)\right|^2\mathrm{d}t\quad\text{Continuous}\\ \displaystyle P_{\infty}\triangleq\lim_{N\rightarrow\infty}\frac{1}{2N+1}\sum_{n=-N}^{N}\left|x[n]\right|^2\quad\text{Discrete} \end{array}\right. $$

For any finite-energy signal $E_{\infty}<\infty$, its average power

$$ P_{\infty}=\lim_{T\rightarrow\infty}\frac{E_{\infty}}{2T}=0\quad\text{and}\quad P_{\infty}=\lim_{N\rightarrow\infty}\frac{E_{\infty}}{2N+1}=0. $$

Finite-power signal(but not zero): $P_{\infty}<\infty, E_{\infty}=\infty$

Infinite energy & power signal: $P_{\infty}\rightarrow\infty,E_{\infty}\rightarrow\infty$.

$$ x(t)\rightarrow y(t)=x(t-t_0) $$

Relationship:

$$ y(t)\bigg|_{t=t_1}=x(t-t_0)\bigg|_{t=t_1}=x(t_1-t_0)=x(t)\bigg|_{t=t_1-t_0}. $$

For discrete-time signal

$$ x[n]\rightarrow x[n-n_0]. $$

Flip along the axis where the independent variable $=0$.

$$ x(t)\rightarrow x(-t);\ x[n]\rightarrow x[-n] $$

$$ \text{Compressed as }x(t)\rightarrow x(2t);\quad\text{Stretched as }x(t)\rightarrow x(t/2). $$

Combine all of the above as

$$ x(t)\rightarrow x(\alpha t+\beta) $$

$\left|\alpha\right|>1$, compressed; $\left|\alpha\right|<1$, stretched; $\alpha<0$, reversed; $\beta\neq 0$, shifted.

A continuous-time signal $x(t)$ is a periodic signal $\mathrm{iff.}$ there exists some $T$ $\mathrm{s.t.}$

$$ \forall t,\ x(t)=x(t+T). $$

The smallest **positive** value of $T$ is called the **fundamental period** (基波周期) of the signal.

Similarly for discrete-time signals we have $x[n]=x[n+N]$ for all $n$.

**Even Signal**:

$$ x(t)=x(-t)\quad x[n]=x[-n] $$

**Odd Signal**:

$$ x(t)=-x(-t)\quad x[n]=-x[-n] $$

**Any** Signal can be broken into a sum of two signals, one even and one odd, in the following method:

$$ x(t)=x_e(t)+x_o(t),\text{ where }\begin{cases} x_e(t)=E_v\{x(t)\}=\displaystyle\frac{1}{2}[x(t)+x(-t)]\\ x_o(t)=O_d\{x(t)\}=\displaystyle\frac{1}{2}[x(t)-x(-t)] \end{cases}. $$

For discrete-time signals, it is similar that

$$ x[n]=x_e[n]+x_o[n],\text{ where }\begin{cases} x_e[n]=\displaystyle\frac{x[n]+x[-n]}{2}\\ x_o[n]=\displaystyle\frac{x[n]-x[-n]}{2} \end{cases}. $$

In the general case we have

$$ x(t)=Ce^{at} $$

where $C$ and $a$ are complex number;

When $C$ and $a$ are real, it is a real exponential signal where

$$ \begin{cases} a>0\Leftrightarrow x(t)\uparrow\text{ as }t\uparrow\\ a<0\Leftrightarrow x(t)\downarrow\text{ as }t\uparrow\\ a=0\Leftrightarrow x(t)\text{ is constant.} \end{cases} $$

**Periodic exponential signals**

In such case, $c$ is real, specifically $c=1$; $a$ is purely imaginary as $a=\omega_0j$

$$ x(t)=e^{j\omega_0t} $$

To find its Fundamental period $T_0$, we have

$$ \begin{align} x(t)&=e^{j\omega_0t}=e^{j\omega_0(t+T)}=e^{j\omega_ 0t}\cdot e^{j\omega_0T}\Rightarrow e^{j\omega_0T}=1 \\&\Rightarrow \omega_0T=2k\pi,k\in\mathbb{Z}\setminus\{0\} \\&\Rightarrow T=\frac{2k\pi}{\omega_0} \\&\Rightarrow T_0=\frac{2\pi}{\left|\omega_0\right|} \end{align} $$

Notice: $T_0$ is undefined for $\omega_0=0$ as $x(t)$ is constant and any $T$ is its period.

**Sinusoidal Signals**

$$ x(t)=A\cos(\omega_0t+\phi) $$

Which is closely related to complex exponential signals (Euler's Formula)

$$ \begin{align} &\quad e^{j(\omega_0t+\phi)}=\cos(\omega_0t+\phi)+j\sin(\omega_0+\phi) \\&\Leftrightarrow\begin{cases} A\cos(\omega_0t+\phi)=A\cdot Re\{e^{j(\omega_0+\phi)}\} \\A\sin(\omega_0t+\phi)=A\cdot Im\{e^{j(\omega_0+\phi)}\} \end{cases} \end{align} $$

Its **fundamental frequency** (基波频率) is $\omega_0$.

Both $e^{j\omega_0}t$ and $A\cos(\omega_0t+\phi)$ has infinite total energy but finite average power:

$$ \begin{align} \text{For }e^{j\omega_0t}&: \begin{cases} \displaystyle E_{period}=\int_0^{T_0}\left|e^{j\omega_0t}\right|^2\mathrm{d}t=\int_0^{T_0}1\mathrm{d}t=T_0 \\ P_{period}=\displaystyle\frac{1}{T_0}E_{period}=1 \end{cases}\\ \end{align} $$

$$ \begin{align} \text{For }A\cos(\omega_0t+\phi):E_{period}&=\int_0^{T_0}\left|A\cos(\omega_0+\phi)\right|^2\mathrm{d}t \\&=A^2\int_0^{T_0}\cos^2(\omega_0t+\phi)\mathrm{d}t \\{*}&=\frac{A^2}{2}\int_0^{T_0}1+\cos(2\omega_0t+2\phi)\mathrm{d}t \\&=\frac{A^2}{2}T_0+\frac{A^2}{2}\int_0^{T_0}\cos(2\omega_0t+2\phi)\mathrm{d}t \\&=\frac{A^2}{2}T_0+\frac{A^2}{2}\cdot\frac{1}{2\omega_0}\cdot\sin(2\omega_0t+2\phi)\bigg|_{0}^{T_0} \\&=\frac{A^2}{2}T_0; \\P_{period}&=\frac{1}{T_0}E_{period}=\frac{A^2}{2}. \end{align} $$

**Harmonically related (谐波) complex exponentials**

Sets of periodic exponentials (**with different frequencies**), all of which are periodic with a common period $T_0$

To have the common period $T_0$, they must satisfy that

$$ e^{j\omega t}=e^{j\omega(t+T_0)}=e^{j\omega t}\cdot e^{j\omega T_0}\Rightarrow \omega T_0=2k\pi\Rightarrow\omega=\frac{2k\pi}{T_0}=k\omega_0\Rightarrow \omega_0=\frac{2\pi}{T_0} $$

We call

$$ \phi_k(t)=e^{jk\omega_0t},k\in\mathbb{Z}\setminus\{0\} $$

a **harmonically related set**.

For any $k\neq 0$, they have fundamental frequency $\left|k\right|\omega_0$ and fundamental period $\displaystyle\frac{2\pi}{\left|k\right|\omega_0}=\frac{T_0}{\left|k\right|}$.

**Back to the general case**:

$$ x(t)=Ce^{at} $$

Since that complex numbers $C,a$ can be written in the form

$$ C=\left|C\right|e^{j\theta},\ a=r+j\omega_0 $$

We have

$$ \begin{align} Ce^{at}&=\left|C\right|e^{j\theta}e^{(r+j\omega_0t)}=\left|C\right|e^{rt}e^{j(\omega_0t+\theta)} \\&=\left|C\right|e^{rt}\cos(\omega_0t+\theta)+j\left|C\right|e^{rt}\sin(\omega_0t+\theta) \end{align} $$

In the general case

$$ x[n]=C\alpha^n $$

where $C$ and $\alpha$ are complex numbers;

$x[n]$ can be expressed in exponential form as $x[n]=Ce^{\beta n}$ if we let $\alpha=e^{\beta}$ using Euler's Formula, but for discrete-time signals it is easier to handle in the former format.

**Sinusoidal signals**

In such case $C$ is real, specifically $C=1$; $\beta$ is purely imaginary:

$$ x[n]=e^{j\omega_0n} $$

Which is closely related to $A\cos(\omega_0n+\phi)$ since that

$$ \begin{align} e^{j\omega_0n}&=\cos\omega_0n+j\sin\omega_0n\\ A\cos(\omega_0n+\phi)&=\frac{A}{2}\cdot e^{j\theta}e^{j\omega_0n}+\frac{A}{2}\cdot e^{-j\theta}e^{-j\omega_0n} \end{align} $$

Similarly, it has infinite total energy but finite average power since

$$ \left|e^{j\omega_0}n\right|^2=1. $$

**General Signals**

$$ \begin{align} x[n]&=C\alpha^n\text{ where }C=\left|C\right|e^{j\theta},\alpha=\left|\alpha\right|e^{j\omega_0} \\\Rightarrow x[n]&=\left|C\right|\left|\alpha\right|^n\cos(\omega_0n+\theta)+j\left|C\right|\left|\alpha\right|^n\sin(\omega_0n+\theta) \end{align} $$

**Periodicity properties**

Focusing on $\omega_0$ of $x[n]=e^{j\omega_0n}$:

It is obvious that $e^{j\omega_0n}$ has the same value at $\omega_0$ and $\omega_0+2k\pi$;

Only consider interval $0\leq\omega_0\leq 2\pi$:

Notice that here we are considering different periods $\omega_0$ rather than one period of a specific signal.

From $0$ to $\pi$: when $\omega_0$ increases, the **oscillation rate** (振荡率) of $e^{j\omega_0n}$ **increases**;

From $\pi$ to $2\pi$: when $\omega_0$ increases, the oscillation rate of $e^{j\omega_0n}$ **decreases**.

The maximum oscillation rate is taken at $\omega_0=\pi$ where

$$ e^{j\pi n}=(e^{j\pi})^n=(-1)^n $$

Then, we focus on $n$ of $x[n]=e^{j\omega_0n}$:

In order for $e^{j\omega_0n}$ to be periodic with $N>0$, it must satisfy that

$$ e^{j\omega_0(n+N)}=e^{j\omega_0N}\cdot e^{j\omega_0n}=e^{j\omega_0n}\Rightarrow\omega_0N=2m\pi\Rightarrow\frac{\omega_0}{2\pi}=\frac{m}{N} $$

So that $\displaystyle\frac{\omega_0}{2\pi}$ must be a rational number; In such case,

Fundamental frequency: $\displaystyle\frac{2\pi}{N}=\frac{\omega_0}{m}$; Fundamental period: $N=\displaystyle\frac{2m\pi}{\omega_0}$.

**Unit Impulse** (unit sample, 单位冲击函数) is defined as

$$ \delta[n]=\begin{cases}0&n\neq0\\1&n=0\end{cases} $$

**Unit Step** (单位阶跃函数) is defined as

$$ u[n]=\begin{cases}0&n<0\\1&n\geq0\end{cases} $$

The impulse is the first difference (差分) of the step:

$$ \delta[n]=u[n]-u[n-1] $$

Conversely, the step is the *running* sum of the step:

$$ u[n]=\sum_{m=-\infty}^n\delta[m]\rightarrow\begin{cases}n<0&\sum=0\\n\geq0&\sum=1\end{cases} $$

Or let $m=n-k$, we have

$$ u[n]=\sum_{k=\infty}^0\delta[n-k]\quad\text{or}\quad\sum_{k=0}^{\infty}\delta[n-k]. $$

**Sampling Property**

$$ \begin{align} &x[n]\delta[n]=x[0]\delta[n],\ \text{since }\delta[n]\text{ has value only when }n=0;\\ &\text{More Generally, }x[n]\delta[n-n_0]=x[n_0]\delta[n-n_0]. \end{align} $$

Unit step is defined as

$$ u(t)=\begin{cases}0&t<0\\1&t>0\end{cases} $$

As we can see, according to the definition, the function is

discontinuousat $t=0$.

Still, the continuous unit step $u(t)$ is the **running integral** of unit impulse $\delta(t)$:

$$ u(t)=\int_{-\infty}^t\delta(\tau)\mathrm{d}\tau $$

And $\delta(t)$ is the first derivative of $u(t)$:

$$ \delta(t)=\frac{\mathrm{d}u(t)}{\mathrm{d}t} $$

- Since $u(t)$ is discontinuous at $t=0$, How can we get $\delta(t)$?

We can consider $\delta(t)$ as a **approximation** of some function $u_{\Delta}(t)$, whose value changes from $0$ to $1$ in a very short time $\Delta$.

$$ u(t)=\lim_{\Delta\rightarrow0}u_{\Delta}(t)\Rightarrow\delta_{\Delta}(t)=\frac{\mathrm{d}u_{\Delta}(t)}{\mathrm{d}t}\Rightarrow\delta(t)=\lim_{\Delta\rightarrow0}\delta_{\Delta}(t) $$

Also shown in the figure below:

The area of the impulse maintains to be $1$ which is the change of value for $u_{\Delta}(t)$; When $\Delta\rightarrow0$, the area is concentrated at $t=0$. From this interpretation, $\delta(t)$ can actually be written as

$$ \delta(t)=\begin{cases}1&t=0\\0&t\neq0\end{cases} $$

which is exactly the same as $\delta[t]$ for discrete-time signals.

Or let $\sigma=t-\tau$, and we have

$$ u(t)=\int_0^{\infty}\delta(t-\sigma)\mathrm{d}\sigma. $$

**Sampling Property**

Similarly we have

$$ x_1(t)=x(t)\delta_{\Delta}(t)\approx x(0)\delta_{\Delta}(t)\\ $$

The approximation is made since that $\Delta$ is so small here that we can consider $x(t)$ as a constant $x(0)$; So that we have the following result

$$ \begin{gather} x(t)\delta(t)=\lim_{\Delta\rightarrow0}x(t)\delta_{\Delta}(t)=x(0)\lim_{\Delta\rightarrow0}\delta_{\Delta}(t)=x(0)\delta(t) \\ \text{More Generally,}\ x(t)\delta(t-t_0)=x(t_0)\delta(t-t_0). \end{gather} $$

Here the property is actually

very similarto that of discrete-time signals; However the approach is different since $\delta(t)$ has a different definition here, $\mathrm{i.e.}$ $\delta(t)=\displaystyle\frac{\mathrm{d}u(t)}{\mathrm{d}t}$.

For continuous-time systems, the Input & Output are continuous:

$$ x(t)\rightsquigarrow\text{System}\rightsquigarrow y(t) $$

And for discrete-time systems, the Input & Output are discrete:

$$ x[t]\rightsquigarrow\text{System}\rightsquigarrow y[t] $$

**RC Circuit**

For an RC circuit, we have

$$ \begin{cases} i(t)=\displaystyle\frac{v_s(t)-v_c(t)}{R}&\text{current in circuit}\\ i(t)=C\cdot\displaystyle\frac{\mathrm{d}v_c(t)}{t}&\text{current through }C \end{cases}\Rightarrow\frac{\mathrm{d}v_c(t)}{\mathrm{d}t}+\frac{1}{RC}v_c(t)=\frac{1}{RC}v_s(t). $$

The latter equation is the relationship between input signal $v_s(t)$ and output signal $v_c(t)$.

**Balance in a bank account**

For balance in a bank account as an example of discrete-time system, we have

$$ y[n]=1.01y[n-1]+x[n] $$

where $y[n]$ is the balance at the end of $n$th month, $x[n]$ is next deposit, and the interest rate is $1\%$.

The relationship between input $x[n]$ and output $y[n]$ is

$$ y[n]-1.01y[n-1]=x[n]. $$

**Interconnections of Systems**

**System without Memory **: Output is dependent **only** on the current input.

$$ \mathrm{e.g.}\ \begin{cases}y[n]=\big(2x[n]-x^2[n]\big)^2\\y(t)=Rx(t)\end{cases}\ ,\text{where the only variable is the current input }n/t. $$

**System with Memory **: Output is dependent on the current **and** previous inputs.

$$ \mathrm{e.g.}\ \begin{cases}y[n]=\displaystyle\sum_{k=-\infty}^nx[k]\\y[n]=x[n-1]\rightarrow\text{延迟单元，存储输入并在下一时刻输出}\\y(t)=\displaystyle\frac{1}{C}\int_{-\infty}^tx(\tau)\mathrm{d}\tau\end{cases} $$

**Memory** : retaining or storing information about input values at times

Specifically, for *physical systems*, memory is associated with the storage of energy.

**Invertible** : Distinct inputs lead to distinct outputs. (单射, $f^{-1}$ exists)

For example, the Accumulator $\displaystyle y[n]=\sum_{k=-\infty}^n x[k]$, where the difference between two successive outputs is precisely the inputs $y[n]-y[n-1]=x[n]$.

**Noninvertible** : $f^{-1}$ does not exist, examples: $y[n]=0,\ y(t)=x^2(t).$

**Casual** : The output at any time depends only on the inputs at the **present time** and in the **past**. (But not in the future; In other words, the system has no ability for prediction.)

**Informally**, small inputs lead to responses that do not diverge.

**Formally**, bounded input leads to bounded output.

Here a function $f(t)$ is bounded if such $B$ exist that $\left|f(t)\right|<B$.

Examples :

$$ \begin{align} 1.\quad y[n]&=\frac{1}{2M+1}\sum_{k=-M}^{+M}x[n-k]\text{ which is the Average on }n\in[-M,M]\\ &\Rightarrow\text{Bounded}\Rightarrow\text{Stable}; \\ 2.\quad y[n]&=\sum_{k=-\infty}^nu[k]=(n+1)u[n]\rightarrow\infty\Rightarrow\text{Unstable}. \end{align} $$

**Time Invariant** : A time shift in the input signal results in an identical time shift in the output signal.

$$ \begin{align} x[n]\rightsquigarrow y[n]&\Leftrightarrow x[n-n_0]\rightsquigarrow y[n-n_0];\\ x(t)\rightsquigarrow y(t)&\Leftrightarrow x(t-t_0)\rightsquigarrow y(t-t_0). \end{align} $$

Method for proving time invariance:

$$ \begin{cases} x_2(t)=x_1(t-t_0)\\y_2(t)=f\{x_2(t)\}\\y_2'(t)=y_1(t-t_0) \end{cases}\Rightarrow y_2(t)=y_2'(t)? $$

Examples :

$$ \begin{align} &1.\quad y(t)=\sin[x(t)]\\ &\text{If }x_2(t)=x_1(t-t_0),y_2(t)=\sin[x_1(t-t_0)]\\ &y_2'(t)=y_1(t-t_0)=\sin[x_1(t-t_0)]=y_2(t)\\ &\text{Time Invariant.}\\\\ &2.\quad y[n]=nx[n]\\ &\text{If }x_2[n]=x_1[n-n_0],y_2[n]=f\{x_2[n]\}=n\cdot x_1[n-n_0];\\ &y_2'[n]=y_1[n-n_0]=(n-n_0)x_1[n-n_0]\neq y_2[n].\\ &\text{Time Variant.}\\\\ &3.\quad y(t)=x(2t)\\ &\text{If }x_2(t)=x_1(t-t_0),y_2(t)=f\{x_2(t)\}=x_2(2t)=x_1(2t-t_0);\\ &y_2'(t)=y_1(t-t_0)=x_1(2t-2t_0)\neq y_2(t).\\ &\text{Time Variant.}\\ \end{align} $$

也即，先进行变换$y(t)=f\{x(t)\}$再平移($x\rightarrow x-x_0$)与先平移再进行变换所得到的两种结果应该是完全一致的，否则不具有时不变性。

A system is linear if and only if it has superposition property, $\mathrm{i.e.}$ additivity(可加性) and homogeneity(齐次性).

$$ \begin{cases} x_1(t)\rightsquigarrow y_1(t)\\ x_2(t)\rightsquigarrow y_2(t) \end{cases}\Rightarrow ax_1(t)+bx_2(t)\rightsquigarrow ay_1(t)+by_2(t). $$

Method for proving linearity:

$$ \begin{cases} x_3(t)=ax_1(t)+bx_2(t)\\ y_3(t)=f\{x_3(t)\}\\ y_3'(t)=ay_1(t)+by_2(t) \end{cases}\Rightarrow y_3(t)=y_3'(t)? $$

Examples :

$$ \begin{align} &1.\quad y(t)=tx(t)\\ &\text{If }x_3(t)=ax_1(t)+bx_2(t),y_3(t)=f\{x_3(t)\}=t[ax_1(t)+bx_2(t)];\\ &y_3'(t)=ay_1(t)+by_2(t)=atx_1(t)+btx_2(t)=y_3(t).\\ &\text{Linear.}\\\\ &2.\quad y(t)=x^2(t)\\ &\text{If }x_3(t)=ax_1(t)+bx_2(t),y_3(t)=f\{x_3(t)\}=[ax_1(t)+bx_2(t)]^2;\\ &y_3'(t)=ay_1(t)+by_2(t)=ax_1^2(t)+bx_2^2(t)\neq y_3(t).\\ &\text{Not Linear.}\\\\ &3.\quad y[n]=\mathrm{Re}\{x[n]\}\\ &\text{If }x_3[n]=ax_1[n]+bx_2[n],y_3[n]=f\{x_3[n]\}=\mathrm{Re}\{ax_1[n]+bx_2[n]\};\\ &y_3'[n]=ay_1[n]+by_2[n]=a\cdot\mathrm{Re}\{x_1[n]\}+b\cdot\mathrm{Re}\{x_2[n]\}\\ &y_3[n]\neq y_3'[n]\text{ if }a\text{ and }b\text{ are Complex numbers.}\\ &\text{Not Linear.} \end{align} $$

]]>- Creation, manipulation, and storage of
**geometric objects**(modelling) and their images (rendering). - Display those images on screens or hardcopy devices.
- The overall methodology depends heavily on the underlying sciences of geometry, optics, physics and perception.

Geometry, Modeling, Simulation/Animation, Image/Video, Rendering, Visualization, Interaction/VR, Fabrication(制造业，例如3D打印), Sound Graphics, etc.

- Movie Industry, Game Industry;
- CAD(Computer Aided Design 计算机辅助设计):

Used in many fields: Mechanical, Electronic, Architecture...

Drives the high end of the hardware market

Integration of computing and display resources

Reduced design circles $\Rightarrow$ faster systems

- Metaverse
- Digital Twin
- Others: Medical Imaging and Scientific Visualization, Fabrication(3D Printing), Industrial application, Service Industry, Entertainment, etc.

**Signals**: functions containing information about the behavior or nature of some phenomenon.

**Systems**: respond to particular signals by producing other signals or some desired behavior.

**Application**: Communication systems, Medical imaging, Geophysics, Signal processing, Optical computing

Cartesian notation: $z=Re\{z\}+j\cdot Im\{z\}$

Polar notation: $z=\left|z\right|e^{j\theta}$, where $\displaystyle\theta=\arctan\frac{Im\{z\}}{Re\{z\}}$.

Complex conjugation (共轭): $z^{*}=Re\{z\}-j\cdot Im\{z\}$

Euler's Formula:

$$ e^{j\theta}=\cos\theta+j\sin\theta\ ;\quad\begin{cases}\cos\theta=\displaystyle\frac{e^{j\theta}+e^{-j\theta}}{2}\\\sin\theta=\displaystyle\frac{e^{j\theta}-e^{-j\theta}}{2}\end{cases}\ . $$

For some (possibly complex) number $z_0$, we have

$$ \sum_{n=0}^{\infty}z_0^n=\frac{1}{1-z_0}\quad\mathrm{iff.}\quad \left|z_0\right|<1. $$

For some (complex) number $a$, find zeros of $z^N-a=0$

We have that

$$ z^N=a=\left|a\right|e^{j\cdot(2k\pi+\theta)}\Rightarrow z_k=\left|a\right|^{\frac{1}{N}}\cdot e^{j\cdot\frac{2k\pi+\theta}{N}}\ \text{for }k=0,1,\cdots,N-1. $$

]]>