Vectors
It always a tricky to decide if you should first introduce vectors or matrices when teaching linear algebra. As vectors can be seen as a special case of matrices but matrices can also be seen as a collection of vectors. However, most commonly vectors are introduced first as they can be visualized easier and are more intuitive.
So what is a vector? Depending on the field you are working in, a vector can mean many different things. In computer science, a vector is just a list of numbers, i.e. a 1D array or n-tuple such as [1,2,4]
.
The elements of a vector are usually referred to as components in the maths world.
In maths, a vector can be thought of as an arrow in space that has a starting and ending point, also referred to as the head and the tail of the vector. Most commonly vectors are denoted by a lowercase letter either in bold or with an arrow above it, e.g. \(\vec{v}\) or \(\mathbf{v}\). If the vector is defined by two coordinate points, \(A=(2,1)\) and \(B=(4,5)\), then the vector is denoted as \(\vec{AB}\), this type of vector is called a position vector as it shows how to get to the position of \(B\) from the position of \(A\).
Usually the starting point of the vector is at the origin, \((0, 0)\) in 2D space or \((0, 0, 0)\) in 3D space etc. making the head of the vector the point \((x, y)\) or \((x, y, z)\) etc. this vector is then in the standard position. However, it is important to note that the vector is independent of the starting point, i.e. the vector is the same no matter where it starts only the direction and length of the vector matter.

Vectors are easily visualized in 2D and 3D space, but can be extended to any number of dimensions. If we define that all vectors have the same starting point, they can be uniquely be defined by their ending point which is the same as the direction and length of the vector. The length of the vector is also called the magnitude. Vectors in math can however, also be seen as movement in space so they do not be in a specific position but can be anywhere in space.
This is also in line with vectors in physics, where vectors are used to represent physical quantities that have both magnitude and direction such as velocity, force, acceleration, etc. For example, a force vector indicates both the magnitude of the force and the direction in which it is applied. Unlike position vectors, these vectors do not necessarily have a fixed starting point at the origin. Instead, they can be applied at any point in space, and their effects are determined by their magnitude and direction. The length of the vector represents the magnitude of the physical quantity, and the direction of the arrow indicates the direction in which the quantity acts.

If we have a vector \(\mathbf{v}\) in 2D space defined by the coordinates \((3, 4)\), then we can write it as:
\[\mathbf{v} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \]We can also define the so-called position vector \(\vec{AB}\) from the point \(A=(0, 0)\) to the point \(B=(3, 4)\) as:
\[\vec{AB} = \begin{bmatrix} 3 - 0 \\ 4 - 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} \]The vector representing the movement from \(A\) to \(B\) is the same as the vector \(\mathbf{v}\), i.e. \(\vec{AB} = \mathbf{v}\). The vector that represent the origin so the point with coordinates \((0, 0)\) is called the zero vector and is denoted as \(\mathbf{o}\):
\[\mathbf{o} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]Vector Addition
To add two vectors together, we simply add the corresponding components of the vectors together i.e. we just add element-wise. This also means that the two vectors must have the same number of components. This is equivalent to matrix addition. So we can add two vectors \(\mathbf{x} \in \mathbb{R}^n\) and \(\mathbf{y} \in \mathbb{R}^n\) as follows:
\[\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} + \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} x_1 + y_1 \\ x_2 + y_2 \\ \vdots \\ x_n + y_n \end{bmatrix} \]We can also visualize vector addition nicely in 2D and 3D space with position vectors. Geometrically we can think of vector addition as adding the two movements of the vectors together, i.e. we first move along the first vector and then along the second vector. This results in the moving the tail of the second vector to the head of the first vector. The resulting vector is the vector that starts at the tail of the first vector and ends at the head of the second vector. However, we can also change the order of the vectors along which we move, i.e. we can first move along the second vector and then along the first vector. This can be clearly seen in the image below where the two different orders form a parallelogram, showing that the order of vector addition does not matter, i.e. the addition of vectors is commutative, \(\mathbf{x} + \mathbf{y} = \mathbf{y} + \mathbf{x}\).

This idea can also be extended to adding multiple vectors together. It is easy to see that vector addition is also associative, i.e. we can add multiple vectors together in any order:
\[\mathbf{a} + \mathbf{b} + \mathbf{c} = (\mathbf{a} + \mathbf{b}) + \mathbf{c} = \mathbf{a} + (\mathbf{b} + \mathbf{c}) \]Visually this results in lots of different routes that all lead to the same point, the result. The number of routes is equal to the number of ways we can order the vectors which for \(n\) vectors is \(n!\).

Scalar Multiplication
We can multiply a vector by a scalar, i.e. a number, by multiplying each component of the vector by the scalar. So if we have a vector \(\mathbf{x}\) and a scalar \(s\), then we can multiply them together just like when we multiply a matrix by a scalar:
\[s\mathbf{x} = s \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} sx_1 \\ sx_2 \\ \vdots \\ sx_n \end{bmatrix} \]We can also visualize scalar multiplication nicely in 2D and 3D space. Geometrically we can think of scalar multiplication as stretching or shrinking the vector by the scalar. This is why scalar multiplication is also called vector scaling and the number is called the scalar. If the scalar is negative, then the vector will be flipped around, i.e. it will point in the opposite direction.

If we multiply a vector with the scalar 0, then the resulting vector is the zero vector, i.e. the vector with all components equal to 0. This can be thought of as the vector being shrunk to a point at the origin or collapsing it to a single point.
\[0\mathbf{x} = 0 \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} 0 \cdot x_1 \\ 0 \cdot x_2 \\ \vdots \\ 0 \cdot x_n \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} = \mathbf{o} \]Subtraction
Subtracting two vectors is the same as adding the first vector to the negative of the second vector, i.e. multiplying the second vector by \(-1\). So if we have two vectors we can subtract them as follows:
\[\mathbf{x} - \mathbf{y} = \mathbf{x} + (-\mathbf{y}) \]When visualizing the subtraction of two vectors, we can think of it as adding the negative of the second vector to the first vector, i.e. moving the tail of the second vector after flipping it to the head of the first vector.

Another geometric interpretation of vector subtraction is that the resulting vector is the vector that points from the head of the second vector to the head of the first vector after moving the tail of the second vector to the tail of the first vector. From this interpretation, we can clearly see that the subtraction of two vectors is not commutative, i.e. the order in which we subtract the vectors matters as the resulting vector will point in the opposite direction just like in normal subtraction, \(1 - 2 = -1\) and \(2 - 1 = 1\). If you think of \(\mathbf{b} - \mathbf{a}\) as the vector \(\mathbf{c}\), then you can also see that \(\mathbf{a} + \mathbf{c} = \mathbf{b}\) and after rewriting the equation we get \(\mathbf{c} = \mathbf{b} - \mathbf{a}\) visually.

Linear Combinations
If we combine the concepts of vector addition and scalar multiplication, we get the concept of a linear combination. So if we have two vectors \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^m\) and two scalars \(s, t \in \mathbb{R}\), then we can define \(\mathbf{z}\) as the linear combination of the two vectors as follows:
\[\mathbf{z} = s\mathbf{x} + t\mathbf{y} = s\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} + t\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \begin{bmatrix} sx_1 + ty_1 \\ sx_2 + ty_2 \\ \vdots \\ sx_m + ty_m \end{bmatrix} \]This idea can be extended to more than two vectors and scalars. So if we have a set of vectors \(\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_n \in \mathbb{R}^m\) and a set of scalars \(s_1, s_2, \dots, s_n \in \mathbb{R}\), then we can combine them as follows:
\[\mathbf{z} = s_1 \begin{bmatrix} v_{11} \\ v_{12} \\ \vdots \\ v_{1m} \end{bmatrix} + s_2 \begin{bmatrix} v_{21} \\ v_{22} \\ \vdots \\ v_{2m} \end{bmatrix} + \dots + s_n \begin{bmatrix} v_{n1} \\ v_{n2} \\ \vdots \\ v_{nm} \end{bmatrix} = s_1\mathbf{v}_1 + s_2\mathbf{v}_2 + \dots + s_n\mathbf{v}_n = \sum_{i=1}^n s_i\mathbf{v}_i = \mathbf{x} \]The scalars \(s_1, s_2, \dots, s_n\) are also often called weights as they determine how much each vector contributes to the resulting vector. Linear combinations are literally the basis (you will get this joke later on) of linear algebra and are used to define many more complex concepts such as linear independence, vector spaces, and linear transformations.
If \(\mathbf{v}\) and \(\mathbf{w}\) are defined as:
\[\mathbf{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix} \quad \text{and} \quad \mathbf{w} = \begin{bmatrix} 3 \\ -1 \end{bmatrix} \]We can combine them as follows:
\[2\mathbf{v} + -1\mathbf{w} = 2\begin{bmatrix} 2 \\ 3 \end{bmatrix} + -1\begin{bmatrix} 3 \\ -1 \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \end{bmatrix} + \begin{bmatrix} -3 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 7 \end{bmatrix} \]We can show that all vectors are linear combinations of other vectors. For example we can create all vectors with two components \(\mathbf{u} \in \mathbb{R}^2\) by combining the following two vectors:
\[\mathbf{e}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \quad \text{and} \quad \mathbf{e}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \]This becomes pretty clear if we think of the two vectors as x and y axis in 2D space. Any point in 2D space can be defined by the x and y coordinates, i.e. the linear combination of the two vectors. These vectors are called the standard basis vectors and we will go into more detail about them later. However, these are not the only vectors that can be used to create all vectors in 2D space, we could also use the following two vectors:
\[\mathbf{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix} \quad \text{and} \quad \mathbf{w} = \begin{bmatrix} 3 \\ -1 \end{bmatrix} \]We can show that any vector \(\mathbf{u} \in \mathbb{R}^2\) can be created by combining the two vectors by setting up a system of equations and seeing if we can solve it.
\[\begin{align*} \mu\mathbf{v} + \lambda\mathbf{w} &= \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} \\ \mu\begin{bmatrix} 2 \\ 3 \end{bmatrix} + \lambda\begin{bmatrix} 3 \\ -1 \end{bmatrix} &= \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} \\ \begin{vmatrix} \mu \cdot 2 + \lambda \cdot 3 = u_1 \\ \mu \cdot 3 + \lambda \cdot -1 = u_2 \end{vmatrix} \end{align*} \]To solve this we can first eliminate \(\lambda\) from the first equation by multiplying the second equation by 3 and adding it to the first equation. We can then solve for \(\mu\) and substitute it back into the second equation to solve for \(\lambda\). This will give us formulas to calculate the scalars \(\mu\) and \(\lambda\) that create any vector \(\mathbf{u}\):
\[\begin{align*} \mu \cdot 2 + \lambda \cdot 3 &= u_1 \\ 3(\mu \cdot 3 + \lambda \cdot -1) + (\mu \cdot 2 + \lambda \cdot 3) &= 3u_2 + u_1 \\ \mu \cdot 2 + \lambda \cdot 3 + 9\mu - 3\lambda &= 3u_2 + u_1 \\ (2 + 9)\mu + (3 - 3)\lambda &= 3u_2 + u_1 \\ 11\mu &= 3u_2 + u_1 \\ \mu &= \frac{3u_2 + u_1}{11} \end{align*} \]Now we can substitute \(\mu\) back into the second equation to solve for \(\lambda\):
\[\begin{align*} \mu \cdot 3 + \lambda \cdot -1 &= u_2 \\ 3\left(\frac{3u_2 + u_1}{11}\right) + \lambda \cdot -1 &= u_2 \\ \frac{9u_2 + 3u_1}{11} - \lambda &= u_2 \\ \frac{9u_2 + 3u_1}{11} - u_2 &= \lambda \\ \lambda &= \frac{9u_2 + 3u_1 - 11u_2}{11} \\ \lambda &= \frac{-2u_2 + 3u_1}{11} \end{align*} \]So for example to construct the vector \(\mathbf{u} = \begin{bmatrix} 11 \\ 11 \end{bmatrix}\) we can use the following scalars:
\[\begin{align*} \mu = \frac{3 \cdot 11 + 11}{11} = \frac{33 + 11}{11} = \frac{44}{11} = 4 \\ \lambda = \frac{-2 \cdot 11 + 3 \cdot 11}{11} = \frac{-22 + 33}{11} = \frac{11}{11} = 1 \end{align*} \]So we can create the vector \(\mathbf{u}\) as follows:
\[\begin{align*} \mathbf{u} &= 4\mathbf{v} + 1\mathbf{w} \\ &= 4\begin{bmatrix} 2 \\ 3 \end{bmatrix} + 1\begin{bmatrix} 3 \\ -1 \end{bmatrix} \\ &= \begin{bmatrix} 8 \\ 12 \end{bmatrix} + \begin{bmatrix} 3 \\ -1 \end{bmatrix} \\ &= \begin{bmatrix} 8 + 3 \\ 12 - 1 \end{bmatrix} \\ &= \begin{bmatrix} 11 \\ 11 \end{bmatrix} \end{align*} \]We can also understand this geometrically as the two vectors \(\mathbf{v}\) and \(\mathbf{w}\) being axis in a plane in 2D space (so a skewed coordinate system), and any point in that plane can be reached by moving along the two vectors. So we make shifted copies of both axes such that they cross in \(\mathbf{u}\). This is possible because we can create any vector in 2D space by combining the two vectors with the right scalars. This is called the column view/picture of the two vectors, where the two vectors define a plane in 2D space and any point in that plane can be reached by moving along the two vectors.

Another way is the so-called row view/picture of the two vectors, where the two equations define a line in the \(\mu\)-\(\lambda\) plane and the scalars are where the two lines intersect. This is the same as the column view but in a different coordinate system, i.e. the scalars are the coordinates of the point in the \(\mu\)-\(\lambda\) plane.

Rather then looking at two specific vectors, we can also look at the general case of two vectors \(\mathbf{v}\) and \(\mathbf{w}\) in 2D space and see if we can construct the vector \(\mathbf{u}\) with the scalars \(\mu\) and \(\lambda\). So we want to solve the following system of equations:
\[\begin{vmatrix} \mu \cdot v_1 + \lambda \cdot w_1 = u_1 \\ \mu \cdot v_2 + \lambda \cdot w_2 = u_2 \end{vmatrix} \]We assume that \(\mathbf{v} \neq \mathbf{o}\) as then the system would only be solvable if the vector \(\mathbf{u}\) is a multiple of \(\mathbf{w}\), which would mean that we can only create vectors that lie on the line defined by \(\mathbf{w}\).
So let’s first solve the first equation for \(\lambda\):
\[\lambda = \frac{u_1 - \mu \cdot w_1}{v_1} \]This is okay as long as \(v_1 \neq 0\) (otherwise we take the other equation and solve for \(\lambda\) as either \(v_1 \neq 0\) or \(v_2 \neq 0\)). Now we can substitute this into the second equation and solve for \(\mu\):
\[\begin{align*} v_2\left(\frac{u_1 - \mu \cdot w_1}{v_1}\right) + \mu \cdot w_2 &= u_2 \\ \frac{v_2 \cdot u_1}{v_1} - \frac{v_2 \cdot \mu \cdot w_1}{v_1} + \mu \cdot w_2 &= u_2 \\ \frac{v_2 \cdot u_1}{v_1} + \mu\left(w_2 - \frac{v_2 \cdot w_1}{v_1}\right) &= u_2 \\ \mu(w_2 - \frac{v_2 \cdot w_1}{v_1}) &= u_2 - \frac{u_1 \cdot w_2}{v_1} \\ \mu &= \frac{u_2 - \frac{u_1 \cdot w_2}{v_1}}{w_2 - \frac{v_2 \cdot w_1}{v_1}} \end{align*} \]However notice for this to hold the denominator must not be zero, i.e. \(w_2 - \frac{v_2 \cdot w_1}{v_1} \neq 0\). Only then can we find \(\lambda\) and \(\mu\) such that we can create the vector \(\mathbf{u}\) with the two vectors \(\mathbf{v}\) and \(\mathbf{w}\). Let’s see when the denominator is zero to build a condition for the two vectors to be able to create all vectors in 2D space. If we set the denominator to zero, we get:
\[w_2 - \frac{v_2 \cdot w_1}{v_1} = 0 \implies w_2 = \frac{v_2 \cdot w_1}{v_1} \text{ and } w_1 = \frac{v_1 \cdot w_2}{v_2} \]So this means that the ratios of the components of the two vectors must be equal, i.e. \(\frac{w_1}{v_1} = \frac{w_2}{v_2}\), which means that there exists some \(k\) such that \(w_1 = k \cdot v_1\) and \(w_2 = k \cdot v_2\). This means that the two vectors are linearly dependent, i.e. they are multiples of each other.
So in other words we can create all vectors in 2D space with two vectors \(\mathbf{v}\) and \(\mathbf{w}\) as long as they are not multiples of each other, i.e. they are not linearly dependent. This means that the two vectors are independent and span the whole space, so we can create any vector in 2D space by combining them with the right scalars.
So for example if we try and create every vector in 2D space with the following two vectors we notice that we cannot create all vectors in 2D space:
\[\mathbf{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix} \quad \text{and} \quad \mathbf{w} = \begin{bmatrix} 4 \\ 6 \end{bmatrix} \]This is because notice that the second vector is just a multiple of the first vector, i.e. \(\mathbf{w} = 2\mathbf{v}\). This means that the two vectors are not independent and do not span the whole space. In fact, they only span a line in 2D space, so we can only create vectors that lie on that line. This is also why we cannot create all vectors in 2D space with just one vector, as it would only span a line in 2D space. For them to span the entire space they need to form a basis for the space (this is discussed in more detail in other sections).
Special Combinations
Depending on the scalars we use in the combination, we can create some special types of combinations:
- Linear: We have already seen this case where the scalars can be any real number, so \(s_1, s_2, \dots, s_n \in \mathbb{R}\).
- Affine: This is a linear combination where the sum of the scalars is equal to one, i.e. \(\sum_{i=1}^n s_i = 1\). The affine combination is used to create a point that lies on a line or plane defined by the vectors. This is because the combination can be rewritten as follows: \(s_1\mathbf{v}_1 + s_2\mathbf{v}_2= s_1\mathbf{v}_1 + (1 - s_1)\mathbf{v}_2= \mathbf{v}_1 + (1 - s_1)(\mathbf{v}_2 - \mathbf{v}_1)\).
- Conic: This is a linear combination where the scalars are non-negative, i.e. \(s_1, s_2, \dots, s_n \geq 0\). The conic combination is used to create a point that lies inside the convex hull of the vectors. The convex hull is the smallest convex set that contains all the vectors.
- Convex: This is a mix between the affine and conic combinations, i.e. the scalars are non-negative and the sum of the scalars is equal to one, i.e. \(\sum_{i=1}^n s_i = 1\) and \(s_1, s_2, \dots, s_n \geq 0\). The convex combination is used to create a point that lies inside the convex hull of the vectors and on the line or plane defined by the vectors.

Add the example with the cocktail mixing.
Linear Independence
Two vectors are linearly independent if neither of them can be written as a linear combination of the other. In other words, two vectors are linearly independent if they are not scalar multiples of each other. It is however easier to define and check for linear dependence. The vectors \(\mathbf{a}\) and \(\mathbf{b}\) are linearly dependent if they are scalar multiples of each other, i.e. if there exists a scalar \(c\) such that:
\[\mathbf{a} = c\mathbf{b} \quad \text{for some } c \in \mathbb{R} \]this can also be written as:
\[\mathbf{a} - c\mathbf{b} = \mathbf{0} \]where \(\mathbf{0}\) is the zero vector. This means that the vectors \(\mathbf{a}\) and \(\mathbf{b}\) are linearly dependent if they are collinear, i.e. they lie on the same line. The two equations above can also be used to define linear independence, we just replace the equal sign with a not equal sign. So the vectors \(\mathbf{a}\) and \(\mathbf{b}\) are linearly independent if:
\[\mathbf{a} - c\mathbf{b} \neq \mathbf{0} \quad \text{for all } c \in \mathbb{R} \]
If \(\mathbf{a}\) and \(\mathbf{b}\) are defined as:
\[\mathbf{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \quad \text{and} \quad \mathbf{b} = \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} \]then \(\mathbf{a}\) and \(\mathbf{b}\) are linearly dependent because:
\[\mathbf{a} = 2\mathbf{b} \]However, if \(\mathbf{a}\) and \(\mathbf{b}\) are defined as:
\[\mathbf{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \quad \text{and} \quad \mathbf{b} = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} \]then \(\mathbf{a}\) and \(\mathbf{b}\) are linearly independent because no scalar multiple of \(\mathbf{b}\) can be equal to \(\mathbf{a}\).
Let three independent vectors be \(\mathbf{a}, \mathbf{b}, \mathbf{c}\), then show that the following are also linearly independent:
- w_1 = a + b
- w_2 = -a + b
- w_3 = a + b + c
Show more why these are equivalent and the derivations
This idea can then be extended to a set of vectors, looking at a sequence of vectors doesn’t make a lot of sense as the same vector is obviously linearly dependent on itself.
A set is linearly dependent:
- If one of the vectors in the set can be written as a linear combination of the others
- At least one of the vectors in the set can be written as a linear combination of the previous vectors.
- There is a non-trivial solution to make the linear combination equal to the zero vector
More formally, a set of vectors \(\{\mathbf{v_1}, \mathbf{v_2}, \ldots, \mathbf{v_n}\}\) is linearly dependent if there exist scalars \(c_1, c_2, \ldots, c_n\), not all zero, such that:
\[c_1\mathbf{v_1} + c_2\mathbf{v_2} + \ldots + c_n\mathbf{v_n} = \mathbf{0} \]If the only solution to this equation is the trivial solution where all scalars are zero, then the set of vectors is linearly independent. From this it obviously follows that the null vector even on it’s own is always linearly dependent as we can create it by multiplying any vector by zero.
Dot Product
The dot product or also called the inner product is the most common type of vector multiplication. This is defined just like the matrix multiplication. However, so the dimensions of the two vectors must be the same. To achieve this, the second vector is transposed. The dot product is often denoted as \(\mathbf{x} \cdot \mathbf{y}\), but sometimes also as \(\langle \mathbf{x}, \mathbf{y} \rangle\) rather than \(\mathbf{x}^T \cdot \mathbf{y}\) to avoid confusion with the matrix multiplication or scalar multiplication. So if we have two vectors \(\mathbf{x} \in \mathbb{R}^{n \times 1}\) and \(\mathbf{y} \in \mathbb{R}^{n \times 1}\), then we can multiply them together as follows:
\[\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T\mathbf{y} = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = x_1y_1 + x_2y_2 + \dots + x_ny_n = \sum_{i=1}^n x_iy_i \]From the dimensions we can also clearly see that the dot product results in a scalar which is why it is also called the scalar product, not to be confused with scalar multiplication!
Unlike the matrix multiplication, the dot product is commutative, meaning that the order in which we multiply the vectors together does not matter.
The dot product is also commutative for real numbers, meaning that the order in which we multiply the vectors together does not matter as long as the first vector is transposed. This is because the dot product is the sum of the products of the corresponding components of the two vectors and the pairs of components are the same in both cases.
\[\begin{align*} \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = x_1y_1 + x_2y_2 + \dots + x_ny_n\\ \begin{bmatrix} y_1 & y_2 & \dots & y_n \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = y_1x_1 + y_2x_2 + \dots + y_nx_n \end{align*} \]This can of course also be seen from the sum operator as the order of the summation does not matter:
\[\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^n x_iy_i = \sum_{i=1}^n y_ix_i = \mathbf{y} \cdot \mathbf{x} \]There are also other properties of the dot product that are important to know:
- Scalars can be factored out of the dot product:
- The dot product is distributive:
To prove this we can use some rules from the sum operator:
\[\begin{align*} \mathbf{x} \cdot (\mathbf{y} + \mathbf{z}) &= \sum_{i=1}^m x_i(y_i + z_i) \\ &= \sum_{i=1}^m (x_iy_i + x_iz_i) \\ &= \sum_{i=1}^m x_iy_i + \sum_{i=1}^m x_iz_i \\ &= \mathbf{x} \cdot \mathbf{y} + \mathbf{x} \cdot \mathbf{z} \end{align*} \]We can also show that the dot product with itself is always positive or zero:
\[\mathbf{x} \cdot \mathbf{x} \geq 0 \quad \text{because} \quad \mathbf{x} \cdot \mathbf{x} = \sum_{i=1}^n x_i^2 \]Norms
A norm is a function denoted by \(\|\cdot\|\) that maps vectors to real values and satisfies the following properties:
- It is positive definite meaning it assigns a non-negative real numbers, i.e a length or size to each vector.
- If the norm of a vector is zero, then the vector is the zero vector, i.e. \(\|\mathbf{x}\| = 0 \iff \mathbf{x} = \mathbf{0}\).
- The norm of a vector scaled by a scalar is equal to the absolute value of the scalar times the norm of the vector, i.e. \(\|s\mathbf{x}\| = |s|\|\mathbf{x}\|\) where \(s \in \mathbb{R}\) and \(\mathbf{x} \in \mathbb{R}^n\).
- The triangle inequality holds which we will see later.
In simple terms the norm of a vector is the length of the vector. There are many different types of norms, but the most common ones are the \(L_1\) and \(L_2\) norms, also known as the Manhattan and Euclidean norms respectively. The \(L_p\) norm is a generalization of the \(L_1\) and \(L_2\) norms. We denote a vector’s norm by writing it in between two vertical bars, e.g. \(\|\mathbf{x}\|\), and the subscript denotes the type of norm, e.g. \(\|\mathbf{x}\|_1\) or \(\|\mathbf{x}\|_2\) etc. If the subscript is omitted, then the \(L_2\) norm is assumed.
Manhattan Norm
The Manhattan norm or \(L_1\) norm is defined as the sum of the absolute values of the vector’s components. It is called the Manhattan norm because it can be thought of as the distance between two points along the axis of a rectangular grid, like the streets of Manhattan or any other city with a grid-like structure. No matter how we move along the roads of Manhattan, the distance between two points is always the same.
\[\|\mathbf{x}\|_1 = |x_1| + |x_2| + \dots + |x_n| = \sum_{i=1}^n |x_i| \]
If \(\mathbf{x}\) is defined as:
\[\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \]then the \(L_1\) norm of \(\mathbf{x}\) is:
\[\|\mathbf{x}\|_1 = |1| + |2| + |3| = 6 \]Euclidean Norm
As the name suggests, the Euclidean norm or \(L_2\) norm is the distance between two points in Euclidean space, i.e. the straight line distance between two points. For the 2D case, the Euclidean norm is just the Pythagorean theorem, i.e the length of the hypotenuse of a right-angled triangle.
\[\|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2} = \sqrt{\sum_{i=1}^n x_i^2} \]
From the definition above we can actually see that the Euclidean norm is the square root of the dot product of the vector with itself.
\[\|\mathbf{x}\|_2 = \sqrt{\mathbf{x} \cdot \mathbf{x}} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle} \]This makes sense as we said that the norm is always a non-negative real number and the dot product of a vector with itself is always non-negative. So the square root of the dot product is also non-negative.
We say a vector is a unit vector if its norm is one, i.e. \(\|\mathbf{x}\|_2 = 1\). The set of all unit vectors visualized in 2D space therefore forms a circle with a radius of one around the origin, the so called unit circle.
P-Norm
The idea of the \(L_p\) norm is to generalize the \(L_1\) and \(L_2\) norms. The \(L_p\) norm is defined as:
\[\|\mathbf{x}\|_p = \left(|x_1|^p + |x_2|^p + \dots + |x_n|^p\right)^{\frac{1}{p}} = \left(\sum_{i=1}^n |x_i|^p\right)^{\frac{1}{p}} \]An arbitrary norm is rarely used in practice, most commonly the \(L_1\) and \(L_2\) norms are used. For some use-cases the \(L_\infty\) norm is used, which is defined as:
\[\|\mathbf{x}\|_\infty = \max_i |x_i| \]In other words, the \(L_\infty\) norm is vector component with the largest absolute value. We can visualize how these different norms look like in 2D space by looking at their “unit circles” which are the set of all vectors with a norm of one. The \(L_1\) norm forms a diamond shape, the \(L_2\) norm forms a circle and the \(L_\infty\) norm forms a square.

In machine learning we use these norms for regularization, i.e we limit the size of the weights to prevent overfitting. Commonly we use the \(L_1\) and \(L_2\) norms for this purpose. The \(L_1\) norm is also called Lasso regularization and the \(L_2\) norm is called Ridge regularization?
If \(\mathbf{x}\) is defined as:
\[\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \]then the \(L_4\) norm of \(\mathbf{x}\) is:
\[\|\mathbf{x}\|_4 = \left(|1|^4 + |2|^4 + |3|^4\right)^{\frac{1}{4}} = \left(1 + 16 + 81\right)^{\frac{1}{4}} = 4 \]and the \(L_\infty\) norm of \(\mathbf{x}\) is:
\[\|\mathbf{x}\|_\infty = \max_i |x_i| = \max\{1, 2, 3\} = 3 \]Normalization
Normalizing means to bring something into some sort of normal or standard state. In the case of vectors, normalizing means to scale the vector in a way that it’s length is equal to one. Often we denote a normalized vector by adding a hat to the vector, e.g. \(\hat{\mathbf{x}}\) is the normalized vector of \(\mathbf{x}\). So we can say if \(\|\mathbf{x}\| = 1\), then \(\mathbf{x}\) is normalized. From this definition we can see that to normalize a vector, we simply divide the vector by it’s length, i.e. we divide the vector by a scalar. So if we have a vector \(\mathbf{x}\), then we can normalize it as follows:
\[\hat{\mathbf{x}} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2} \]This normalized vector will have the same direction as the original vector, but it’s length will be equal to one. By eliminating the length of the vector, we can uniquely identify a vector by it’s direction. This is useful because we can now compare vectors based on their direction, without having to worry about their length. All these normalized vectors are also called unit vectors and if they are placed at the origin in 2D they span the unit circle.

Angle between Vectors
The question now is: What does the dot product actually measure? It turns out that the dot product is not just a formula—it encodes geometric information about how two vectors point relative to each other. In particular, the dot product is a measure of the similarity of the directions of two vectors. To really understand this, let’s derive the relationship between the dot product and the angle between two vectors.
As the angle does not depend on how long the arrows are, and whether we simultaneously rotate them, we can look at the case where \(\mathbf{v} = e_1\) and the other vector \(\mathbf{w}\) is some other unit vector with an angle \(\theta\) to the first vector.
It is obvious that the two vectors have length 1, so \(\|\mathbf{v}\| = \|\mathbf{w}\| = 1\). From trigonometry and also as shown in the image below, we also know that the components of \(\mathbf{w}\) are \(\cos(\theta)\) (or \(-\cos(\theta)\) if obtuse) and \(\sin(\theta)\). In both cases, the red triangle with hypotenuse \(\|\mathbf{v} - \mathbf{w}\|\) has the legs \(\sin(\theta)\) and \(1 - \cos(\theta)\). By the Pythagorean theorem, we can calculate the length of the hypotenuse as follows:
\[\begin{align*} \|\mathbf{v} - \mathbf{w}\| &= \sqrt{\sin^2(\theta) + (1 - \cos(\theta))^2} \\ \|\mathbf{v} - \mathbf{w}\|^2 &= \sin^2(\theta) + (1 - \cos(\theta))^2 \\ \|\mathbf{v} - \mathbf{w}\|^2&= \sin^2(\theta) + (1 - 2\cos(\theta) + \cos^2(\theta)) \\ \|\mathbf{v} - \mathbf{w}\|^2&= 1 - 2\cos(\theta) + \sin^2(\theta) + \cos^2(\theta) \\ \|\mathbf{v} - \mathbf{w}\|^2&= 1 - 2\cos(\theta) + 1 \\ \|\mathbf{v} - \mathbf{w}\|^2&= 2 - 2\cos(\theta) \\ (\mathbf{v} - \mathbf{w}) \cdot (\mathbf{v} - \mathbf{w}) &= 2 - 2\cos(\theta) \\ \mathbf{v} \cdot \mathbf{v} - 2\mathbf{v} \cdot \mathbf{w} + \mathbf{w} \cdot \mathbf{w} &= 2 - 2\cos(\theta) \\ 2 - 2\mathbf{v} \cdot \mathbf{w} &= 2 - 2\cos(\theta) \\ \mathbf{v} \cdot \mathbf{w} &= \cos(\theta) \end{align*} \]So the dot product of two unit vectors \(\mathbf{v}\) and \(\mathbf{w}\) if we place the tails at the same point is equal to the cosine of the angle between them.

This means that the dot product can be used to calculate the angle between any two vectors. Because the angle does not depend on the length of the vectors we can just normalize the vectors to unit vectors and then calculate the angle between them. Because the vectors anyway describe a direction their tails don’t need to be at the same point, we can just place them at the origin. This then gives us the following formula for the dot product of two vectors \(\mathbf{x}\) and \(\mathbf{y}\):
\[\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|} \]rewriting the equation above gives us the following alternative form of the dot product:
\[\mathbf{x} \cdot \mathbf{y} = \|\mathbf{x}\| \|\mathbf{y}\| \cos(\theta) \]where \(\theta\) is the angle between the two vectors. We can also calculate the angle between the two vectors by rewriting the equation above as follows:
\[\theta = \cos^{-1}\left(\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}\right) \]Where \(\cos^{-1}\) is the inverse cosine function, also called the arccosine function \(\arccos\).
If \(\mathbf{x}\) and \(\mathbf{y}\) are defined as:
\[\mathbf{x} = \begin{bmatrix} 3 \\ -2 \end{bmatrix} \quad \text{and} \quad \mathbf{y} = \begin{bmatrix} 1 \\ 7 \end{bmatrix} \]then the angle between \(\mathbf{x}\) and \(\mathbf{y}\) is:
\[\begin{align*} \theta &= \cos^{-1}\left(\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}\right) \\ &= \cos^{-1}\left(\frac{3 \cdot 1 + (-2) \cdot 7}{\sqrt{3^2 + (-2)^2} \sqrt{1^2 + 7^2}}\right) \\ &= \cos^{-1}\left(\frac{3 - 14}{\sqrt{9 + 4} \sqrt{1 + 49}}\right) \\ &= \cos^{-1}\left(\frac{-11}{\sqrt{13} \sqrt{50}}\right) \\ &= 115.6^\circ \end{align*} \]Orthogonal Vectors
We call two vectors orthogonal if the angle between them is 90 degrees, i.e. they are perpendicular to each other. If two vectors are orthogonal, then their dot product is zero, because \(\cos(90) = 0\). So if we have two vectors \(\mathbf{x}\) and \(\mathbf{y}\), then we can check if they are orthogonal as follows:
\[\mathbf{x} \cdot \mathbf{y} = \cos(90) = 0 \]Or
\[\begin{bmatrix} 4 \\ 2 \end{bmatrix} \cdot \begin{bmatrix} -1 \\ 2 \end{bmatrix} = 4 \cdot (-1) + 2 \cdot 2 = -4 + 4 = 0 \]Cauchy-Schwarz Inequality
The Cauchy-Schwarz inequality states that the dot product of two vectors is always less than or equal to the product of the two vectors Euclidean norms.
\[|\mathbf{x} \cdot \mathbf{y}| \leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 \]There is no real geometric interpretation of this inequality. However, it is a very useful inequality especially for setting some bounds.
We want to prove that for any two vectors \(\mathbf{x}\) and \(\mathbf{y}\), the inequality holds.
Case 1: When one of the vectors is the zero vector
If one of the vectors is the zero vector, then the inequality holds because the dot product is zero and the product of the norms are also zero. So then the inequality becomes:
\[0 \leq 0 \]Case 2: If both vectors are unit vectors
If both vectors are unit vectors, then the inequality becomes the following:
\[|\mathbf{x} \cdot \mathbf{y}| \leq 1 \]We can then rewrite the dot product as the cosine of the angle between the two vectors, because the norms are one this also simplifies to:
\[\mathbf{x} \cdot \mathbf{y} = \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 \cos(\theta) = \cos(\theta) \]The cosine of the angle between two vectors is always between -1 and 1. The inequality however also takes the absolute value of the dot product, so the inequality holds.
\[|\mathbf{x} \cdot \mathbf{y}| = |\cos(\theta)| \leq 1 \]Case 3: Any two vectors
If the vectors are not unit vectors, then we can scale the vectors to be unit vectors. We don’t need to worry about dividing by zero as we’ve already shown, if any of the vectors is the zero vector the inequality becomes zero.
\[\mathbf{u} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2} \quad \text{and} \quad \mathbf{v} = \frac{\mathbf{y}}{\|\mathbf{y}\|_2} \]From above we know that \(|\mathbf{u} \cdot \mathbf{v}| \leq 1\), so we can write the following:
\[\begin{align*} \mathbf{x} \cdot \mathbf{y} &= \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 (\mathbf{u} \cdot \mathbf{v}) \\ |\mathbf{x} \cdot \mathbf{y}| &= \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 |\mathbf{u} \cdot \mathbf{v}| \\ |\mathbf{x} \cdot \mathbf{y}| &\leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 \end{align*} \]Using the cauchy schwarz inequality we can also show that the cosine of the angle is always between -1 and 1. We know that the dot product is equal to the cosine of the angle between the two vectors times the product of the norms of the two vectors:
\[\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2} \]combining this with the Cauchy-Schwarz inequality gives us:
\[\begin{align*} |\mathbf{x} \cdot \mathbf{y}| &\leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2 \\ \left|\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}\right| &\leq 1 \\ |\cos(\theta)| &\leq 1 \\ -1 &\leq \cos(\theta) \leq 1 \end{align*} \]Triangle Inequality
The triangle inequality states that the norm of the sum of two vectors is less than or equal to the sum of the norms of the two vectors.
\[\|\mathbf{x} + \mathbf{y}\|_2 \leq \|\mathbf{x}\|_2 + \|\mathbf{y}\|_2 \]This can also visually be seen in the 2D case, where the direct path from one point to another is always shorter than the path that goes through another point. Or also that the hypotenuse of a triangle is always shorter than the sum of the other two sides.

Because both sides of the inequality are positive, we can look at the squares to make the proof easier.
\[\begin{align*} \|\mathbf{x} + \mathbf{y}\|_2^2 &= (\mathbf{x} + \mathbf{y}) \cdot (\mathbf{x} + \mathbf{y}) \\ &= \mathbf{x} \cdot \mathbf{x} + \mathbf{x} \cdot \mathbf{y} + \mathbf{y} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} \\ &= \|\mathbf{x}\|_2^2 + 2\mathbf{x} \cdot \mathbf{y} + \|\mathbf{y}\|_2^2 \end{align*} \]Now we can use the Cauchy-Schwarz inequality on the middle term and get:
\[2\mathbf{x} \cdot \mathbf{y} \leq 2\|\mathbf{x}\|_2 \|\mathbf{y}\|_2 \]So we can rewrite the norm of the sum of two vectors squared and take the square root to get the triangle inequality.
\[\begin{align*} \|\mathbf{x} + \mathbf{y}\|_2^2 &\leq \|\mathbf{x}\|_2^2 + 2\|\mathbf{x}\|_2 \|\mathbf{y}\|_2 + \|\mathbf{y}\|_2^2 \\ &= (\|\mathbf{x}\|_2 + \|\mathbf{y}\|_2)^2 \\ \|\mathbf{x} + \mathbf{y}\|_2 &\leq \|\mathbf{x}\|_2 + \|\mathbf{y}\|_2 \end{align*} \]We can also prove the cauchy schwarz inequality using the triangle inequality.
Orthonormal Vector
We can now combine the idea of orthogonal vectors and normalized vectors to get orthonormal vectors. Orthonormal vectors are vectors that are orthogonal to each other and have a length of one.

Standard Unit Vectors
An example of orthonormal vectors are the standard unit vectors. The standard unit vectors can be thought of as the vectors that correspond to the axes of a coordinate system. Later on we will see that these vectors can be used to span any vector space and form the standard basis of the vector space. The standard unit vectors are denoted as \(\mathbf{e}_i\) where \(i\) is the index of the vector. The \(i\) also corresponds to the index of the component that is one, while all other components are zero. The dimensionality of the vector is inferred from the index, so \(\mathbf{e}_1\) is a 1D vector, \(\mathbf{e}_2\) is a 2D vector, \(\mathbf{e}_3\) is a 3D vector depending on the context.
\[\mathbf{e}_i = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} \]It is quite easy to see that the standard unit vectors are orthonormal, because they are orthogonal to each other and have a length of one. It also easy to see that any vector can be written as a linear combination of the standard unit vectors, this is why they are so useful and will become an important concept later on when discussing vector spaces and bases.
