Now that we have described the concept of dimension reduction and some of the applications of SVD and principal component analysis, we focus on more details related to the mathematics behind these. We start with projections. A projection is a linear algebra concept that helps us understand many of the mathematical operations we perform on high-dimensional data. For more details, you can review projects in a linear algebra book. Here we provide a quick review and then provide some data analysis related examples.
As a review, remember that projections minimize the distance between points and subspace.
In the figure above, the point on top is pointing to a point in space. In this particular cartoon, the space is two dimensional, but we should be thinking abstractly. The space is represented by the Cartesian plan and the line on which the little person stands is a subspace of points. The projection to this subspace is the place that is closest to the original point. Geometry tells us that we can find this closest point by dropping a perpendicular line (dotted line) from the point to the space. The little person is standing on the projection. The amount this person had to walk from the origin to the new projected point is referred to as the coordinate.
For the explanation of projections, we will use the standard matrix algebra notation for points: is a point in -dimensional space and is smaller subspace.
Simple example with N=2
If we let . We can plot it like this:
mypar (1,1) plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n") arrows(0,0,2,3,lwd=3) text(2,3," Y",pos=4,cex=3)
We can immediately define a coordinate system by projecting this vector to the space defined by: (the x-axis) and (the y-axis). The projections of to the subspace defined by these points are 2 and 3 respectively:
We say that and are the coordinates and that are the bases.
Now let’s define a new subspace. The red line in the plot below is subset defined by points satisfying with . The projection of onto is the closest point on to . So we need to find the that minimizes the distance between and . In linear algebra, we learn that the difference between these points is orthogonal to the space so:
this implies that:
Here the dot represents the dot product: .
The following R code confirms this equation works:
mypar(1,1) plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n") arrows(0,0,2,3,lwd=3) abline(0,0.5,col="red",lwd=3) #if x=2c and y=c then slope is 0.5 (y=0.5x) text(2,3," Y",pos=4,cex=3) y=c(2,3) x=c(2,1) cc = crossprod(x,y)/crossprod(x) segments(x*cc,x*cc,y,y,lty=2) text(x*cc,x*cc,expression(hat(Y)),pos=4,cex=3)
Note that if was such that , then is simply and the space does not change. This simplification is one reason we like orthogonal matrices.
Example: The sample mean is a projection
Let and is the space spanned by:
In this space, all components of the vectors are the same number, so we can think of this space as representing the constants: in the projection each dimension will be the same value. So what minimizes the distance between and ?
When talking about problems like this, we sometimes use 2 dimensional figures such as the one above. We simply abstract and think of as a point in and as a subspace defined by a smaller number of values, in this case just one: .
Getting back to our question, we know that the projection is:
which in this case is the average:
Here, it also would have been just as easy to use calculus:
Example: Regression is also a projection
Let us give a slightly more complicated example. Simple linear regression can also be explained with projections. Our data (we are no longer going to use the notation) is again an vector and our model predicts with a line . We want to find the and that minimize the distance between and the space defined by:
Our matrix is and any point in can be written as .
The equation for the multidimensional version of orthogonal projection is:
which we have seen before and gives us:
And the projection to is therefore: