Shortfalls of Perceptron¶
Although Perceptron can learn most of 16 logical operations, it cannot learn two out of them: XOR and IF-AND-ONLY-IF. The total input activation of the output node in Perceptron is computed as $y=w_1x_1+w_2x_2$. This is actually a line in a two dimensional space. Now the logical judgments become a geometric problem. As shown in the below figure, we can find a lineral boundary to separte different sets of outcomes in AND and OR. However, we cannot find a linear boundary for the XOR operation. Therefore, Perceptron cannot solve a nonlinearly separable problem.
from IPython.display import Image
Image(filename="and_or.png", width=400, height=400)
Researcher have attempted several ways to make Perceptron able to solve the XOR problem. One solution is by transferring the stimulus space to a new vectorspace, on which a linear boundary can work to separate different sets of the XOR outcomes. That is, we can map the stimuli in this space to another higher dimensional space.
from IPython.display import Image
Image(filename="xor.png", width=400, height=400)
For instance, in this new space, we can add one more dimention to represent the product of the feature values.
\begin{bmatrix} 1 & 1 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0\\ \end{bmatrix}
Thus, in this 3-D space, XOR operation becomes possible. See the below script. We set up a Perceptron with iteration = 30 and learning rate = 0.01. The third column of this stimulus matrix represents the product of the two features of the original space. However, this is not a good solution, as the product of features will exponentially increase when features increase. Thus, we need a general solution.
import numpy as np
from numpy import random
A=np.array([[1,1,1,1],
[1,0,0,1],
[0,1,0,1],
[0,0,0,1]])
T=np.array([0,1,1,0])
W=np.zeros(4)
iter=30
Error=np.zeros(iter)
#W=random.uniform(size=(1,3))
#W=W-0.5
def perceptron(A,T,W,eta,iter):
# Variable declaring
H=np.zeros(len(A))
# Feed forward
for i in range(0,iter):
# Generate output activation for all input stimuli
Act=A.dot(W.T)
for j in range(0,len(Act)):
if(Act[j]<=0):
H[j]=0
else:
H[j]=1
# Compute error
E=T-H
Error[i]=sum(np.square(E))/2
# Backpropagation
deltaW=eta*(E.dot(A))
W=W+deltaW
return Act,H,Error,W;
perceptron(A,T,W,0.01,iter)
(array([-0.02, 0.01, 0.01, 0. ]), array([0., 1., 1., 0.]), array([1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 0.5, 0.5, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]), array([ 0.01, 0.01, -0.04, 0. ]))
Researchers proposed that a multilayered neural network can solve the XOR problem. The most basic form of the multilayered neural network is simply to add one hidden layer in between the input and output layers. Thus, we have two weight matrices to adjust: W between input and hidden layers and Z between hidden and output layers. However, if the activation of hidden nodes is just the linear sum of the input activation, then a multilayered neural network will functionally turn to a two layered neural netwrok. Suppose the activation of hidden nodes is the inner product of input activation and the associative weight matrix. $A^{hid}=WA^{in}$. The hidden nodes activation then is linearly transformed, such as multiplying with a constant c. The acitvation of output nodes is $A^{out}=ZH^{hid}=cZA^{hid}=cZWA^{in}.$ Since $Z$ and $W$ are matrices and the product of $cZW$ can be another matrix, say M, then $A^{out}=MA^{in}$. Again, it becomes a two-layered Percetpron. Obviously, the linear transformation is not allowed in the hidden layer. Therefore, a nonlinear transformation function is needed for the hidden nodes, such as the sigmoid function.
\begin{equation} y=\frac{1}{1+e^{-x}}, \end{equation} where $-\infty \leq x \leq \infty$ and $0 \leq y \leq 1$ with $y=0.5$ as the middle point. For a multilayered neural network to work, at least one out of the hidden and output layers should have a nonlinear transformation. Before we strat composing our first multilayered neural network, we need to do some calculus.
Backpropagation algorithm¶
Suppose that $A^{in}$ is the vector of input stimulus and $W$ and $Z$ are the weights between input and hidden layers and the weights between hidden and output layers. The hidden nodes use sigmoid transformation for their activation.
\begin{align*} A^{hid}_j&=\sum W_{ji} A^{in}_i\\ H^{hid}_j&=sig(A^{hid}_j)\\ A^{out}&=\sum Z_{kj} H^{hid}_j. \end{align*}
The error function for all output nodes is computed as
\begin{equation} E=\frac{1}{2}\sum_k(T_k-A^{out}_k)^2. \end{equation}
We want to know how to adjust the weights in order to make the whole system learn the targets. Thus, we do partial differentiation for the weights $W$ and $Z$. Since $\Delta Z \propto -\frac{\partial E}{\partial Z}$, we set up $\Delta Z = -\eta \frac{\partial E}{\partial Z}$. What is $\frac{\partial E}{\partial Z}$? We apply the chain rule to solve it.
\begin{align*} \frac{\partial E}{\partial Z_{kj}}&=\frac{\partial E_k}{\partial A^{out}_k}\frac{\partial A^{out}_k}{\partial Z_j}\\ &=-(T_k-A^{out}_k)H^{hid}_j, \end{align*} Thus,
\begin{equation} \Delta Z_{kj}=\eta (T_k-A^{out}_k)H^{hid}_j. \end{equation}
This function is exactly the Widrow-Hoff algorithm. Also, the adjusted amount for $W$ $\Delta W\propto - \frac{\partial E}{\partial W}$. We apply the chain rule to solve $\frac{\partial E_k}{\partial W_{ji}}$. That is,
\begin{align*} \frac{\partial E_k}{\partial W_{ji}}=\frac{\partial E}{\partial A^{out}}\frac{\partial A^{out}}{\partial H^{hid}}\frac{\partial H^{hid}}{\partial A^{in}}\frac{\partial A^{in}}{\partial W_{ji}}. \end{align*}
Let's separate this equation. First,
\begin{align*} \frac{\partial E}{\partial A^{out}}\frac{\partial A^{out}}{\partial H^{hid}}&=-\{\sum_k (T_k-A^{out}_k)Z_{kj}\}.\\ \end{align*}
Second,
\begin{align*} \frac{\partial H^{hid}_j}{\partial A^{hid}_j}&=(1-H^{hid}_j)H^{hid}_j.\\ \end{align*}
Third,
\begin{align*} \frac{\partial A^{hid}_j}{\partial W_{ji}}&=A^{in}_i.\\ \end{align*}
Now we assemble these parts together, as
\begin{align*} \frac{\partial E}{\partial W_{ji}}=-\sum_k (T_k-A^{out}_k)Z_{kj}H^{hid}_j(1-H^{hid}_j)A^{in}_i. \end{align*}
Therefore, $\Delta W_{ji}=\eta \{\sum_k (T_k-A^{out}_k)Z_{kj}\}H^{hid}_j(1-H^{hid}_j)A^{in}_i$. Alternatively, you can view the error on each output node as the error singnal on that node, which is weighted by the matrix $Z_{kj}$ and passed backward to each hidde node. Now let's see how a 3-layered fully-connected neural network works.
import numpy as np
from numpy import random
A=np.array([[1,1],
[1,0],
[0,1],
[0,0]])
T=np.array([0,1,1,0])
# A 3x2 weight between input and hidden layers
W=random.uniform(size=(3,2))
W=W-0.5
# A 1x3 weight between hidden and output layers
Z=random.uniform(size=(1,3))
Z=Z-0.5
iter=2500
Error=np.zeros(iter)
eta=0.03
def sig(x):
return(1/(1+np.exp(-x)))
def ml(A,T,Z,W,eta,iter):
# Feed forward
for i in range(0,iter):
# Generate activation of hidden nodes for all input stimuli
Act=W.dot(A.T)
# Sigmoid transformation
H=sig(Act)
Out=Z.dot(H)
# Compute error
E=T-Out
Error[i]=sum(np.square(E.sum(axis=0)))/2
# Backpropagation
# Adjusted amount of Z
deltaZ=eta*(E.dot(H.T))
# Adusted amount of W
Err_sig=E.T.dot(Z)
temp=H*(1-H)
deltaW=(Err_sig.T*temp).dot(A)
# Weight updating
Z=Z+deltaZ
W=W+deltaW
# Generate predictions
Act=W.dot(A.T)
H=sig(Act)
Out=Z.dot(H)
result={'Error':Error,'Z':Z,'W':W,'Out':Out}
return result;
Result=ml(A,T,Z,W,eta,iter)
We can see the error curve. The error gradually drops through 2500 iterations. The output of this model after learning 2500 iterations shows that this model can successfully learn the XOR judgment. The shortfall of this 3-layered neural network is too slow.
import matplotlib.pyplot as plt
y=Result['Error']
x=range(1,2501)
plt.plot(x,y,'r')
plt.xlabel('Iteration')
plt.ylabel('Error')
#plt.show()
Result['Out']
array([[0.08296654, 0.85371464, 0.9071505 , 0.27069332]])