What Do We Need Autograd For?
A machine learning model is a function, with inputs and outputs. For this discussion, we’ll treat the inputs a as an i-dimensional vector \vec{x}x, with elements x_{i}xi. We can then express the model, M, as a vector-valued function of the input: \vec{y} = \vec{M}(\vec{x})y=M(x). (We treat the value of M’s output as a vector because in general, a model may have any number of outputs.)
Since we’ll mostly be discussing autograd in the context of training, our output of interest will be the model’s loss. The loss function L(\vec{y}y) = L(\vec{M}M(\vec{x}x)) is a single-valued scalar function of the model’s output. This function expresses how far off our model’s prediction was from a particular input’s ideal output. Note: After this point, we will often omit the vector sign where it should be contextually clear - e.g., yy instead of \vec yy.
In training a model, we want to minimize the loss. In the idealized case of a perfect model, that means adjusting its learning weights - that is, the adjustable parameters of the function - such that loss is zero for all inputs. In the real world, it means an iterative process of nudging the learning weights until we see that we get a tolerable loss for a wide variety of inputs.
How do we decide how far and in which direction to nudge the weights? We want to minimize the loss, which means making its first derivative with respect to the input equal to 0: \frac{\partial L}{\partial x} = 0∂x∂L=0.
In particular, the gradients over the learning weights are of interest to us - they tell us what direction to change each weight to get the loss function closer to zero.
Since the number of such local derivatives (each corresponding to a separate path through the model’s computation graph) will tend to go up exponentially with the depth of a neural network, so does the complexity in computing them. This is where autograd comes in: It tracks the history of every computation. Every computed tensor in your PyTorch model carries a history of its input tensors and the function used to create it. Combined with the fact that PyTorch functions meant to act on tensors each have a built-in implementation for computing their own derivatives, this greatly speeds the computation of the local derivatives needed for learning.
a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
print(a)
Out:
tensor([0.0000, 0.2618, 0.5236, 0.7854, 1.0472, 1.3090, 1.5708, 1.8326, 2.0944, 2.3562, 2.6180, 2.8798, 3.1416, 3.4034, 3.6652, 3.9270, 4.1888, 4.4506, 4.7124, 4.9742, 5.2360, 5.4978, 5.7596, 6.0214, 6.2832], requires_grad=True)
I think the most crucial point to understand here is the difference between a torch.tensor
and np.ndarray
:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors
has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray
or torch.tensor
can be used interchangeably.
However, torch.tensor
s are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray
object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor
to np.ndarray
you must explicitly remove the computational graph of the tensor using the detach()
command.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensor
s and np.ndarray
s can be used interchangeably.
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t @ torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
grad_fn=<SinBackward0>
grad_fn=<AddBackward0>
grad_fn=<MulBackward0>
grad_fn=<SumBackward0>
This grad_fn
gives us a hint that when we execute the backpropagation step and compute gradients, we’ll need to compute the derivative of sin(x)sin(x) for all this tensor’s inputs.
Each grad_fn
stored with our tensors allows you to walk the computation all the way back to its inputs with its next_functions
property. We can see below that drilling down on this property on d
shows us the gradient functions for all the prior tensors. Note that a.grad_fn
is reported as None
, indicating that this was an input to the function with no history of its own.
print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\nc:')
print(c.grad_fn)
print('\nb:')
print(b.grad_fn)
print('\na:')
print(a.grad_fn)
Out:
d: <AddBackward0 object at 0x7ff3d80a3518> ((<MulBackward0 object at 0x7ff3d80a3588>, 0), (None, 0)) ((<SinBackward0 object at 0x7ff3d80a3588>, 0), (None, 0)) ((<AccumulateGrad object at 0x7ff3d80a3518>, 0),) () c: <MulBackward0 object at 0x7ff3d80a35f8> b: <SinBackward0 object at 0x7ff3d80a3518> a: None
Adding a constant, as we did to compute d
, does not change the derivative. That leaves c = 2 * b = 2 * sin(a)c=2∗b=2∗sin(a), the derivative of which should be 2 * cos(a)2∗cos(a). Looking at the graph above, that’s just what we see.
Be aware than only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad)
you’d get back None
. In this simple example, only the input is a leaf node, so only it has gradients computed.
One important thing about the process: After calling optimizer.step()
, you need to call optimizer.zero_grad()
, or else every time you run loss.backward()
, the gradients on the learning weights will accumulate
If you only need autograd turned off temporarily, a better way is to use the
torch.no_grad()
:
There’s a corresponding context manager, torch.enable_grad()
, for turning autograd on when it isn’t already. It may also be used as a decorator.
Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the Tensor
object’s detach()
method - it creates a copy of the tensor that is detached from the computation history:
x = torch.rand(5, requires_grad=True)
y = x.detach()
print(x)
print(y)
Out:
tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892], requires_grad=True) tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892])
We did this above when we wanted to graph some of our tensors. This is because matplotlib
expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. Making a detached copy lets us move forward.
Jacobian
>>> inputs = (torch.rand(3), torch.rand(3)) # arguments for the function
>>> print(inputs)
(tensor([0.7074, 0.9178, 0.3003]), tensor([0.9081, 0.2903, 0.7643]))
>>> def exp_adder(x, y):
... return 2 * x.exp() + 3 * y
...
>>> torch.autograd.functional.jacobian(exp_adder, inputs)
(tensor([[4.0574, 0.0000, 0.0000],
[0.0000, 5.0077, 0.0000],
[0.0000, 0.0000, 2.7005]]), tensor([[3., 0., 0.],
[0., 3., 0.],
[0., 0., 3.]]))
可以算梯度!!!!!!!!!!!!!
Hessian
functional-higher-level for autograd