THE FUNDAMENTALS OF AUTOGRAD

What Do We Need Autograd For?

A machine learning model is a function, with inputs and outputs. For this discussion, we’ll treat the inputs a as an i-dimensional vector \vec{x}x, with elements x_{i}xi​. We can then express the model, M, as a vector-valued function of the input: \vec{y} = \vec{M}(\vec{x})y​=M(x). (We treat the value of M’s output as a vector because in general, a model may have any number of outputs.)

Since we’ll mostly be discussing autograd in the context of training, our output of interest will be the model’s loss. The loss function L(\vec{y}y​) = L(\vec{M}M(\vec{x}x)) is a single-valued scalar function of the model’s output. This function expresses how far off our model’s prediction was from a particular input’s ideal output. Note: After this point, we will often omit the vector sign where it should be contextually clear - e.g., yy instead of \vec yy​.

In training a model, we want to minimize the loss. In the idealized case of a perfect model, that means adjusting its learning weights - that is, the adjustable parameters of the function - such that loss is zero for all inputs. In the real world, it means an iterative process of nudging the learning weights until we see that we get a tolerable loss for a wide variety of inputs.

How do we decide how far and in which direction to nudge the weights? We want to minimize the loss, which means making its first derivative with respect to the input equal to 0: \frac{\partial L}{\partial x} = 0∂x∂L​=0.

In particular, the gradients over the learning weights are of interest to us - they tell us what direction to change each weight to get the loss function closer to zero.

Since the number of such local derivatives (each corresponding to a separate path through the model’s computation graph) will tend to go up exponentially with the depth of a neural network, so does the complexity in computing them. This is where autograd comes in: It tracks the history of every computation. Every computed tensor in your PyTorch model carries a history of its input tensors and the function used to create it. Combined with the fact that PyTorch functions meant to act on tensors each have a built-in implementation for computing their own derivatives, this greatly speeds the computation of the local derivatives needed for learning.

a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
print(a)

Out:

tensor([0.0000, 0.2618, 0.5236, 0.7854, 1.0472, 1.3090, 1.5708, 1.8326, 2.0944,
        2.3562, 2.6180, 2.8798, 3.1416, 3.4034, 3.6652, 3.9270, 4.1888, 4.4506,
        4.7124, 4.9742, 5.2360, 5.4978, 5.7596, 6.0214, 6.2832],
       requires_grad=True)

I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.

So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.

However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.

As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.

Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.

with torch.no_grad():
  x_t = torch.rand(3,4)
  y_np = np.ones((4, 2), dtype=np.float32)
  x_t @ torch.from_numpy(y_np)  # dot product in torch
  np.dot(x_t.numpy(), y_np)  # the same dot product in numpy
grad_fn=<SinBackward0>
 grad_fn=<AddBackward0>
grad_fn=<MulBackward0>
grad_fn=<SumBackward0>

This grad_fn gives us a hint that when we execute the backpropagation step and compute gradients, we’ll need to compute the derivative of sin(x)sin(x) for all this tensor’s inputs.

Each grad_fn stored with our tensors allows you to walk the computation all the way back to its inputs with its next_functions property. We can see below that drilling down on this property on d shows us the gradient functions for all the prior tensors. Note that a.grad_fn is reported as None, indicating that this was an input to the function with no history of its own.

print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\nc:')
print(c.grad_fn)
print('\nb:')
print(b.grad_fn)
print('\na:')
print(a.grad_fn)

Out:

d:
<AddBackward0 object at 0x7ff3d80a3518>
((<MulBackward0 object at 0x7ff3d80a3588>, 0), (None, 0))
((<SinBackward0 object at 0x7ff3d80a3588>, 0), (None, 0))
((<AccumulateGrad object at 0x7ff3d80a3518>, 0),)
()

c:
<MulBackward0 object at 0x7ff3d80a35f8>

b:
<SinBackward0 object at 0x7ff3d80a3518>

a:
None

Adding a constant, as we did to compute d, does not change the derivative. That leaves c = 2 * b = 2 * sin(a)c=2∗b=2∗sin(a), the derivative of which should be 2 * cos(a)2∗cos(a). Looking at the graph above, that’s just what we see.

Be aware than only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad) you’d get back None. In this simple example, only the input is a leaf node, so only it has gradients computed.

One important thing about the process: After calling optimizer.step(), you need to call optimizer.zero_grad(), or else every time you run loss.backward(), the gradients on the learning weights will accumulate

If you only need autograd turned off temporarily, a better way is to use the torch.no_grad():

There’s a corresponding context manager, torch.enable_grad(), for turning autograd on when it isn’t already. It may also be used as a decorator.

Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the Tensor object’s detach() method - it creates a copy of the tensor that is detached from the computation history:

x = torch.rand(5, requires_grad=True)
y = x.detach()

print(x)
print(y)

Out:

tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892], requires_grad=True)
tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892])

We did this above when we wanted to graph some of our tensors. This is  because matplotlib expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. Making a detached copy lets us move forward.

Jacobian

>>> inputs = (torch.rand(3), torch.rand(3)) # arguments for the function
>>> print(inputs)
(tensor([0.7074, 0.9178, 0.3003]), tensor([0.9081, 0.2903, 0.7643]))
>>> def exp_adder(x, y):
...     return 2 * x.exp() + 3 * y
...
>>> torch.autograd.functional.jacobian(exp_adder, inputs)
(tensor([[4.0574, 0.0000, 0.0000],
        [0.0000, 5.0077, 0.0000],
        [0.0000, 0.0000, 2.7005]]), tensor([[3., 0., 0.],
        [0., 3., 0.],
        [0., 0., 3.]]))

可以算梯度!!!!!!!!!!!!!

Hessian

functional-higher-level for autograd

Automatic differentiation package - torch.autograd — PyTorch 1.10.1 documentationTHE FUNDAMENTALS OF AUTOGRADhttps://pytorch.org/docs/stable/autograd.html#functional-higher-level-api

上一篇:【云周刊】第201期:云栖专辑 | 阿里开发者们的第10个感悟:产品经理最优秀的能力,是框架思维,脑海中有蓝图


下一篇:从PyTorch中的梯度计算出发谈如何避免训练GAN中出现inplace error