PyTorch官方Tutorials
跟着PyTorch官方Tutorials码的,便于理解自己稍有改动代码并添加注释,IDE用的jupyter notebook
AUTOMATIC DIFFERENTIATION WITH TORCH.AUTOGRAD
用torch.autograd自动微分
When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.
训练神经网络的时候 最常使用的算法是后向传播算法 在这个算法中 参数(模型权重)根据由给定参数求得的损失函数的梯度调整
To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.
为了计算梯度 pytorch提供内置的微分引擎torch.grad 它支持任何计算图的自动求导
Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:
考虑最简单的单层神经网络 有输入x 参数w和b 和一些损失函数 在pytorch中它可以定义如下:
import torch
x = torch.ones(5)
y = torch.zeros(3)
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
Tensors, Functions and Computational graph
This code defines the following computational graph:
上述代码定义了下面的计算图
In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad
property of those tensors.
在这个网络中 w和b是需要优化的参数 因此需要能够计算这些变量相关的损失函数的梯度 为了计算梯度 设置tensor的
requres_grad
参数
You can set the value of requires_grad
when creating a tensor, or later by using x.requires_grad_(True)
method.
可以在创建tensor时设置
requires_grad
参数 也可以在创建后使用x.requires_grad(True)
方法
A function that we apply to tensors to construct computational graph is in fact an object of class Function
. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn
property of a tensor. You can find more information of Function in the documentation.
用在tensor上的构建计算图的函数实际上是Function类的一个对象 这个对象知道如何在前向计算中计算函数 也知道如何在后向中计算微分 对反向传播函数的一个应用存储在tensor的grad_fn属性中 可以在文档中找到更多这个函数的信息
print("Gradient function for z=",z.grad_fn)
print("Gradient function for loss=",loss.grad_fn)
Gradient function for z= <AddBackward0 object at 0x0000021D3BB0B5C8>
Gradient function for loss= <BinaryCrossEntropyWithLogitsBackward object at 0x0000021D3BB0B588>
Computing Gradients
To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RzQyYWv8-1625055243716)(attachment:image-2.png)]and [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jyuo0e4w-1625055243718)(attachment:image-3.png)]under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:
为了优化神经网络中的权重 需要计算和参数有关的损失函数的微分 即这两个超大号公式 调用loss.backward() 然后从w.grad和b.grad取得结果
loss.backward()
print(w.grad)
print(b.grad)
tensor([[0.0675, 0.1460, 0.0066],
[0.0675, 0.1460, 0.0066],
[0.0675, 0.1460, 0.0066],
[0.0675, 0.1460, 0.0066],
[0.0675, 0.1460, 0.0066]])
tensor([0.0675, 0.1460, 0.0066])
We can only obtain the grad
properties for the leaf nodes of the computational graph, which have requires_grad property set to True
. For all other nodes in our graph, gradients will not be available.
We can only perform gradient calculations using backward
once on a given graph, for performance reasons. If we need to do several backward
calls on the same graph, we need to pass retain_graph=True
to the backward
call.
只能获得计算图的叶节点的的梯度属性 它们的requires_grad属性设置为了True 对于其他节点 梯度无法取得
计算图的backward方法出于性能原因只能使用一次 如果想要在一张图上做多次backward需要将
retain_graph=True
传入backward调用
Disabling Gradient Tracking
关闭自动梯度追踪
By default, all tensors with requires_grad=True
are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad()
block:
默认情况下 所有requires_grad=True的tensor都会记录计算历史来支持梯度计算 但是 有些情况下我们不需要那样做 比如 当我们训练模型而且知识想将它用于一些输入数据 因此 我们只需要网络的求安详静思园 可以将计算代码放在
torch.no_grad()
下
z=torch.matmul(x,w)+b
print(z.requires_grad)
#因为w,b的requires_grad=True
with torch.no_grad():
z=torch.matmul(x,w)+b
print(z.requires_grad)
True
False
Another way to achieve the same result is to use the detach() method on the tensor:
在tensor中使用
detach()
方法达到相同结果
#z_det和z共用数据 但是z_det的requires_grad=False
z=torch.matmul(x,w)+b
z_det=z.detach()
print(z_det.requires_grad)
False
There are reasons you might want to disable gradient tracking:
可能有时候你会想要关闭梯度追踪
To mark some parameters in your neural network at frozen parameters. This is a very common scenario for finetuning a pretrained network
将神经网络中的某些参数标记为冻结参数。这是微调预训练网络的一个非常常见的场景
To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.
当你只做前向运算时 加速计算 因为不追踪梯度时 tensor的计算会更快
More on Computational Graphs
更多关于计算图
Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.
从概念上讲 autograd保留了一份tensor数据和所有执行过的操作(和生成的新tensor)在一个由Function对象组成的有向无环图中 在这个有向无环图中 叶节点是输入tensor 根节点是输出tensor 通过从根到叶的跟踪 可以使用链式法则自动计算梯度
In a forward pass, autograd does two things simultaneously:
前向传播中 autograd自动做了两件事:
- run the requested operation to compute a resulting tensor
运行计算结果tensor需要的操作
- maintain the operation’s gradient function in the DAG.
在有向无环图中维护操作的梯度函数
The backward pass kicks off when .backward()
is called on the DAG root. autograd
then:
当.backward()在根节点被调用时 后向传播就开始了 然后
- computes the gradients from each
.grad_fn
,
从每个grad_fn计算梯度
- accumulates them in the respective tensor’s
.grad
attribute
将他们基类在tensor的.grad属性中
- using the chain rule, propagates all the way to the leaf tensors.
利用链式法则 一直传播到叶节点tensor
DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.
在pytorch中 有向无环图DAGs是动态的 需要注意的重要一点是 该图是从头开始创建的 在每次.backward()调用后 autograd开始填充?新图 这正是为什么可以在模型中使用控制流语句 在每次迭代中都可以改变shape size 和 operation
Optional Reading: Tensor Gradients and Jacobian Products
选读: tensor梯度 雅可比积
In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called Jacobian product, and not the actual gradient.
在很多情况下 有一个标量损失函数 需要计算一些参数的梯度 但是有些情况下输出函数是任意tensor 在这种情况下pytorch允许计算所谓的雅可比积 而不是实际的梯度
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pkt2Q7EY-1625055243720)(attachment:image.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-erHkWG6d-1625055243722)(attachment:image-2.png)]
#torch.eye(n,m=None,out=None)
#这个函数主要是为了生成对角线全1,其余部分全0的二维数组
inp=torch.eye(5,requires_grad=True)
print("inp=",inp)
out=(inp+1).pow(2)
print("\nout=",out)
out.backward(torch.ones_like(inp),retain_graph=True)
print("\nFirst call\n",inp.grad)
out.backward(torch.ones_like(inp),retain_graph=True)
print("\nSecond call\n",inp.grad)
#梯度清零
inp.grad.zero_()
out.backward(torch.ones_like(inp),retain_graph=True)
print("\nCall after zeroing gradients\n",inp.grad)
inp= tensor([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]], requires_grad=True)
out= tensor([[4., 1., 1., 1., 1.],
[1., 4., 1., 1., 1.],
[1., 1., 4., 1., 1.],
[1., 1., 1., 4., 1.],
[1., 1., 1., 1., 4.]], grad_fn=<PowBackward0>)
First call
tensor([[4., 2., 2., 2., 2.],
[2., 4., 2., 2., 2.],
[2., 2., 4., 2., 2.],
[2., 2., 2., 4., 2.],
[2., 2., 2., 2., 4.]])
Second call
tensor([[8., 4., 4., 4., 4.],
[4., 8., 4., 4., 4.],
[4., 4., 8., 4., 4.],
[4., 4., 4., 8., 4.],
[4., 4., 4., 4., 8.]])
Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
[2., 4., 2., 2., 2.],
[2., 2., 4., 2., 2.],
[2., 2., 2., 4., 2.],
[2., 2., 2., 2., 4.]])
Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch accumulates the gradients, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this.
注意第二次调用相同参数backward时 梯度的值是不同的 这是因为在后向传播中 pytorch累加了梯度值 因此梯度的计算值被加到了计算图中所有叶节点的grad属性中 如果想要计算正确的梯度值 需要在之前清理梯度值 实际使用中 训练一个optimizer帮助我们做了这个工作
Previously we were calling backward()
function without parameters. This is essentially equivalent to calling backward(torch.tensor(1.0))
, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.
之前我们调用backward()函数 参数为空 这实际上等同属于使用backward(torch.tensor(1.0)) 这是在标量函数计算梯度时有用的方法 比神经网络训练期间的损失函数
翻译好累…但是不翻译看英文记不住 太难了