Blog-detail

由一个问题引发的 goroutine 相关源码的探究

编辑时间：2019-09-06 09:59:56 作者：lmc 浏览量：6146

在 Go 语言中文网微信群有人问了这么一个问题，如下代码：

const N = 26

func main() {
 const GOMAXPROCS = 1
 runtime.GOMAXPROCS(GOMAXPROCS)
 var wg sync.WaitGroup
 wg.Add(2 * N)
 for i := 0; i < N; i++ {
  go func(i int) {
   defer wg.Done()
   fmt.Printf("%c", 'a'+i)
  }(i)

  go func(i int) {
   defer wg.Done()
   fmt.Printf("%c", 'A'+i)
  }(i)
 }
 wg.Wait()
}

问题：

为什么输出是 ZaAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYz，而不是 aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
如果想要输出 aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ，怎么做？

一看，这是个涉及到 goroutine 调度的问题。这里郑重警示：实际中，不要依赖任何调度器的调度顺序！当然，作为学习，我们可以来研究一下该问题。

解决方案

根据现象，我们可以容易的想到如下很“奇葩”的解决方案，即在 wg.Wait() 之前开一个空 goroutine：

go func(){}()

除了这种方案，还有一种方案，那就是在 wg.Wait() 之前 Sleep，时间长短无所谓：

time.Sleep(1e9)

为什么会这样？

该问题和解决方案都着实让人困惑。下面就让我们一步步来揭示其中的原因。

简单的例子

我们通过一个简单的示例来猜测创建 goroutine 的顺序和执行 goroutine 的顺序的关系。也就是说，维护的是一个栈还是队列，亦或是其他的？

func main() {
 // 为了验证问题，该句是必须的
 runtime.GOMAXPROCS(1)

 var wg sync.WaitGroup

 wg.Add(2)

 // goroutine 1
 go func() {
  defer wg.Done()
  fmt.Println("first goroutine")
 }()

 // goroutine 2
 go func() {
  defer wg.Done()
  fmt.Println("second goroutine")
 }()

 wg.Wait()
}

可运行源码地址：https://play.studygolang.com/p/9UvaxWtMNNY

执行以上代码，输出：

second goroutine
first goroutine

如果调整两个 goroutine 的顺序，再执行，发现输出是：

first goroutine
second goroutine

由此似乎可以得出结论：goroutine 队列是后进先出。

三个 goroutine 呢？

我们尝试再加一个 goroutine。

func main() {
 // 为了验证问题，该句是必须的
 runtime.GOMAXPROCS(1)

 var wg sync.WaitGroup

 wg.Add(3)

 // goroutine 1
 go func() {
  defer wg.Done()
  fmt.Println("first goroutine")
 }()

 // goroutine 2
 go func() {
  defer wg.Done()
  fmt.Println("second goroutine")
 }()
 
 // goroutine 3
 go func() {
  defer wg.Done()
  fmt.Println("third goroutine")
 }()

 wg.Wait()
}

可运行源码地址：https://play.studygolang.com/p/XFCy9AbzKum

输出：

third goroutine
first goroutine
second goroutine

这又表明，并非是后进先出，顶多是：最后进的先出；其他的按先进先出的顺序。这也是本文开始问题的第 1 种解决方案。

time.Sleep 怎么结果又变了？

回到 2 个 goroutine 的情况，我们改为如下方式：

func main() {
 // 为了验证问题，该句是必须的
 runtime.GOMAXPROCS(1)

 // goroutine 1
 go func() {
  fmt.Println("first goroutine")
 }()

 // goroutine 2
 go func() {
  fmt.Println("second goroutine")
 }()

 time.Sleep(1)
}

可运行源码地址：https://play.studygolang.com/p/3HY4wyHVV5f

输出：

first goroutine
second goroutine

什么鬼？怎么最后进的也不是先出了？

这也是为什么加上 time.Sleep() 本文开始的问题就解决了。

到此，似乎类似的问题知道怎么解决了，但为什么是这样的？

GMP 中调度相关源码

因为涉及到 goroutine 的调度，这里不得不搬出 Go 相关源码，但我会尽可能不让大家烧脑，用简单易懂的方式阐述。

你应该听过 Go 语言的 GMP 模型，这个必须有所了解（更多信息可以查阅本文最后提供的深入学习资料）。这里你只需要知道，P 中会有一个可运行（runnable）的 goroutine 队列。在 P 的结构体中，由 runq 字段保存，也就是 P 的本地可运行的 goroutine 队列。

type p struct {
 ...
 runq     [256]guintptr
  ...
}

创建一个 goroutine，对应是 runtime 包中的 newproc 函数，忽略一些细节，我们找到了 runqput 函数的调用：

runqput(_p_, newg, true)

该函数的签名如下：

// runqput tries to put g on the local runnable queue.
// If next is false, runqput adds g to the tail of the runnable queue.
// If next is true, runqput puts g in the _p_.runnext slot.
// If the run queue is full, runnext puts g on the global queue.
// Executed only by the owner P.
func runqput(_p_ *p, gp *g, next bool)

runqput 尝试把 G 放到本地执行队列中。这里的 next 很关键，next 参数如果是 false 的话，runqput 会将 G 放到运行队列的尾部。从上面代码看出，调用 runqput 时，next 传递的是 true，这时候会怎么处理呢？

我们要回到 P 结构体，其中有一个字段（runq 下一个字段）：

// runnext, if non-nil, is a runnable G that was ready'd by
// the current G and should be run next instead of what's in
// runq if there's time remaining in the running G's time
// slice. It will inherit the time left in the current time
// slice. If a set of goroutines is locked in a
// communicate-and-wait pattern, this schedules that set as a
// unit and eliminates the (potentially large) scheduling
// latency that otherwise arises from adding the ready'd
// goroutines to the end of the run queue.
runnext guintptr

runnext 非空时，代表的是一个 runnable 状态的 G，这个 G 是被当前 G 修改为 ready 状态的，并且相比在 runq 中的 G 有更高的优先级。如果当前 G 还有剩余的可用时间，那么就应该运行这个 G。运行之后，该 G 会继承当前 G 的剩余时间。

这段话很重要。什么是当前 G ？比如在 main 函数中，我们创建一个 goroutine，那么 main 函数所在 goroutine（也就是 main goroutine）就是当前 G，而创建的 G 就是这句话所说的 runnable 状态的 G。

接着说 runqput 函数，当 next 为 true 时，当前 G 最后创建的 G 放在 runnext 中，但上一次创建的放在 runnext 中的怎么办呢？答案是移到 P 本地 G 队列的队尾。

到这里，你的疑惑应该差不多解开了。我们梳理下开始的问题：

因为 main goroutine 还有剩余可用时间，最后创建的 goroutine 优先执行，也就是打印 Z 的先执行，其他部分按预期输出
对于通过额外创建一个 goroutine 的方式来输出正常的结果，这里不需要额外解释了；

但为什么加上 time.Sleep 也可以呢？

跟踪 time.Sleep 的源码，我们发现了 addtimerLocked 这个方法，留意其中如下代码：

if tb.rescheduling {
  tb.rescheduling = false
  goready(tb.gp, 0)
}
// 如果没有启动 timer 管理定时器，则启动。timerproc 只会启动一次，即全局 timer 管理器
if !tb.created {
  tb.created = true
  go timerproc(tb)
}

如果代码中第 1 次使用 time.Sleep，则会执行 go timerproc(tb)，这相当于又创建了一个 goroutine，它成为了 runnext。所以，main goroutine 之后，应该是先执行这个 goroutine，它负责管理全局 timer 定时器。（timerproc 函数不会返回）

更进一步，既然 timerproc 只会启动一次，那么我们在 main.main 一开始就 time.Sleep 一次，这样就不会因为 timerproc 这里的 G 占用 runnext 了。

实际上，这时进入了 addtimerLocked 方法中的 goready，它会把上面管理全局 timer 定时器的 G 放入 runnext。

可见，只要有 time.Sleep 后，runnext 中的 G 都是 timerproc 对应的 G，让它有尽可能多的机会检查 timer，而 timerproc 中有各种机制导致调度器重新调度，因此，runq 中的 G 得以调度执行。

总结

我们永远不应该依赖调度器的调度顺序。然而，我们应该对调度器有所了解，知道它大致的调度策略，遇到问题，查阅资料、阅读源码，这个过程会对自己有很大的提升。

我希望你看此文时，能尝试着看看相关源码。不懂的地方肯定很多，我们只需抓住关心的点。欢迎一起交流学习！

调度器相关参考文献

Go语言源码 http://docs.studygolang.com/src/runtime/proc.go
详尽干货！从源码角度看 Golang 的调度 https://studygolang.com/articles/20651
雨痕《Go语言学习笔记》调度器相关源码剖析
Xargin 调度器源码阅读 https://github.com/cch123/golang-notes/blob/master/scheduler.md
调度器源码中文注释 https://github.com/zboya/golang_runtime_reading/blob/master/src/runtime/proc.go
调度器剖析三部曲 https://studygolang.com/articles/14264 https://studygolang.com/articles/15316 https://studygolang.com/articles/17014
Go work-stealing 调度器 https://studygolang.com/articles/12328
Goroutine调度实例简要分析 https://studygolang.com/articles/11720
也谈goroutine调度器 https://studygolang.com/articles/10116

欢迎补充其他资源！

补充说明

为什么最近的 goroutine 要给优先执行权，官方给了解释：https://go-review.googlesource.com/c/go/+/9289，目的是更好的保证公平。

注意，Go 1.14.x 调度器有改变，time.Sleep 的方案已经不生效了。这也证明我们写代码不能依赖调度器的调度顺序。

来说两句吧