Add documentation for our new GHA infrastructure
PiperOrigin-RevId: 508680514pull/11906/head
parent
76573bfef5
commit
6aefc477d7
|
@ -0,0 +1,196 @@
|
|||
This directory contains all of our automatically triggered workflows.
|
||||
|
||||
# Test runner
|
||||
|
||||
Our top level `test_runner.yml` is responsible for kicking off all tests, which
|
||||
are represented as reusable workflows. This is carefully constructed to satisfy
|
||||
the design laid out in go/protobuf-gha-protected-resources (see below), and
|
||||
duplicating it across every workflow file would be difficult to maintain. As an
|
||||
added bonus, we can manually dispatch our full test suite with a single button
|
||||
and monitor the progress of all of them simultaneously in GitHub's actions UI.
|
||||
|
||||
There are five ways our test suite can be triggered:
|
||||
|
||||
- **Post-submit tests** (`push`): These are run over newly submitted code
|
||||
that we can assume has been thoroughly reviewed. There are no additional
|
||||
security concerns here and these jobs can be given highly privileged access to
|
||||
our internal resources and caches.
|
||||
|
||||
- **Pre-submit tests from a branch** (`push_request`): These are run over
|
||||
every PR as changes are made. Since they are coming from branches in our
|
||||
repository, they have secret access by default and can also be given highly
|
||||
privileged access. However, we expect *many* of these events per change,
|
||||
and likely many from abandoned/exploratory changes. Given the much higher
|
||||
frequency, we restrict the ability to *write* to our more expensive caches.
|
||||
|
||||
- **Pre-submit tests from a fork** (`push_request_target`): These are run
|
||||
over every PR from a forked repository as changes are made. These have much
|
||||
more restricted access, since they could be coming from anywhere. To protect
|
||||
our secret keys and our resources, tests will not run until a commit has been
|
||||
labeled `safe to submit`. Further commits will require further approvals to
|
||||
run our test suite. Once marked as safe, we will provide read-only access to
|
||||
our caches and Docker images, but will generally disallow any writes to shared
|
||||
resources.
|
||||
|
||||
- **Continuous tests** (`schedule`): These are run on a fixed schedule. We
|
||||
currently have them set up to run daily, and can help identify non-hermetic
|
||||
issues in tests that don't get run often (such as due to test caching) or during
|
||||
slow periods like weekends and holidays. Similar to post-submit tests, these
|
||||
are run over submitted code and are highly privileged in the resources they
|
||||
can use.
|
||||
|
||||
- **Manual testing** (`workflow_dispatch`): Our test runner can be triggered
|
||||
manually over any branch. This is treated similarly to pre-submit tests,
|
||||
which should be highly privileged because they can only be triggered by the
|
||||
protobuf team.
|
||||
|
||||
# Staleness handling
|
||||
|
||||
While Bazel handles code generation seamlessly, we do support build systems that
|
||||
don't. There are a handful of cases where we need to check in generated files
|
||||
that can become stale over time. In order to provide a good developer
|
||||
experience, we've implemented a system to make this more manageable.
|
||||
|
||||
- Stale files should have a corresponding `staleness_test` Bazel target. This
|
||||
should be marked `manual` to avoid getting picked up in CI, but will fail if
|
||||
files become stale. It also provides a `--fix` flag to update the stale files.
|
||||
|
||||
- Bazel tests will never depend on the checked-in versions, and will generate
|
||||
new ones on-the-fly during build.
|
||||
|
||||
- Non-Bazel tests will always regenerate necessary files before starting. This
|
||||
is done using our `bash` and `docker` actions, which should be used for any
|
||||
non-Bazel tests. This way, no tests will fail due to stale files.
|
||||
|
||||
- A post-submit job will immediately regenerate any stale files and commit them
|
||||
if they've changed.
|
||||
|
||||
- A scheduled job will run late at night every day to make sure the post-submit
|
||||
is working as expected (that is, it will run all the staleness tests).
|
||||
|
||||
The `regenerate_stale_files.sh` script is the central script responsible for all
|
||||
the re-generation of stale files.
|
||||
|
||||
# Caches
|
||||
|
||||
We have a number of different caching strategies to help speed up tests. These
|
||||
live either in GCP buckets or in our GitHub repository cache. The former has
|
||||
a lot of resources available and we don't have to worry as much about bloat.
|
||||
On the other hand, the GitHub repository cache is limited to 10GB, and will
|
||||
start pruning old caches when it exceeds that threshold. Therefore, we need
|
||||
to be very careful about the size and quantity of our caches in order to
|
||||
maximize the gains.
|
||||
|
||||
## Bazel remote cache
|
||||
|
||||
As described in https://bazel.build/remote/caching, remote caching allows us to
|
||||
offload a lot of our build steps to a remote server that holds a cache of
|
||||
previous builds. We use our GCP project for this storage, and configure
|
||||
*every* Bazel call to use it. This provides substantial performance
|
||||
improvements at minimal cost.
|
||||
|
||||
We do not allow forked PRs to upload updates to our Bazel caches, but they
|
||||
do use them. Every other event is given read/write access to the caches.
|
||||
Because Bazel behaves poorly under certain environment changes (such as
|
||||
toolchain, operating system), we try to use finely-grained caches. Each job
|
||||
should typically have its own cache to avoid cross-pollution.
|
||||
|
||||
## Bazel repository cache
|
||||
|
||||
When Bazel starts up, it downloads all the external dependencies for a given
|
||||
build and stores them in the repository cache. This cache is *separate* from
|
||||
the remote cache, and only exists locally. Because we have so many Bazel
|
||||
dependencies, this can be a source of frequent flakes due to network issues.
|
||||
|
||||
To avoid this, we keep a cached version of the repository cache in GitHub's
|
||||
action cache. Our full set of repository dependencies ends up being ~300MB,
|
||||
which is fairly expensive given our 10GB maximum. The most expensive ones seem
|
||||
to come from Java, which has some very large downstream dependencies.
|
||||
|
||||
Given the cost, we take a more conservative approach for this cache. Only push
|
||||
events will ever write to this cache, but all events can read from them.
|
||||
Additionally, we only store three caches for any given commit, one per platform.
|
||||
This means that multiple jobs are trying to update the same cache, leading to a
|
||||
race. GitHub rejects all but one of these updates, so we designed the system so
|
||||
that caches are only updated if they've actually changed. That way, over time
|
||||
(and multiple pushes) the repository caches will incrementally grow to encompass
|
||||
all of our dependencies. A scheduled job will run monthly to clear these caches
|
||||
to prevent unbounded growth as our dependencies evolve.
|
||||
|
||||
## ccache
|
||||
|
||||
In order to speed up non-Bazel builds to be on par with Bazel, we make use of
|
||||
[ccache](https://ccache.dev/). This intercepts all calls to the compiler, and
|
||||
caches the result. Subsequent calls with a cache-hit will very quickly
|
||||
short-circuit and return the already computed result. This has minimal affect
|
||||
on any *single* job, since we typically only run a single build. However, by
|
||||
caching the ccache results in GitHub's action cache we can substantially
|
||||
decrease the build time of subsequent runs.
|
||||
|
||||
One useful feature of ccache is that you can set a maximum cache size, and it
|
||||
will automatically prune older results to keep below that limit. On Linux and
|
||||
Mac cmake builds, we generally get 30MB caches and set a 100MB cache limit. On
|
||||
Windows, with debug symbol stripping we get ~70MB and set a 200MB cache limit.
|
||||
|
||||
Because CMake build tend to be our slowest, bottlenecking the entire CI process,
|
||||
we use a fairly expensive strategy with ccache. All events will cache their
|
||||
ccache directory, keyed by the commit and the branch. This means that each
|
||||
PR and each branch will write its own set of caches. When looking up which
|
||||
cache to use initially, each job will first look for a recent cache in its
|
||||
current branch. If it can't find one, it will accept a cache from the base
|
||||
branch (for example, PRs will initially use the latest cache from their target
|
||||
branch).
|
||||
|
||||
While the ccache caches quickly over-run our GitHub action cache, they also
|
||||
quickly become useless. Since GitHub prunes caches based on the time they were
|
||||
last used, this just means that we'll see quicker turnover.
|
||||
|
||||
## Bazelisk
|
||||
|
||||
Bazelisk will automatically download a pinned version of Bazel on first use.
|
||||
This can lead to flakes, and to avoid that we cache the result keyed on the
|
||||
Bazel version. Only push events will write to this cache, but it's unlikely
|
||||
to change very often.
|
||||
|
||||
## Docker images
|
||||
|
||||
Instead of downloading a fresh Docker image for every test run, we can save it
|
||||
as a tar and cache it using `docker image save` and later restore using
|
||||
`docker image load`. This can decrease download times and also reduce flakes.
|
||||
Note, Docker's load can actually be significantly slower than a pull in certain
|
||||
situations. Therefore, we should reserve this strategy for only Docker images
|
||||
that are causing noticeable flakes.
|
||||
|
||||
## Pip dependencies
|
||||
|
||||
The actions/setup-python action we use for Python supports automated caching
|
||||
of pip dependencies. We enable this to avoid having to download these
|
||||
dependencies on every run, which can lead to flakes.
|
||||
|
||||
# Custom actions
|
||||
|
||||
We've defined a number of custom actions to abstract out shared pieces of our
|
||||
workflows.
|
||||
|
||||
- **Bazel** use this for running all Bazel tests. It can take either a single
|
||||
Bazel command or a more general bash command. In the latter case, it provides
|
||||
environment variables for running Bazel with all our standardized settings.
|
||||
|
||||
- **Bazel-Docker** nearly identical to the **Bazel** action, this additionally
|
||||
runs everything in a specified Docker image.
|
||||
|
||||
- **Bash** use this for running non-Bazel tests. It takes a bash command and
|
||||
runs it verbatim. It also handles the regeneration of stale files (which does
|
||||
use Bazel), which non-Bazel tests might depend on.
|
||||
|
||||
- **Docker** nearly identical to the **Bash** action, this additionally runs
|
||||
everything in a specified Docker image.
|
||||
|
||||
- **ccache** this sets up a ccache environment, and initializes some
|
||||
environment variables for standardized usage of ccache.
|
||||
|
||||
- **Cross-compile protoc** this abstracts out the compilation of protoc using
|
||||
our cross-compilation infrastructure. It will set a `PROTOC` environment
|
||||
variable that gets automatically picked up by a lot of our infrastructure.
|
||||
This is most useful in conjunction with the **Bash** action with non-Bazel
|
||||
tests.
|
|
@ -0,0 +1,17 @@
|
|||
This directory contains CI-specific tooling.
|
||||
|
||||
# Clang wrappers
|
||||
|
||||
CMake allows for compiler wrappers to be injected such as ccache, which
|
||||
intercepts compiler calls and short-circuits on cache-hits. This can be done
|
||||
by specifying `CMAKE_C_COMPILER_LAUNCHER` and `CMAKE_CXX_COMPILER_LAUNCHER`
|
||||
during CMake's configure step. Unfortunately, X-Code doesn't provide anything
|
||||
like this, so we use basic wrapper scripts to invoke ccache + clang.
|
||||
|
||||
# Bazelrc files
|
||||
|
||||
In order to allow platform-specific `.bazelrc` flags during testing, we keep
|
||||
3 different versions here along with a shared `common.bazelrc` that they all
|
||||
include. Our GHA infrastructure will select the appropriate file for any test
|
||||
and overwrite the default `.bazelrc` in our workspace, which is intended for
|
||||
development only.
|
Loading…
Reference in New Issue