summaryrefslogtreecommitdiffstats
path: root/tools/perf
diff options
context:
space:
mode:
authorpetrcermak <petrcermak@chromium.org>2016-03-18 08:52:29 -0700
committerCommit bot <commit-bot@chromium.org>2016-03-18 15:54:51 +0000
commitbff3b1b348e980f6b3c52fc57d6df44ec25b0d02 (patch)
treeb0a5523cc37a2389816128745bc36b8382a71820 /tools/perf
parente9b221f0cf73fdbd4c06ed968fe347c58b865d15 (diff)
downloadchromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.zip
chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.gz
chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.bz2
Improve perf bot sheriffing documentation
This patch: 1. Aligns the documentation with the Markdown style guide (https://github.com/google/styleguide/blob/gh-pages/docguide/style.md). 2. Fixes some minor mistakes. 3. Adds more useful information (how to disable a test on a particular Android device and a some extra links). Review URL: https://codereview.chromium.org/1815173002 Cr-Commit-Position: refs/heads/master@{#381981}
Diffstat (limited to 'tools/perf')
-rw-r--r--tools/perf/docs/perf_bot_sheriffing.md204
1 files changed, 106 insertions, 98 deletions
diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md
index 3f368ee..645d3c7 100644
--- a/tools/perf/docs/perf_bot_sheriffing.md
+++ b/tools/perf/docs/perf_bot_sheriffing.md
@@ -5,9 +5,9 @@ waterfall up and running, and triaging performance test failures and flakes.
## Key Responsibilities
- * [Handle Device and Bot Failures](#botfailures)
- * [Handle Test Failures](#testfailures)
- * [Follow up on failures](#followup)
+* [Handle Device and Bot Failures](#botfailures)
+* [Handle Test Failures](#testfailures)
+* [Follow up on failures](#followup)
##<a name="waterfallstate"></a> Understanding the Waterfall State
@@ -24,18 +24,18 @@ the upstream and downstream views of the waterfall and bots, you can install the
[Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),
which adds a switching button to Chrome's URL bar.
-Note that there are four different views:
+Note that there are three different views:
- 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)
- makes it easier to see a summary.
- 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall)
- shows more details, including recent changes.
- 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of
- recent builds. It takes url parameter arguments:
- * **master** can be chromium.perf, tryserver.chromium.perf
- * **builder** can be a builder or tester name, like
+1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) makes
+ it easier to see a summary.
+2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall)
+ shows more details, including recent changes.
+3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of
+ recent builds. It takes url parameter arguments:
+ * **master** can be chromium.perf, tryserver.chromium.perf
+ * **builder** can be a builder or tester name, like
"Android Nexus5 Perf (2)"
- * **start_time** is seconds since the epoch.
+ * **start_time** is seconds since the epoch.
In addition to watching the waterfall directly,
[Sheriff-O-Matic](https://sheriff-o-matic.appspot.com/chromium.perf) may
@@ -69,38 +69,39 @@ which step is failing, and paste any relevant info from the logs into the bug.
There are two types of device failures:
-1. A device is blacklisted in the `device_status_check` step. You can look at
- the buildbot status page to see how many devices were listed as online during
- this step. You should always see 7 devices online. If you see fewer than 7
- devices online, there is a problem in the lab.
-2. A device is passing `device_status_check` but still in poor health. The
- symptom of this is that all the tests are failing on it. You can see that on
- the buildbot status page by looking at the `Device Affinity`. If all tests
- with the same device affinity number are failing, it's probably a device
- failure.
-
-For both types of failures, please file a bug with [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
+1. A device is blacklisted in the `device_status_check` step. You can look at
+ the buildbot status page to see how many devices were listed as online
+ during this step. You should always see 7 devices online. If you see fewer
+ than 7 devices online, there is a problem in the lab.
+2. A device is passing `device_status_check` but still in poor health. The
+ symptom of this is that all the tests are failing on it. You can see that on
+ the buildbot status page by looking at the `Device Affinity`. If all tests
+ with the same device affinity number are failing, it's probably a device
+ failure.
+
+For both types of failures, please file a bug with
+[this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
which will add an issue to the infra labs queue.
If you need help triaging, here are the common labels you should use:
- * **Performance-BotHealth** should go on all bugs you file about the bots;
- it's the label we use to track all the issues.
- * **Infra-Troopers** adds the bug to the trooper queue. This is for high
- priority issues, like a build breakage. Please add a comment explaining
- what you want the trooper to do.
- * **Infra-Labs** adds the bug to the labs queue. If there is a hardware
- problem, like an android device not responding or a bot that likely needs
- a restart, please use this label. Make sure you set the **OS-** label
- correctly as well, and add a comment explaining what you want the labs
- team to do.
- * **Infra** label is appropriate for bugs that are not high priority, but we
- need infra team's help to triage. For example, the buildbot status page
- UI is weird or we are getting some infra-related log spam. The infra team
- works to triage these bugs within 24 hours, so you should ping if you do
- not get a response.
- * **Cr-Tests-Telemetry** for telemetry failures.
- * **Cr-Tests-AutoBisect** for bisect and perf try job failures.
+* **Performance-BotHealth** should go on all bugs you file about the bots;
+ it's the label we use to track all the issues.
+* **Infra-Troopers** adds the bug to the trooper queue. This is for high
+ priority issues, like a build breakage. Please add a comment explaining what
+ you want the trooper to do.
+* **Infra-Labs** adds the bug to the labs queue. If there is a hardware
+ problem, like an android device not responding or a bot that likely needs a
+ restart, please use this label. Make sure you set the **OS-** label
+ correctly as well, and add a comment explaining what you want the labs team
+ to do.
+* **Infra** label is appropriate for bugs that are not high priority, but we
+ need infra team's help to triage. For example, the buildbot status page UI
+ is weird or we are getting some infra-related log spam. The infra team works
+ to triage these bugs within 24 hours, so you should ping if you do not get a
+ response.
+* **Cr-Tests-Telemetry** for telemetry failures.
+* **Cr-Tests-AutoBisect** for bisect and perf try job failures.
If you still need help, ask the speed infra chat, or escalate to sullivan@.
@@ -109,48 +110,46 @@ If you need help triaging, here are the common labels you should use:
You want to keep the waterfall green! So any bot that is red or purple needs to
be investigated. When a test fails:
-1. File a bug using
- [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
- You'll want to be sure to include:
- * Link to buildbot status page of failing build.
- * Copy and paste of relevant failure snippet from the stdio.
- * CC the test owner from
- [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).
- * The revision range the test occurred on.
- * A list of all platforms the test fails on.
-
-2. Disable the failing test if it is failing more than one out of five runs.
- (see below for instructions on telemetry and other types of tests). Make sure
- your disable cl includes a BUG= line with the bug from step 1 and the test
- owner is cc-ed on the bug.
-3. After the disable CL lands, you can downgrade the priority to Pri-2 and
- ensure that the bug title reflects something like "Fix and re-enable
- testname".
-4. Investigate the failure. Some tips for investigating:
- * [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures)
- * If you suspect a specific CL in the range, you can revert it locally and
- run the test on the
- [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots).
- * You can run a return code bisect to narrow down the culprit CL:
- 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report)
- on one of the failing platforms.
- 2. Hover over a data point and click the "Bisect" button on the tooltip.
- 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit
- pos data was received from, the **Bad Revision** the last commit pos
- and set **Bisect mode** to `return_code`.
- * On Android and Mac, you can view platform-level screenshots of the device
- screen for failing tests, links to which are printed in the logs. Often
- this will immediately reveal failure causes that are opaque from the logs
- alone. On other platforms, Devtools will produce tab screenshots as long as
- the tab did not crash.
-
+1. File a bug using
+ [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
+ You'll want to be sure to include:
+ * Link to buildbot status page of failing build.
+ * Copy and paste of relevant failure snippet from the stdio.
+ * CC the test owner from
+ [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).
+ * The revision range the test occurred on.
+ * A list of all platforms the test fails on.
+2. Disable the failing test if it is failing more than one out of five runs.
+ (see below for instructions on telemetry and other types of tests). Make
+ sure your disable cl includes a BUG= line with the bug from step 1 and the
+ test owner is cc-ed on the bug.
+3. After the disable CL lands, you can downgrade the priority to Pri-2 and
+ ensure that the bug title reflects something like "Fix and re-enable
+ testname".
+4. Investigate the failure. Some tips for investigating:
+ * [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures)
+ * If you suspect a specific CL in the range, you can revert it locally and
+ run the test on the
+ [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots).
+ * You can run a return code bisect to narrow down the culprit CL:
+ 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report)
+ on one of the failing platforms.
+ 2. Hover over a data point and click the "Bisect" button on the
+ tooltip.
+ 3. Type the **Bug ID** from step 1, the **Good Revision** the last
+ commit pos data was received from, the **Bad Revision** the last
+ commit pos and set **Bisect mode** to `return_code`.
+ * On Android and Mac, you can view platform-level screenshots of the
+ device screen for failing tests, links to which are printed in the logs.
+ Often this will immediately reveal failure causes that are opaque from
+ the logs alone. On other platforms, Devtools will produce tab
+ screenshots as long as the tab did not crash.
###<a name="telemetryfailures"></a> Disabling Telemetry Tests
If the test is a telemetry test, its name will have a '.' in it, such as
-`thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the
-first dot will be a python file in [tools/perf/benchmarks](
-https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
+`thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the
+first dot will be a python file in [tools/perf/benchmarks](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
If a telemetry test is failing and there is no clear culprit to revert
immediately, disable the test. You can do this with the `@benchmark.Disabled`
@@ -159,28 +158,36 @@ has background on why the test was disabled, and also include a BUG= line in
the CL.**
Please disable the narrowest set of bots possible; for example, if
-the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vista')`.
-Supported disabled arguments include:
-
- * `win`
- * `mac`
- * `chromeos`
- * `linux`
- * `android`
- * `vista`
- * `win7`
- * `win8`
- * `yosemite`
- * `elcapitan`
- * `all` (please use as a last resort)
+the benchmark only fails on Windows Vista you can use
+`@benchmark.Disabled('vista')`. Supported disabled arguments include:
+
+* `win`
+* `mac`
+* `chromeos`
+* `linux`
+* `android`
+* `vista`
+* `win7`
+* `win8`
+* `yosemite`
+* `elcapitan`
+* `all` (please use as a last resort)
If the test fails consistently in a very narrow set of circumstances, you may
-consider implementing a ShouldDisable method on the benchmark instead.
+consider implementing a `ShouldDisable` method on the benchmark instead.
[Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs&l=72) is
and example of disabling a benchmark which OOMs on svelte.
-Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not**
-submit with NOTRY=true.
+As a last resort, if you need to disable a benchmark on a particular Android
+device, you can do so by checking the return value of
+`possible_browser.platform.GetDeviceTypeName()` in `ShouldDisable`. Here are
+some [examples](https://code.google.com/p/chromium/codesearch#search/&q=ShouldDisable%20GetDeviceTypeName%20lang:py&sq=package:chromium&type=cs)
+of this. The type name of the failing device can be found by searching for the
+value of `ro.product.model` under the `provision_devices` step of the failing
+bot.
+
+Disabling CLs can be TBR-ed to anyone in [tools/perf/OWNERS](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/OWNERS),
+but please do **not** submit with NOTRY=true.
###<a name="otherfailures"></a> Disabling Other Tests
@@ -196,12 +203,13 @@ priority. Pri-0 generally implies an entire waterfall is down.
**[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**
should be pinged daily, and checked to make sure someone is following up. Pri-1
-bugs are for a red test (not yet disabled), purple bot, or failing device.
+bugs are for a red test (not yet disabled), purple bot, or failing device. Here
+is the [list of Pri-1 bugs that have not been pinged today](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-1%20modified-before:today&sort=modified).
**[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**
are for disabled tests. These should be pinged weekly, and work towards fixing
should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the
-[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)
+[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified).
<!-- Unresolved issues:
1. Do perf sheriffs watch the bisect waterfall?