diff options
author | petrcermak <petrcermak@chromium.org> | 2016-03-18 08:52:29 -0700 |
---|---|---|
committer | Commit bot <commit-bot@chromium.org> | 2016-03-18 15:54:51 +0000 |
commit | bff3b1b348e980f6b3c52fc57d6df44ec25b0d02 (patch) | |
tree | b0a5523cc37a2389816128745bc36b8382a71820 /tools/perf | |
parent | e9b221f0cf73fdbd4c06ed968fe347c58b865d15 (diff) | |
download | chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.zip chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.gz chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.bz2 |
Improve perf bot sheriffing documentation
This patch:
1. Aligns the documentation with the Markdown style guide
(https://github.com/google/styleguide/blob/gh-pages/docguide/style.md).
2. Fixes some minor mistakes.
3. Adds more useful information (how to disable a test on a particular
Android device and a some extra links).
Review URL: https://codereview.chromium.org/1815173002
Cr-Commit-Position: refs/heads/master@{#381981}
Diffstat (limited to 'tools/perf')
-rw-r--r-- | tools/perf/docs/perf_bot_sheriffing.md | 204 |
1 files changed, 106 insertions, 98 deletions
diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md index 3f368ee..645d3c7 100644 --- a/tools/perf/docs/perf_bot_sheriffing.md +++ b/tools/perf/docs/perf_bot_sheriffing.md @@ -5,9 +5,9 @@ waterfall up and running, and triaging performance test failures and flakes. ## Key Responsibilities - * [Handle Device and Bot Failures](#botfailures) - * [Handle Test Failures](#testfailures) - * [Follow up on failures](#followup) +* [Handle Device and Bot Failures](#botfailures) +* [Handle Test Failures](#testfailures) +* [Follow up on failures](#followup) ##<a name="waterfallstate"></a> Understanding the Waterfall State @@ -24,18 +24,18 @@ the upstream and downstream views of the waterfall and bots, you can install the [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), which adds a switching button to Chrome's URL bar. -Note that there are four different views: +Note that there are three different views: - 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) - makes it easier to see a summary. - 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall) - shows more details, including recent changes. - 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of - recent builds. It takes url parameter arguments: - * **master** can be chromium.perf, tryserver.chromium.perf - * **builder** can be a builder or tester name, like +1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) makes + it easier to see a summary. +2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall) + shows more details, including recent changes. +3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of + recent builds. It takes url parameter arguments: + * **master** can be chromium.perf, tryserver.chromium.perf + * **builder** can be a builder or tester name, like "Android Nexus5 Perf (2)" - * **start_time** is seconds since the epoch. + * **start_time** is seconds since the epoch. In addition to watching the waterfall directly, [Sheriff-O-Matic](https://sheriff-o-matic.appspot.com/chromium.perf) may @@ -69,38 +69,39 @@ which step is failing, and paste any relevant info from the logs into the bug. There are two types of device failures: -1. A device is blacklisted in the `device_status_check` step. You can look at - the buildbot status page to see how many devices were listed as online during - this step. You should always see 7 devices online. If you see fewer than 7 - devices online, there is a problem in the lab. -2. A device is passing `device_status_check` but still in poor health. The - symptom of this is that all the tests are failing on it. You can see that on - the buildbot status page by looking at the `Device Affinity`. If all tests - with the same device affinity number are failing, it's probably a device - failure. - -For both types of failures, please file a bug with [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf) +1. A device is blacklisted in the `device_status_check` step. You can look at + the buildbot status page to see how many devices were listed as online + during this step. You should always see 7 devices online. If you see fewer + than 7 devices online, there is a problem in the lab. +2. A device is passing `device_status_check` but still in poor health. The + symptom of this is that all the tests are failing on it. You can see that on + the buildbot status page by looking at the `Device Affinity`. If all tests + with the same device affinity number are failing, it's probably a device + failure. + +For both types of failures, please file a bug with +[this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf) which will add an issue to the infra labs queue. If you need help triaging, here are the common labels you should use: - * **Performance-BotHealth** should go on all bugs you file about the bots; - it's the label we use to track all the issues. - * **Infra-Troopers** adds the bug to the trooper queue. This is for high - priority issues, like a build breakage. Please add a comment explaining - what you want the trooper to do. - * **Infra-Labs** adds the bug to the labs queue. If there is a hardware - problem, like an android device not responding or a bot that likely needs - a restart, please use this label. Make sure you set the **OS-** label - correctly as well, and add a comment explaining what you want the labs - team to do. - * **Infra** label is appropriate for bugs that are not high priority, but we - need infra team's help to triage. For example, the buildbot status page - UI is weird or we are getting some infra-related log spam. The infra team - works to triage these bugs within 24 hours, so you should ping if you do - not get a response. - * **Cr-Tests-Telemetry** for telemetry failures. - * **Cr-Tests-AutoBisect** for bisect and perf try job failures. +* **Performance-BotHealth** should go on all bugs you file about the bots; + it's the label we use to track all the issues. +* **Infra-Troopers** adds the bug to the trooper queue. This is for high + priority issues, like a build breakage. Please add a comment explaining what + you want the trooper to do. +* **Infra-Labs** adds the bug to the labs queue. If there is a hardware + problem, like an android device not responding or a bot that likely needs a + restart, please use this label. Make sure you set the **OS-** label + correctly as well, and add a comment explaining what you want the labs team + to do. +* **Infra** label is appropriate for bugs that are not high priority, but we + need infra team's help to triage. For example, the buildbot status page UI + is weird or we are getting some infra-related log spam. The infra team works + to triage these bugs within 24 hours, so you should ping if you do not get a + response. +* **Cr-Tests-Telemetry** for telemetry failures. +* **Cr-Tests-AutoBisect** for bisect and perf try job failures. If you still need help, ask the speed infra chat, or escalate to sullivan@. @@ -109,48 +110,46 @@ If you need help triaging, here are the common labels you should use: You want to keep the waterfall green! So any bot that is red or purple needs to be investigated. When a test fails: -1. File a bug using - [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E). - You'll want to be sure to include: - * Link to buildbot status page of failing build. - * Copy and paste of relevant failure snippet from the stdio. - * CC the test owner from - [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). - * The revision range the test occurred on. - * A list of all platforms the test fails on. - -2. Disable the failing test if it is failing more than one out of five runs. - (see below for instructions on telemetry and other types of tests). Make sure - your disable cl includes a BUG= line with the bug from step 1 and the test - owner is cc-ed on the bug. -3. After the disable CL lands, you can downgrade the priority to Pri-2 and - ensure that the bug title reflects something like "Fix and re-enable - testname". -4. Investigate the failure. Some tips for investigating: - * [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures) - * If you suspect a specific CL in the range, you can revert it locally and - run the test on the - [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots). - * You can run a return code bisect to narrow down the culprit CL: - 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report) - on one of the failing platforms. - 2. Hover over a data point and click the "Bisect" button on the tooltip. - 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit - pos data was received from, the **Bad Revision** the last commit pos - and set **Bisect mode** to `return_code`. - * On Android and Mac, you can view platform-level screenshots of the device - screen for failing tests, links to which are printed in the logs. Often - this will immediately reveal failure causes that are opaque from the logs - alone. On other platforms, Devtools will produce tab screenshots as long as - the tab did not crash. - +1. File a bug using + [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E). + You'll want to be sure to include: + * Link to buildbot status page of failing build. + * Copy and paste of relevant failure snippet from the stdio. + * CC the test owner from + [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). + * The revision range the test occurred on. + * A list of all platforms the test fails on. +2. Disable the failing test if it is failing more than one out of five runs. + (see below for instructions on telemetry and other types of tests). Make + sure your disable cl includes a BUG= line with the bug from step 1 and the + test owner is cc-ed on the bug. +3. After the disable CL lands, you can downgrade the priority to Pri-2 and + ensure that the bug title reflects something like "Fix and re-enable + testname". +4. Investigate the failure. Some tips for investigating: + * [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures) + * If you suspect a specific CL in the range, you can revert it locally and + run the test on the + [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots). + * You can run a return code bisect to narrow down the culprit CL: + 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report) + on one of the failing platforms. + 2. Hover over a data point and click the "Bisect" button on the + tooltip. + 3. Type the **Bug ID** from step 1, the **Good Revision** the last + commit pos data was received from, the **Bad Revision** the last + commit pos and set **Bisect mode** to `return_code`. + * On Android and Mac, you can view platform-level screenshots of the + device screen for failing tests, links to which are printed in the logs. + Often this will immediately reveal failure causes that are opaque from + the logs alone. On other platforms, Devtools will produce tab + screenshots as long as the tab did not crash. ###<a name="telemetryfailures"></a> Disabling Telemetry Tests If the test is a telemetry test, its name will have a '.' in it, such as -`thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the -first dot will be a python file in [tools/perf/benchmarks]( -https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). +`thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the +first dot will be a python file in [tools/perf/benchmarks](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). If a telemetry test is failing and there is no clear culprit to revert immediately, disable the test. You can do this with the `@benchmark.Disabled` @@ -159,28 +158,36 @@ has background on why the test was disabled, and also include a BUG= line in the CL.** Please disable the narrowest set of bots possible; for example, if -the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vista')`. -Supported disabled arguments include: - - * `win` - * `mac` - * `chromeos` - * `linux` - * `android` - * `vista` - * `win7` - * `win8` - * `yosemite` - * `elcapitan` - * `all` (please use as a last resort) +the benchmark only fails on Windows Vista you can use +`@benchmark.Disabled('vista')`. Supported disabled arguments include: + +* `win` +* `mac` +* `chromeos` +* `linux` +* `android` +* `vista` +* `win7` +* `win8` +* `yosemite` +* `elcapitan` +* `all` (please use as a last resort) If the test fails consistently in a very narrow set of circumstances, you may -consider implementing a ShouldDisable method on the benchmark instead. +consider implementing a `ShouldDisable` method on the benchmark instead. [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs&l=72) is and example of disabling a benchmark which OOMs on svelte. -Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not** -submit with NOTRY=true. +As a last resort, if you need to disable a benchmark on a particular Android +device, you can do so by checking the return value of +`possible_browser.platform.GetDeviceTypeName()` in `ShouldDisable`. Here are +some [examples](https://code.google.com/p/chromium/codesearch#search/&q=ShouldDisable%20GetDeviceTypeName%20lang:py&sq=package:chromium&type=cs) +of this. The type name of the failing device can be found by searching for the +value of `ro.product.model` under the `provision_devices` step of the failing +bot. + +Disabling CLs can be TBR-ed to anyone in [tools/perf/OWNERS](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/OWNERS), +but please do **not** submit with NOTRY=true. ###<a name="otherfailures"></a> Disabling Other Tests @@ -196,12 +203,13 @@ priority. Pri-0 generally implies an entire waterfall is down. **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)** should be pinged daily, and checked to make sure someone is following up. Pri-1 -bugs are for a red test (not yet disabled), purple bot, or failing device. +bugs are for a red test (not yet disabled), purple bot, or failing device. Here +is the [list of Pri-1 bugs that have not been pinged today](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-1%20modified-before:today&sort=modified). **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)** are for disabled tests. These should be pinged weekly, and work towards fixing should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the -[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified) +[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified). <!-- Unresolved issues: 1. Do perf sheriffs watch the bisect waterfall? |