Improve perf bot sheriffing documentation

This patch: 1. Aligns the documentation with the Markdown style guide (https://github.com/google/styleguide/blob/gh-pages/docguide/style.md). 2. Fixes some minor mistakes. 3. Adds more useful information (how to disable a test on a particular Android device and a some extra links). Review URL: https://codereview.chromium.org/1815173002 Cr-Commit-Position: refs/heads/master@{#381981}
author: petrcermak <petrcermak@chromium.org> 2016-03-18 08:52:29 -0700
committer: Commit bot <commit-bot@chromium.org> 2016-03-18 15:54:51 +0000
commit: bff3b1b348e980f6b3c52fc57d6df44ec25b0d02 (patch)
tree: b0a5523cc37a2389816128745bc36b8382a71820 /tools/perf
parent: e9b221f0cf73fdbd4c06ed968fe347c58b865d15 (diff)
download: chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.zip
chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.gz
chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.bz2
1 files changed, 106 insertions, 98 deletions
diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md
index 3f368ee..645d3c7 100644
--- a/tools/perf/docs/perf_bot_sheriffing.md
+++ b/tools/perf/docs/perf_bot_sheriffing.md
@@ -5,9 +5,9 @@ waterfall up and running, and triaging performance test failures and flakes.
 
 ## Key Responsibilities
 
-  * [Handle Device and Bot Failures](#botfailures)
-  * [Handle Test Failures](#testfailures)
-  * [Follow up on failures](#followup)
+*   [Handle Device and Bot Failures](#botfailures)
+*   [Handle Test Failures](#testfailures)
+*   [Follow up on failures](#followup)
 
 ##<a name="waterfallstate"></a> Understanding the Waterfall State
 
@@ -24,18 +24,18 @@ the upstream and downstream views of the waterfall and bots, you can install the
 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),
 which adds a switching button to Chrome's URL bar.
 
-Note that there are four different views:
+Note that there are three different views:
 
-   1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)
-      makes it easier to see a summary.
-   2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall)
-      shows more details, including recent changes.
-   3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of
-      recent builds. It takes url parameter arguments:
-      * **master** can be chromium.perf, tryserver.chromium.perf
-      * **builder** can be a builder or tester name, like
+1.  [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) makes
+    it easier to see a summary.
+2.  [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/waterfall)
+    shows more details, including recent changes.
+3.  [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of
+    recent builds. It takes url parameter arguments:
+    *   **master** can be chromium.perf, tryserver.chromium.perf
+    *   **builder** can be a builder or tester name, like
         "Android Nexus5 Perf (2)"
-      * **start_time** is seconds since the epoch.
+    *   **start_time** is seconds since the epoch.
 
 In addition to watching the waterfall directly,
 [Sheriff-O-Matic](https://sheriff-o-matic.appspot.com/chromium.perf) may
@@ -69,38 +69,39 @@ which step is failing, and paste any relevant info from the logs into the bug.
 
 There are two types of device failures:
 
-1. A device is blacklisted in the `device_status_check` step. You can look at
-   the buildbot status page to see how many devices were listed as online during
-   this step. You should always see 7 devices online. If you see fewer than 7
-   devices online, there is a problem in the lab.
-2. A device is passing `device_status_check` but still in poor health. The
-   symptom of this is that all the tests are failing on it. You can see that on
-   the buildbot status page by looking at the `Device Affinity`. If all tests
-   with the same device affinity number are failing, it's probably a device
-   failure.
-
-For both types of failures, please file a bug with [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
+1.  A device is blacklisted in the `device_status_check` step. You can look at
+    the buildbot status page to see how many devices were listed as online
+    during this step. You should always see 7 devices online. If you see fewer
+    than 7 devices online, there is a problem in the lab.
+2.  A device is passing `device_status_check` but still in poor health. The
+    symptom of this is that all the tests are failing on it. You can see that on
+    the buildbot status page by looking at the `Device Affinity`. If all tests
+    with the same device affinity number are failing, it's probably a device
+    failure.
+
+For both types of failures, please file a bug with
+[this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
 which will add an issue to the infra labs queue.
 
 If you need help triaging, here are the common labels you should use:
 
-   * **Performance-BotHealth** should go on all bugs you file about the bots;
-     it's the label we use to track all the issues.
-   * **Infra-Troopers** adds the bug to the trooper queue. This is for high
-     priority issues, like a build breakage. Please add a comment explaining
-     what you want the trooper to do.
-   * **Infra-Labs** adds the bug to the labs queue. If there is a hardware
-     problem, like an android device not responding or a bot that likely needs
-     a restart, please use this label. Make sure you set the **OS-** label
-     correctly as well, and add a comment explaining what you want the labs
-     team to do.
-   * **Infra** label is appropriate for bugs that are not high priority, but we
-     need infra team's help to triage. For example, the buildbot status page
-     UI is weird or we are getting some infra-related log spam. The infra team
-     works to triage these bugs within 24 hours, so you should ping if you do
-     not get a response.
-   * **Cr-Tests-Telemetry** for telemetry failures.
-   * **Cr-Tests-AutoBisect** for bisect and perf try job failures.
+*   **Performance-BotHealth** should go on all bugs you file about the bots;
+    it's the label we use to track all the issues.
+*   **Infra-Troopers** adds the bug to the trooper queue. This is for high
+    priority issues, like a build breakage. Please add a comment explaining what
+    you want the trooper to do.
+*   **Infra-Labs** adds the bug to the labs queue. If there is a hardware
+    problem, like an android device not responding or a bot that likely needs a
+    restart, please use this label. Make sure you set the **OS-** label
+    correctly as well, and add a comment explaining what you want the labs team
+    to do.
+*   **Infra** label is appropriate for bugs that are not high priority, but we
+    need infra team's help to triage. For example, the buildbot status page UI
+    is weird or we are getting some infra-related log spam. The infra team works
+    to triage these bugs within 24 hours, so you should ping if you do not get a
+    response.
+*   **Cr-Tests-Telemetry** for telemetry failures.
+*   **Cr-Tests-AutoBisect** for bisect and perf try job failures.
 
  If you still need help, ask the speed infra chat, or escalate to sullivan@.
 
@@ -109,48 +110,46 @@ If you need help triaging, here are the common labels you should use:
 You want to keep the waterfall green! So any bot that is red or purple needs to
 be investigated. When a test fails:
 
-1. File a bug using
-   [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
-   You'll want to be sure to include:
-   * Link to buildbot status page of failing build.
-   * Copy and paste of relevant failure snippet from the stdio.
-   * CC the test owner from
-     [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).
-   * The revision range the test occurred on.
-   * A list of all platforms the test fails on.
-
-2. Disable the failing test if it is failing more than one out of five runs.
-   (see below for instructions on telemetry and other types of tests). Make sure
-   your disable cl includes a BUG= line with the bug from step 1 and the test
-   owner is cc-ed on the bug.
-3. After the disable CL lands, you can downgrade the priority to Pri-2 and
-   ensure that the bug title reflects something like "Fix and re-enable
-   testname".
-4. Investigate the failure. Some tips for investigating:
-   * [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures)
-   * If you suspect a specific CL in the range, you can revert it locally and
-     run the test on the
-     [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots).
-   * You can run a return code bisect to narrow down the culprit CL:
-      1. Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report)
-         on one of the failing platforms.
-      2. Hover over a data point and click the "Bisect" button on the tooltip.
-      3. Type the **Bug ID** from step 1, the **Good Revision** the last commit
-         pos data was received from, the **Bad Revision** the last commit pos
-         and set **Bisect mode** to `return_code`.
-   * On Android and Mac, you can view platform-level screenshots of the device
-     screen for failing tests, links to which are printed in the logs. Often
-     this will immediately reveal failure causes that are opaque from the logs
-     alone. On other platforms, Devtools will produce tab screenshots as long as
-     the tab did not crash.
-
+1.  File a bug using
+    [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
+    You'll want to be sure to include:
+    *   Link to buildbot status page of failing build.
+    *   Copy and paste of relevant failure snippet from the stdio.
+    *   CC the test owner from
+        [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6wB5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).
+    *   The revision range the test occurred on.
+    *   A list of all platforms the test fails on.
+2.  Disable the failing test if it is failing more than one out of five runs.
+    (see below for instructions on telemetry and other types of tests). Make
+    sure your disable cl includes a BUG= line with the bug from step 1 and the
+    test owner is cc-ed on the bug.
+3.  After the disable CL lands, you can downgrade the priority to Pri-2 and
+    ensure that the bug title reflects something like "Fix and re-enable
+    testname".
+4.  Investigate the failure. Some tips for investigating:
+    *   [Debugging telemetry failures](https://www.chromium.org/developers/telemetry/diagnosing-test-failures)
+    *   If you suspect a specific CL in the range, you can revert it locally and
+        run the test on the
+        [perf trybots](https://www.chromium.org/developers/telemetry/performance-try-bots).
+    *   You can run a return code bisect to narrow down the culprit CL:
+        1.  Open up the graph in the [perf dashboard](https://chromeperf.appspot.com/report)
+            on one of the failing platforms.
+        2.  Hover over a data point and click the "Bisect" button on the
+            tooltip.
+        3.  Type the **Bug ID** from step 1, the **Good Revision** the last
+            commit pos data was received from, the **Bad Revision** the last
+            commit pos and set **Bisect mode** to `return_code`.
+    *   On Android and Mac, you can view platform-level screenshots of the
+        device screen for failing tests, links to which are printed in the logs.
+        Often this will immediately reveal failure causes that are opaque from
+        the logs alone. On other platforms, Devtools will produce tab
+        screenshots as long as the tab did not crash.
 
 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests
 
 If the test is a telemetry test, its name will have a '.' in it, such as
-`thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the
-first dot will be a python file in [tools/perf/benchmarks](
-https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
+`thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the
+first dot will be a python file in [tools/perf/benchmarks](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
 
 If a telemetry test is failing and there is no clear culprit to revert
 immediately, disable the test. You can do this with the `@benchmark.Disabled`
@@ -159,28 +158,36 @@ has background on why the test was disabled, and also include a BUG= line in
 the CL.**
 
 Please disable the narrowest set of bots possible; for example, if
-the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vista')`.
-Supported disabled arguments include:
-
-   * `win`
-   * `mac`
-   * `chromeos`
-   * `linux`
-   * `android`
-   * `vista`
-   * `win7`
-   * `win8`
-   * `yosemite`
-   * `elcapitan`
-   * `all` (please use as a last resort)
+the benchmark only fails on Windows Vista you can use
+`@benchmark.Disabled('vista')`. Supported disabled arguments include:
+
+*   `win`
+*   `mac`
+*   `chromeos`
+*   `linux`
+*   `android`
+*   `vista`
+*   `win7`
+*   `win8`
+*   `yosemite`
+*   `elcapitan`
+*   `all` (please use as a last resort)
 
 If the test fails consistently in a very narrow set of circumstances, you may
-consider implementing a ShouldDisable method on the benchmark instead.
+consider implementing a `ShouldDisable` method on the benchmark instead.
 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs&l=72) is
 and example of disabling a benchmark which OOMs on svelte.
 
-Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not**
-submit with NOTRY=true.
+As a last resort, if you need to disable a benchmark on a particular Android
+device, you can do so by checking the return value of
+`possible_browser.platform.GetDeviceTypeName()` in `ShouldDisable`. Here are
+some [examples](https://code.google.com/p/chromium/codesearch#search/&q=ShouldDisable%20GetDeviceTypeName%20lang:py&sq=package:chromium&type=cs)
+of this. The type name of the failing device can be found by searching for the
+value of `ro.product.model` under the `provision_devices` step of the failing
+bot.
+
+Disabling CLs can be TBR-ed to anyone in [tools/perf/OWNERS](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/OWNERS),
+but please do **not** submit with NOTRY=true.
 
 ###<a name="otherfailures"></a> Disabling Other Tests
 
@@ -196,12 +203,13 @@ priority. Pri-0 generally implies an entire waterfall is down.
 
 **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**
 should be pinged daily, and checked to make sure someone is following up. Pri-1
-bugs are for a red test (not yet disabled), purple bot, or failing device.
+bugs are for a red test (not yet disabled), purple bot, or failing device. Here
+is the [list of Pri-1 bugs that have not been pinged today](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-1%20modified-before:today&sort=modified).
 
 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**
 are for disabled tests. These should be pinged weekly, and work towards fixing
 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the
-[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)
+[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified).
 
 <!-- Unresolved issues:
 1. Do perf sheriffs watch the bisect waterfall?
author	petrcermak <petrcermak@chromium.org>	2016-03-18 08:52:29 -0700
committer	Commit bot <commit-bot@chromium.org>	2016-03-18 15:54:51 +0000
commit	bff3b1b348e980f6b3c52fc57d6df44ec25b0d02 (patch)
tree	b0a5523cc37a2389816128745bc36b8382a71820 /tools/perf
parent	e9b221f0cf73fdbd4c06ed968fe347c58b865d15 (diff)
download	chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.zip chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.gz chromium_src-bff3b1b348e980f6b3c52fc57d6df44ec25b0d02.tar.bz2