summaryrefslogtreecommitdiffstats
path: root/docs/html/training/articles/smp.jd
blob: 53d78798ae7c37099dc28fb06a73b38b86d93496 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
page.title=SMP Primer for Android
page.article=true
@jd:body

<div id="tb-wrapper">
<div id="tb">
<h2>In this document</h2>
<ol class="nolist">
  <li><a href="#theory">Theory</a>
    <ol class="nolist">
      <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#proc_consistency">Processor consistency</a></li>
          <li style="margin:0"><a href="#cpu_cache">CPU cache behavior</a></li>
          <li style="margin:0"><a href="#observability">Observability</a></li>
          <li style="margin:0"><a href="#ordering">ARM’s weak ordering</a></li>
        </ol>
      </li>
      <li style="margin:3px 0 0"><a href="#datamem_barriers">Data memory barriers</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#ss_ll">Store/store and load/load</a></li>
          <li style="margin:0"><a href="#ls_sl">Load/store and store/load</a></li>
          <li style="margin:0"><a href="#barrier_inst">Barrier instructions</a></li>
          <li style="margin:0"><a href="#addr_dep">Address dependencies and causal consistency</a></li>
          <li style="margin:0"><a href="#membarrier_summry">Memory barrier summary</a></li>
        </ol>
      </li>
      <li style="margin:3px 0 0"><a href="#atomic_ops">Atomic operations</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#atomic_essentials">Atomic essentials</a></li>
          <li style="margin:0"><a href="#atomic_barrierpairing">Atomic + barrier pairing</a></li>
          <li style="margin:0"><a href="#acq_rel">Acquire and release</a></li>
        </ol>
      </li>
    </ol>
  </li>
  <li><a href="#practice">Practice</a>
    <ol class="nolist">
      <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li>
          <li style="margin:0"><a href="#examplesc">Examples</a></li>
        </ol>
      </li>
      <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li>
          <li style="margin:0"><a href="#examplesj">Examples</a></li>
        </ol>
      </li>
      <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
        <ol class="nolist">
          <li style="margin:0"><a href="#advice">General advice</a></li>
          <li style="margin:0"><a href="#sync_guarantees">Synchronization primitive guarantees</a></li>
          <li style="margin:0"><a href="#ccpp_changes">Upcoming changes to C/C++</a></li>
        </ol>
      </li>
    </ol>
  </li>
  <li><a href="#closing_notes">Closing Notes</a></li>
  <li><a href="#appendix">Appendix</a>
    <ol class="nolist">
      <li style="margin:0"><a href="#smp_failure_example">SMP failure example</a></li>
      <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
      <li style="margin:0"><a href="#more">Further reading</a></li>
    </ol>
  </li>
</ol>
</div>
</div>

<p>Android 3.0 and later platform versions are optimized to support
multiprocessor architectures. This document introduces issues that
can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java
programming language (hereafter referred to simply as “Java” for the sake of
brevity). It's intended as a primer for Android app developers, not as a complete 
discussion on the subject. The focus is on the ARM CPU architecture.</p>

<p>If you’re in a hurry, you can skip the <a href="#theory">Theory</a> section
and go directly to <a href="#practice">Practice</a> for best practices, but this
is not recommended.</p>


<h2 id="intro">Introduction</h2>

<p>SMP is an acronym for “Symmetric Multi-Processor”.  It describes a design in
which two or more identical CPU cores share access to main memory.  Until
a few years ago, all Android devices were UP (Uni-Processor).</p>

<p>Most &mdash; if not all &mdash; Android devices do have multiple CPUs, but generally one
of them is used to run applications while others manage various bits of device
hardware (for example, the radio).  The CPUs may have different architectures, and the
programs running on them can’t use main memory to communicate with each
other.</p>

<p>Most Android devices sold today are built around SMP designs,
making things a bit more complicated for software developers.  The sorts of race
conditions you might encounter in a multi-threaded program are much worse on SMP
when two or more of your threads are running simultaneously on different cores.
What’s more, SMP on ARM is more challenging to work with than SMP on x86.  Code
that has been thoroughly tested on x86 may break badly on ARM.</p>

<p>The rest of this document will explain why, and tell you what you need to do
to ensure that your code behaves correctly.</p>


<h2 id="theory">Theory</h2>

<p>This is a high-speed, glossy overview of a complex subject.  Some areas will
be incomplete, but none of it should be misleading or wrong.</p>

<p>See <a href="#more">Further reading</a> at the end of the document for
pointers to more thorough treatments of the subject.</p>

<h3 id="mem_consistency">Memory consistency models</h3>

<p>Memory consistency models, or often just “memory models”, describe the
guarantees the hardware architecture makes about memory accesses.  For example,
if you write a value to address A, and then write a value to address B, the
model might guarantee that every CPU core sees those writes happen in that
order.</p>

<p>The model most programmers are accustomed to is <em>sequential
consistency</em>, which is described like this <span
style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
Gharachorloo</a>)</span>:</p>

<ul>
<li>All memory operations appear to execute one at a time</li>
<li>All operations on a single processor appear to execute in the order described
by that processor's program.</li>
</ul>

<p>If you look at a bit of code and see that it does some reads and writes from
memory, on a sequentially-consistent CPU architecture you know that the code
will do those reads and writes in the expected order.  It’s possible that the
CPU is actually reordering instructions and delaying reads and writes, but there
is no way for code running on the device to tell that the CPU is doing anything
other than execute instructions in a straightforward manner.  (We’re ignoring
memory-mapped device driver I/O for the moment.)</p>

<p>To illustrate these points it’s useful to consider small snippets of code,
commonly referred to as <em>litmus tests</em>.  These are assumed to execute in
<em>program order</em>, that is, the order in which the instructions appear here is
the order in which the CPU will execute them.  We don’t want to consider
instruction reordering performed by compilers just yet.</p>

<p>Here’s a simple example, with code running on two threads:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = 3<br />
B = 5</code></td>
<td><code>reg0 = B<br />
reg1 = A</code></td>
</tr>
</table>

<p>In this and all future litmus examples, memory locations are represented by
capital letters (A, B, C) and CPU registers start with “reg”.  All memory is
initially zero.  Instructions are executed from top to bottom.  Here, thread 1
stores the value 3 at location A, and then the value 5 at location B.  Thread 2
loads the value from location B into reg0, and then loads the value from
location A into reg1.  (Note that we’re writing in one order and reading in
another.)</p>

<p>Thread 1 and thread 2 are assumed to execute on different CPU cores.  You
should <strong>always</strong> make this assumption when thinking about
multi-threaded code.</p>

<p>Sequential consistency guarantees that, after both threads have finished
executing, the registers will be in one of the following states:</p>


<table>
<tr>
<th>Registers</th>
<th>States</th>
</tr>
<tr>
<td>reg0=5, reg1=3</td>
<td>possible (thread 1 ran first)</td>
</tr>
<tr>
<td>reg0=0, reg1=0</td>
<td>possible (thread 2 ran first)</td>
</tr>
<tr>
<td>reg0=0, reg1=3</td>
<td>possible (concurrent execution)</td>
</tr>
<tr>
<td>reg0=5, reg1=0</td>
<td>never</td>
</tr>
</table>

<p>To get into a situation where we see B=5 before we see the store to A, either
the reads or the writes would have to happen out of order.  On a
sequentially-consistent machine, that can’t happen.</p>

<p>Most uni-processors, including x86 and ARM, are sequentially consistent.
Most SMP systems, including x86 and ARM, are not.</p>

<h4 id="proc_consistency">Processor consistency</h4>

<p>x86 SMP provides <em>processor consistency</em>, which is slightly weaker than
sequential.  While the architecture guarantees that loads are not reordered with
respect to other loads, and stores are not reordered with respect to other
stores, it does not guarantee that a store followed by a load will be observed
in the expected order.</p>

<p>Consider the following example, which is a piece of Dekker’s Algorithm for
mutual exclusion:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = true<br />
reg1 = B<br />
if (reg1 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
<td><code>B = true<br />
reg2 = A<br />
if (reg2 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
</tr>
</table>

<p>The idea is that thread 1 uses A to indicate that it’s busy, and thread 2
uses B.  Thread 1 sets A and then checks to see if B is set; if not, it can
safely assume that it has exclusive access to the critical section.  Thread 2
does something similar.  (If a thread discovers that both A and B are set, a
turn-taking algorithm is used to ensure fairness.)</p>

<p>On a sequentially-consistent machine, this works correctly.  On x86 and ARM
SMP, the store to A and the load from B in thread 1 can be “observed” in a
different order by thread 2.  If that happened, we could actually appear to
execute this sequence (where blank lines have been inserted to highlight the
apparent order of operations):</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>reg1 = B<br />
<br />
<br />
A = true<br />
if (reg1 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>

<td><code><br />
B = true<br />
reg2 = A<br />
<br />
if (reg2 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
</tr>
</table>

<p>This results in both reg1 and reg2 set to “false”, allowing the threads to
execute code in the critical section simultaneously.  To understand how this can
happen, it’s useful to know a little about CPU caches.</p>

<h4 id="cpu_cache">CPU cache behavior</h4>

<p>This is a substantial topic in and of itself.  An extremely brief overview
follows.  (The motivation for this material is to provide some basis for
understanding why SMP systems behave as they do.)</p>

<p>Modern CPUs have one or more caches between the processor and main memory.
These are labeled L1, L2, and so on, with the higher numbers being successively
“farther” from the CPU.  Cache memory adds size and cost to the hardware, and
increases power consumption, so the ARM CPUs used in Android devices typically
have small L1 caches and little or no L2/L3.</p>

<p>Loading or storing a value into the L1 cache is very fast.  Doing the same to
main memory can be 10-100x slower.  The CPU will therefore try to operate out of
the cache as much as possible.  The <em>write policy</em> of a cache determines when data
written to it is forwarded to main memory.  A <em>write-through</em> cache will initiate
a write to memory immediately, while a <em>write-back</em> cache will wait until it runs
out of space and has to evict some entries.  In either case, the CPU will
continue executing instructions past the one that did the store, possibly
executing dozens of them before the write is visible in main memory.  (While the
write-through cache has a policy of immediately forwarding the data to main
memory, it only <strong>initiates</strong> the write.  It does not have to wait
for it to finish.)</p>

<p>The cache behavior becomes relevant to this discussion when each CPU core has
its own private cache.  In a simple model, the caches have no way to interact
with each other directly.  The values held by core #1’s cache are not shared
with or visible to core #2’s cache except as loads or stores from main memory.
The long latencies on memory accesses would make inter-thread interactions
sluggish, so it’s useful to define a way for the caches to share data.  This
sharing is called <em>cache coherency</em>, and the coherency rules are defined
by the CPU architecture’s <em>cache consistency model</em>.</p>

<p>With that in mind, let’s return to the Dekker example.  When core 1 executes
“A = 1”, the value gets stored in core 1’s cache.  When core 2 executes “if (A
== 0)”, it might read from main memory or it might read from core 2’s cache;
either way it won’t see the store performed by core 1.  (“A” could be in core
2’s cache because of a previous load from “A”.)</p>

<p>For the memory consistency model to be sequentially consistent, core 1 would
have to wait for all other cores to be aware of “A = 1” before it could execute
“if (B == 0)” (either through strict cache coherency rules, or by disabling the
caches entirely so everything operates out of main memory).  This would impose a
performance penalty on every store operation.  Relaxing the rules for the
ordering of stores followed by loads improves performance but imposes a burden
on software developers.</p>

<p>The other guarantees made by the processor consistency model are less
expensive to make.  For example, to ensure that memory writes are not observed
out of order, it just needs to ensure that the stores are published to other
cores in the same order that they were issued.  It doesn’t need to wait for
store #1 to <strong>finish</strong> being published before it can start on store
#2, it just needs to ensure that it doesn’t finish publishing #2 before it
finishes publishing #1.  This avoids a performance bubble.</p>

<p>Relaxing the guarantees even further can provide additional opportunities for
CPU optimization, but creates more opportunities for code to behave in ways the
programmer didn’t expect.</p>

<p>One additional note: CPU caches don’t operate on individual bytes.  Data is
read or written as <em>cache lines</em>; for many ARM CPUs these are 32 bytes.  If you
read data from a location in main memory, you will also be reading some adjacent
values.  Writing data will cause the cache line to be read from memory and
updated.  As a result, you can cause a value to be loaded into cache as a
side-effect of reading or writing something nearby, adding to the general aura
of mystery.</p>

<h4 id="observability">Observability</h4>

<p>Before going further, it’s useful to define in a more rigorous fashion what
is meant by “observing” a load or store.  Suppose core 1 executes “A = 1”.  The
store is <em>initiated</em> when the CPU executes the instruction.  At some
point later, possibly through cache coherence activity, the store is
<em>observed</em> by core 2.  In a write-through cache it doesn’t really
<em>complete</em> until the store arrives in main memory, but the memory
consistency model doesn’t dictate when something completes, just when it can be
<em>observed</em>.</p>


<p>(In a kernel device driver that accesses memory-mapped I/O locations, it may
be very important to know when things actually complete.  We’re not going to go
into that here.)</p>

<p>Observability may be defined as follows:</p>

<ul>
<li>"A write to a location in memory is said to be observed by an observer Pn
when a subsequent read of the location by Pn would return the value written by
the write."</li>
<li>"A read of a location in memory is said to be observed by an observer Pm
when a subsequent write to the location by Pm would have no effect on the value
returned by the read." <span style="font-size:.9em;color:#777">(<em><a
href="#more" style="color:#777">Reasoning about the ARM weakly consistent memory
model</a></em>)</span></li>
</ul>


<p>A less formal way to describe it (where “you” and “I” are CPU cores) would be:</p>

<ul>
<li>I have observed your write when I can read what you wrote</li>
<li>I have observed your read when I can no longer affect the value you read</li>
</ul>

<p>The notion of observing a write is intuitive; observing a read is a bit less
so (don’t worry, it grows on you).</p>

<p>With this in mind, we’re ready to talk about ARM.</p>

<h4 id="ordering">ARM's weak ordering</h4>

<p>ARM SMP provides weak memory consistency guarantees.  It does not guarantee that
loads or stores are ordered with respect to each other.</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = 41<br />
B = 1    // “A is ready”</code></td>
<td><code>loop_until (B == 1)<br />
reg = A</code></td>
</tr>
</table>

<p>Recall that all addresses are initially zero.  The “loop_until” instruction
reads B repeatedly, looping until we read 1 from B.  The idea here is that
thread 2 is waiting for thread 1 to update A.  Thread 1 sets A, and then sets B
to 1 to indicate data availability.</p>

<p>On x86 SMP, this is guaranteed to work.  Thread 2 will observe the stores
made by thread 1 in program order, and thread 1 will observe thread 2’s loads in
program order.</p>

<p>On ARM SMP, the loads and stores can be observed in any order.  It is
possible, after all the code has executed, for reg to hold 0.  It’s also
possible for it to hold 41.  Unless you explicitly define the ordering, you
don’t know how this will come out.</p>

<p>(For those with experience on other systems, ARM’s memory model is equivalent
to PowerPC in most respects.)</p>


<h3 id="datamem_barriers">Data memory barriers</h3>

<p>Memory barriers provide a way for your code to tell the CPU that memory
access ordering matters.  ARM/x86 uniprocessors offer sequential consistency,
and thus have no need for them.  (The barrier instructions can be executed but
aren’t useful; in at least one case they’re hideously expensive, motivating
separate builds for SMP targets.)</p>

<p>There are four basic situations to consider:</p>

<ol>
<li>store followed by another store</li>
<li>load followed by another load</li>
<li>load followed by store</li>
<li>store followed by load</li>
</ol>

<h4 id="ss_ll">Store/store and load/load</h4>

<p>Recall our earlier example:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = 41<br />
B = 1    // “A is ready”</code></td>
<td><code>loop_until (B == 1)<br />
reg = A</code></td>
</tr>
</table>


<p>Thread 1 needs to ensure that the store to A happens before the store to B.
This is a “store/store” situation.  Similarly, thread 2 needs to ensure that the
load of B happens before the load of A; this is a load/load situation.  As
mentioned earlier, the loads and stores can be observed in any order.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>Going back to the cache discussion, assume A and B are on separate cache
lines, with minimal cache coherency.  If the store to A stays local but the
store to B is published, core 2 will see B=1 but won’t see the update to A.  On
the other side, assume we read A earlier, or it lives on the same cache line as
something else we recently read.  Core 2 spins until it sees the update to B,
then loads A from its local cache, where the value is still zero.</p>
</div>
</div>

<p>We can fix it like this:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = 41<br />
<em>store/store barrier</em><br />
B = 1    // “A is ready”</code></td>
<td><code>loop_until (B == 1)<br />
<em>load/load barrier</em><br />
reg = A</code></td>
</tr>
</table>

<p>The store/store barrier guarantees that <strong>all observers</strong> will
observe the write to A before they observe the write to B.  It makes no
guarantees about the ordering of loads in thread 1, but we don’t have any of
those, so that’s okay.  The load/load barrier in thread 2 makes a similar
guarantee for the loads there.</p>

<p>Since the store/store barrier guarantees that thread 2 observes the stores in
program order, why do we need the load/load barrier in thread 2?  Because we
also need to guarantee that thread 1 observes the loads in program order.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>The store/store barrier could work by flushing all
dirty entries out of the local cache, ensuring that other cores see them before
they see any future stores.  The load/load barrier could purge the local cache
completely and wait for any “in-flight” loads to finish, ensuring that future
loads are observed after previous loads.  What the CPU actually does doesn’t
matter, so long as the appropriate guarantees are kept.  If we use a barrier in
core 1 but not in core 2, core 2 could still be reading A from its local
cache.</p>
</div>
</div>

<p>Because the architectures have different memory models, these barriers are
required on ARM SMP but not x86 SMP.</p>

<h4 id="ls_sl">Load/store and store/load</h4>

<p>The Dekker’s Algorithm fragment shown earlier illustrated the need for a
store/load barrier.  Here’s an example where a load/store barrier is
required:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>reg = A<br />
B = 1    // “I have latched A”</code></td>
<td><code>loop_until (B == 1)<br />
A = 41    // update A</code></td>
</tr>
</table>

<p>Thread 2 could observe thread 1’s store of B=1 before it observe’s thread 1’s
load from A, and as a result store A=41 before thread 1 has a chance to read A.
Inserting a load/store barrier in each thread solves the problem:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>reg = A<br />
<em>load/store barrier</em><br />
B = 1    // “I have latched A”</code></td>
<td><code>loop_until (B == 1)<br />
<em>load/store barrier</em><br />
A = 41    // update A</code></td>
</tr>
</table>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>A store to local cache may be observed before a load from main memory,
because accesses to main memory are so much slower.  In this case, assume core
1’s cache has the cache line for B but not A.  The load from A is initiated, and
while that’s in progress execution continues.  The store to B happens in local
cache, and by some means becomes available to core 2 while the load from A is
still in progress.  Thread 2 is able to exit the loop before it has observed
thread 1’s load from A.</p>

<p>A thornier question is: do we need a barrier in thread 2?  If the CPU doesn’t
perform speculative writes, and doesn’t execute instructions out of order, can
thread 2 store to A before thread 1’s read if thread 1 guarantees the load/store
ordering?  (Answer: no.)  What if there’s a third core watching A and B?
(Answer: now you need one, or you could observe B==0 / A==41 on the third core.)
 It’s safest to insert barriers in both places and not worry about the
details.</p>
</div>
</div>

<p>As mentioned earlier, store/load barriers are the only kind required on x86
SMP.</p>

<h4 id="barrier_inst">Barrier instructions</h4>

<p>Different CPUs provide different flavors of barrier instruction.  For
example:</p>

<ul>
<li>Sparc V8 has a “membar” instruction that takes a 4-element bit vector.  The
four categories of barrier can be specified individually.</li>
<li>Alpha provides “rmb” (load/load), “wmb” (store/store), and “mb” (full).
(Trivia: the linux kernel provides three memory barrier functions with these
names and behaviors.)</li>
<li>x86 has a variety of options; “mfence” (introduced with SSE2) provides a
full barrier.</li>
<li>ARMv7 has “dmb st” (store/store) and “dmb sy” (full).</li>
</ul>

<p>“Full barrier” means all four categories are included.</p>

<p>It is important to recognize that the only thing guaranteed by barrier
instructions is ordering.  Do not treat them as cache coherency “sync points” or
synchronous “flush” instructions.  The ARM “dmb” instruction has no direct
effect on other cores.  This is important to understand when trying to figure
out where barrier instructions need to be issued.</p>


<h4 id="addr_dep">Address dependencies and causal consistency</h4>

<p><em>(This is a slightly more advanced topic and can be skipped.)</em>

<p>The ARM CPU provides one special case where a load/load barrier can be
avoided.  Consider the following example from earlier, modified slightly:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>[A+8] = 41<br />
<em>store/store barrier</em><br />
B = 1    // “A is ready”</code></td>
<td><code>loop:<br />
&nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
&nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
reg1 = 8<br />
reg2 = [A + reg1]</code></td>
</tr>
</table>

<p>This introduces a new notation.  If “A” refers to a memory address, “A+n”
refers to a memory address offset by 8 bytes from A.  If A is the base address
of an object or array, [A+8] could be a field in the object or an element in the
array.</p>

<p>The “loop_until” seen in previous examples has been expanded to show the load
of B into reg0.  reg1 is assigned the numeric value 8, and reg2 is loaded from
the address [A+reg1] (same location that thread 1 is accessing).</p>

<p>This will not behave correctly because the load from B could be observed
after the load from [A+reg1].  We can fix this with a load/load barrier after
the loop, but on ARM we can also just do this:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>A = 41<br />
<em>store/store barrier</em><br />
B = 1    // “A is ready”</code></td>
<td><code>loop:<br />
&nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
&nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
reg1 = 8 <strong>+ (reg0 & 0)</strong><br />
reg2 = [A + reg1]</code></td>
</tr>
</table>

<p>What we’ve done here is change the assignment of reg1 from a constant (8) to
a value that depends on what we loaded from B.  In this case, we do a bitwise
AND of the value with 0, which yields zero, which means reg1 still has the value
8.  However, the ARM CPU believes that the load from [A+reg1] depends upon the
load from B, and will ensure that the two are observed in program order.</p>

<p>This is called an <em>address dependency</em>.  Address dependencies exist
when the value returned by a load is used to compute the address of a subsequent
load or store.  It can let you avoid the need for an explicit barrier in certain
situations.</p>

<p>ARM does not provide <em>control dependency</em> guarantees.  To illustrate
this it’s necessary to dip into ARM code for a moment: <span
style="font-size:.9em;color:#777">(<em><a href="#more"
style="color:#777">Barrier Litmus Tests and Cookbook</a></em>)</span>.</p>

<pre>
LDR r1, [r0]
CMP r1, #55
LDRNE r2, [r3]
</pre>

<p>The loads from r0 and r3 may be observed out of order, even though the load
from r3 will not execute at all if [r0] doesn’t hold 55.  Inserting AND r1, r1,
#0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure
proper ordering without an explicit barrier.  (This is a prime example of why
you can’t think about consistency issues in terms of instruction execution.
Always think in terms of memory accesses.)</p>

<p>While we’re hip-deep, it’s worth noting that ARM does not provide <em>causal
consistency</em>:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
<th>Thread 3</th>
</tr>
<tr>
<td><code>A = 1</code></td>
<td><code>loop_until (A == 1)<br />
B = 1</code></td>
<td><code>loop:<br />
&nbsp;&nbsp;reg0 = B<br />
&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
reg1 = reg0 & 0<br />
reg2 = [A+reg1]</code></td>
</tr>
</table>

<p>Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to
signal thread 3.  Thread 3 sees it and loads from A, using an address dependency
to ensure that the load of B and the load of A are observed in program
order.</p>

<p>It’s possible for reg2 to hold zero at the end of this.  The fact that a
store in thread 1 causes something to happen in thread 2 which causes something
to happen in thread 3 does not mean that thread 3 will observe the stores in
that order.  (Inserting a load/store barrier in thread 2 fixes this.)</p>

<h4 id="membarrier_summary">Memory barrier summary</h4>

<p>Barriers come in different flavors for different situations.  While there can
be performance advantages to using exactly the right barrier type, there are
code maintenance risks in doing so &mdash; unless the person updating the code
fully understands it, they might introduce the wrong type of operation and cause
a mysterious breakage.  Because of this, and because ARM doesn’t provide a wide
variety of barrier choices, many atomic primitives use full
barrier instructions when a barrier is required.</p>

<p>The key thing to remember about barriers is that they define ordering.  Don’t
think of them as a “flush” call that causes a series of actions to happen.
Instead, think of them as a dividing line in time for operations on the current
CPU core.</p>


<h3 id="atomic_ops">Atomic operations</h3>

<p>Atomic operations guarantee that an operation that requires a series of steps
always behaves as if it were a single operation.  For example, consider a
non-atomic increment (“++A”) executed on the same variable by two threads
simultaneously:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>reg = A<br />
reg = reg + 1<br />
A = reg</code></td>
<td><code>reg = A<br />
reg = reg + 1<br />
A = reg</code></td>
</tr>
</table>

<p>If the threads execute concurrently from top to bottom, both threads will
load 0 from A, increment it to 1, and store it back, leaving a final result of
1.  If we used an atomic increment operation, you would be guaranteed that the
final result will be 2.</p>

<h4 id="atomic_essentials">Atomic essentials</h4>

<p>The most fundamental operations &mdash; loading and storing 32-bit values
&mdash; are inherently atomic on ARM so long as the data is aligned on a 32-bit
boundary.  For example:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>reg = 0x00000000<br />
A = reg</code></td>
<td><code>reg = 0xffffffff<br />
A = reg</code></td>
</tr>
</table>

<p>The CPU guarantees that A will hold 0x00000000 or 0xffffffff.  It will never
hold 0x0000ffff or any other partial “mix” of bytes.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>The atomicity guarantee is lost if the data isn’t aligned.  Misaligned data
could straddle a cache line, so other cores could see the halves update
independently.  Consequently, the ARMv7 documentation declares that it provides
“single-copy atomicity” for all byte accesses, halfword accesses to
halfword-aligned locations, and word accesses to word-aligned locations.
Doubleword (64-bit) accesses are <strong>not</strong> atomic, unless the
location is doubleword-aligned and special load/store instructions are used.
This behavior is important to understand when multiple threads are performing
unsynchronized updates to packed structures or arrays of primitive types.</p>
</div>
</div>

<p>There is no need for 32-bit “atomic read” or “atomic write” functions on ARM
or x86.  Where one is provided for completeness, it just does a trivial load or
store.</p>

<p>Operations that perform more complex actions on data in memory are
collectively known as <em>read-modify-write</em> (RMW) instructions, because
they load data, modify it in some way, and write it back.  CPUs vary widely in
how these are implemented.  ARM uses a technique called “Load Linked / Store
Conditional”, or LL/SC.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>A <em>linked</em> or <em>locked</em> load reads the data from memory as
usual, but also establishes a reservation, tagging the physical memory address.
The reservation is cleared when another core tries to write to that address.  To
perform an LL/SC, the data is read with a reservation, modified, and then a
conditional store instruction is used to try to write the data back.  If the
reservation is still in place, the store succeeds; if not, the store will fail.
Atomic functions based on LL/SC usually loop, retrying the entire
read-modify-write sequence until it completes without interruption.</p>
</div>
</div>

<p>It’s worth noting that the read-modify-write operations would not work
correctly if they operated on stale data.  If two cores perform an atomic
increment on the same address, and one of them is not able to see what the other
did because each core is reading and writing from local cache, the operation
won’t actually be atomic.  The CPU’s cache coherency rules ensure that the
atomic RMW operations remain atomic in an SMP environment.</p>

<p>This should not be construed to mean that atomic RMW operations use a memory
barrier.  On ARM, atomics have no memory barrier semantics.  While a series of
atomic RMW operations on a single address will be observed in program order by
other cores, there are no guarantees when it comes to the ordering of atomic and
non-atomic operations.</p>

<p>It often makes sense to pair barriers and atomic operations together. The
next section describes this in more detail.</p>

<h4 id="atomic_barrierpairing">Atomic + barrier pairing</h4>

<p>As usual, it’s useful to illuminate the discussion with an example.  We’re
going to consider a basic mutual-exclusion primitive called a <em>spin
lock</em>.  The idea is that a memory address (which we’ll call “lock”)
initially holds zero.  When a thread wants to execute code in the critical
section, it sets the lock to 1, executes the critical code, and then changes it
back to zero when done.  If another thread has already set the lock to 1, we sit
and spin until the lock changes back to zero.</p>

<p>To make this work we use an atomic RMW primitive called
<em>compare-and-swap</em>.  The function takes three arguments: the memory
address, the expected current value, and the new value.  If the value currently
in memory matches what we expect, it is replaced with the new value, and the old
value is returned.  If the current value is not what we expect, we don’t change
anything.  A minor variation on this is called <em>compare-and-set</em>; instead
of returning the old value it returns a boolean indicating whether the swap
succeeded.  For our needs either will work, but compare-and-set is slightly
simpler for examples, so we use it and just refer to it as “CAS”.</p>

<p>The acquisition of the spin lock is written like this (using a C-like
language):</p>

<pre>do {
    success = atomic_cas(&lock, 0, 1)
} while (!success)

full_memory_barrier()

<em>critical-section</em></pre>

<p>If no thread holds the lock, the lock value will be 0, and the CAS operation
will set it to 1 to indicate that we now have it.  If another thread has it, the
lock value will be 1, and the CAS operation will fail because the expected
current value does not match the actual current value.  We loop and retry.
(Note this loop is on top of whatever loop the LL/SC code might be doing inside
the atomic_cas function.)</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>On SMP, a spin lock is a useful way to guard a small critical section.  If we
know that another thread is going to execute a handful of instructions and then
release the lock, we can just burn a few cycles while we wait our turn.
However, if the other thread happens to be executing on the same core, we’re
just wasting time because the other thread can’t make progress until the OS
schedules it again (either by migrating it to a different core or by preempting
us).  A proper spin lock implementation would optimistically spin a few times
and then fall back on an OS primitive (such as a Linux futex) that allows the
current thread to sleep while waiting for the other thread to finish up.  On a
uniprocessor you never want to spin at all.  For the sake of brevity we’re
ignoring all this.</p>
</div>
</div>

<p>The memory barrier is necessary to ensure that other threads observe the
acquisition of the lock before they observe any loads or stores in the critical
section.  Without that barrier, the memory accesses could be observed while the
lock is not held.</p>

<p>The <code>full_memory_barrier</code> call here actually does
<strong>two</strong> independent operations.  First, it issues the CPU’s full
barrier instruction.  Second, it tells the compiler that it is not allowed to
reorder code around the barrier.  That way, we know that the
<code>atomic_cas</code> call will be executed before anything in the critical
section.  Without this <em>compiler reorder barrier</em>, the compiler has a
great deal of freedom in how it generates code, and the order of instructions in
the compiled code might be much different from the order in the source code.</p>

<p>Of course, we also want to make sure that none of the memory accesses
performed in the critical section are observed after the lock is released.  The
full version of the simple spin lock is:</p>

<pre>do {
    success = atomic_cas(&lock, 0, 1)   // acquire
} while (!success)
full_memory_barrier()

<em>critical-section</em>

full_memory_barrier()
atomic_store(&lock, 0)                  // release</pre>

<p>We perform our second CPU/compiler memory barrier immediately
<strong>before</strong> we release the lock, so that loads and stores in the
critical section are observed before the release of the lock.</p>

<p>As mentioned earlier, the <code>atomic_store</code> operation is a simple
assignment on ARM and x86.  Unlike the atomic RMW operations, we don’t guarantee
that other threads will see this value immediately.  This isn’t a problem,
though, because we only need to keep the other threads <strong>out</strong>. The
other threads will stay out until they observe the store of 0.  If it takes a
little while for them to observe it, the other threads will spin a little
longer, but we will still execute code correctly.</p>

<p>It’s convenient to combine the atomic operation and the barrier call into a
single function.  It also provides other advantages, which will become clear
shortly.</p>


<h4 id="acq_rel">Acquire and release</h4>

<p>When acquiring the spinlock, we issue the atomic CAS and then the barrier.
When releasing the spinlock, we issue the barrier and then the atomic store.
This inspires a particular naming convention: operations followed by a barrier
are “acquiring” operations, while operations preceded by a barrier are
“releasing” operations.  (It would be wise to install the spin lock example
firmly in mind, as the names are not otherwise intuitive.)</p>

<p>Rewriting the spin lock example with this in mind:</p>

<pre>do {
    success = atomic_<strong>acquire</strong>_cas(&lock, 0, 1)
} while (!success)

<em>critical-section</em>

atomic_<strong>release</strong>_store(&lock, 0)</pre>

<p>This is a little more succinct and easier to read, but the real motivation
for doing this lies in a couple of optimizations we can now perform.</p>

<p>First, consider <code>atomic_release_store</code>.  We need to ensure that
the store of zero to the lock word is observed after any loads or stores in the
critical section above it.  In other words, we need a load/store and store/store
barrier.  In an earlier section we learned that these aren’t necessary on x86
SMP -- only store/load barriers are required.  The implementation of
<code>atomic_release_store</code> on x86 is therefore just a compiler reorder
barrier followed by a simple store.  No CPU barrier is required.</p>

<p>The second optimization mostly applies to the compiler (although some CPUs,
such as the Itanium, can take advantage of it as well).  The basic principle is
that code can move across acquire and release barriers, but only in one
direction.</p>

<p>Suppose we have a mix of locally-visible and globally-visible memory
accesses, with some miscellaneous computation as well:</p>

<pre>local1 = arg1 / 41
local2 = threadStruct->field2
threadStruct->field3 = local2

do {
    success = atomic_acquire_cas(&lock, 0, 1)
} while (!success)

local5 = globalStruct->field5
globalStruct->field6 = local5

atomic_release_store(&lock, 0)</pre>

<p>Here we see two completely independent sets of operations.  The first set
operates on a thread-local data structure, so we’re not concerned about clashes
with other threads.  The second set operates on a global data structure, which
must be protected with a lock.</p>

<p>A full compiler reorder barrier in the atomic ops will ensure that the
program order matches the source code order at the lock boundaries.  However,
allowing the compiler to interleave instructions can improve performance.  Loads
from memory can be slow, but the CPU can continue to execute instructions that
don’t require the result of that load while waiting for it to complete.  The
code might execute more quickly if it were written like this instead:</p>

<pre>do {
    success = atomic_acquire_cas(&lock, 0, 1)
} while (!success)

local2 = threadStruct->field2
local5 = globalStruct->field5
local1 = arg1 / 41
threadStruct->field3 = local2
globalStruct->field6 = local5

atomic_release_store(&lock, 0)</pre>

<p>We issue both loads, do some unrelated computation, and then execute the
instructions that make use of the loads.  If the integer division takes less
time than one of the loads, we essentially get it for free, since it happens
during a period where the CPU would have stalled waiting for a load to
complete.</p>

<p>Note that <strong>all</strong> of the operations are now happening inside the
critical section.  Since none of the “threadStruct” operations are visible
outside the current thread, nothing else can see them until we’re finished here,
so it doesn’t matter exactly when they happen.</p>

<p>In general, it is always safe to move operations <strong>into</strong> a
critical section, but never safe to move operations <strong>out of</strong> a
critical section.  Put another way, you can migrate code “downward” across an
acquire barrier, and “upward” across a release barrier.  If the atomic ops used
a full barrier, this sort of migration would not be possible.</p>

<p>Returning to an earlier point, we can state that on x86 all loads are
acquiring loads, and all stores are releasing stores.  As a result:</p>

<ul>
<li>Loads may not be reordered with respect to each other.  You can’t take a
load and move it “upward” across another load’s acquire barrier.</li>
<li>Stores may not be reordered with respect to each other, because you can’t
move a store “downward” across another store’s release barrier.</li>
<li>A load followed by a store can’t be reordered, because neither instruction
will tolerate it.</li>
<li>A store followed by a load <strong>can</strong> be reordered, because each
instruction can move across the other in that direction.</li>
</ul>

<p>Hence, you only need store/load barriers on x86 SMP.</p>

<p>Labeling atomic operations with “acquire” or “release” describes not only
whether the barrier is executed before or after the atomic operation, but also
how the compiler is allowed to reorder code.</p>

<h2 id="practice">Practice</h2>

<p>Debugging memory consistency problems can be very difficult.  If a missing
memory barrier causes some code to read stale data, you may not be able to
figure out why by examining memory dumps with a debugger.  By the time you can
issue a debugger query, the CPU cores will have all observed the full set of
accesses, and the contents of memory and the CPU registers will appear to be in
an “impossible” state.</p>

<h3 id="c_dont">What not to do in C</h3>

<p>Here we present some examples of incorrect code, along with simple ways to
fix them.  Before we do that, we need to discuss the use of a basic language
feature.</p>

<h4 id="volatile">C/C+++ and "volatile"</h4>

<p>When writing single-threaded code, declaring a variable “volatile” can be
very useful.  The compiler will not omit or reorder accesses to volatile
locations.  Combine that with the sequential consistency provided by the
hardware, and you’re guaranteed that the loads and stores will appear to happen
in the expected order.</p>

<p>However, accesses to volatile storage may be reordered with non-volatile
accesses, so you have to be careful in multi-threaded uniprocessor environments
(explicit compiler reorder barriers may be required).  There are no atomicity
guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at
all in multi-threaded SMP environments.  The C and C++ language standards are
being updated to address this with built-in atomic operations.</p>

<p>If you think you need to declare something “volatile”, that is a strong
indicator that you should be using one of the atomic operations instead.</p>

<h4 id="examplesc">Examples</h4>

<p>In most cases you’d be better off with a synchronization primitive (like a
pthread mutex) rather than an atomic operation, but we will employ the latter to
illustrate how they would be used in a practical situation.</p>

<p>For the sake of brevity we’re ignoring the effects of compiler optimizations
here &mdash; some of this code is broken even on uniprocessors &mdash; so for
all of these examples you must assume that the compiler generates
straightforward code (for example, compiled with gcc -O0).  The fixes presented here do
solve both compiler-reordering and memory-access-ordering issues, but we’re only
going to discuss the latter.</p>

<pre>MyThing* gGlobalThing = NULL;

void initGlobalThing()    // runs in thread 1
{
    MyStruct* thing = malloc(sizeof(*thing));
    memset(thing, 0, sizeof(*thing));
    thing->x = 5;
    thing->y = 10;
    /* initialization complete, publish */
    gGlobalThing = thing;
}

void useGlobalThing()    // runs in thread 2
{
    if (gGlobalThing != NULL) {
        int i = gGlobalThing->x;    // could be 5, 0, or uninitialized data
        ...
    }
}</pre>

<p>The idea here is that we allocate a structure, initialize its fields, and at
the very end we “publish” it by storing it in a global variable.  At that point,
any other thread can see it, but that’s fine since it’s fully initialized,
right?  At least, it would be on x86 SMP or a uniprocessor (again, making the
erroneous assumption that the compiler outputs code exactly as we have it in the
source).</p>

<p>Without a memory barrier, the store to <code>gGlobalThing</code> could be observed before
the fields are initialized on ARM.  Another thread reading from <code>thing->x</code> could
see 5, 0, or even uninitialized data.</p>

<p>This can be fixed by changing the last assignment to:</p>

<pre>    atomic_release_store(&gGlobalThing, thing);</pre>

<p>That ensures that all other threads will observe the writes in the proper
order, but what about reads?  In this case we should be okay on ARM, because the
address dependency rules will ensure that any loads from an offset of
<code>gGlobalThing</code> are observed after the load of
<code>gGlobalThing</code>.  However, it’s unwise to rely on architectural
details, since it means your code will be very subtly unportable.  The complete
fix also requires a barrier after the load:</p>

<pre>    MyThing* thing = atomic_acquire_load(&gGlobalThing);
    int i = thing->x;</pre>

<p>Now we know the ordering will be correct.  This may seem like an awkward way
to write code, and it is, but that’s the price you pay for accessing data
structures from multiple threads without using locks.  Besides, address
dependencies won’t always save us:</p>

<pre>MyThing gGlobalThing;

void initGlobalThing()    // runs in thread 1
{
    gGlobalThing.x = 5;
    gGlobalThing.y = 10;
    /* initialization complete */
    gGlobalThing.initialized = true;
}

void useGlobalThing()    // runs in thread 2
{
    if (gGlobalThing.initialized) {
        int i = gGlobalThing.x;    // could be 5 or 0
    }
}</pre>

<p>Because there is no relationship between the <code>initialized</code> field and the
others, the reads and writes can be observed out of order.  (Note global data is
initialized to zero by the OS, so it shouldn’t be possible to read “random”
uninitialized data.)</p>

<p>We need to replace the store with:</p>
<pre>    atomic_release_store(&gGlobalThing.initialized, true);</pre>

<p>and replace the load with:</p>
<pre>    int initialized = atomic_acquire_load(&gGlobalThing.initialized);</pre>

<p>Another example of the same problem occurs when implementing
reference-counted data structures.  The reference count itself will be
consistent so long as atomic increment and decrement operations are used, but
you can still run into trouble at the edges, for example:</p>

<pre>void RefCounted::release()
{
    int oldCount = atomic_dec(&mRefCount);
    if (oldCount == 1) {    // was decremented to zero
        recycleStorage();
    }
}

void useSharedThing(RefCountedThing sharedThing)
{
    int localVar = sharedThing->x;
    sharedThing->release();
    sharedThing = NULL;    // can’t use this pointer any more
    doStuff(localVar);    // value of localVar might be wrong
}</pre>

<p>The <code>release()</code> call decrements the reference count using a
barrier-free atomic decrement operation.  Because this is an atomic RMW
operation, we know that it will work correctly.  If the reference count goes to
zero, we recycle the storage.</p>

<p>The <code>useSharedThing()</code> function extracts what it needs from
<code>sharedThing</code> and then releases its copy.  However, because we didn’t
use a memory barrier, and atomic and non-atomic operations can be reordered,
it’s possible for other threads to observe the read of
<code>sharedThing->x</code> <strong>after</strong> they observe the recycle
operation.  It’s therefore possible for <code>localVar</code> to hold a value
from "recycled" memory, for example a new object created in the same
location by another thread after <code>release()</code> is called.</p>

<p>This can be fixed by replacing the call to <code>atomic_dec()</code> with
<code>atomic_release_dec()</code>. The barrier ensures that the reads from
<code>sharedThing</code> are observed before we recycle the object.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
<p>In most cases the above won’t actually fail, because the “recycle” function
is likely guarded by functions that themselves employ barriers (libc heap
<code>free()</code>/<code>delete()</code>, or an object pool guarded by a
mutex).  If the recycle function used a lock-free algorithm implemented without
barriers, however, the above code could fail on ARM SMP.</p>
</div>
</div>

<h3 id="j_dont">What not to do in Java</h3>

<p>We haven’t discussed some relevant Java language features, so we’ll take a
quick look at those first.</p>

<h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>

<p>The “synchronized” keyword provides the Java language’s in-built locking
mechanism.  Every object has an associated “monitor” that can be used to provide
mutually exclusive access.</p>

<p>The implementation of the “synchronized” block has the same basic structure
as the spin lock example: it begins with an acquiring CAS, and ends with a
releasing store.  This means that compilers and code optimizers are free to
migrate code into a “synchronized” block.  One practical consequence: you must
<strong>not</strong> conclude that code inside a synchronized block happens
after the stuff above it or before the stuff below it in a function.  Going
further, if a method has two synchronized blocks that lock the same object, and
there are no operations in the intervening code that are observable by another
thread, the compiler may perform “lock coarsening” and combine them into a
single block.</p>

<p>The other relevant keyword is “volatile”.  As defined in the specification
for Java 1.4 and earlier, a volatile declaration was about as weak as its C
counterpart.  The spec for Java 1.5 was updated to provide stronger guarantees,
almost to the level of monitor synchronization.</p>

<p>The effects of volatile accesses can be illustrated with an example.  If
thread 1 writes to a volatile field, and thread 2 subsequently reads from that
same field, then thread 2 is guaranteed to see that write and all writes
previously made by thread 1.  More generally, the writes made by
<strong>any</strong> thread up to the point where it writes the field will be
visible to thead 2 when it does the read.  In effect, writing to a volatile is
like a monitor release, and reading from a volatile is like a monitor
acquire.</p>

<p>Non-volatile accesses may be reorded with respect to volatile accesses in the
usual ways, for example the compiler could move a non-volatile load or store “above” a
volatile store, but couldn’t move it “below”.  Volatile accesses may not be
reordered with respect to each other.  The VM takes care of issuing the
appropriate memory barriers.</p>

<p>It should be mentioned that, while loads and stores of object references and
most primitive types are atomic, <code>long</code> and <code>double</code>
fields are not accessed atomically unless they are marked as volatile.
Multi-threaded updates to non-volatile 64-bit fields are problematic even on
uniprocessors.</p>

<h4 id="examplesj">Examples</h4>

<p>Here’s a simple, incorrect implementation of a monotonic counter: <span
style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
theory and practice: Managing volatility</a></em>)</span>.</p>

<pre>class Counter {
    private int mValue;

    public int get() {
        return mValue;
    }
    public void incr() {
        mValue++;
    }
}</pre>

<p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
threads, and we want to be sure that every thread sees the current count when
<code>get()</code> is called.  The most glaring problem is that
<code>mValue++</code> is actually three operations:</p>

<ol>
<li><code>reg = mValue</code></li>
<li><code>reg = reg + 1</code></li>
<li><code>mValue = reg</code></li>
</ol>

<p>If two threads execute in <code>incr()</code> simultaneously, one of the
updates could be lost.  To make the increment atomic, we need to declare
<code>incr()</code> “synchronized”.  With this change, the code will run
correctly in multi-threaded uniprocessor environments.</p>

<p>It’s still broken on SMP, however.  Different threads might see different
results from <code>get()</code>, because we’re reading the value with an ordinary load.  We
can correct the problem by declaring <code>get()</code> to be synchronized.
With this change, the code is obviously correct.</p>

<p>Unfortunately, we’ve introduced the possibility of lock contention, which
could hamper performance.  Instead of declaring <code>get()</code> to be
synchronized, we could declare <code>mValue</code> with “volatile”.  (Note
<code>incr()</code> must still use <code>synchronize</code>.)  Now we know that
the volatile write to <code>mValue</code> will be visible to any subsequent volatile read of
<code>mValue</code>. <code>incr()</code> will be slightly slower, but
<code>get()</code> will be faster, so even in the absence of contention this is
a win if reads outnumber writes. (See also {@link
java.util.concurrent.atomic.AtomicInteger}.)</p>

<p>Here’s another example, similar in form to the earlier C examples:</p>

<pre>class MyGoodies {
    public int x, y;
}
class MyClass {
    static MyGoodies sGoodies;

    void initGoodies() {    // runs in thread 1
        MyGoodies goods = new MyGoodies();
        goods.x = 5;
        goods.y = 10;
        sGoodies = goods;
    }

    void useGoodies() {    // runs in thread 2
        if (sGoodies != null) {
            int i = sGoodies.x;    // could be 5 or 0
            ....
        }
    }
}</pre>

<p>This has the same problem as the C code, namely that the assignment
<code>sGoodies = goods</code> might be observed before the initialization of the
fields in <code>goods</code>.  If you declare <code>sGoodies</code> with the
volatile keyword, you can think about the loads as if they were
<code>atomic_acquire_load()</code> calls, and the stores as if they were
<code>atomic_release_store()</code> calls.</p>

<p>(Note that only the <code>sGoodies</code> reference itself is volatile.  The
accesses to the fields inside it are not.  The statement <code>z =
sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
followed by a non-volatile load of <code>sGoodies.x</code>.  If you make a local
reference <code>MyGoodies localGoods = sGoodies</code>, <code>z =
localGoods.x</code> will not perform any volatile loads.)</p>

<p>A more common idiom in Java programming is the infamous “double-checked
locking”:</p>

<pre>class MyClass {
    private Helper helper = null;

    public Helper getHelper() {
        if (helper == null) {
            synchronized (this) {
                if (helper == null) {
                    helper = new Helper();
                }
            }
        }
        return helper;
    }
}</pre>

<p>The idea is that we want to have a single instance of a <code>Helper</code>
object associated with an instance of <code>MyClass</code>.  We must only create
it once, so we create and return it through a dedicated <code>getHelper()</code>
function.  To avoid a race in which two threads create the instance, we need to
synchronize the object creation.  However, we don’t want to pay the overhead for
the “synchronized” block on every call, so we only do that part if
<code>helper</code> is currently null.</p>

<p>This doesn’t work correctly on uniprocessor systems, unless you’re using a
traditional Java source compiler and an interpreter-only VM.  Once you add fancy
code optimizers and JIT compilers it breaks down.  See the “‘Double Checked
Locking is Broken’ Declaration” link in the appendix for more details, or Item
71 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java,
2nd Edition.</em>.</p>

<p>Running this on an SMP system introduces an additional way to fail.  Consider
the same code rewritten slightly, as if it were compiled into a C-like language
(I’ve added a couple of integer fields to represent <code>Helper’s</code>
constructor activity):</p>

<pre>if (helper == null) {
    // acquire monitor using spinlock
    while (atomic_acquire_cas(&this.lock, 0, 1) != success)
        ;
    if (helper == null) {
        newHelper = malloc(sizeof(Helper));
        newHelper->x = 5;
        newHelper->y = 10;
        helper = newHelper;
    }
    atomic_release_store(&this.lock, 0);
}</pre>

<p>Now the problem should be obvious: the store to <code>helper</code> is
happening before the memory barrier, which means another thread could observe
the non-null value of <code>helper</code> before the stores to the
<code>x</code>/<code>y</code> fields.</p>

<p>You could try to ensure that the store to <code>helper</code> happens after
the <code>atomic_release_store()</code> on <code>this.lock</code> by rearranging
the code, but that won’t help, because it’s okay to migrate code upward &mdash;
the compiler could move the assignment back above the
<code>atomic_release_store()</code> to its original position.</p>

<p>There are two ways to fix this:</p>
<ol>
<li>Do the simple thing and delete the outer check.  This ensures that we never
examine the value of <code>helper</code> outside a synchronized block.</li>
<li>Declare <code>helper</code> volatile.  With this one small change, the code
in Example J-3 will work correctly on Java 1.5 and later.  (You may want to take
a minute to convince yourself that this is true.)</li>
</ol>

<p>This next example illustrates two important issues when using volatile:</p>

<pre>class MyClass {
    int data1, data2;
    volatile int vol1, vol2;

    void setValues() {    // runs in thread 1
        data1 = 1;
        vol1 = 2;
        data2 = 3;
    }

    void useValues1() {    // runs in thread 2
        if (vol1 == 2) {
            int l1 = data1;    // okay
            int l2 = data2;    // wrong
        }
    }
    void useValues2() {    // runs in thread 2
        int dummy = vol2;
        int l1 = data1;    // wrong
        int l2 = data2;    // wrong
    }</pre>

<p>Looking at <code>useValues1()</code>, if thread 2 hasn’t yet observed the
update to <code>vol1</code>, then it can’t know if <code>data1</code> or
<code>data2</code> has been set yet.  Once it sees the update to
<code>vol1</code>, it knows that the change to <code>data1</code> is also
visible, because that was made before <code>vol1</code> was changed.  However,
it can’t make any assumptions about <code>data2</code>, because that store was
performed after the volatile store.</p>

<P>The code in <code>useValues2()</code> uses a second volatile field,
<code>vol2</code>, in an attempt to force the VM to generate a memory barrier.
This doesn’t generally work.  To establish a proper “happens-before”
relationship, both threads need to be interacting with the same volatile field.
You’d have to know that <code>vol2</code> was set after <code>data1/data2</code>
in thread 1.  (The fact that this doesn’t work is probably obvious from looking
at the code; the caution here is against trying to cleverly “cause” a memory
barrier instead of creating an ordered series of accesses.)</p>

<h3 id="bestpractice">What to do</h3>

<h4 id="advice">General advice</h4>

<p>In C/C++, use the <code>pthread</code> operations, like mutexes and
semaphores.  These include the proper memory barriers, providing correct and
efficient behavior on all Android platform versions.  Be sure to use them
correctly, for example be wary of signaling a condition variable without holding the
corresponding mutex.</p>

<p>It's best to avoid using atomic functions directly. Locking and
unlocking a pthread mutex require a single atomic operation each if there’s no
contention, so you’re not going to save much by replacing mutex calls with
atomic ops.  If you need a lock-free design, you must fully understand the
concepts in this entire document before you begin (or, better yet, find an
existing code library that is known to be correct on SMP ARM).</p>

<p>Be extremely circumspect with "volatile” in C/C++.  It often indicates a
concurrency problem waiting to happen.</p>

<p>In Java, the best answer is usually to use an appropriate utility class from
the {@link java.util.concurrent} package.  The code is well written and well
tested on SMP.</p>

<p>Perhaps the safest thing you can do is make your class immutable.  Objects
from classes like String and Integer hold data that cannot be changed once the
class is created, avoiding all synchronization issues.  The book <em>Effective
Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in
particular the importance of declaring fields “final" <span
style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>

<p>If neither of these options is viable, the Java “synchronized” statement
should be used to guard any field that can be accessed by more than one thread.
If mutexes won’t work for your situation, you should declare shared fields
“volatile”, but you must take great care to understand the interactions between
threads.  The volatile declaration won’t save you from common concurrent
programming mistakes, but it will help you avoid the mysterious failures
associated with optimizing compilers and SMP mishaps.</p>

<p>The Java Memory Model guarantees that assignments to final fields are visible
to all threads once the constructor has finished &mdash; this is what ensures
proper synchronization of fields in immutable classes.  This guarantee does not
hold if a partially-constructed object is allowed to become visible to other
threads.  It is necessary to follow safe construction practices.<span
style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Safe
Construction Techniques in Java</a>)</span>.</p>

<h4 id="sync_guarantees">Synchronization primitive guarantees</h4>

<p>The pthread library and VM make a couple of useful guarantees: all accesses
previously performed by a thread that creates a new thread are observable by
that new thread as soon as it starts, and all accesses performed by a thread
that is exiting are observable when a <code>join()</code> on that thread
returns.  This means you don’t need any additional synchronization when
preparing data for a new thread or examining the results of a joined thread.</p>

<p>Whether or not these guarantees apply to interactions with pooled threads
depends on the thread pool implementation.</p>

<p>In C/C++, the pthread library guarantees that any accesses made by a thread
before it unlocks a mutex will be observable by another thread after it locks
that same mutex.  It also guarantees that any accesses made before calling
<code>signal()</code> or <code>broadcast()</code> on a condition variable will
be observable by the woken thread.</p>

<p>Java language threads and monitors make similar guarantees for the comparable
operations.</p>

<h4 id="ccpp_changes">Upcoming changes to C/C++</h4>

<p>The C and C++ language standards are evolving to include a sophisticated
collection of atomic operations.  A full matrix of calls for common data types
is defined, with selectable memory barrier semantics (choose from relaxed,
consume, acquire, release, acq_rel, seq_cst).</p>

<p>See the <a href="#more">Further Reading</a> section for pointers to the
specifications.</p>


<h2 id="closing_notes">Closing Notes</h2>

<p>While this document does more than merely scratch the surface, it doesn’t
manage more than a shallow gouge.  This is a very broad and deep topic.  Some
areas for further exploration:</p>

<ul>
<li>Learn the definitions of <em>happens-before</em>,
<em>synchronizes-with</em>, and other essential concepts from the Java Memory
Model.  (It’s hard to understand what “volatile” really means without getting
into this.)</li>
<li>Explore what compilers are and aren’t allowed to do when reordering code.
(The JSR-133 spec has some great examples of legal transformations that lead to
unexpected results.)</li>
<li>Find out how to write immutable classes in Java and C++.  (There’s more to
it than just “don’t change anything after construction”.)</li>
<li>Internalize the recommendations in the Concurrency section of <em>Effective
Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
meant to be overridden while inside a synchronized block.)</li>
<li>Understand what sorts of barriers you can use on x86 and ARM.  (And other
CPUs for that matter, for example Itanium’s acquire/release instruction
modifiers.)</li>
<li>Read through the {@link java.util.concurrent} and {@link
java.util.concurrent.atomic} APIs to see what's available.  Consider using
concurrency annotations like <code>@ThreadSafe</code> and
<code>@GuardedBy</code> (from net.jcip.annotations).</li>
</ul>

<p>The <a href="#more">Further Reading</a> section in the appendix has links to
documents and web sites that will better illuminate these topics.</p>

<h2 id="appendix">Appendix</h2>

<h3 id="smp_failure_example">SMP failure example</h3>

<p>This document describes a lot of “weird” things that can, in theory, happen.
If you’re not convinced that these issues are real, a practical example may be
useful.</p>

<p>Bill Pugh’s Java memory model <a
href="http://www.cs.umd.edu/~pugh/java/memoryModel/">web site</a> has a few
test programs on it.  One interesting test is ReadAfterWrite.java, which does
the following:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;a = i;<br />
&nbsp;&nbsp;&nbsp;&nbsp;BB[i] = b;<br />
}</code></td>
<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;b = i;<br />
&nbsp;&nbsp;&nbsp;&nbsp;AA[i] = a;<br />
}</code></td>
</tr>
</table>

<p>Where <code>a</code> and <code>b</code> are declared as volatile
<code>int</code> fields, and <code>AA</code> and <code>BB</code> are ordinary
integer arrays.

<p>This is trying to determine if the VM ensures that, after a value is written
to a volatile, the next read from that volatile sees the new value.  The test
code executes these loops a million or so times, and then runs through afterward
and searches the results for inconsistencies.</p>

<p>At the end of execution,<code>AA</code> and <code>BB</code> will be full of
gradually-increasing integers.  The threads will not run side-by-side in a
predictable way, but we can assert a relationship between the array contents.
For example, consider this execution fragment:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>(initially a == 1534)<br />
a = 1535<br />
BB[1535] = 165<br />
a = 1536<br />
BB[1536] = 165<br />
<br />
<br />
<br />
<br />
a = 1537<br />
BB[1537] = 167</code></td>
<td><code>(initially b == 165)
<br />
<br />
<br />
<br />
<br />
b = 166<br />
AA[166] = 1536<br />
b = 167<br />
AA[167] = 1536<br />
<br /></code></td>
</tr>
</table>

<p>(This is written as if the threads were taking turns executing so that it’s
more obvious when results from one thread should be visible to the other, but in
practice that won’t be the case.)</p>

<p>Look at the assignment of <code>AA[166]</code> in thread 2.  We are capturing
the fact that, at the point where thread 2 was on iteration 166, it can see that
thread 1 was on iteration 1536.  If we look one step in the future, at thread
1’s iteration 1537, we expect to see that thread 1 saw that thread 2 was at
iteration 166 (or later).  <code>BB[1537]</code> holds 167, so it appears things
are working.</p>

<p>Now suppose we fail to observe a volatile write to <code>b</code>:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>(initially a == 1534)<br />
a = 1535<br />
BB[1535] = 165<br />
a = 1536<br />
BB[1536] = 165<br />
<br />
<br />
<br />
<br />
a = 1537<br />
BB[1537] = 165  // stale b</code></td>
<td><code>(initially b == 165)<br />
<br />
<br />
<br />
<br />
b = 166<br />
AA[166] = 1536<br />
b = 167<br />
AA[167] = 1536</code></td>
</tr>
</table>

<p>Now, <code>BB[1537]</code> holds 165, a smaller value than we expected, so we
know we have a problem.  Put succinctly, for i=166, BB[AA[i]+1] < i.  (This also
catches failures by thread 2 to observe writes to <code>a</code>, for example if we
miss an update and assign <code>AA[166] = 1535</code>, we will get
<code>BB[AA[166]+1] == 165</code>.)</p>

<p>If you run the test program under Dalvik (Android 3.0 “Honeycomb” or later)
on an SMP ARM device, it will never fail.  If you remove the word “volatile”
from the declarations of <code>a</code> and <code>b</code>, it will consistently
fail.  The program is testing to see if the VM is providing sequentially
consistent ordering for accesses to <code>a</code> and <code>b</code>, so you
will only see correct behavior when the variables are volatile.  (It will also
succeed if you run the code on a uniprocessor device, or run it while something
else is using enough CPU that the kernel doesn’t schedule the test threads on
separate cores.)</p>

<p>If you run the modified test a few times you will note that it doesn’t fail
in the same place every time.  The test fails consistently because it performs
the operations a million times, and it only needs to see out-of-order accesses
once.  In practice, failures will be infrequent and difficult to locate.  This
test program could very well succeed on a broken VM if things just happen to
work out.</p>

<h3 id="sync_stores">Implementing synchronization stores</h3>

<p><em>(This isn’t something most programmers will find themselves implementing,
but the discussion is illuminating.)</em></p>

<p>Consider once again volatile accesses in Java.  Earlier we made reference to
their similarities with acquiring loads and releasing stores, which works as a
starting point but doesn’t tell the full story.</p>

<p>We start with a fragment of Dekker’s algorithm. Initially both
<code>flag1</code> and <code>flag2</code> are false:</p>

<table>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
<tr>
<td><code>flag1 = true<br />
if (flag2 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
<td><code>flag2 = true<br />
if (flag1 == false)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
</tr>
</table

<p><code>flag1</code> and <code>flag2</code> are declared as volatile boolean
fields.  The rules for acquiring loads and releasing stores would allow the
accesses in each thread to be reordered, breaking the algorithm.  Fortunately,
the JMM has a few things to say here.  Informally:</p>

<ul>
<li>A write to a volatile field <em>happens-before</em> every subsequent read of that
same field.  (For this example, it means that if one thread updates a flag, and
later on the other thread reads that flag, the reader is guaranteed to see the
write.)</li>
<li>Every execution has a total order over all volatile field accesses.  The
order is consistent with program order.</li>
</ul>

<p>Taken together, these rules say that the volatile accesses in our example
must be observable in program order by all threads. Thus, we will never see
these threads executing the “critical-stuff” simultaneously.</p>

<div style="padding:.5em 2em;">
<div style="border-left:4px solid #999;padding:0 1em;font-style:italic;">
<p>Another way to think about this is in terms of <em>data races</em>.  A data race
occurs if two accesses to the same memory location by different threads are not
ordered, at least one of them stores to the memory location, and at least one of
them is not a synchronization action <span style="font-size:.9em;color:#777">(<a
href="#more" style="color:#777">Boehm and McKenney</a>)</span>. The memory model
declares that a program free of data races must behave as if executed by a
sequentially-consistent machine.  Because both <code>flag1</code> and
<code>flag2</code> are volatile, and volatile accesses are considered
synchronization actions, there are no data races and this code must execute in a
sequentially consistent manner.</p>
</div>
</div>

<p>As we saw in an earlier section, we need to insert a store/load barrier
between the two operations.  The code executed in the VM for a volatile access
will look something like this:</p>

<table>
<tr>
<th>volatile load</th>
<th>volatile store</th>
</tr>
<tr>
<td><code>reg = A<br />
<em>load/load + load/store barrier</em></code></td>
<td><code><em>store/store barrier</em><br />
A = reg<br />
<em>store/load barrier</em></code></td>
</tr>
</table>

<p>The volatile load is just an acquiring load.  The volatile store is similar
to a releasing store, but we’ve omitted load/store from the pre-store barrier,
and added a store/load barrier afterward.</p>

<p>What we’re really trying to guarantee, though, is that (using thread 1 as an
example) the write to flag1 is observed before the read of flag2.  We could
issue the store/load barrier before the volatile load instead and get the same
result, but because loads tend to outnumber stores it’s best to associate it
with the store.</p>

<p>On some architectures, it’s possible to implement volatile stores with an
atomic operation and skip the explicit store/load barrier. On x86, for example,
atomics provide a full barrier. The ARM LL/SC operations don’t include a
barrier, so for ARM we must use explicit barriers.</p>

<p>(Much of this is due to Doug Lea and his “JSR-133 Cookbook for Compiler
Writers” page.)</p>

<h3 id="more">Further reading</h3>

<p>Web pages and documents that provide greater depth or breadth.  The more generally useful articles are nearer the top of the list.</p>

<dl>
<dt>Shared Memory Consistency Models: A Tutorial</dt>
<dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
<br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>

<dt>Memory Barriers</dt>
<dd>Nice little article summarizing the issues.
<br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>

<dt>Threads Basics</dt>
<dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm.  Excellent discussion of data races and basic synchronization methods.
<br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html</a></dd>

<dt>Java Concurrency In Practice</dt>
<dd>Published in 2006, this book covers a wide range of topics in great detail.  Highly recommended for anyone writing multi-threaded code in Java.
<br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>

<dt>JSR-133 (Java Memory Model) FAQ</dt>
<dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>

<dt>Overview of package java.util.concurrent</dt>
<dd>The documentation for the <code>java.util.concurrent</code> package.  Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
<br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>

<dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
<dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>

<dt>Java Theory and Practice: Managing Volatility</dt>
<dd>A nice article describing what you can and can’t accomplish with volatile fields in Java.
<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>

<dt>The “Double-Checked Locking is Broken” Declaration</dt>
<dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken.  Includes C/C++ and Java.
<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>

<dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
<dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code.  If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this.  Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly).
<br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>

<dt>Linux Kernel Memory Barriers
<dd>Documentation for Linux kernel memory barriers.  Includes some useful examples and ASCII art.
<br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>

<dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (“Atomic operations library”)</dt>
<dd>Draft standard for C++ atomic operation features.
<br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf</a>
<br/ >(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>

<dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics &lt;stdatomic.h&gt;”)</dt>
<dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.)
<br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf</a></dd>

<dt>Dekker’s algorithm</dt>
<dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”.  The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
<br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>

<dt>Comments on ARM vs. Alpha and address dependencies</dt>
<dd>An e-mail on the arm-kernel mailing list from Catalin Marinas.  Includes a nice summary of address and control dependencies.
<br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>

<dt>What Every Programmer Should Know About Memory</dt>
<dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
<br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>

<dt>Reasoning about the ARM weakly consistent memory model</dt>
<dd>This paper was written by Chong & Ishtiaq of ARM, Ltd.  It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion.  The definition of “observability” used here comes from this paper.
<br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>

<dt>The JSR-133 Cookbook for Compiler Writers</dt>
<dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation.  It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation.
<br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>

<dt>The Semantics of Power and ARM Multiprocessor Machine Code</dt>
<dd>If you prefer your explanations in rigorous mathematical form, this is a fine place to go next.
<br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf</a></dd>
</dl>