aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm/numa.c
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/mm/numa: Fix break placementAndrey Utkin2014-09-131-1/+1
| | | | | | | | | | commit b00fc6ec1f24f9d7af9b8988b6a198186eb3408c upstream. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631 Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc: fix numa distance for form0 device treeVaidyanathan Srinivasan2013-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7122beeee7bc1757682049780179d7c216dd1c83 upstream. The following commit breaks numa distance setup for old powerpc systems that use form0 encoding in device tree. commit 41eab6f88f24124df89e38067b3766b7bef06ddb powerpc/numa: Use form 1 affinity to setup node distance Device tree node /rtas/ibm,associativity-reference-points would index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or form1 encoding detected by ibm,architecture-vec-5 property. All modern systems use form1 and current kernel code is correct. However, on older systems with form0 encoding, the numa distance will get hard coded as LOCAL_DISTANCE for all nodes. This causes task scheduling anomaly since scheduler will skip building numa level domain (topmost domain with all cpus) if all numa distances are same. (value of 'level' in sched_init_numa() will remain 0) Prior to the above commit: ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) Restoring compatible behavior with this patch for old powerpc systems with device tree where numa distance are encoded as form0. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc/numa: NUMA topology support for PowerNVDipankar Sarma2011-11-081-7/+17
| | | | | | | | | | This patch adds support for numa topology on powernv platforms running OPAL formware. It checks for the type of platform at run time and sets the affinity form correctly so that NUMA topology can be discovered correctly. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds2011-11-061-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
| * powerpc: various straight conversions from module.h --> export.hPaul Gortmaker2011-10-311-1/+1
| | | | | | | | | | | | | | | | | | All these files were including module.h just for the basic EXPORT_SYMBOL infrastructure. We can shift them off to the export.h header which is a way smaller footprint and thus realize some compile time gains. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* | powerpc: Coding style cleanupsAnton Blanchard2011-09-201-3/+4
| | | | | | | | | | | | | | | | While converting code to use for_each_node_by_type I noticed a number of coding style issues. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: Use for_each_node_by_type instead of open coding itAnton Blanchard2011-09-201-5/+5
| | | | | | | | | | | | | | Use for_each_node_by_type instead of open coding it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc/numa: Remove double of_node_put in hot_add_node_scn_to_nidAnton Blanchard2011-09-201-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | During memory hotplug testing, I got the following warning: ERROR: Bad of_node_put() on /memory@0 of_node_release kref_put of_node_put of_find_node_by_type hot_add_node_scn_to_nid hot_add_scn_to_nid memory_add_physaddr_to_nid ... of_find_node_by_type() loop does the of_node_put for us so we only need the handle the case where we terminate the loop early. As suggested by Stephen Rothwell we can do the of_node_put unconditionally outside of the loop since of_node_put handles a NULL argument fine. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Convert old cpumask API into new oneKOSAKI Motohiro2011-05-041-1/+1
| | | | | | | | | | | | Adapt new API. Almost change is trivial. Most important change is the below line because we plan to change task->cpus_allowed implementation. - ctx->cpus_allowed = current->cpus_allowed; Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Look for ibm, associativity-reference-points at the rootMichael Ellerman2011-04-271-8/+7
| | | | | | | | | | If we don't find ibm,associativity-reference-points as a child of /rtas, look for it at the root of the tree instead. We use this on Book3E where we have no RTAS but still use the sPAPR conventions for NUMA. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Fix common misspellingsLucas De Marchi2011-03-311-5/+5
| | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* powerpc/pseries: Disable VPNH featureBenjamin Herrenschmidt2011-03-101-1/+2
| | | | | | | | | | This feature triggers nasty races in the scheduler between the rebuilding of the topology and the load balancing code, causing the machine to hang. Disable it for now until the races are fixed. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Fix bug in unmap_cpu_from_nodeAnton Blanchard2011-02-071-1/+1
| | | | | | | | | | | | | | | | | | When converting to the new cpumask code I screwed up: - if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) { - cpu_clear(cpu, numa_cpumask_lookup_table[node]); + if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) { + cpumask_set_cpu(cpu, node_to_cpumask_map[node]); This was introduced in commit 25863de07af9 (powerpc/cpumask: Convert NUMA code to new cpumask API) Fix it. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Disable VPHN on dedicated processor partitionsAnton Blanchard2011-02-071-1/+2
| | | | | | | | There is no need to start up the timer and monitor topology changes on a dedicated processor partition, so disable it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Add length when creating OF properties via VPHNAnton Blanchard2011-02-071-3/+9
| | | | | | | | | The rest of the NUMA code expects an OF associativity property with the first cell containing the length. Without this fix all topology changes cause us to misparse the property and put the cpu into node 0. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Check for all VPHN changesAnton Blanchard2011-02-071-1/+1
| | | | | | | | | The hypervisor uses unsigned 1 byte counters to signal topology changes to the OS. Since they can wrap we need to check for any difference, not just if the hypervisor count is greater than the previous count. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Only use active VPHN count fieldsAnton Blanchard2011-02-071-4/+6
| | | | | | | | | | | | | | VPHN supports up to 8 distance fields but the number of entries in ibm,associativity-reference-points signifies how many are in use. Don't look at all the VPHN counts, only distance_ref_points_depth worth. Since we already cap our distance metrics at MAX_DISTANCE_REF_POINTS, use that to size the VPHN arrays and add a BUILD_BUG_ON to avoid it growing larger than the VPHN maximum of 8. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Remove unnecessary variable initializations in numa.cJesse Larrew2011-02-071-9/+8
| | | | | | | Remove unnecessary variable initializations in VPHN functions. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Fix brace placement in numa.cJesse Larrew2011-02-071-6/+3
| | | | | | | Fix brace placement in VPHN code. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Fix typo in VPHN commentsJesse Larrew2011-02-071-1/+1
| | | | | | | Correct a spelling error in VPHN comments in numa.c. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Fix build of topology stuff without CONFIG_NUMABenjamin Herrenschmidt2011-01-121-1/+0
| | | | Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Fix VPHN build errors on non-SMP systemsJesse Larrew2011-01-111-3/+6
| | | | | | | | | | | | | | | | | | | | The header asm/hvcall.h was previously included indirectly via smp.h. On non-SMP systems, however, these declarations are excluded and the build breaks. This is easily fixed by including asm/hvcall.h directly. The VPHN feature is only meaningful on NUMA systems that implement the SPLPAR option, so exclude the VPHN code on systems without SPLPAR enabled. Also, expose unmap_cpu_from_node() on systems with SPLPAR enabled, even if CONFIG_HOTPLUG_CPU is disabled. Lastly, map_cpu_to_node() is now needed by VPHN to manipulate the node masks after boot time, so remove the __cpuinit annotation to fix a section mismatch. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
* powerpc/pseries: Poll VPA for topology changes and update NUMA mapsJesse Larrew2010-12-091-10/+267
| | | | | | | | | | | | | This patch sets a timer during boot that will periodically poll the associativity change counters in the VPA. When a change in associativity is detected, it retrieves the new associativity domain information via the H_HOME_NODE_ASSOCIATIVITY hcall and updates the NUMA node maps and sysfs entries accordingly. Note that since the ibm,associativity device tree property does not exist on configurations with both NUMA and SPLPAR enabled, no device tree updates are necessary. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Add memory_hotplug_max()Nishanth Aravamudan2010-11-291-0/+26
| | | | | | | | | | Add a function to get the maximum address that can be hotplug added. This is needed to calculate the size of the tce table needed to cover all memory in 1:1 mode. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* memblock, bootmem: Round pfn properly for memory and reserved regionsYinghai Lu2010-10-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | We need to round memory regions correctly -- specifically, we need to round reserved region in the more expansive direction (lower limit down, upper limit up) whereas usable memory regions need to be rounded in the more restrictive direction (lower limit up, upper limit down). This introduces two set of inlines: memblock_region_memory_base_pfn() memblock_region_memory_end_pfn() memblock_region_reserved_base_pfn() memblock_region_reserved_end_pfn() Although they are antisymmetric (and therefore are technically duplicates) the use of the different inlines explicitly documents the programmer's intention. The lack of proper rounding caused a bug on ARM, which was then found to also affect other architectures. Reported-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4CB4CDFD.4020105@kernel.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* Merge commit 'v2.6.36-rc3' into x86/memblockIngo Molnar2010-08-311-33/+89
|\ | | | | | | | | | | | | | | | | | | Conflicts: arch/x86/kernel/trampoline.c mm/memblock.c Merge reason: Resolve the conflicts, update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * Merge branch 'next' of ↵Linus Torvalds2010-08-051-33/+89
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (79 commits) powerpc/8xx: Add support for the MPC8xx based boards from TQC powerpc/85xx: Introduce support for the Freescale P1022DS reference board powerpc/85xx: Adding DTS for the STx GP3-SSA MPC8555 board powerpc/85xx: Change deprecated binding for 85xx-based boards powerpc/tqm85xx: add a quirk for ti1520 PCMCIA bridge powerpc/tqm85xx: update PCI interrupt-map attribute powerpc/mpc8308rdb: support for MPC8308RDB board from Freescale powerpc/fsl_pci: add quirk for mpc8308 pcie bridge powerpc/85xx: Cleanup QE initialization for MPC85xxMDS boards powerpc/85xx: Fix booting for P1021MDS boards powerpc/85xx: Fix SWIOTLB initalization for MPC85xxMDS boards powerpc/85xx: kexec for SMP 85xx BookE systems powerpc/5200/i2c: improve i2c bus error recovery of/xilinxfb: update tft compatible versions powerpc/fsl-diu-fb: Support setting display mode using EDID powerpc/5121: doc/dts-bindings: update doc of FSL DIU bindings powerpc/5121: shared DIU framebuffer support powerpc/5121: move fsl-diu-fb.h to include/linux powerpc/5121: fsl-diu-fb: fix issue with re-enabling DIU area descriptor powerpc/512x: add clock structure for Video-IN (VIU) unit ...
| | * Merge commit 'gcl/next' into nextBenjamin Herrenschmidt2010-08-041-42/+42
| | |\
| | * | powerpc/numa: Use form 1 affinity to setup node distanceAnton Blanchard2010-07-091-33/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Form 1 affinity allows multiple entries in ibm,associativity-reference-points which represent affinity domains in decreasing order of importance. The Linux concept of a node is always the first entry, but using the other values as an input to node_distance() allows the memory allocator to make better decisions on which node to go first when local memory has been exhausted. We keep things simple and create an array indexed by NUMA node, capped at 4 entries. Each time we lookup an associativity property we initialise the array which is overkill, but since we should only hit this path during boot it didn't seem worth adding a per node valid bit. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | | | memblock/powerpc: Use new accessorsBenjamin Herrenschmidt2010-08-041-8/+9
|/ / / | | | | | | | | | Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | | powerpc: Fix erroneous lmb->memblock conversionsBenjamin Herrenschmidt2010-07-231-12/+12
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | Oooops... we missed these. We incorrectly converted strings used when parsing the device-tree on pseries, thus breaking access to drconf memory and hotplug memory. While at it, also revert some variable names that represent something the FW calls "lmb" and thus don't need to be converted to "memblock". Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> ---
* | lmb: rename to memblockYinghai Lu2010-07-141-42/+42
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | via following scripts FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/lmb/memblock/g' \ -e 's/LMB/MEMBLOCK/g' \ $FILES for N in $(find . -name lmb.[ch]); do M=$(echo $N | sed 's/lmb/memblock/g') mv $N $M done and remove some wrong change like lmbench and dlmb etc. also move memblock.c from lib/ to mm/ Suggested-by: Ingo Molnar <mingo@elte.hu> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Use ibm,architecture-vec-5 to detect form 1 affinityAnton Blanchard2010-05-211-8/+12
| | | | | | | | | | | | | I've been told that the architected way to determine we are in form 1 affinity mode is by reading the ibm,architecture-vec-5 property which mirrors the layout of the fifth vector of the ibm,client-architecture structure. Eventually we may want to parse the ibm,architecture-vec-5 and create FW_FEATURE_* bits. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge commit 'origin/master' into nextBenjamin Herrenschmidt2010-05-071-2/+15
|\
| * powerpc/numa: Add form 1 NUMA affinityAnton Blanchard2010-04-281-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Firmware changed the way it represents memory and cpu affinity on POWER7. Unfortunately the old method now caps the topology to work around issues with legacy operating systems. For Linux to get the correct topology we need to use the new form 1 affinity information. We set the form 1 field in the client architecture, and if we see "1" in the ibm,associativity-form property firmware supports form 1 affinity and we should look at the first field in the ibm,associativity-reference-points array. If not we use the second field as we always have. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc/cpumask: Convert NUMA code to new cpumask APIAnton Blanchard2010-05-061-13/+45
|/ | | | | | | | | Convert NUMA code to new cpumask API. We shift the node to cpumask setup code until after we complete bootmem allocation so we can dynamically allocate the cpumasks. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* nodemask.h: remove macro any_online_nodeH Hartley Sweeten2010-03-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macro any_online_node() is prone to producing sparse warnings due to the local symbol 'node'. Since all the in-tree users are really requesting the first online node (the mask argument is either NODE_MASK_ALL or node_online_map) just use the first_online_node macro and remove the any_online_node macro since there are no users. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Milton Miller <miltonm@bga.com> Cc: Nathan Fontenot <nfont@austin.ibm.com> Cc: Geoff Levand <geoffrey.levand@am.sony.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: David S. Miller <davem@davemloft.net> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* powerpc: Set init_bootmem_done on NUMA platforms as wellBenjamin Herrenschmidt2009-06-091-0/+2
| | | | | | | | | | | | For some obscure reason, we only set init_bootmem_done after initializing bootmem when NUMA isn't enabled. We even document this next to the declaration of that global in system.h which of course I didn't read before I had to debug why some WIP code wasn't working properly... This patch changes it so that we always set it after bootmem is initialized which should have always been the case... go figure ! Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/numa: Cleanup hot_add_scn_to_nidNathan Fontenot2009-02-231-66/+73
| | | | | | | | | | | | | | | This patch reworks the hot_add_scn_to_nid and its supporting functions to make them easier to understand. There are no functional changes in this patch and has been tested on machine with memory represented in the device tree as memory nodes and in the ibm,dynamic-memory property. My previous patch that introduced support for hotplug memory add on systems whose memory was represented by the ibm,dynamic-memory property of the device tree only left the code more unintelligible. This will hopefully makes things easier to understand. Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge commit 'origin/master' into nextBenjamin Herrenschmidt2009-02-181-2/+3
|\ | | | | | | | | Manual merge of: arch/powerpc/include/asm/pgtable-ppc32.h
| * powerpc/mm: Fix numa reserve bootmem page selectionDave Hansen2009-02-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the powerpc NUMA reserve bootmem page selection logic. commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb (powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes) changed the logic for how the powerpc LMB reserved regions were converted to bootmen reserved regions. As the folowing discussion reports, the new logic was not correct. mark_reserved_regions_for_nid() goes through each LMB on the system that specifies a reserved area. It searches for active regions that intersect with that LMB and are on the specified node. It attempts to bootmem-reserve only the area where the active region and the reserved LMB intersect. We can not reserve things on other nodes as they may not have bootmem structures allocated, yet. We base the size of the bootmem reservation on two possible things. Normally, we just make the reservation start and stop exactly at the start and end of the LMB. However, the LMB reservations are not aware of NUMA nodes and on occasion a single LMB may cross into several adjacent active regions. Those may even be on different NUMA nodes and will require separate calls to the bootmem reserve functions. So, the bootmem reservation must be trimmed to fit inside the current active region. That's all fine and dandy, but we trim the reservation in a page-aligned fashion. That's bad because we start the reservation at a non-page-aligned address: physbase. The reservation may only span 2 bytes, but that those bytes may span two pfns and cause a reserve_size of 2*PAGE_SIZE. Take the case where you reserve 0x2 bytes at 0x0fff and where the active region ends at 0x1000. You'll jump into that if() statment, but node_ar.end_pfn=0x1 and start_pfn=0x0. You'll end up with a reserve_size=0x1000, and then call reserve_bootmem_node(node, physbase=0xfff, size=0x1000); 0x1000 may not be on the same node as 0xfff. Oops. In almost all the vm code, end_<anything> is not inclusive. If you have an end_pfn of 0x1234, page 0x1234 is not included in the range. Using PFN_UP instead of the (>> >> PAGE_SHIFT) will make this consistent with the other VM code. We also need to do math for the reserved size with physbase instead of start_pfn. node_ar.end_pfn << PAGE_SHIFT is *precisely* the end of the node. However, (start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning of the reserved area. That is, of course, physbase. If we don't use physbase here, the reserve_size can be made too large. From: Dave Hansen <dave@linux.vnet.ibm.com> Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Tested on PS3. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc/numa: Remove redundant find_cpu_node()Milton Miller2009-02-111-31/+2
| | | | | | | | | | | | | | | | Use of_get_cpu_node, which is a superset of numa.c's find_cpu_node in a less restrictive section (text vs cpuinit). Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc/numa: Avoid possible reference beyond prop. length in ↵Milton Miller2009-02-111-1/+1
|/ | | | | | | | | | | find_min_common_depth() find_min_common_depth() was checking the property length incorrectly. The value is in bytes not cells, and it is using the second entry. Signed-off-By: Milton Miller <miltonm@bga.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm: Cleanup careful_allocation(): consolidate memset()Dave Hansen2009-01-081-6/+5
| | | | | | | | | | | Both users of careful_allocation() immediately memset() the result. So, just do it in one place. Also give careful_allocation() a 'z' prefix to bring it in line with kzmalloc() and friends. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm: Make careful_allocation() return virtual addrsDave Hansen2009-01-081-17/+20
| | | | | | | | | | | | | | | | | | Since we memset() the result in both of the uses here, just make careful_alloc() return a virtual address. Also, add a separate variable to store the physial address that comes back from the lmb_alloc() functions. This makes it less likely that someone will screw it up forgetting to convert before returning since the vaddr is always in a void* and the paddr is always in an unsigned long. I admit this is arbitrary since one of its users needs a paddr and one a vaddr, but it does remove a good number of casts. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm:: Cleanup careful_allocation(): bootmem already panicsDave Hansen2009-01-081-5/+1
| | | | | | | | | | | | | | If we fail a bootmem allocation, the bootmem code itself panics. No need to redo it here. Also change the wording of the other panic. We don't strictly have to allocate memory on the specified node. It is just a hint and that node may not even *have* any memory on it. In that case we can and do fall back to other nodes. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm: Add better comment on careful_allocation()Dave Hansen2009-01-081-2/+10
| | | | | | | | | The behavior in careful_allocation() really confused me at first. Add a comment to hopefully make it easier on the next doofus that looks at it. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Fix bootmem reservation on uninitialized nodeDave Hansen2008-12-161-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | careful_allocation() was calling into the bootmem allocator for nodes which had not been fully initialized and caused a previous bug: http://patchwork.ozlabs.org/patch/10528/ So, I merged a few broken out loops in do_init_bootmem() to fix it. That changed the code ordering. I think this bug is triggered by having reserved areas for a node which are spanned by another node's contents. In the mark_reserved_regions_for_nid() code, we attempt to reserve the area for a node before we have allocated the NODE_DATA() for that nid. We do this since I reordered that loop. I suck. This is causing crashes at bootup on some systems, as reported by Jon Tollefson. This may only present on some systems that have 16GB pages reserved. But, it can probably happen on any system that is trying to reserve large swaths of memory that happen to span other nodes' contents. This commit ensures that we do not touch bootmem for any node which has not been initialized, and also removes a compile warning about an unused variable. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
* powerpc: Fix boot freeze on machine with empty memory nodeDave Hansen2008-12-011-47/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a bug report about a distro kernel not booting on a particular machine. It would freeze during boot: > ... > Could not find start_pfn for node 1 > [boot]0015 Setup Done > Built 2 zonelists in Node order, mobility grouping on. Total pages: 123783 > Policy zone: DMA > Kernel command line: > [boot]0020 XICS Init > [boot]0021 XICS Done > PID hash table entries: 4096 (order: 12, 32768 bytes) > clocksource: timebase mult[7d0000] shift[22] registered > Console: colour dummy device 80x25 > console handover: boot [udbg0] -> real [hvc0] > Dentry cache hash table entries: 1048576 (order: 7, 8388608 bytes) > Inode-cache hash table entries: 524288 (order: 6, 4194304 bytes) > freeing bootmem node 0 I've reproduced this on 2.6.27.7. It is caused by commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb ("powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes"). The problem is that Jon took a loop which was (in pseudocode): for_each_node(nid) NODE_DATA(nid) = careful_alloc(nid); setup_bootmem(nid); reserve_node_bootmem(nid); and broke it up into: for_each_node(nid) NODE_DATA(nid) = careful_alloc(nid); setup_bootmem(nid); for_each_node(nid) reserve_node_bootmem(nid); The issue comes in when the 'careful_alloc()' is called on a node with no memory. It falls back to using bootmem from a previously-initialized node. But, bootmem has not yet been reserved when Jon's patch is applied. It gives back bogus memory (0xc000000000000000) and pukes later in boot. The following patch collapses the loop back together. It also breaks the mark_reserved_regions_for_nid() code out into a function and adds some comments. I think a huge part of introducing this bug is because for loop was too long and hard to read. The actual bug fix here is the: + if (end_pfn <= node->node_start_pfn || + start_pfn >= node_end_pfn) + continue; Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
* powerpc: Always trim numa memory to lmb_end_of_DRAM()Milton Miller2008-10-211-4/+2
| | | | | | | | | | | | numa_enforce_memory_limit tried to be smart and only call lmb_end_of_DRAM when a memory limit was set via mem= on the command line. However, the early boot code will also limit memory added to the lmb system when iommu=off is specified. When this happens, the page allocator is given pages not in the linear mapping and this results in a fatal data reference to the unmapped page. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>