aboutsummaryrefslogtreecommitdiffstats
path: root/arch/ia64/mm/discontig.c
Commit message (Collapse)AuthorAgeFilesLines
* arch, mm: filter disallowed nodes from arch specific show_mem functionsDavid Rientjes2011-05-251-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Architectures that implement their own show_mem() function did not pass the filter argument to show_free_areas() to appropriately avoid emitting the state of nodes that are disallowed in the current context. This patch now passes the filter argument to show_free_areas() so those nodes are now avoided. This patch also removes the show_free_areas() wrapper around __show_free_areas() and converts existing callers to pass an empty filter. ia64 emits additional information for each node, so skip_free_areas_zone() must be made global to filter disallowed nodes and it is converted to use a nid argument rather than a zone for this use case. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib, arch: add filter argument to show_mem and fix private implementationsDavid Rientjes2011-03-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* percpu: remove per_cpu__ prefix.Rusty Russell2009-10-291-1/+1
| | | | | | | | | | | | | | | | | | Now that the return from alloc_percpu is compatible with the address of per-cpu vars, it makes sense to hand around the address of per-cpu variables. To make this sane, we remove the per_cpu__ prefix we used created to stop people accidentally using these vars directly. Now we have sparse, we can use that (next patch). tj: * Updated to convert stuff which were missed by or added after the original patch. * Kill per_cpu_var() macro. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
* percpu: make percpu symbols in ia64 uniqueTejun Heo2009-10-291-2/+3
| | | | | | | | | | | | | | | | | | | This patch updates percpu related symbols in ia64 such that percpu symbols are unique and don't clash with local symbols. This serves two purposes of decreasing the possibility of global percpu symbol collision and allowing dropping per_cpu__ prefix from percpu symbols. * arch/ia64/kernel/setup.c: s/cpu_info/ia64_cpu_info/ Partly based on Rusty Russell's "alloc_percpu: rename percpu vars which cause name clashes" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: linux-ia64@vger.kernel.org
* ia64: convert to dynamic percpu allocatorTejun Heo2009-10-021-0/+85
| | | | | | | | | | | | | | | | | | | | | | | Unlike other archs, ia64 reserves space for percpu areas during early memory initialization. These areas occupy a contiguous region indexed by cpu number on contiguous memory model or are grouped by node on discontiguous memory model. As allocation and initialization are done by the arch code, all that setup_per_cpu_areas() needs to do is communicating the determined layout to the percpu allocator. This patch implements setup_per_cpu_areas() for both contig and discontig memory models and drops HAVE_LEGACY_PER_CPU_AREA. Please note that for contig model, the allocation itself is modified only to allocate for possible cpus instead of NR_CPUS. As dynamic percpu allocator can handle non-direct mapping, there's no reason to allocate memory for cpus which aren't possible. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: linux-ia64 <linux-ia64@vger.kernel.org>
* ia64: allocate percpu area for cpu0 like percpu areas for other cpusTejun Heo2009-10-021-11/+24
| | | | | | | | | | | | | | | | | | | | | cpu0 used special percpu area reserved by the linker, __cpu0_per_cpu, which is set up early in boot by head.S. However, this doesn't guarantee that the area will be on the same node as cpu0 and the percpu area for cpu0 ends up very far away from percpu areas for other cpus which cause problems for congruent percpu allocator. This patch makes percpu area initialization allocate percpu area for cpu0 like any other cpus and copy it from __cpu0_per_cpu which now resides in the __init area. This means that for cpu0, percpu area is first setup at __cpu0_per_cpu early by head.S and then moved to an area in the linear mapping during memory initialization and it's not allowed to take a pointer to percpu variables between head.S and memory initialization. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: linux-ia64 <linux-ia64@vger.kernel.org>
* ia64: don't alias VMALLOC_END to vmalloc_endTejun Heo2009-10-021-2/+2
| | | | | | | | | | | | | | | | | | | | If CONFIG_VIRTUAL_MEM_MAP is enabled, ia64 defines macro VMALLOC_END as unsigned long variable vmalloc_end which is adjusted to prepare room for vmemmap. This becomes probnlematic if a local variables vmalloc_end is defined in some function (not very unlikely) and VMALLOC_END is used in the function - the function thinks its referencing the global VMALLOC_END value but would be referencing its own local vmalloc_end variable. There's no reason VMALLOC_END should be a macro. Just define it as an unsigned long variable if CONFIG_VIRTUAL_MEM_MAP is set to avoid nasty surprises. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: linux-ia64 <linux-ia64@vger.kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org>
* [IA64] fix the difference between node_mem_map and node_start_pfnKen'ichi Ohmichi2008-11-041-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | makedumpfile[1] cannot run on ia64 discontigmem kernel, because the member node_mem_map of struct pgdat_list has invalid value. This patch fixes it. node_start_pfn shows the start pfn of each node, and node_mem_map should point 'struct page' of each node's node_start_pfn. On my machine, node0's node_start_pfn shows 0x400 and its node_mem_map points 0xa0007fffbf000000. This address is the same as vmem_map, so the node_mem_map points 'struct page' of pfn 0, even if its node_start_pfn shows 0x400. The cause is due to the round down of min_pfn in count_node_pages() and node0's node_mem_map points 'struct page' of inactive pfn (0x0). This patch fixes it. makedumpfile[1]: dump filtering command https://sourceforge.net/projects/makedumpfile/ Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Cc: Bernhard Walle <bwalle@suse.de> Cc: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] Put the space for cpu0 per-cpu area into .data sectionTony Luck2008-09-291-1/+1
| | | | | | | | | Initial fix for making sure that we can access percpu variables in all C code (commit: 10617bbe84628eb18ab5f723d3ba35005adde143) inadvertantly allocated the memory in the "percpu" section of the vmlinux ELF executable. This confused kexec/dump. Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] Ensure cpu0 can access per-cpu variables in early boot codeTony Luck2008-08-121-1/+5
| | | | | | | | | | | | | | | | | | | | ia64 handles per-cpu variables a litle differently from other architectures in that it maps the physical memory allocated for each cpu at a constant virtual address (0xffffffffffff0000). This mapping is not enabled until the architecture specific cpu_init() function is run, which causes problems since some generic code is run before this point. In particular when CONFIG_PRINTK_TIME is enabled, the boot cpu will trap on the access to per-cpu memory at the first printk() call so the boot will fail without the kernel printing anything to the console. Fix this by allocating percpu memory for cpu0 in the kernel data section and doing all initialization to enable percpu access in head.S before calling any generic code. Other cpus must take care not to access per-cpu variables too early, but their code path from start_secondary() to cpu_init() is all in arch/ia64 Signed-off-by: Tony Luck <tony.luck@intel.com>
* bootmem: replace node_boot_start in struct bootmem_dataJohannes Weiner2008-07-241-9/+10
| | | | | | | | | | | | Almost all users of this field need a PFN instead of a physical address, so replace node_boot_start with node_min_pfn. [Lee.Schermerhorn@hp.com: fix spurious BUG_ON() in mark_bootmem()] Signed-off-by: Johannes Weiner <hannes@saeureba.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move bootmem descriptors definition to a single placeJohannes Weiner2008-07-241-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | There are a lot of places that define either a single bootmem descriptor or an array of them. Use only one central array with MAX_NUMNODES items instead. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Pull miscellaneous into release branchTony Luck2008-04-171-3/+1
|\ | | | | | | | | | | Conflicts: arch/ia64/kernel/mca.c
| * [IA64] Fix NUMA configuration issueZoltan Menyhart2008-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a NUMA memory configuration issue in 2.6.24: A 2-node machine of ours has got the following memory layout: Node 0: 0 - 2 Gbytes Node 0: 4 - 8 Gbytes Node 1: 8 - 16 Gbytes Node 0: 16 - 18 Gbytes "efi_memmap_init()" merges the three last ranges into one. "register_active_ranges()" is called as follows: efi_memmap_walk(register_active_ranges, NULL); i.e. once for the 4 - 18 Gbytes range. It picks up the node number from the start address, and registers all the memory for the node #0. "register_active_ranges()" should be called as follows to make sure there is no merged address range at its entry: efi_memmap_walk(filter_memory, register_active_ranges); "filter_memory()" is similar to "filter_rsvd_memory()", but the reserved memory ranges are not filtered out. Signed-off-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net> Signed-off-by: Tony Luck <tony.luck@intel.com>
| * [IA64] remove redundant display of free swap space in show_mem()Johannes Weiner2008-04-091-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | show_mem() has no need to print the amount of free swap space manually because show_free_areas() does this already and is called by the former. The two outputs only differ in text formatting: printk("Free swap = %lukB\n", ...); printk("Free swap: %6ldkB\n", ...); Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | [IA64] Minimize per_cpu reservations.holt@sgi.com2008-04-081-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | This attached patch significantly shrinks boot memory allocation on ia64. It does this by not allocating per_cpu areas for cpus that can never exist. In the case where acpi does not have any numa node description of the cpus, I defaulted to assigning the first 32 round-robin on the known nodes.. For the !CONFIG_ACPI I used for_each_possible_cpu(). Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | [IA64] Correct pernodesize calculation.holt@sgi.com2008-04-081-0/+1
|/ | | | | | | | | | | | A simple fix. The existing pernodesize reservation is not taking into account a second array of pg_data_t structures. This is normally not important because the PAGE_ALIGN macro reserves adequate space. I made the compute_pernodesize steps in the same order as the fill_pernode steps to make the correlation more clear. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* Introduce flags for reserve_bootmem()Bernhard Walle2008-02-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This patchset adds a flags variable to reserve_bootmem() and uses the BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions between crashkernel area and already used memory. This patch: Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE. If that flag is set, the function returns with -EBUSY if the memory already has been reserved in the past. This is to avoid conflicts. Because that code runs before SMP initialisation, there's no race condition inside reserve_bootmem_core(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix powerpc build] Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add vmcoreinfoKen'ichi Ohmichi2007-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch set frees the restriction that makedumpfile users should install a vmlinux file (including the debugging information) into each system. makedumpfile command is the dump filtering feature for kdump. It creates a small dumpfile by filtering unnecessary pages for the analysis. To distinguish unnecessary pages, it needs a vmlinux file including the debugging information. These days, the debugging package becomes a huge file, and it is hard to install it into each system. To solve the problem, kdump developers discussed it at lkml and kexec-ml. As the result, we reached the conclusion that necessary information for dump filtering (called "vmcoreinfo") should be embedded into the first kernel file and it should be accessed through /proc/vmcore during the second kernel. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html) Dan Aloni created the patch set for the above implementation. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html) And I updated it for multi architectures and memory models. (http://lists.infradead.org/pipermail/kexec/2007-August/000479.html) Signed-off-by: Dan Aloni <da-x@monatomic.org> Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* IA64: SPARSEMEM_VMEMMAP 16K page size supportChristoph Lameter2007-10-161-0/+8
| | | | | | | | | | | | | | | | | | | | Equip IA64 sparsemem with a virtual memmap. This is similar to the existing CONFIG_VIRTUAL_MEM_MAP functionality for DISCONTIGMEM. It uses a PAGE_SIZE mapping. This is provided as a minimally intrusive solution. We split the 128TB VMALLOC area into two 64TB areas and use one for the virtual memmap. This should replace CONFIG_VIRTUAL_MEM_MAP long term. [apw@shadowen.org: convert to new helper based initialisation] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [IA64] Stop bogus NMI & softlockup warnings in ia64 show_memPrarit Bhargava2007-09-011-0/+3
| | | | | | | | | | | When dumping memory via sysrq-m it is possible to take a bogus NMI watchdog or softlockup watchdog because the dump can take a long time on big memory systems. Occasionally tickle the watchdog when doing the dump. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] spelling fixes: arch/ia64/Simon Arlott2007-05-111-1/+1
| | | | | | | Spelling and apostrophe fixes in arch/ia64/. Signed-off-by: Simon Arlott <simon@fire.lp0.eu> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] Quicklist support for IA64Christoph Lameter2007-05-111-1/+1
| | | | | | | | | IA64 is the origin of the quicklist implementation. So cut out the pieces that are now in core code and modify the functions called. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* Fix section mismatch of memory hotplug related code.Yasunori Goto2007-05-081-0/+2
| | | | | | | | | This is to fix many section mismatches of code related to memory hotplug. I checked compile with memory hotplug on/off on ia64 and x86-64 box. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [IA64] min_low_pfn and max_low_pfn calculation fixZou Nan hai2007-03-201-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have seen bad_pte_print when testing crashdump on an SN machine in recent 2.6.20 kernel. There are tons of bad pte print (pfn < max_low_pfn) reports when the crash kernel boots up, all those reported bad pages are inside initmem range; That is because if the crash kernel code and data happens to be at the beginning of the 1st node. build_node_maps in discontig.c will bypass reserved regions with filter_rsvd_memory. Since min_low_pfn is calculated in build_node_map, so in this case, min_low_pfn will be greater than kernel code and data. Because pages inside initmem are freed and reused later, we saw pfn_valid check fail on those pages. I think this theoretically happen on a normal kernel. When I check min_low_pfn and max_low_pfn calculation in contig.c and discontig.c. I found more issues than this. 1. min_low_pfn and max_low_pfn calculation is inconsistent between contig.c and discontig.c, min_low_pfn is calculated as the first page number of boot memmap in contig.c (Why? Though this may work at the most of the time, I don't think it is the right logic). It is calculated as the lowest physical memory page number bypass reserved regions in discontig.c. max_low_pfn is calculated include reserved regions in contig.c. It is calculated exclude reserved regions in discontig.c. 2. If kernel code and data region is happen to be at the begin or the end of physical memory, when min_low_pfn and max_low_pfn calculation is bypassed kernel code and data, pages in initmem will report bad. 3. initrd is also in reserved regions, if it is at the begin or at the end of physical memory, kernel will refuse to reuse the memory. Because the virt_addr_valid check in free_initrd_mem. So it is better to fix and clean up those issues. Calculate min_low_pfn and max_low_pfn in a consistent way. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Acked-by: Jay Lan <jlan@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] point saved_max_pfn to the max_pfn of the entire systemHorms2007-03-061-6/+0
| | | | | | | | | | | | | | | | Make saved_max_pfn point to max_pfn of entire system. Without this patch is so that vmcore is zero length on ia64. This is because saved_max_pfn was wrongly being set to the max_pfn of the crash kernel's address space, rather than the max_pfg on the physical memory of the machine - the whole purpose of vmcore is to access physical memory that is not part of the crash kernel's addresss space. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Sort-Of-Acked-By: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [PATCH] optional ZONE_DMA: optional ZONE_DMA for ia64Christoph Lameter2007-02-111-0/+6
| | | | | | | | | | | | ZONE_DMA less operation for IA64 SGI platform Disable ZONE_DMA for SGI SN2. All memory is addressable by all devices and we do not need any special memory pool. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [IA64] clean up sparsemem memory_present callBob Picco2007-02-051-33/+3
| | | | | | | | | | Eliminate arch specific memory_present call ia64 NUMA by utilizing sparse_memory_present_with_active_regions. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] register memory ranges in a consistent mannerBob Picco2007-02-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | While pursuing and unrelated issue with 64Mb granules I noticed a problem related to inconsistent use of add_active_range. There doesn't appear any reason to me why FLATMEM versus DISCONTIG_MEM should register memory to add_active_range with different code. So I've changed the code into a common implementation. The other subtle issue fixed by this patch was calling add_active_range in count_node_pages before granule aligning is performed. We were lucky with 16MB granules but not so with 64MB granules. count_node_pages has reserved regions filtered out and as a consequence linked kernel text and data aren't covered by calls to count_node_pages. So linked kernel regions wasn't reported to add_active_regions. This resulted in free_initmem causing numerous bad_page reports. This won't occur with this patch because now all known memory regions are reported by register_active_ranges. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Bob Picco <bob.picco@hp.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] Zero size /proc/vmcore on ia64Horms2007-02-051-0/+6
| | | | | | | | | | | | | Set saved_max_pfn when discontig memory is in use. This sets up saved_max_pfn when disctontig memory is in use. This mirrors the code for contig memory. This patch does not entirely solve the problem of making vmcore work, however it does appear to be neccessary. Please consider applying. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [PATCH] mm: use symbolic names instead of indices for zone initialisationMel Gorman2006-10-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Arch-independent zone-sizing is using indices instead of symbolic names to offset within an array related to zones (max_zone_pfns). The unintended impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the the machine fails to boot but will boot with CONFIG_HIGHMEM turned off. The following patch properly initialises the max_zone_pfns[] array and uses symbolic names instead of indices in each architecture using arch-independent zone-sizing. Two users have successfully booted their powerpcs with it (one an ibook G4). It has also been boot tested on x86, x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2. Credit to Benjamin Herrenschmidt for identifying the bug and rolling the first fix. Additional credit to Johannes Berg and Andreas Schwab for reporting the problem and testing on powerpc. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Merge branch 'release' of ↵Linus Torvalds2006-09-271-14/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] minor reformatting to vmlinux.lds.S [IA64] CMC/CPE: Reverse the order of fetching log and checking poll threshold [IA64] PAL calls need physical mode, stacked [IA64] ar.fpsr not set on MCA/INIT kernel entry [IA64] printing support for MCA/INIT [IA64] trim output of show_mem() [IA64] show_mem() printk levels [IA64] Make gp value point to Region 5 in mca handler Revert "[IA64] Unwire set/get_robust_list" [IA64] Implement futex primitives [IA64-SGI] Do not request DMA memory for BTE [IA64] Move perfmon tables from thread_struct to pfm_context [IA64] Add interface so modules can discover whether multithreading is on. [IA64] kprobes: fixup the pagefault exception caused by probehandlers [IA64] kprobe opcode 16 bytes alignment on IA64 [IA64] esi-support [IA64] Add "model name" to /proc/cpuinfo
| * [IA64] trim output of show_mem()Jes Sorensen2006-09-261-5/+4
| | | | | | | | | | | | | | | | Cut the number of lines of memory info output per node from five to one line. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
| * [IA64] show_mem() printk levelsJes Sorensen2006-09-261-14/+15
| | | | | | | | | | | | | | | | | | Use the default sysrq printk level for printing show_mem() output both for disconfig and contig versions. This is consistent with the printk level used on other architectures (well ia32 at least). Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | [PATCH] Have ia64 use add_active_range() and free_area_init_nodesMel Gorman2006-09-271-35/+9
|/ | | | | | | | | | | | | | | | | | | | Size zones and holes in an architecture independent manner for ia64. [bob.picco@hp.com: fix ia64 FLATMEM+VIRTUAL_MEM_MAP] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Bob Picco <bob.picco@hp.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [IA64] fix show_mem for VIRTUAL_MEM_MAP+FLATMEMBob Picco2006-08-031-63/+2
| | | | | | | | | | | | | | | | contig.c (FLATMEM) requires the same optimization as in discontig.c for show_mem when VIRTUAL_MEM_MAP is in use. Otherwise FLATMEM has softlockup timeouts. This was boot tested for memory configuration: SPARSEMEM, DISCONTIG+VIRTUAL_MEM_MAP, FLATMEM, FLATMEM+VIRTUAL_MEM_MAP and FLATMEM+VIRTUAL_MEM_MAP with largest memory gap less than LARGE_GAP by using boot parameter "mem=". This was boot tested and "echo m >/proc/sysrq-trigger" output evaluated for : FLATMEM, FLATMEM+VIRTUAL_MEM_MAP, DISCONTIGMEM+VIRTUAL_MEM_MAP and SPARSEMEM. Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] align high endpoint of VIRTUAL_MEM_MAPBob Picco2006-08-031-1/+2
| | | | | | | | | | | | | | | Assure that vmem_map's high endpoint is MAX_ORDER aligned. Not doing so violates the buddy allocator algorithm. Also anyone using mem=XXX on boot line and not aligned to MAX_ORDER requires this patch in order to satisfy buddy allocator. vmem_map always starts at pfn 0. The potentially large MAX_ORDER on ia64 (due to hugetlbfs) requires that the end of vmem_map be aligned to MAX_ORDER_NR_PAGES. This was boot tested for: FLATMEM, FLATMEM+VIRTUAL_MEM_MAP, DISCONTIGMEM+VIRTUAL_MEM_MAP and SPARSEMEM. Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [PATCH] Fix copying of pgdat array on each node for ia64 memory hotplugYasunori Goto2006-07-041-3/+13
| | | | | | | | | | | | | | | | | | | | | | | I found a bug in memory hot-add code for ia64. IA64's code has copies of pgdat's array on each node to reduce memory access over crossing node. This array is used by NODE_DATA() macro. When new node is hot-added, this pgdat's array should be updated and copied on new node too. However, I used for_each_online_node() in scatter_node_data() to copy it. This meant its array is not copied on new node. Because initialization of structures for new node was halfway, so online_node_map couldn't be set at this time. To copy arrays on new node, I changed it to check value of pgdat_list[] which is source array of copies. I tested this patch with my Memory Hotadd emulation on Tiger4. This patch is for 2.6.17-git20. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pgdat allocation and update for ia64 of memory hotplug: allocate ↵Yasunori Goto2006-06-271-2/+14
| | | | | | | | | | | | pgdat and per node data This is a patch to allocate pgdat and per node data area for ia64. The size for them can be calculated by compute_pernodesize(). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pgdat allocation and update for ia64 of memory hotplug: update pgdat ↵Yasunori Goto2006-06-271-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | address array This is to refresh node_data[] array for ia64. As I mentioned previous patches, ia64 has copies of information of pgdat address array on each node as per node data. At v2 of node_add, this function used stop_machine_run() to update them. (I wished that they were copied safety as much as possible.) But, in this patch, this arrays are just copied simply, and set node_online_map bit after completion of pgdat initialization. So, kernel must touch NODE_DATA() macro after checking node_online_map(). (Current code has already done it.) This is more simple way for just hot-add..... Note : It will be problem when hot-remove will occur, because, even if online_map bit is set, kernel may touch NODE_DATA() due to race condition. :-( Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pgdat allocation and update for ia64 of memory hotplug: hold pgdat ↵Yasunori Goto2006-06-271-11/+8
| | | | | | | | | | | | | | | | address at system running This is a preparatory patch to make common code for updating of NODE_DATA() of ia64 between boottime and hotplug. Current code remembers pgdat address in mem_data which is used at just boot time. But its information can be used at hotplug time by moving to global value. The next patch uses this array. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [IA64] Make show_mem() skip holes in a pgdatRobin Holt2006-04-131-1/+65
| | | | | | | | | | | | | | This patch modifies ia64's show_mem() to walk the vmem_map page tables and rapidly skip forward across regions where the page tables are missing. This prevents the pfn_valid() check from causing numerous unnecessary page faults. Without this patch on a 512 node 512 cpu system where every node has four memory holes, the show_mem() call takes 1 hour 18 minutes. With this patch, it takes less than 3 seconds. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [PATCH] for_each_online_pgdat: remove sorting pgdatKAMEZAWA Hiroyuki2006-03-271-31/+0
| | | | | | | | | | | | | | Because pgdat_list was linked to pgdat_list in *reverse* order, (By default) some of arch has to sort it by themselves. for_each_pgdat has gone..for_each_online_pgdat() uses node_online_map, which doesn't need to be sorted. This patch removes codes for sorting pgdat. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] for_each_online_pgdat: renaming for_each_pgdatKAMEZAWA Hiroyuki2006-03-271-2/+2
| | | | | | | | Replace for_each_pgdat() with for_each_online_pgdat(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [IA64] add init declaration to cpu initialization functionsChen, Kenneth W2006-03-221-1/+1
| | | | | | | Add init declaration to cpu initialization functions. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] support for cpu0 removalAshok Raj2006-01-051-2/+7
| | | | | | | | | | | | | | here is the BSP removal support for IA64. Its pretty much the same thing that was released a while back, but has your feedback incorporated. - Removed CONFIG_BSP_REMOVE_WORKAROUND and associated cmdline param - Fixed compile issue with sn2/zx1 due to a undefined fix_b0_for_bsp - some formatting nits (whitespace etc) This has been tested on tiger and long back by alex on hp systems as well. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] Limit the maximum NODEDATA_ALIGN() offsetJack Steiner2005-12-061-1/+3
| | | | | | | | | | | | | The per-node data structures are allocated with strided offsets that are a function of the node number. This prevents excessive cache-aliasing from occurring. On systems with a large number of nodes, the strided offset becomes too large. This patch restricts the maximum offset to 32MB. This is far larger than the size of any current L3 cache. Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [IA64] fix memory less node allocationBob Picco2005-11-081-10/+9
| | | | | | | | | | | | | | | | | | | | The original memory less node allocation attempted to use NODEDATA_ALIGN for alignment. The bootmem allocator only allows a power of two alignments. This causes a BUG_ON for some nodes. For cpu only nodes just allocate with a PERCPU_PAGE_SIZE alignment. Some older firmware reports SLIT distances of 0xff and results in bestnode not being computed. This is now treated correctly. The failed allocation check was removed because it's redundant. The bootmem allocator already makes this check. This fix has been boot tested on 4 node machine which has 4 cpu only nodes and 1 memory node. Thanks to Pete Keilty for reporting this and helping me test it. Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* [PATCH] memory hotplug locking: node_size_lockDave Hansen2005-10-291-1/+6
| | | | | | | | | | | | | | | | | | | | pgdat->node_size_lock is basically only neeeded in one place in the normal code: show_mem(), which is the arch-specific sysrq-m printing function. Strictly speaking, the architectures not doing memory hotplug do no need this locking in show_mem(). However, they are all included for completeness. This should also make any future consolidation of all of the implementations a little more straightforward. This lock is also held in the sparsemem code during a memory removal, as sections are invalidated. This is the place there pfn_valid() is made false for a memory area that's being removed. The lock is only required when doing pfn_valid() operations on memory which the user does not already have a reference on the page, such as in show_mem(). Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>