<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">Oracle bug 9107465
Chuck Anderson
chuck.anderson@oracle.com
07/26/2010

If an RHEL or OEL HVM guests uses the config parm "maxmem" it may crash
when run on an OVM 2.2 (Xen 3.4.0) system.  The crash is usually seen
when the guest is under memory pressure.  It may be difficult if not
impossible to collect crash info.  The only reported crash fingerprint
info is the following console message obtained with "xm dmesg":
(XEN) domain_crash called from p2m.c:1173
(XEN) pg error: p2m_set_entry(): configure P2M table 4KB L2 entry with large 
and:
(XEN) p2m_pod_demand_populate: Out of populate-on-demand memory!
(XEN) domain_crash called from p2m.c:1091
(XEN) Domain 62 reported crashed by domain 0 on cpu#1:
(XEN) grant_table.c:350:d0 Iomem mapping not permitted ffffffffffffffff (domain 62)

Maxmem, if greater than the "memory" parm, will invoke Populate On Demand
(PoD) to allow an HVM guest to increase its memory allocation after boot.
For example, with the following vm.cfg parms:
	maxmem = '4096'
	memory = '512'
an HVM guest will boot with 512MB of memory allocated to it.  The guest can
vary its memory allocation using the Xen "xm mem-set" command, increasing it
up to the maxmem value of 4096MB.  In this example, using maxmem is similar
to having "memory = '4096'" and manually using mem-set to change the guest's
memory reservation to 512MB after boot.

PoD, which was introduced in Xen 3.4.0, requires (1) that the HVM guest has
a balloon driver, i.e. it is a PVHVM guest and (2) that the balloon driver
is aware of PoD.  OVM balloon drivers do not allow for PoD which causes
HVM guests using maxmem to crash when they are under memory pressure.  The
crash/hang can be quickly and reliably reproduced by booting an OEL or
RHEL5u4 64-bit HVM or PVHVM guest that has the vm.cfg parms:
	maxmem = '512'
	memory = '400'
and running the following command:
	# dd if=/dev/hda of=/dev/null

With maxmem=512 and memory=400, The guest will find 512MB in the E820 maps
at boot and assume it has 512MB of memory.  With PoD, 400MB of that 512MB
will have been allocated in PoD's cache which is managed with a per-domain list.
The remmaining (512MB - 400MB)
is not allocated.  Entries in the P2M table are initially set to type PoD
and do not have memory allocated to them.  P2M entries are populated on
demand from the PoD cache.  When the balloon driver loads, it decreases
the guest's memory reservation by (512MB - 400MB) setting the current
reservation to 400MB.

The crash occurs because the balloon driver used the kernel variable
"totalram_pages" to determine an HVM guest's initial memory allocation
(current_pages).  On HVM, totalram_pages does not include certain pages
such as those backing the kernel image or pages allocated using the
bootmem allocator.  This causes the balloon driver to use a different
value for the guest's inital memory allocation size than what Xen has.
For example:

	maxmem=512
	memory=400
	Initially:
		totalram_pages = 496288kB (does not include certain pages)
		512MB == 524288kB
		524288kB - totalram_pages (496288kB) = 28000kB == 7000 pages
		Xen would say the guest's memory reservation is 512MB
		The balloon driver would compute:
			 current_pages = totalram_pages === 496288kB
		which is 28000kB less than what Xen thinks is there.

The memory=400 parm causes the balloon driver to reduce the guest's memory
to 400MB.  The balloon driver would balloon out (reduce the guest's memory
reservation) from what it thinks is current (496288kB) to the specified
400MB.  It starts out at the wrong value (496288kB) which is
lower than what Xen thinks it is (512MB) so the balloon driver will balloon
out less than expected (496288kB - 400MB) instead of what Xen expected
(512MB - 400MB).  Normally this wouldn't matter that
much.  With PoD though, it causes the guest to crash.  PoD depends on
the balloon driver to balloon out exactly (512MB - 400MB).  If the balloon
driver does less, as is the case here, then the P2M table will have more
unpopulated (type PoD) entries than there are in the PoD cache.  PoD would
eventually run out of PoD cache pages and cause the guest to crash.
In fact, I have seen empirically that Xen
adds 4MB to the memory parm value to determine the actual memory reservation.
In the example above, Xen's value for the guest's memory reservation was
516MB.  That additional 4MB makes the difference between Xen's and the guest's
value for the memory reservation even larger.

The balloon driver needs to initialize current_pages with Xen's value for
the guest's memory allocation rather than using a kernel variable value
that may not include all pages.  Xen's value can be obtained from the
XENMEM_maximum_reservation hypervisor call.

Patch is based on several patches by Jan Beulich found online:
http://lists.xensource.com/archives/html/xen-devel/2010-02/txtEk3vRicfHF.txt
http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01601.html

The patch is kludgy but (1) it apparently will be the patch that goes into
mainline and (2) needs to work on both OVM 2.2 and OVM prior to 2.2; both
PV and PVHVM.

Patch pseudo-code:

	if defined CONFIG_XEN
		// A PV guest.  Force totalram_bias = 0
		#define totalram_bias (unsigned long)0

	increase_reservation()
	decrease_reservation()
		// current_pages has Xen's value for pages available to the
		// guest.  The kernel's totalram_pages should == current_pages
		// but it is less because the kernel has subtracted some pages
		// from it at boot.  The difference has been saved in
		// totalram_bias.  Subtract totalram_bias from current_pages
		// to compute the new value for totalram_pages so that the
		// change to totalram_pages is just what we increased/decreased.

	if not defined CONFIG_XEN
		// PVHVM guest
		if not defined XENMEM_get_pod_target
			define XENMEM_get_pod_target
			define xen_pod_target_t
		rc = HYPERVISOR_memory_op(XENMEM_get_pod_target)
		if rc == -ENOSYS
			// no PoD
			op = XENMEM_current_reservation
		else if rc == 1
			// assume Xen prior to 3.4.0 has clipped
			// XENMEM_get_pod_target to 4 bits changing it from
			// 17 to 1 (XENMEM_decrease_reservation)
			op = XENMEM_current_reservation
		else
			// assume PoD support
			op = XENMEM_maximum_reservation
		totalram_bias = HYPERVISOR_memory_op(op)
		if totalram_bias = -ENOSYS
			// guessed wrong: no PoD support
			totalram_bias = 0
			current_pages = totalram_pages
		else
			// totalram_bias = our max or current reservation
			current_pages = totalram_bias
			// compute difference between the pages the kernel
			// thinks are available (totalram_pages) and the
			// pages Xen thinks are available (totalram_bias).
			// Difference is stored in totalram_bias.  It is
			// assumed to be static after very early in boot.
			totalram_bias -= totalram_pages

diff -uNrp linux-2.6.18.x86_64.orig/drivers/xen/balloon/balloon.c linux-2.6.18.x86_64/drivers/xen/balloon/balloon.c
--- linux-2.6.18.x86_64.orig/drivers/xen/balloon/balloon.c	2010-11-18 14:14:44.000000000 -0800
+++ linux-2.6.18.x86_64/drivers/xen/balloon/balloon.c	2010-11-18 17:11:17.000000000 -0800
@@ -105,6 +105,22 @@ extern unsigned long totalhigh_pages;
 #define dec_totalhigh_pages() ((void)0)
 #endif
 
+#ifndef CONFIG_XEN
+/*
+ * HVM guests should not use the kernel variable totalram_pages to determine
+ * their current memory reservation.  totalram_pages does not include
+ * certain pages causing it to be out of sync (lower) with what Xen expects.
+ * This can cause an HVM guest using PoD to crash (Oracle bug 9107465).
+ * HVM guests need to make a hypervisor call to determine their memory
+ * reservation.  totalram_bias is used to keep track of the difference between
+ * what the kernel uses for total pages (totalram_pages)
+ * and what Xen uses (returned by XENMEM_maximum_reservation).
+ */
+static unsigned long __read_mostly totalram_bias;
+#else
+#define totalram_bias (unsigned long)0
+#endif 
+
 /*
  * Drivers may alter the memory reservation independently, but they must
  * inform the balloon driver so that we can avoid hitting the hard limit.
@@ -262,7 +278,7 @@ static int increase_reservation(unsigned
 	}
 
 	current_pages += rc;
-	totalram_pages = current_pages;
+	totalram_pages = current_pages - totalram_bias;
 
  out:
 	balloon_unlock(flags);
@@ -335,7 +351,7 @@ static int decrease_reservation(unsigned
 	BUG_ON(ret != nr_pages);
 
 	current_pages -= nr_pages;
-	totalram_pages = current_pages;
+	totalram_pages = current_pages - totalram_bias;
 
 	balloon_unlock(flags);
 
@@ -496,10 +512,23 @@ static struct notifier_block xenstore_no
 
 static int __init balloon_init(void)
 {
-#if defined(CONFIG_X86) &amp;&amp; defined(CONFIG_XEN) 
+#if !defined(CONFIG_XEN)
+# ifndef XENMEM_get_pod_target
+#  define XENMEM_get_pod_target 17
+	typedef struct xen_pod_target {
+		uint64_t target_pages;
+		uint64_t tot_pages;
+		uint64_t pod_cache_pages;
+		uint64_t pod_entries;
+		domid_t domid;
+	} xen_pod_target_t;
+# endif /* ifndef XENMEM_get_pod_target */
+	xen_pod_target_t pod_target = { .domid = DOMID_SELF };
+	int rc;
+#elif defined(CONFIG_X86)
 	unsigned long pfn;
 	struct page *page;
-#endif
+#endif /* defined(CONFIG_X86) */
 
 	if (!is_running_on_xen())
 		return -ENODEV;
@@ -510,8 +539,31 @@ static int __init balloon_init(void)
 	current_pages = min(xen_start_info-&gt;nr_pages, max_pfn);
 	totalram_pages = current_pages;
 #else
-	current_pages = totalram_pages;
-#endif
+	rc = HYPERVISOR_memory_op(XENMEM_get_pod_target, &amp;pod_target);
+	/*
+	 * Xen prior to 3.4.0 masks the memory_op command to 4 bits, thus
+	 * converting XENMEM_get_pod_target to XENMEM_decrease_reservation.
+	 * Fortunately this results in a request with all input fields zero,
+	 * but (due to the way bit 4 and upwards get interpreted) a starting
+	 * extent of 1. When start_extent &gt; nr_extents (&gt;= in newer Xen), we
+	 * simply get start_extent returned.
+	 */
+	totalram_bias = HYPERVISOR_memory_op(rc != -ENOSYS &amp;&amp; rc != 1
+		? XENMEM_maximum_reservation : XENMEM_current_reservation,
+		&amp;pod_target.domid);
+	if ((long)totalram_bias != -ENOSYS) {
+		BUG_ON(totalram_bias &lt; totalram_pages);
+		current_pages = totalram_bias;
+		totalram_bias -= totalram_pages;
+	} else {
+		totalram_bias = 0;
+		current_pages = totalram_pages;
+	}
+#endif /* CONFIG_XEN */
+	IPRINTK("balloon driver: current pages %lu totalram_pages %lu RAM bias %lu\n",
+		current_pages,
+		totalram_pages,
+		totalram_bias);
 	target_pages  = current_pages;
 	balloon_low   = 0;
 	balloon_high  = 0;
@@ -630,7 +682,7 @@ struct page **alloc_empty_pages_and_page
 			goto err;
 		}
 
-		totalram_pages = --current_pages;
+		totalram_pages = --current_pages - totalram_bias; 
 
 		balloon_unlock(flags);
 	}
</pre></body></html>