unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-03 12:14 [PATCH] Add LD_NUMA_REPLICATION for glibc Huang Shijie via Libc-alpha
@ 2021-09-03  6:28 ` Florian Weimer via Libc-alpha
  2021-09-03 15:15   ` Huang Shijie via Libc-alpha
  2021-09-09 10:19   ` Huang Shijie via Libc-alpha
  0 siblings, 2 replies; 8+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-09-03  6:28 UTC (permalink / raw)
  To: Huang Shijie via Libc-alpha; +Cc: Huang Shijie, zwang, patches

* Huang Shijie via Libc-alpha:

> This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
>
> If LD_NUMA_REPLICATION is set for program foo like this:
> 	#LD_NUMA_REPLICATION=1 ./foo
>
> At the time ld.so mmaps the shared libraries, it will uses
> 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> for them, and the mmap will trigger COW(copy on write) for the shared
> libraries at the NUMA node which the program `foo` runs. After the
> COW, the foo will have a copy of the shared library segment(mmap
> covered) which belong to the same NUMA node.
>
> So when enable LD_NUMA_REPLICATION, it will consume more memory,
> but it will reduce the remote-access in NUMA.

I think the kernel could do this in a much better way, avoiding
duplicating the pages within the same NUMA node.

The other issue is the temporary RWX mapping, which does not
interoperate well with some security hardening features.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] Add LD_NUMA_REPLICATION for glibc
@ 2021-09-03 12:14 Huang Shijie via Libc-alpha
  2021-09-03  6:28 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 8+ messages in thread
From: Huang Shijie via Libc-alpha @ 2021-09-03 12:14 UTC (permalink / raw)
  To: carlos; +Cc: Huang Shijie, zwang, patches, libc-alpha

This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.

If LD_NUMA_REPLICATION is set for program foo like this:
	#LD_NUMA_REPLICATION=1 ./foo

At the time ld.so mmaps the shared libraries, it will uses
	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
for them, and the mmap will trigger COW(copy on write) for the shared libraries
at the NUMA node which the program `foo` runs. After the COW, the foo will have a copy of
the shared library segment(mmap covered) which belong to the same NUMA node.

So when enable LD_NUMA_REPLICATION, it will consume more memory,
but it will reduce the remote-access in NUMA.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 elf/dl-map-segments.h      | 28 ++++++++++++++++++++++++----
 elf/dl-support.c           |  4 ++++
 elf/rtld.c                 |  4 ++++
 sysdeps/generic/ldsodefs.h |  4 ++++
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/elf/dl-map-segments.h b/elf/dl-map-segments.h
index f9fb110e..ae6661a7 100644
--- a/elf/dl-map-segments.h
+++ b/elf/dl-map-segments.h
@@ -52,13 +52,33 @@ _dl_map_segments (struct link_map *l, int fd,
                                   c->mapstart & GLRO(dl_use_load_bias))
            - MAP_BASE_ADDR (l));
 
-      /* Remember which part of the address space this object uses.  */
-      l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
+      if (__glibc_unlikely(GLRO(dl_numa_replication)))
+      {
+	/* Trigger the linux kernel COW(copy on write) on purpose */
+        l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
+                                            c->prot|PROT_WRITE,
+                                            MAP_COPY|MAP_FILE|MAP_POPULATE,
+                                            fd, c->mapoff);
+        if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
+          return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+
+	/* Change back to c->prot if needed */
+	if (!(c->prot & PROT_WRITE))
+        {
+	  if (__mprotect((caddr_t)l->l_map_start, maplength, c->prot))
+            return DL_MAP_SEGMENTS_ERROR_MPROTECT;
+	}
+      }
+      else
+      {
+        /* Remember which part of the address space this object uses.  */
+        l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
                                             c->prot,
                                             MAP_COPY|MAP_FILE,
                                             fd, c->mapoff);
-      if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
-        return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+        if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
+          return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;
+      }
 
       l->l_map_end = l->l_map_start + maplength;
       l->l_addr = l->l_map_start - c->mapstart;
diff --git a/elf/dl-support.c b/elf/dl-support.c
index 01557181..d2eb3164 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -79,6 +79,10 @@ const char *_dl_origin_path;
 /* Nonzero if runtime lookup should not update the .got/.plt.  */
 int _dl_bind_not;
 
+  /* Do we want to do the replication(by linux copy on write) for shared libraries in NUMA?
+   Only valid in the linux system. */
+int _dl_numa_replication;
+
 /* A dummy link map for the executable, used by dlopen to access the global
    scope.  We don't export any symbols ourselves, so this can be minimal.  */
 static struct link_map _dl_main_map =
diff --git a/elf/rtld.c b/elf/rtld.c
index d733359e..10378c00 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -2788,7 +2788,11 @@ process_envvars (struct dl_main_state *state)
 	      GLRO(dl_verbose) = 1;
 	      GLRO(dl_debug_mask) |= DL_DEBUG_PRELINK;
 	      GLRO(dl_trace_prelink) = &envline[17];
+	      break;
 	    }
+
+	  if (memcmp (envline, "NUMA_REPLICATION", 16) == 0)
+	    GLRO(dl_numa_replication) = true;
 	  break;
 
 	case 20:
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 9c152592..f6114522 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -569,6 +569,10 @@ struct rtld_global_ro
   /* Nonzero if runtime lookups should not update the .got/.plt.  */
   EXTERN int _dl_bind_not;
 
+  /* Do we want to do the replication(by linux copy on write) for shared libraries in NUMA?
+     Only valid in the linux system. */
+  EXTERN int _dl_numa_replication;
+
   /* Nonzero if references should be treated as weak during runtime
      linking.  */
   EXTERN int _dl_dynamic_weak;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-03  6:28 ` Florian Weimer via Libc-alpha
@ 2021-09-03 15:15   ` Huang Shijie via Libc-alpha
  2021-09-03 22:16     ` Song Bao Hua (Barry Song) via Libc-alpha
  2021-09-09 10:19   ` Huang Shijie via Libc-alpha
  1 sibling, 1 reply; 8+ messages in thread
From: Huang Shijie via Libc-alpha @ 2021-09-03 15:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: song.bao.hua, Huang Shijie via Libc-alpha, zwang, willy, linux-mm,
	viro, patches, torvalds, akpm

On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> * Huang Shijie via Libc-alpha:
> 
> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
> >
> > If LD_NUMA_REPLICATION is set for program foo like this:
> > 	#LD_NUMA_REPLICATION=1 ./foo
> >
> > At the time ld.so mmaps the shared libraries, it will uses
> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > for them, and the mmap will trigger COW(copy on write) for the shared
> > libraries at the NUMA node which the program `foo` runs. After the
> > COW, the foo will have a copy of the shared library segment(mmap
> > covered) which belong to the same NUMA node.
> >
> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > but it will reduce the remote-access in NUMA.
> 
> I think the kernel could do this in a much better way, avoiding
> duplicating the pages within the same NUMA node.
I think maybe only the per-NUMA-node page cache can avoid it..

> 
> The other issue is the temporary RWX mapping, which does not
> interoperate well with some security hardening features.
> 
Thanks for the feedback.
I CC more people who may  have interesting about this.


Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-03 15:15   ` Huang Shijie via Libc-alpha
@ 2021-09-03 22:16     ` Song Bao Hua (Barry Song) via Libc-alpha
  2021-09-06  9:14       ` Huang Shijie via Libc-alpha
  0 siblings, 1 reply; 8+ messages in thread
From: Song Bao Hua (Barry Song) via Libc-alpha @ 2021-09-03 22:16 UTC (permalink / raw)
  To: Huang Shijie, Florian Weimer
  Cc: Huang Shijie via Libc-alpha, zwang@amperecomputing.com,
	willy@infradead.org, linux-mm@kvack.org, viro@zeniv.linux.org.uk,
	patches@amperecomputing.com, torvalds@linux-foundation.org,
	akpm@linux-foundation.org



> -----Original Message-----
> From: Huang Shijie [mailto:shijie@os.amperecomputing.com]
> Sent: Saturday, September 4, 2021 3:16 AM
> To: Florian Weimer <fweimer@redhat.com>
> Cc: Huang Shijie via Libc-alpha <libc-alpha@sourceware.org>;
> carlos@systemhalted.org; zwang@amperecomputing.com;
> patches@amperecomputing.com; akpm@linux-foundation.org;
> torvalds@linux-foundation.org; viro@zeniv.linux.org.uk; willy@infradead.org;
> Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; linux-mm@kvack.org
> Subject: Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
> 
> On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> > * Huang Shijie via Libc-alpha:
> >
> > > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared
> libraries at run time.
> > >
> > > If LD_NUMA_REPLICATION is set for program foo like this:
> > > 	#LD_NUMA_REPLICATION=1 ./foo
> > >
> > > At the time ld.so mmaps the shared libraries, it will uses
> > > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > > for them, and the mmap will trigger COW(copy on write) for the shared
> > > libraries at the NUMA node which the program `foo` runs. After the
> > > COW, the foo will have a copy of the shared library segment(mmap
> > > covered) which belong to the same NUMA node.
> > >
> > > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > > but it will reduce the remote-access in NUMA.
> >
> > I think the kernel could do this in a much better way, avoiding
> > duplicating the pages within the same NUMA node.
> I think maybe only the per-NUMA-node page cache can avoid it..
> 
> >
> > The other issue is the temporary RWX mapping, which does not
> > interoperate well with some security hardening features.
> >
> Thanks for the feedback.
> I CC more people who may  have interesting about this.
> 

Hi Shijie,
Thanks!

Could we have some benchmark data for this patch? I assume it
will benefit frontend-bound large workset which might face
relatively more icache miss?

> 
> Thanks
> Huang Shijie

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-03 22:16     ` Song Bao Hua (Barry Song) via Libc-alpha
@ 2021-09-06  9:14       ` Huang Shijie via Libc-alpha
  0 siblings, 0 replies; 8+ messages in thread
From: Huang Shijie via Libc-alpha @ 2021-09-06  9:14 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Florian Weimer, Huang Shijie via Libc-alpha,
	zwang@amperecomputing.com, willy@infradead.org,
	linux-mm@kvack.org, viro@zeniv.linux.org.uk,
	patches@amperecomputing.com, torvalds@linux-foundation.org,
	akpm@linux-foundation.org

Hi Barry,
On Fri, Sep 03, 2021 at 10:16:14PM +0000, Song Bao Hua (Barry Song) wrote:
> Could we have some benchmark data for this patch? I assume it
> will benefit frontend-bound large workset which might face
> relatively more icache miss?
I did not find a proper program for the benchmark test..

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-03  6:28 ` Florian Weimer via Libc-alpha
  2021-09-03 15:15   ` Huang Shijie via Libc-alpha
@ 2021-09-09 10:19   ` Huang Shijie via Libc-alpha
  2021-09-10 11:01     ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 8+ messages in thread
From: Huang Shijie via Libc-alpha @ 2021-09-09 10:19 UTC (permalink / raw)
  To: Florian Weimer; +Cc: zwang, Huang Shijie via Libc-alpha, patches

Hi Florian,
On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
> * Huang Shijie via Libc-alpha:
> 
> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
> >
> > If LD_NUMA_REPLICATION is set for program foo like this:
> > 	#LD_NUMA_REPLICATION=1 ./foo
> >
> > At the time ld.so mmaps the shared libraries, it will uses
> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
> > for them, and the mmap will trigger COW(copy on write) for the shared
> > libraries at the NUMA node which the program `foo` runs. After the
> > COW, the foo will have a copy of the shared library segment(mmap
> > covered) which belong to the same NUMA node.
> >
> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
> > but it will reduce the remote-access in NUMA.
> 
> I think the kernel could do this in a much better way, avoiding
> duplicating the pages within the same NUMA node.

https://marc.info/?l=linux-kernel&m=163070220429222&w=2
Since Linus did not think it a good choice to do it in kernel,
glibc is the only place to do it now.
So could you please re-evaluate this patch?

> 
> The other issue is the temporary RWX mapping, which does not
> interoperate well with some security hardening features.
Could you please tell me in detail? I am confused at it.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-09 10:19   ` Huang Shijie via Libc-alpha
@ 2021-09-10 11:01     ` Florian Weimer via Libc-alpha
  2021-09-13 14:40       ` Huang Shijie via Libc-alpha
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-09-10 11:01 UTC (permalink / raw)
  To: Huang Shijie; +Cc: zwang, Huang Shijie via Libc-alpha, patches

* Huang Shijie:

> Hi Florian,
> On Fri, Sep 03, 2021 at 08:28:57AM +0200, Florian Weimer wrote:
>> * Huang Shijie via Libc-alpha:
>> 
>> > This patch adds LD_NUMA_REPLICATION which influences the linkage of shared libraries at run time.
>> >
>> > If LD_NUMA_REPLICATION is set for program foo like this:
>> > 	#LD_NUMA_REPLICATION=1 ./foo
>> >
>> > At the time ld.so mmaps the shared libraries, it will uses
>> > 	mmap(, c->prot | PROT_WRITE, MAP_COPY | MAP_FILE | MAP_POPULATE,)
>> > for them, and the mmap will trigger COW(copy on write) for the shared
>> > libraries at the NUMA node which the program `foo` runs. After the
>> > COW, the foo will have a copy of the shared library segment(mmap
>> > covered) which belong to the same NUMA node.
>> >
>> > So when enable LD_NUMA_REPLICATION, it will consume more memory,
>> > but it will reduce the remote-access in NUMA.
>> 
>> I think the kernel could do this in a much better way, avoiding
>> duplicating the pages within the same NUMA node.
>
> https://marc.info/?l=linux-kernel&m=163070220429222&w=2
> Since Linus did not think it a good choice to do it in kernel,
> glibc is the only place to do it now.
> So could you please re-evaluate this patch?

The name of the environment variable is quite misleading.  It should
refer to MAP_POPULATE, not NUMA.

As far as I can tell, it does not necessarily have the desired effect
for multi-threaded applications (if some threads end up running on other
NUMA nodes).

And it would be helpful to have some performance numbers.

And I wonder if a FUSE file system could do better, by making one
backing copy per NUMA node instead of one copy per process.

>> The other issue is the temporary RWX mapping, which does not
>> interoperate well with some security hardening features.
> Could you please tell me in detail? I am confused at it.

Some environments block mapping files with PROT_WRITE | PROT_EXEC.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add LD_NUMA_REPLICATION for glibc
  2021-09-10 11:01     ` Florian Weimer via Libc-alpha
@ 2021-09-13 14:40       ` Huang Shijie via Libc-alpha
  0 siblings, 0 replies; 8+ messages in thread
From: Huang Shijie via Libc-alpha @ 2021-09-13 14:40 UTC (permalink / raw)
  To: Florian Weimer; +Cc: zwang, Huang Shijie via Libc-alpha, patches

On Fri, Sep 10, 2021 at 01:01:46PM +0200, Florian Weimer wrote:
Hi Florian,
> > https://marc.info/?l=linux-kernel&m=163070220429222&w=2
> > Since Linus did not think it a good choice to do it in kernel,
> > glibc is the only place to do it now.
> > So could you please re-evaluate this patch?
> 
> The name of the environment variable is quite misleading.  It should
> refer to MAP_POPULATE, not NUMA.
In the kernel, in do_cow_fault:
https://lxr.missinglinkelectronics.com/linux/mm/memory.c#L3508
alloc_page_vma() will be passed with the numa_node_id() which will return
the current NUMA node.

This patch just uses MAP_POPULATE to implement the per-NUMA replication 
of the shared libraries.

So its name should have "NUMA". (I cannot think out a better name now.)

> 
> As far as I can tell, it does not necessarily have the desired effect
> for multi-threaded applications (if some threads end up running on other
> NUMA nodes).
Yes. Not suit for that case.

> 
> And it would be helpful to have some performance numbers.
I hope some one can test this patch, since I cannot find a proper benchmark
program. :(

> 
> And I wonder if a FUSE file system could do better, by making one
> backing copy per NUMA node instead of one copy per process.
Interesting...

> 
> >> The other issue is the temporary RWX mapping, which does not
> >> interoperate well with some security hardening features.
> > Could you please tell me in detail? I am confused at it.
> 
> Some environments block mapping files with PROT_WRITE | PROT_EXEC.
We can filer these environments by adding more check in the "if".

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-09-13  6:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-03 12:14 [PATCH] Add LD_NUMA_REPLICATION for glibc Huang Shijie via Libc-alpha
2021-09-03  6:28 ` Florian Weimer via Libc-alpha
2021-09-03 15:15   ` Huang Shijie via Libc-alpha
2021-09-03 22:16     ` Song Bao Hua (Barry Song) via Libc-alpha
2021-09-06  9:14       ` Huang Shijie via Libc-alpha
2021-09-09 10:19   ` Huang Shijie via Libc-alpha
2021-09-10 11:01     ` Florian Weimer via Libc-alpha
2021-09-13 14:40       ` Huang Shijie via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).