Simple intro:
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.
A more detailed intro:
Understanding NUMA Architecture
Hardware and Software (Linux OS) perspective on NUMA:
What is NUMA? — The Linux Kernel documentation
ScienceDirect Collection (see UMA and NUMA):
https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-acc
2.5.1 UMA and NUMA
UMA and NUMA stand for uniform memory access and non-uniform memory access, respectively. The terms are most often used in the context of higher order memory systems and not specifically cache systems, but they are useful designations in the discussion of distributed/partitioned caches because UMA/NUMA systems illustrate the main issues of dealing with physically disjoint memories. A NUMA architecture is one in which all memory accesses have nominally the same latency; a NUMA architecture is one in which references to different addresses may have different latencies due to the nature of where the data is stored. The implication behind the deceptively simple “where the data is stored” is that an item's identity is tied to its location. For instance, in a solid-state setting, an item's physical address determines the memory partition in which it can be found. As an example, in a system of 16 memory nodes, the physical memory space could be divided into 16 regions, and the top 4 bits of the address would determine which node owns the data in question.
Figure 2.14(a) shows an organization in which a CPU's requests all go to the same memory system (which could be a cache, the DRAM system, or a disk; the details do not matter for purposes of this discussion). Thus, all memory requests have nominally the same latency—any observed variations in latency will be due to the specifics of the memory technology and not the memory organization. This organization is typical of shared cache organizations (e.g., the last-level cache of a chip-level multiprocessor) and symmetric multiprocessors.
Figure 2.14(b) shows an organization in which a CPU's memory request may be satisfied by one of many possible sub-memories. In particular, a request to any non-local memory (any memory not directly attached to the CPU making the request) is likely to exhibit a higher latency than a request to a local memory. This organization is used in many commercial settings, such as those based on AMD Opteron or Alpha 21364 (EV7) processors. The ccNUMA variant (for cache-coherent NUMA) is one in which each processor or processor cluster maintains a cache or cache system that caches requests to reduce internode traffic and average latency to non-local data.
The concepts embodied in the terms UMA and NUMA extend to cache systems and memory systems in general because the issues are those of data partitioning and placement, cost/benefit analysis, etc., which are universal. The Fully Buffered DIMM is a prime example of a non-uniform memory architecture which can be configured to behave as a uniform architecture by forcing all DIMMs in the system to stall and emulate the latency of the slowest DIMM in the system. For more detail, see Chapter 14, “The Fully Buffered DIMM Memory System.”
A deeper delve into the workload based performance of NUMA system
https://www.cse.wustl.edu/~angelee/cse539/papers/numa.pdf