Multi-GPU implementation. More...

#include "astaroth.h"
#include "errchk.h"
#include "math_utils.h"

Include dependency graph for node.cc:

Data Structures
struct	node_s

Functions
AcResult	acNodeCreate (const int id, const AcMeshInfo node_config, Node *node_handle)

AcResult	acNodeDestroy (Node node)

AcResult	acNodePrintInfo (const Node node)

AcResult	acNodeQueryDeviceConfiguration (const Node node, DeviceConfiguration *config)

AcResult	acNodeAutoOptimize (const Node node)

AcResult	acNodeSynchronizeStream (const Node node, const Stream stream)

AcResult	acNodeSynchronizeVertexBuffer (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle)

AcResult	acNodeSynchronizeMesh (const Node node, const Stream stream)

AcResult	acNodeSwapBuffers (const Node node)

AcResult	acNodeLoadConstant (const Node node, const Stream stream, const AcRealParam param, const AcReal value)

AcResult	acNodeLoadVertexBufferWithOffset (const Node node, const Stream stream, const AcMesh host_mesh, const VertexBufferHandle vtxbuf_handle, const int3 src, const int3 dst, const int num_vertices)

AcResult	acNodeLoadMeshWithOffset (const Node node, const Stream stream, const AcMesh host_mesh, const int3 src, const int3 dst, const int num_vertices)

AcResult	acNodeLoadVertexBuffer (const Node node, const Stream stream, const AcMesh host_mesh, const VertexBufferHandle vtxbuf_handle)

AcResult	acNodeLoadMesh (const Node node, const Stream stream, const AcMesh host_mesh)

AcResult	acNodeStoreVertexBufferWithOffset (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle, const int3 src, const int3 dst, const int num_vertices, AcMesh *host_mesh)

AcResult	acNodeStoreMeshWithOffset (const Node node, const Stream stream, const int3 src, const int3 dst, const int num_vertices, AcMesh *host_mesh)

AcResult	acNodeStoreVertexBuffer (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle, AcMesh *host_mesh)

AcResult	acNodeStoreMesh (const Node node, const Stream stream, AcMesh *host_mesh)

AcResult	acNodeIntegrateSubstep (const Node node, const Stream stream, const int isubstep, const int3 start, const int3 end, const AcReal dt)

AcResult	acNodeIntegrate (const Node node, const AcReal dt)

AcResult	acNodePeriodicBoundcondStep (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle)

AcResult	acNodePeriodicBoundconds (const Node node, const Stream stream)

AcResult	acNodeReduceScal (const Node node, const Stream stream, const ReductionType rtype, const VertexBufferHandle vtxbuf_handle, AcReal *result)

AcResult	acNodeReduceVec (const Node node, const Stream stream, const ReductionType rtype, const VertexBufferHandle a, const VertexBufferHandle b, const VertexBufferHandle c, AcReal *result)

Detailed Description

Multi-GPU implementation.

JP: The old way for computing boundary conditions conflicts with the way we have to do things with multiple GPUs.

The older approach relied on unified memory, which represented the whole memory area as one huge mesh instead of several smaller ones. However, unified memory in its current state is more meant for quick prototyping when performance is not an issue. Getting the CUDA driver to migrate data intelligently across GPUs is much more difficult than when managing the memory explicitly.

In this new approach, I have simplified the multi- and single-GPU layers significantly. Quick rundown: New struct: Grid. There are two global variables, "grid" and "subgrid", which contain the extents of the whole simulation domain and the decomposed grids, respectively. To simplify thing, we require that each GPU is assigned the same amount of work, therefore each GPU in the node is assigned and "subgrid.m" -sized block of data to work with.

    The whole simulation domain is decomposed with respect to the z dimension.
    For example, if the grid contains (nx, ny, nz) vertices, then the subgrids
    contain (nx, ny, nz / num_devices) vertices.

    An local index (i, j, k) in some subgrid can be mapped to the global grid with
            global idx = (i, j, k + device_id * subgrid.n.z)

Terminology:

Single-GPU function: a function defined on the single-GPU layer (device.cu)

Changes required to this commented code block:

The thread block dimensions (tpb) are no longer passed to the kernel here but in device.cu instead. Same holds for any complex index calculations. Instead, the local coordinates should be passed as an int3 type without having to consider how the data is actually laid out in device memory
The unified memory buffer no longer exists (d_buffer). Instead, we have an opaque handle of type "Device" which should be passed to single-GPU functions. In this file, all devices are stored in a global array "devices[num_devices]".
Every single-GPU function is executed asynchronously by default such that we can optimize Astaroth by executing memory transactions concurrently with computation. Therefore a StreamType should be passed as a parameter to single-GPU functions. Refresher: CUDA function calls are non-blocking when a stream is explicitly passed as a parameter and commands executing in different streams can be processed in parallel/concurrently.

Note on periodic boundaries (might be helpful when implementing other boundary conditions):

    With multiple GPUs, periodic boundary conditions applied on indices ranging from

            (0, 0, STENCIL_ORDER/2) to (subgrid.m.x, subgrid.m.y, subgrid.m.z -

STENCIL_ORDER/2)

    on a single device are "local", in the sense that they can be computed without

having to exchange data with neighboring GPUs. Special care is needed only for transferring the data to the fron and back plates outside this range. In the solution we use here, we solve the local boundaries first, and then just exchange the front and back plates in a "ring", like so device_id (n) <-> 0 <-> 1 <-> ... <-> n <-> (0)

Throughout this file we use the following notation and names for various index offsets

Global coordinates: coordinates with respect to the global grid (static Grid grid)
Local coordinates: coordinates with respect to the local subgrid (static Subgrid subgrid)

s0, s1: source indices in global coordinates
d0, d1: destination indices in global coordinates
da = max(s0, d0);
db = min(s1, d1);

These are used in at least
acLoad()
acStore()
acSynchronizeHalos()

 Here we decompose the host mesh and distribute it among the GPUs in
 the node.

 The host mesh is a huge contiguous block of data. Its dimensions are given by
 the global variable named "grid". A "grid" is decomposed into "subgrids",
 one for each GPU. Here we check which parts of the range s0...s1 maps
 to the memory space stored by some GPU, ranging d0...d1, and transfer
 the data if needed.

 The index mapping is inherently quite involved, but here's a picture which
 hopefully helps make sense out of all this.


 Grid
                                  |----num_vertices---|
 xxx|....................................................|xxx
          ^                   ^   ^                   ^
         d0                  d1  s0 (src)            s1

 Subgrid

          xxx|.............|xxx
          ^                   ^
         d0                  d1

                              ^   ^
                             db  da

Function Documentation

◆ acNodeAutoOptimize()

AcResult acNodeAutoOptimize ( const Node node )

◆ acNodeCreate()

AcResult acNodeCreate	(	const int	id,
		const AcMeshInfo	node_config,
		Node *	node
	)

Initializes all devices on the current node.

Devices on the node are configured based on the contents of AcMesh.

Returns: Exit status. Places the newly created handle in the output parameter.

See also: AcMeshInfo

Usage example:

AcMeshInfo info;
acLoadConfig(AC_DEFAULT_CONFIG, &info);
 
Node node;
acNodeCreate(0, info, &node);
acNodeDestroy(node);

◆ acNodeDestroy()

AcResult acNodeDestroy ( Node node )

Resets all devices on the current node.

See also: acNodeCreate()

◆ acNodeIntegrate()

AcResult acNodeIntegrate	(	const Node	node,
		const AcReal	dt
	)

◆ acNodeIntegrateSubstep()

AcResult acNodeIntegrateSubstep	(	const Node	node,
		const Stream	stream,
		const int	isubstep,
		const int3	start,
		const int3	end,
		const AcReal	dt
	)

◆ acNodeLoadConstant()

AcResult acNodeLoadConstant	(	const Node	node,
		const Stream	stream,
		const AcRealParam	param,
		const AcReal	value
	)

◆ acNodeLoadMesh()

AcResult acNodeLoadMesh	(	const Node	node,
		const Stream	stream,
		const AcMesh	host_mesh
	)

◆ acNodeLoadMeshWithOffset()

AcResult acNodeLoadMeshWithOffset	(	const Node	node,
		const Stream	stream,
		const AcMesh	host_mesh,
		const int3	src,
		const int3	dst,
		const int	num_vertices
	)

◆ acNodeLoadVertexBuffer()

AcResult acNodeLoadVertexBuffer	(	const Node	node,
		const Stream	stream,
		const AcMesh	host_mesh,
		const VertexBufferHandle	vtxbuf_handle
	)

Deprecated ?

◆ acNodeLoadVertexBufferWithOffset()

AcResult acNodeLoadVertexBufferWithOffset	(	const Node	node,
		const Stream	stream,
		const AcMesh	host_mesh,
		const VertexBufferHandle	vtxbuf_handle,
		const int3	src,
		const int3	dst,
		const int	num_vertices
	)

Deprecated ? Might be useful though if the user wants to load only one vtxbuf. But in this case the user should supply a AcReal* instead of vtxbuf_handle

◆ acNodePeriodicBoundconds()

AcResult acNodePeriodicBoundconds	(	const Node	node,
		const Stream	stream
	)

◆ acNodePeriodicBoundcondStep()

AcResult acNodePeriodicBoundcondStep	(	const Node	node,
		const Stream	stream,
		const VertexBufferHandle	vtxbuf_handle
	)

◆ acNodePrintInfo()

AcResult acNodePrintInfo ( const Node node )

Prints information about the devices available on the current node.

Requires that Node has been initialized with @See acNodeCreate().

◆ acNodeQueryDeviceConfiguration()

AcResult acNodeQueryDeviceConfiguration	(	const Node	node,
		DeviceConfiguration *	config
	)

See also: DeviceConfiguration

◆ acNodeReduceScal()

AcResult acNodeReduceScal	(	const Node	node,
		const Stream	stream,
		const ReductionType	rtype,
		const VertexBufferHandle	vtxbuf_handle,
		AcReal *	result
	)

◆ acNodeReduceVec()

AcResult acNodeReduceVec	(	const Node	node,
		const Stream	stream,
		const ReductionType	rtype,
		const VertexBufferHandle	a,
		const VertexBufferHandle	b,
		const VertexBufferHandle	c,
		AcReal *	result
	)

◆ acNodeStoreMesh()

AcResult acNodeStoreMesh	(	const Node	node,
		const Stream	stream,
		AcMesh *	host_mesh
	)

◆ acNodeStoreMeshWithOffset()

AcResult acNodeStoreMeshWithOffset	(	const Node	node,
		const Stream	stream,
		const int3	src,
		const int3	dst,
		const int	num_vertices,
		AcMesh *	host_mesh
	)

◆ acNodeStoreVertexBuffer()

AcResult acNodeStoreVertexBuffer	(	const Node	node,
		const Stream	stream,
		const VertexBufferHandle	vtxbuf_handle,
		AcMesh *	host_mesh
	)

Deprecated ?

◆ acNodeStoreVertexBufferWithOffset()

AcResult acNodeStoreVertexBufferWithOffset	(	const Node	node,
		const Stream	stream,
		const VertexBufferHandle	vtxbuf_handle,
		const int3	src,
		const int3	dst,
		const int	num_vertices,
		AcMesh *	host_mesh
	)

Deprecated ?

◆ acNodeSwapBuffers()

AcResult acNodeSwapBuffers ( const Node node )

◆ acNodeSynchronizeMesh()

AcResult acNodeSynchronizeMesh	(	const Node	node,
		const Stream	stream
	)

◆ acNodeSynchronizeStream()

AcResult acNodeSynchronizeStream	(	const Node	node,
		const Stream	stream
	)

◆ acNodeSynchronizeVertexBuffer()

AcResult acNodeSynchronizeVertexBuffer	(	const Node	node,
		const Stream	stream,
		const VertexBufferHandle	vtxbuf_handle
	)

Deprecated ?

Data Structures

Functions

Detailed Description

Throughout this file we use the following notation and names for various index offsets

Function Documentation

◆ acNodeAutoOptimize()

◆ acNodeCreate()

◆ acNodeDestroy()

◆ acNodeIntegrate()

◆ acNodeIntegrateSubstep()

◆ acNodeLoadConstant()

◆ acNodeLoadMesh()

◆ acNodeLoadMeshWithOffset()

◆ acNodeLoadVertexBuffer()

◆ acNodeLoadVertexBufferWithOffset()

◆ acNodePeriodicBoundconds()

◆ acNodePeriodicBoundcondStep()

◆ acNodePrintInfo()

◆ acNodeQueryDeviceConfiguration()

◆ acNodeReduceScal()

◆ acNodeReduceVec()

◆ acNodeStoreMesh()

◆ acNodeStoreMeshWithOffset()

◆ acNodeStoreVertexBuffer()

◆ acNodeStoreVertexBufferWithOffset()

◆ acNodeSwapBuffers()

◆ acNodeSynchronizeMesh()

◆ acNodeSynchronizeStream()

◆ acNodeSynchronizeVertexBuffer()