![]() |
Astaroth
2.2
|
Multi-GPU implementation. More...
Include dependency graph for node.cc:Data Structures | |
| struct | node_s |
Functions | |
| AcResult | acNodeCreate (const int id, const AcMeshInfo node_config, Node *node_handle) |
| AcResult | acNodeDestroy (Node node) |
| AcResult | acNodePrintInfo (const Node node) |
| AcResult | acNodeQueryDeviceConfiguration (const Node node, DeviceConfiguration *config) |
| AcResult | acNodeAutoOptimize (const Node node) |
| AcResult | acNodeSynchronizeStream (const Node node, const Stream stream) |
| AcResult | acNodeSynchronizeVertexBuffer (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle) |
| AcResult | acNodeSynchronizeMesh (const Node node, const Stream stream) |
| AcResult | acNodeSwapBuffers (const Node node) |
| AcResult | acNodeLoadConstant (const Node node, const Stream stream, const AcRealParam param, const AcReal value) |
| AcResult | acNodeLoadVertexBufferWithOffset (const Node node, const Stream stream, const AcMesh host_mesh, const VertexBufferHandle vtxbuf_handle, const int3 src, const int3 dst, const int num_vertices) |
| AcResult | acNodeLoadMeshWithOffset (const Node node, const Stream stream, const AcMesh host_mesh, const int3 src, const int3 dst, const int num_vertices) |
| AcResult | acNodeLoadVertexBuffer (const Node node, const Stream stream, const AcMesh host_mesh, const VertexBufferHandle vtxbuf_handle) |
| AcResult | acNodeLoadMesh (const Node node, const Stream stream, const AcMesh host_mesh) |
| AcResult | acNodeStoreVertexBufferWithOffset (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle, const int3 src, const int3 dst, const int num_vertices, AcMesh *host_mesh) |
| AcResult | acNodeStoreMeshWithOffset (const Node node, const Stream stream, const int3 src, const int3 dst, const int num_vertices, AcMesh *host_mesh) |
| AcResult | acNodeStoreVertexBuffer (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle, AcMesh *host_mesh) |
| AcResult | acNodeStoreMesh (const Node node, const Stream stream, AcMesh *host_mesh) |
| AcResult | acNodeIntegrateSubstep (const Node node, const Stream stream, const int isubstep, const int3 start, const int3 end, const AcReal dt) |
| AcResult | acNodeIntegrate (const Node node, const AcReal dt) |
| AcResult | acNodePeriodicBoundcondStep (const Node node, const Stream stream, const VertexBufferHandle vtxbuf_handle) |
| AcResult | acNodePeriodicBoundconds (const Node node, const Stream stream) |
| AcResult | acNodeReduceScal (const Node node, const Stream stream, const ReductionType rtype, const VertexBufferHandle vtxbuf_handle, AcReal *result) |
| AcResult | acNodeReduceVec (const Node node, const Stream stream, const ReductionType rtype, const VertexBufferHandle a, const VertexBufferHandle b, const VertexBufferHandle c, AcReal *result) |
Multi-GPU implementation.
JP: The old way for computing boundary conditions conflicts with the way we have to do things with multiple GPUs.
The older approach relied on unified memory, which represented the whole memory area as one huge mesh instead of several smaller ones. However, unified memory in its current state is more meant for quick prototyping when performance is not an issue. Getting the CUDA driver to migrate data intelligently across GPUs is much more difficult than when managing the memory explicitly.
In this new approach, I have simplified the multi- and single-GPU layers significantly. Quick rundown: New struct: Grid. There are two global variables, "grid" and "subgrid", which contain the extents of the whole simulation domain and the decomposed grids, respectively. To simplify thing, we require that each GPU is assigned the same amount of work, therefore each GPU in the node is assigned and "subgrid.m" -sized block of data to work with.
The whole simulation domain is decomposed with respect to the z dimension.
For example, if the grid contains (nx, ny, nz) vertices, then the subgrids
contain (nx, ny, nz / num_devices) vertices.
An local index (i, j, k) in some subgrid can be mapped to the global grid with
global idx = (i, j, k + device_id * subgrid.n.z)
Terminology:
Changes required to this commented code block:
Note on periodic boundaries (might be helpful when implementing other boundary conditions):
With multiple GPUs, periodic boundary conditions applied on indices ranging from
(0, 0, STENCIL_ORDER/2) to (subgrid.m.x, subgrid.m.y, subgrid.m.z -
STENCIL_ORDER/2)
on a single device are "local", in the sense that they can be computed without
having to exchange data with neighboring GPUs. Special care is needed only for transferring the data to the fron and back plates outside this range. In the solution we use here, we solve the local boundaries first, and then just exchange the front and back plates in a "ring", like so device_id (n) <-> 0 <-> 1 <-> ... <-> n <-> (0)
Global coordinates: coordinates with respect to the global grid (static Grid grid)
Local coordinates: coordinates with respect to the local subgrid (static Subgrid subgrid)
s0, s1: source indices in global coordinates
d0, d1: destination indices in global coordinates
da = max(s0, d0);
db = min(s1, d1);
These are used in at least
acLoad()
acStore()
acSynchronizeHalos()
Here we decompose the host mesh and distribute it among the GPUs in
the node.
The host mesh is a huge contiguous block of data. Its dimensions are given by
the global variable named "grid". A "grid" is decomposed into "subgrids",
one for each GPU. Here we check which parts of the range s0...s1 maps
to the memory space stored by some GPU, ranging d0...d1, and transfer
the data if needed.
The index mapping is inherently quite involved, but here's a picture which
hopefully helps make sense out of all this.
Grid
|----num_vertices---|
xxx|....................................................|xxx
^ ^ ^ ^
d0 d1 s0 (src) s1
Subgrid
xxx|.............|xxx
^ ^
d0 d1
^ ^
db da
| AcResult acNodeCreate | ( | const int | id, |
| const AcMeshInfo | node_config, | ||
| Node * | node | ||
| ) |
Initializes all devices on the current node.
Devices on the node are configured based on the contents of AcMesh.
Usage example:
Resets all devices on the current node.
| AcResult acNodeIntegrateSubstep | ( | const Node | node, |
| const Stream | stream, | ||
| const int | isubstep, | ||
| const int3 | start, | ||
| const int3 | end, | ||
| const AcReal | dt | ||
| ) |
| AcResult acNodeLoadConstant | ( | const Node | node, |
| const Stream | stream, | ||
| const AcRealParam | param, | ||
| const AcReal | value | ||
| ) |
| AcResult acNodeLoadMeshWithOffset | ( | const Node | node, |
| const Stream | stream, | ||
| const AcMesh | host_mesh, | ||
| const int3 | src, | ||
| const int3 | dst, | ||
| const int | num_vertices | ||
| ) |
| AcResult acNodeLoadVertexBuffer | ( | const Node | node, |
| const Stream | stream, | ||
| const AcMesh | host_mesh, | ||
| const VertexBufferHandle | vtxbuf_handle | ||
| ) |
Deprecated ?
| AcResult acNodeLoadVertexBufferWithOffset | ( | const Node | node, |
| const Stream | stream, | ||
| const AcMesh | host_mesh, | ||
| const VertexBufferHandle | vtxbuf_handle, | ||
| const int3 | src, | ||
| const int3 | dst, | ||
| const int | num_vertices | ||
| ) |
Deprecated ? Might be useful though if the user wants to load only one vtxbuf. But in this case the user should supply a AcReal* instead of vtxbuf_handle
| AcResult acNodePeriodicBoundcondStep | ( | const Node | node, |
| const Stream | stream, | ||
| const VertexBufferHandle | vtxbuf_handle | ||
| ) |
Prints information about the devices available on the current node.
Requires that Node has been initialized with @See acNodeCreate().
| AcResult acNodeQueryDeviceConfiguration | ( | const Node | node, |
| DeviceConfiguration * | config | ||
| ) |
| AcResult acNodeReduceScal | ( | const Node | node, |
| const Stream | stream, | ||
| const ReductionType | rtype, | ||
| const VertexBufferHandle | vtxbuf_handle, | ||
| AcReal * | result | ||
| ) |
| AcResult acNodeReduceVec | ( | const Node | node, |
| const Stream | stream, | ||
| const ReductionType | rtype, | ||
| const VertexBufferHandle | a, | ||
| const VertexBufferHandle | b, | ||
| const VertexBufferHandle | c, | ||
| AcReal * | result | ||
| ) |
| AcResult acNodeStoreMeshWithOffset | ( | const Node | node, |
| const Stream | stream, | ||
| const int3 | src, | ||
| const int3 | dst, | ||
| const int | num_vertices, | ||
| AcMesh * | host_mesh | ||
| ) |
| AcResult acNodeStoreVertexBuffer | ( | const Node | node, |
| const Stream | stream, | ||
| const VertexBufferHandle | vtxbuf_handle, | ||
| AcMesh * | host_mesh | ||
| ) |
Deprecated ?
| AcResult acNodeStoreVertexBufferWithOffset | ( | const Node | node, |
| const Stream | stream, | ||
| const VertexBufferHandle | vtxbuf_handle, | ||
| const int3 | src, | ||
| const int3 | dst, | ||
| const int | num_vertices, | ||
| AcMesh * | host_mesh | ||
| ) |
Deprecated ?
| AcResult acNodeSynchronizeVertexBuffer | ( | const Node | node, |
| const Stream | stream, | ||
| const VertexBufferHandle | vtxbuf_handle | ||
| ) |
Deprecated ?