A module discussing the use of parallelism on shared memory machines. See the detailed documentation and Table of Contents below the lenghty list of members of this module. More...
![]() |
Namespaces | |
namespace | parallel |
A module discussing the use of parallelism on shared memory machines. See the detailed documentation and Table of Contents below the lenghty list of members of this module.
On machines with more than one processor (or multicore processors), it is often profitable to run several parts of the computations in parallel. For example, one could have several threads running in parallel, each of which assembles the cell matrices of a subset of the triangulation and then writes them into the global matrix. Since assembling matrices is often an expensive operation, this frequently leads to significant savings in compute time on multiprocessor machines.
deal.II supports operations running in parallel on shared-memory (SMP) machines through the functions and classes in the Threads namespace. The MultithreadInfo class allows to query certain properties of the system, such as the number of CPUs. These facilities for parallel computing are described in the following. The step-9, step-13, step-14, and step-35 tutorial programs also show their use in practice, with the ones starting with step-35 using a more modern style of doing things in which essentially we describe what can be done in parallel, while the older tutorial programs describe how things have to be done in parallel.
On the other hand, programs running on distributed memory machines (i.e. clusters) need a different programming model built on top of MPI and PETSc or Trilinos. This is described in the step-17, and step-18 example programs.
The traditional view of parallelism on shared memory machines has been to decompose a program into threads, i.e. running different parts of the program in parallel at the same time (if there are more threads than processor cores on your machine, the operating system will run each thread round-robin for a brief amount of time before switching execution to another thread, thereby simulating that threads run concurrently). deal.II's facilities for threads are described below (see Thread-based parallelism), but we would first like to discuss an abstraction that is often more suitable than threads: tasks.
Tasks are essentially the individual parts of a program. Some of them are independent, whereas others depend on previous tasks to be completed first. By way of example, consider the typical layout of a part of the setup_dofs
function that most of the tutorial programs have:
1 dof_handler.distribute_dofs (fe); 2 DoFTools::make_hanging_node_constraints (dof_handler, hanging_node_constraints); 3 DoFTools::make_sparsity_pattern (dof_handler, sparsity_pattern); 4 hanging_node_constraints.condense (sparsity_pattern);
Here, each of the operations require a significant amount of computations. But note that not all of them depend on each other: clearly we can not run statements 2-4 before 1, and 4 needs to wait for the completion of statements 2 and 3. But statements 2 and 3 are independent: they could be run in any order, or in parallel. In essence, we have identified four tasks, some of which are dependent on each other, whereas others are independent. In the current example, tasks are identified with individual C++ statements, but often they more generally coincide with entire code blocks.
The point here is this: To exploit the independence of tasks 2 and 3, we could start two threads and run each task on its own thread; we would then wait for the two threads to finish (an operation called "joining a thread") and go on with statement 4. As discussed in more detail below, code to achieve this would look like this:
dof_handler.distribute_dofs (fe); Threads::Thread<void> thread_1 = Threads::new_thread (&DoFTools::make_hanging_node_constraints, dof_handler, hanging_node_constraints); Threads::Thread<void> thread_2 = Threads::new_thread (&DoFTools::make_sparsity_pattern, dof_handler, sparsity_pattern); thread_1.join(); thread_2.join(); hanging_node_constraints.condense (sparsity_pattern);
But what if your computer has only one processor core, or if we have two but there is already a different part of the program running in parallel to the code above? In that case, we could of course still start new threads, but the program is not going to run faster since no additional compute resources are available; rather, the program will run slower since threads have to be created and destroyed, and the operating system has to schedule threads to oversubscribed compute resources.
A better scheme would identify independent tasks and then hand them off to a scheduler that maps tasks to available compute resources. This way, the program could, for example, start one thread per processor core and then let threads work on tasks. Tasks would run to completion, rather than concurrently, avoiding the overhead of interrupting threads to run a different thread. In this model, if two processor cores are available, tasks 2 and 3 above would run in parallel; if only one is available, the scheduler would first completely execute task 2 before doing task 3, or the other way around. This model is able to execute much more efficiently in particular if a large number of tasks is available for execution, see for example the discussion below in section Abstractions for tasks: Work streams. In essence, tasks are a high-level description of what needs to be done, whereas threads are a low-level way of implementing how these tasks can be completed. As in many other instances, being able to use a high-level description allows to find efficient low-level implementations; in this vein, it often pays off to use tasks, rather than threads, in a program.
deal.II does not implement scheduling tasks to threads itself. For this, we use the Threading Building Blocks (TBB) library for which we provide simple wrappers. TBB abstracts the details of how to start or stop threads, start tasks on individual threads, etc, and provides interfaces that are portable across many different systems.
Ideally, the syntax to start tasks (and similarly for threads, for that matter), would be something like this for the example above:
Threads::Task<void> thread = new_task DoFTools::make_hanging_node_constraints (dof_handler, hanging_node_constraints);
In other words, we would like to indicate the fact that the function call should be run on a separate task by simply prefixing the call with a keyword (such as new_task
here, with a similar keyword new_thread
for threads). Prefixing a call would return a handle for the task that we can use to wait for the tasks's completion and that we may use to query the return value of the function called (unless it is void, as it is here).
Since C++ does not support the creation of new keywords, we have to be a bit more creative. The way chosen is to introduce a function new_task
that takes as arguments the function to call as well as the arguments to the call. The new_task
function is overloaded to accomodate starting tasks with functions that take no, one, two, and up to 9 arguments. In deal.II, these functions live in the Threads namespace. Consequently, the actual code for what we try to do above looks like this:
Threads::Task<void> thread = Threads::new_task (&DoFTools::make_hanging_node_constraints, dof_handler, hanging_node_constraints);
Note that DoFTools::make_hanging_node_constraints is a static member function and so does not need an object of type DoFTools to work on. (In fact, DoFTools has only static member functions and could as well be a namespace instead of class; that it is a class at the time of writing this is mostly a historic relic.)
Similarly, if we want to call a member function on a different task, we can do so by specifying the object on which to call the function as first argument after the function pointer:
class C { public: double f(int); }; int main () { C c; // call f(13) as usual, i.e. using the current processor: c.f(13); // call f(42) as a separate task, to be scheduled // whenever processor resources become available: Threads::Task<double> task = Threads::new_task (&C::f, c, 42); // do something else in between: ...; // having finished our other business, wait until the task // above has terminated and get the value returns by c.f(42): double result = task.return_value();
Here, note first how we pass the object c
(i.e. the this
pointer the function C::f
will see) as if it was the first argument to the function. Secondly, note how we can acquire the value returned by the function on the separate task by calling Threads::Task::return_value(). This function implies waiting for the completion of the task, i.e. the last line is completely equivalent to
task.join ();
double result = task.return_value();
Note also that it is entirely valid if C::f
wants to start tasks of its own:
class C { public: double f(int); private: double f1(int); double f2(int); }; double C::f (int i) { Threads::Task<double> t1 = Threads::new_task (&C::f1, *this, i); Threads::Task<double> t2 = Threads::new_task (&C::f2, *this, i); return t1.return_value() + t2.return_value(); } int main () { C c; Threads::Task<double> task = Threads::new_task (&C::f, c, 42); // do something else in between: ...; double result = task.return_value();
Here, we let C::f
compute its return value as c.f1(i)+c.f2(i)
. If sufficient CPU resources are available, then the two parts of the addition as well as the other things in main()
will all run in parallel. If not, then we will eventually block at one of the places where the return value is needed, thereby freeing up the CPU resources necessary to run all those spawned tasks to completion.
In many cases, such as the introductory example of the setup_dofs
function outlined above, one can identify several independent jobs that can be run as tasks, but will have to wait for all of them to finish at one point. One can do so by storing the returned object from all the Threads::new_task() calls, and calling Threads::Task::join() on each one of them. A simpler way to do this is to put all of these task objects into a Threads::TaskGroup object and waiting for all of them at once. The code would then look like this:
dof_handler.distribute_dofs (fe); Threads::TaskGroup<void> task_group; task_group += Threads::new_task (&DoFTools::make_hanging_node_constraints, dof_handler, hanging_node_constraints); task_group += Threads::new_task (&DoFTools::make_sparsity_pattern, dof_handler, sparsity_pattern); task_group.join_all (); hanging_node_constraints.condense (sparsity_pattern);
The exact details of how tasks are scheduled to run are internal to the Threading Building Blocks (TBB) library that deal.II uses for tasks. The documentation of TBB gives a detailed description of how tasks are scheduled to threads but is rather quiet on how many threads are actually used. However, a reasonable guess is probably to assume that TBB creates as many threads as there are processor cores on your system. This way, it is able to fully utilize the entire system, without having too many threads that the operating system will then have to interrupt regularly so that other threads can run on the available processor cores.
The point then is that the TBB scheduler takes tasks and lets threads execute them. Threads execute tasks completely, i.e. the TBB scheduler does not interrupt a task half way through to make some halfway progress with another task. This makes sure that caches are always hot, for example, and avoids the overhead of preemptive interrupts.
The downside is that the CPU cores are only fully utilized if the threads are actually doing something, and that means that (i) there must be enough tasks available, and (ii) these tasks are actually doing something. Note that both conditions must be met; in particular, this means that CPU cores are underutilized if we have identified a sufficient number of tasks but if some of them twiddle thumbs, for example because a task is writing data to disk (a process where the CPU frequently has to wait for the disk to complete a transaction) or is waiting for input. Other cases are where tasks block on other external events, for example synchronising with other tasks or threads through a mutex. In such cases, the scheduler would let a task run on a thread, but doesn't notice that that thread doesn't fully utilize the CPU core.
In cases like these, it does make sense to create a new thread (see Thread-based parallelism below) that the operating system can put on hold while they are waiting for something external, and let a different thread (for example one running a task scheduled by TBB) use the CPU at the same time.
Some loops execute bodies on data that is completely independent and that can therefore be executed in parallel. Rather than a priori split the loop into a fixed number of chunks and executing them on tasks or threads, the TBB library uses the following concept: the range over which the loop iterates is split into a certain number of sub-ranges (for example two or three times as many as there are CPU cores) and are equally distributed among threads; threads then execute sub-ranges and, if they are done with their work, steal entire or parts of sub-ranges from other threads to keep busy. This way, work is load-balanced even if not every loop iteration takes equally much work, or if some of the CPU cores fall behind because the operating system interrupted them for some other work.
The TBB library primitives for this are a bit clumsy so deal.II has wrapper routines for the most frequently used operations. The simplest one is akin to what the std::transform does: it takes one or more ranges of input operators, one output iterator, and a function object. A typical implementation of std::transform would look like this:
template <typename InputIterator1, typename InputIterator, typename OutputIterator, typename FunctionObject> void transform (const InputIterator1 &begin_in_1, const InputIterator1 &end_in_1, const InputIterator2 &begin_in_2, const OutputIterator &begin_out, FunctionObject &function) { InputIterator1 in_1 = begin_in_1; InputIterator2 in_2 = begin_in_2; OutputIterator out = begin_out; for (; in_1 != end_in_1; ++in_1, ++in_2, ++out) *out = function(*in_1, *in_2); }
In many cases, function
has no state, and so we can split this loop into several sub-ranges as explained above. Consequently, deal.II has a set of functions parallel::transform that look like the one above but that do their work in parallel (there are several versions with one, two, and more input iterators for function objects that take one, two, or more arguments). The only difference in calling these functions is that they take an additional last argument that denotes the minimum size of sub-ranges of [begin_in_1,end_in_1)
; it should be big enough so that we don't spend more time on scheduling sub-ranges to processors but small enough that processors can be efficiently load balanced. A rule of thumb appears to be that a sub-range is too small if it takes less than 2000 instructions to execute it.
An example of how to use these functions are vector operations like the addition in where all three objects are of type Vector:
parallel::transform (x.begin(), x.end(), y.begin(), z.begin(), (boost::lambda::_1 + boost::lambda::_2), 1000);
In this example, we used the Boost Lambda library to construct, on the fly, a function object that takes two arguments and returns the sum of the two. This is exactly what we needed when we want to add the individual elements of vectors and
and write the sum of the two into the elements of
. Because of the way Boost Lambda is written, the function object that we get here is completely known to the compiler and when it expands the loop that results from parallel::transform will be as if we had written the loop in its obvious form:
InputIterator1 in_1 = x.begin();
InputIterator2 in_2 = y.begin();
OutputIterator out = z.begin();
for (; in_1 != x.end(); ++in_1, ++in_2, ++out)
*out = *in_1 + *in_2;
The next C++ standard will contain a more elegant way to achieve the same effect shown above using the Boost Lambda library, through a mechanism known as lambda expressions and closures.
Note also that we have made sure that no CPU ever gets a chunk of the whole loop that is smaller than 1000 iterations (unless the whole range is smaller).
The scheme shown in the previous section is effective if the operation done in each iteration is such that it does not require significant setup costs and can be inlined by the compiler. Lambda expressions are exactly of this kind because the compiler knows everything about the lambda expression and can inline it, thereby eliminating the overhead of calling an external function. However, there are cases where it is inefficient to call some object or function within each iteration.
An example for this case is sparse matrix-vector multiplication. If you know how data is stored in compressed row format like in the SparseMatrix class, then a matrix-vector product function looks like this:
void SparseMatrix::vmult_one_row (const Vector &src, Vector &dst) const { const double *val_ptr = &values[0]; const unsigned int *colnum_ptr = &colnums[0]; Vector::iterator dst_ptr = dst.begin(); for (unsigned int row=0; row<n_rows; ++row, ++dst_ptr) { double s = 0.; const double *const val_end_of_row = &values[rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * src(*colnum_ptr++); *dst_ptr = s; } }
Inside the for loop, we compute the dot product of a single row of the matrix with the right hand side vector src
and write it into the corresponding element of the dst
vector. The code is made more efficient by utilizing that the elements of the next row follow the ones of the current row immediately, i.e. at the beginning of the loop body we do not have to re-set the pointers that point to the values and column numbers of each row.
Using the parallel::transform function above, we could in principle write this code as follows:
void SparseMatrix::vmult (const Vector &src, Vector &dst, Vector::iterator &dst_row) const { const unsigned int row = (dst_row - dst.begin()); const double *val_ptr = &values[rowstart[row]]; const unsigned int *colnum_ptr = &colnums[rowstart[row]]; double s = 0.; const double *const val_end_of_row = &values[rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * src(*colnum_ptr++); *dst_row = s; } void SparseMatrix::vmult (const Vector &src, Vector &dst) const { parallel::transform (dst.begin(), dst.end(), boost::bind (&SparseMatrix::vmult_one_row, this, boost::cref(src), boost::ref(dst), _1), 200); }
Note how we use boost::bind to bind certain arguments to the vmult_one_row
function, leaving one argument open and thus allowing the parallel::transform function to consider the passed function argument as unary. Also note that we need to make the source and destination vectors as (const) references to prevent boost::bind from passing them by value (implying a copy for src
and writing the result into a temporary copy of dst
, neither of which is what we desired). Finally, notice the grainsize of a minimum of 200 rows of a matrix that should be processed by an individual CPU core.
The point is that while this is correct, it is not efficient: we have to set up the row, val_ptr, colnum_ptr
variables in each iteration of the loop. Furthermore, since now the function object to be called on each row is not a simple Boost Lambda expression any more, there is an implied function call including argument passing in each iteration of the loop.
A more efficient way is to let TBB split the original range into sub-ranges, and then call a target function not on each individual element of the loop, but on the entire range. This is facilitated by the parallel::apply_to_subranges function:
void SparseMatrix::vmult_on_subrange (const unsigned int begin_row, const unsigned int end_row, const Vector &src, Vector &dst) { const double *val_ptr = &values[rowstart[begin_row]]; const unsigned int *colnum_ptr = &colnums[rowstart[begin_row]]; Vector::iterator dst_ptr = dst.begin() + begin_row; for (unsigned int row=begin_row; row<end_row; ++row, ++dst_ptr) { double s = 0.; const double *const val_end_of_row = &values[rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * src(*colnum_ptr++); *dst_ptr = s; } } void SparseMatrix::vmult (const Vector &src, Vector &dst) const { parallel::apply_to_subranges (0, n_rows(), boost::bind (vmult_on_subrange, this, _1, _2, boost::cref(src), boost::ref(dst)), 200); }
Here, we call the vmult_on_subrange
function on sub-ranges of at least 200 elements each, so that the initial setup cost can amortize.
A related operation is when the loops over elements each produce a result that must then be accumulated (other reduction operations than addition of numbers would work as well). An example is to form the matrix norm (it really is only a norm if
is positive definite, but let's assume for a moment that it is). A sequential implementation would look like this for sparse matrices:
double SparseMatrix::mat_norm (const Vector &x) const { const double *val_ptr = &values[0]; const unsigned int *colnum_ptr = &colnums[0]; double norm_sqr = 0; for (unsigned int row=0; row<n_rows; ++row, ++dst_ptr) { double s = 0.; const double *const val_end_of_row = &values[rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * x(*colnum_ptr++); norm_sqr += x(row) * s; } return std::sqrt (norm_sqr); }
It would be nice if we could split this operation over several sub-ranges of rows, each of which compute their part of the square of the norm, add results together from the various sub-ranges, and then take the square root of the result. This is what the parallel::accumulate_from_subranges function does (note that you have to specify the result type as a template argument and that, as usual, the minimumum number of elements of the outer loop that can be scheduled on a single CPU core is given as the last argument):
double SparseMatrix::mat_norm_sqr_on_subrange (const unsigned int begin_row, const unsigned int end_row, const Vector &x) { const double *val_ptr = &values[rowstart[begin_row]]; const unsigned int *colnum_ptr = &colnums[rowstart[begin_row]]; Vector::iterator dst_ptr = dst.begin() + begin_row; double norm_sqr = 0; for (unsigned int row=begin_row; row<end_row; ++row, ++dst_ptr) { double s = 0.; const double *const val_end_of_row = &values[rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * x(*colnum_ptr++); norm_sqr += x(row) * s; } return norm_sqr; } double SparseMatrix::mat_norm (const Vector &x) const { return std::sqrt (parallel::accumulate_from_subranges (0, n_rows(), boost::bind (mat_norm_sqr_on_subrange, this, _1, _2, boost::cref(x)), 200)); }
In the examples shown in the introduction we had identified a number of functions that can be run as independent tasks. Ideally, this number of tasks is larger than the number of CPU cores (to keep them busy) but is also not exceedingly huge (so as not to inundate the scheduler with millions of tasks that will then have to be distributed to 2 or 4 cores, for example). There are, however, cases where we have many thousands or even millions of relatively independent jobs: for example, assembling local contributions to the global linear system on each cell of a mesh; evaluating an error estimator on each cell; or postprocessing on each cell computed data for output fall into this class.
Code like this could then be written like this:
template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell) { ... } template <int dim> void MyClass<dim>::assemble_system () { Threads::TaskGroup<void> task_group; for (typename DoFHandler<dim>::active_cell_iterator cell = dof_handler.begin_active(); cell != dof_handler.end(); ++cell) task_group += Threads::new_task (&MyClass<dim>::assemble_on_one_cell, *this, cell); task_group.join_all (); }
On a big mesh, with maybe a million cells, this would create a massive number of tasks; while it would keep all CPU cores busy for a while, the overhead of first creating so many tasks, scheduling them, and then waiting for them would probably not lead to efficient code. A better strategy would be if the scheduler could somehow indicate that it has available resources, at which point we would feed it another newly created task, and we would do so until we run out of tasks and the ones that were created have been worked on.
This is essentially what the WorkStream class does: You give it an iterator range from which it can draw objects to work on (in the above case it is the interval given by dof_handler.begin_active()
to dof_handler.end()
), and a function that would do the work on each item (the function MyClass::assemble_on_one_cell
) together with an object if it is a member function.
Essentially, the way the MyClass::assemble_system
function could be written then is like this (note that this is not quite the correct syntax, as will be described below):
template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell) { ... } template <int dim> void MyClass<dim>::assemble_system () { WorkStream::run (dof_handler.begin_active(), dof_handler.end(), *this, &MyClass<dim>::assemble_on_one_cell); }
There are at least three problems with this, however:
First, let us take a look at how the MyClass::assemble_on_one_cell
function likely looks:
template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell) { FEValues<dim> fe_values (...); FullMatrix<double> cell_matrix (...); Vector<double> cell_rhs (...); // assemble local contributions fe_values.reinit (cell); for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) for (unsigned int q=0; q<fe_values.n_quadrature_points; ++q) cell_matrix(i,j) += ...; ...same for cell_rhs... // now copy results into global system std::vector<unsigned int> dof_indices (...); cell->get_dof_indices (dof_indices); for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) system_matrix.add (dof_indices[i], dof_indices[j], cell_matrix(i,j)); ...same for rhs... }
The problem here is that several tasks, each running MyClass::assemble_on_one_cell
, could potentially try to write into the object MyClass::system_matrix
at the same time. This could be avoided by explicit synchronisation using a Threads::Mutex, for example, and would look like this:
// now copy results into global system std::vector<unsigned int> dof_indices (...); cell->get_dof_indices (dof_indices); static Threads::Mutex mutex; mutex.acquire (); for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) system_matrix.add (dof_indices[i], dof_indices[j], cell_matrix(i,j)); ...same for rhs... mutex.release (); }
By making the mutex a static variable, it exists only once globally (i.e. once for all tasks that may be running in parallel) and only one of the tasks can enter the region protected by the acquire/release calls on the mutex. As an aside, a better way to write this code would be like this, ensuring that the mutex is released even in case an exception is thrown, and without the need to remember to write the call to Threads::Mutex::release():
// now copy results into global system static Threads::Mutex mutex; Threads::Mutex::ScopedLock lock (mutex); for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) system_matrix.add (dof_indices[i], dof_indices[j], cell_matrix(i,j)); ...same for rhs... }
Here, the mutex remains locked from the time the ScopedLock is created to where it is destroyed, at the end of the code block.
Note that although we now avoid the race condition that multiple threads could be writing to the same object, this code is not very efficient: mutexes are expensive on multicore machines, and we also block threads some of the time which is inefficient with tasks as explained above in the section on How scheduling tasks works and when task-based programming is not efficient.
A second correctness problem is that even if we do lock the global matrix and right hand side objects using a mutex, we do so in a more or less random order: while tasks are created in the order in which we traverse cells normally, there is no guarantee that by the time we get to the point where we want to copy the local into the global contributions the order is still as if we computed things sequentially. In other words, it may happen that we add the contributions of cell 1 before those of cell 0. That may seem harmless because addition is commutative and associative, but in fact it is not if done in floating point arithmetic: -- take for example
(because
in floating point arithmetic, using double precision).
As a consequence, the exact values that end up in the global matrix and right hand side will be close but may differ by amounts close to round-off depending on the order in which tasks happened to finish their job. That's not a desirable outcome, since results will not be reproducible this way.
As a consequence, the way the WorkStream class is designed is to use two functions: the MyClass::assemble_on_one_cell
computes the local contributions and stores them somewhere (we'll get to that next), and a second function, say MyClass::copy_local_to_global
, that copies the results computed on each cell into the global objects. The trick implemented in the WorkStream class is that (i) the MyClass::copy_local_to_global
never runs more than once in parallel, so we do not need to synchronise execution through a mutex, and (ii) it runs in exactly the same order on cells as they appear in the iterator range, i.e. we add elements into the global matrix the same way every time, independently of when the computation of these element finishes.
We now only have to discuss how the MyClass::assemble_on_one_cell
communicates to MyClass::copy_local_to_global
what it has computed. The way this is done is to use an object that holds all temporary data:
struct PerTaskData { FullMatrix<double> cell_matrix; Vector<double> cell_rhs; std::vector<unsigned int> dof_indices; } template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell, PerTaskData &data) { FEValues<dim> fe_values (...); data.cell_matrix = 0; data.cell_rhs = 0; // assemble local contributions fe_values.reinit (cell); for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) for (unsigned int q=0; q<fe_values.n_quadrature_points; ++q) data.cell_matrix(i,j) += ...; ...same for cell_rhs... cell->get_dof_indices (data.dof_indices); } template <int dim> void MyClass<dim>::copy_local_to_global (const PerTaskData &data) { for (unsigned int i=0; i<fe.dofs_per_cell; ++i) for (unsigned int j=0; j<fe.dofs_per_cell; ++j) system_matrix.add (data.dof_indices[i], data.dof_indices[j], data.cell_matrix(i,j)); ...same for rhs... } template <int dim> void MyClass<dim>::assemble_system () { PerTaskData per_task_data; ...initialize members of per_task_data to the correct sizes... WorkStream::run (dof_handler.begin_active(), dof_handler.end(), *this, &MyClass<dim>::assemble_on_one_cell, &MyClass<dim>::copy_local_to_global, per_task_data); }
The way this works is that we create a sample per_task_data
object that the work stream object will replicate once per task that runs in parallel. For each task, this object will be passed first to one of possibly several instances of MyClass::assemble_on_one_cell
running in parallel which fills it with the data obtained on a single cell, and then to a sequentially running MyClass::copy_local_to_global
that copies data into the global object. In practice, of course, we will not generate millions of per_task_data
objects if we have millions of cells; rather, we recycle these objects after they have been used by MyClass::copy_local_to_global
and feed them back into another instance of MyClass::assemble_on_one_cell
; this means that the number of such objects we actually do create is a small multiple of the number of threads the scheduler uses, which is typically about as many as there are CPU cores on a system.
The last issue that is worth addressing is that the way we wrote the MyClass::assemble_on_one_cell
function above, we create and destroy an FEValues object every time the function is called, i.e. once for each cell in the triangulation. That's an immensely expensive operation because the FEValues class tries to do a lot of work in its constructor in an attempt to reduce the number of operations we have to do on each cell (i.e. it increases the constant in the effort to initialize such an object in order to reduce the constant in the
operations to call FEValues::reinit on the
cells of a triangulation). Creating and destroying an FEValues object on each cell invalidates this effort.
The way to avoid this is to put the FEValues object into a second structure that will hold scratch data, and initialize it in the constructor:
struct PerTaskData { FullMatrix<double> cell_matrix; Vector<double> cell_rhs; std::vector<unsigned int> dof_indices; PerTaskData (const FiniteElement<dim> &fe) : cell_matrix (fe.dofs_per_cell, fe.dofs_per_cell), cell_rhs (fe.dofs_per_cell), dof_indices (fe.dofs_per_cell) {} } struct ScratchData { FEValues<dim> fe_values; ScratchData (const FiniteElement<dim> &fe, const Quadrature<dim> &quadrature, const UpdateFlags update_flags) : fe_values (fe, quadrature, update_flags) {} }
and then use this FEValues object in the assemble function:
template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell, ScratchData &scratch, PerTaskData &data) { scratch.fe_values.reinit (cell); ... }
The same approach, putting things into the ScratchData
data structure should be used for everything that is expensive to construct. This holds, in particular, for everything that needs to allocate memory upon construction; for example, if the values of a function need to be evaluated at quadrature points, then this is expensive:
template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell, ScratchData &scratch, PerTaskData &data) { std::vector<double> rhs_values (fe_values.n_quadrature_points); rhs_function.value_list (data.fe_values.get_quadrature_points, rhs_values) ... }
whereas this is a much cheaper way:
struct ScratchData { std::vector<double> rhs_values; FEValues<dim> fe_values; ScratchData (const FiniteElement<dim> &fe, const Quadrature<dim> &quadrature, const UpdateFlags update_flags) : rhs_values (quadrature.n_quadrature_points), fe_values (fe, quadrature, update_flags) {} } template <int dim> void MyClass<dim>::assemble_on_one_cell (const typename DoFHandler<dim>::active_cell_iterator &cell, ScratchData &scratch, PerTaskData &data) { rhs_function.value_list (scratch.fe_values.get_quadrature_points, scratch.rhs_values) ... }
As a final point: What if, for some reason, my assembler and copier function do not match the above signature with three and one argument, respectively? That's not a problem either. The WorkStream class offers two versions of the WorkStream::run() function: one that takes an object and the addresses of two member functions, and one that simply takes two function objects that can be called with three and one argument, respectively. So, in other words, the following two calls are exactly identical:
WorkStream::run (dof_handler.begin_active(), dof_handler.end(), *this, &MyClass<dim>::assemble_on_one_cell, &MyClass<dim>::copy_local_to_global, per_task_data); // ...is the same as: WorkStream::run (dof_handler.begin_active(), dof_handler.end(), boost::bind(&MyClass<dim>::assemble_on_one_cell, *this, _1, _2, _3), boost::bind(&MyClass<dim>::copy_local_to_global, *this, _1), per_task_data);
Note how boost::bind
produces a function object that takes three arguments by binding the member function to the *this
object. _1, _2
and _3
are placeholders for the first, second and third argument that can be specified later on. In other words, for example if p
is the result of the first call to boost::bind
, then the call p(cell, scratch_data, per_task_data)
will result in executing this->assemble_on_one_cell (cell, scratch_data, per_task_data)
, i.e. boost::bind
has bound the object to the function pointer but left the three arguments open for later.
Similarly, let us assume that MyClass::assemble_on_one_cell
has the following signature in the solver of a nonlinear, time-dependent problem:
template <int dim> void MyClass<dim>::assemble_on_one_cell (const Vector<double> &linearization_point, const typename DoFHandler<dim>::active_cell_iterator &cell, ScratchData &scratch, PerTaskData &data, const double current_time) { ... }
Because WorkStream expects to be able to call the worker function with just three arguments, the first of which is the iterator and the second and third the ScratchData and PerTaskData objects, we need to pass the following to it:
WorkStream::run (dof_handler.begin_active(), dof_handler.end(), boost::bind(&MyClass<dim>::assemble_on_one_cell, *this, current_solution, _1, _2, _3, previous_time+time_step), boost::bind(&MyClass<dim>::copy_local_to_global, *this, _1), per_task_data);
Here, we bind the object, the linearization point argument, and the current time argument to the function before we hand it off to WorkStream::run(). WorkStream::run() will then simply call the function with the cell and scratch and per task objects which will be filled in at the positions indicated by _1, _2
and _3
.
To see the WorkStream class used in practice on tasks like the ones outlined above, take a look at the step-35 tutorial program.
Even though tasks are a higher-level way to describe things, there are cases where they are poorly suited to a task. The main reason for not using tasks even for computations that are independent are listed in the section on How scheduling tasks works and when task-based programming is not efficient above. Primarily, jobs that are not able to fully utilize the CPU are bad fits for tasks.
In a case like this, you can resort to explicitly start threads, rather than tasks, using pretty much the same syntax as above. For example, if you had a function in your application that generates graphical output and then estimates the error to refine the mesh for the next iteration of an adaptive mesh scheme, it could look like this:
template <int dim> void MyClass<dim>::output_and_estimate_error () const { DataOut<dim> data_out; data_out.attach_dof_handler (dof_handler); data_out.add_data_vector (solution, "solution"); data_out.build_patches (); std::ofstream output ("solution.vtk"); Threads::Thread<void> thread = Threads::new_thread (&DataOut<dim>::write_vtk, data_out, output); Vector<float> error_per_cell (triangulation.n_active_cells()); KellyErrorEstimator<dim>::estimate (dof_handler, QGauss<dim-1>(3), typename FunctionMap<dim>::type(), solution, estimated_error_per_cell); thread.join ();
Here, Threads::new_thread starts the given function that writes to the output file on a new thread that can run in parallel to everything else. Note that this function is actually pretty well parallelized: both DataOut::build_patches() and KellyErrorEstimator::estimate() already use WorkStream and will therefore utilize pretty much all available compute resources. In parallel to the KellyErrorEstimator::estimate() function, the DataOut::write_vtk() function will run on a separate thread, independent of the scheduler that takes care of the tasks, but that is not a problem because writing lots of data to a file is not something that will keep a CPU very busy.
Creating threads works pretty much the same way as tasks, i.e. you can wait for the termination of a thread using Threads::Thread::join(), query the return value of a finished thread using Threads::Thread::return_value(), and you can group threads into a Threads::ThreadGroup object and wait for all of them to finish.
As mentioned earlier, deal.II does not implement scheduling tasks to threads or even starting threads itself. The TBB library does a good job at deciding how many threads to use and they do not recommend setting the number of threads explicitly. However, on large symmetric multiprocessing (SMP) machines, especially ones with a resource/job manager or on systems on which access to some parts of the memory is possible but very expensive for processors far away (e.g. large NUMA SMP machines), it may be necessary to explicitly set the number of threads to prevent the TBB from using too many CPUs. In this case the following commands can be used:
#include <tbb/task_scheduler_init.h> // In the program, before any task-based parallelism is reached. // Early in the main method is a good place to call this: tbb::task_scheduler_init init(n_desired_threads + 1);
The method of setting the number of threads relies on this call to task_scheduler_init
occurring before any other calls to the TBB functions. If this call is first, it will set the number of threads used by TBB for the remainder of the program. Notice that the call starts one less thread than the argument, since the main thread is counted. If this method does not seem successful (gdb can be used to see when and how many threads are started), ensure that the call is early enough in the program (for example right at the start of main()
), and check that there are no other libraries starting threads.
Note that a small number of places inside deal.II also uses thread-based parallelism controlled by MultithreadInfo::n_default_threads generally. Under some circumstances, deal.II also calls the BLAS library which may sometimes also start threads of its own. You will have to consult the documentation of your BLAS installation to determine how to set the number of threads for these operations.