Setting up the system matrix is usually done by looping over all cells and computing the contributions of each cell separately. While the computations of the local contributions is strictly independent, we need to transfer these contributions to the global matrix afterward. This transfer has to be synchronized, in order to avoid that one thread overwrites values that another thread has just written.
In most cases, building the system matrix in parallel will look like the following template:
void MainClass::build_matrix () { // define how many threads will be used (here: 4) const unsigned int n_threads = 4; const unsigned int n_cells_per_thread = triangulation.n_active_cells () / n_threads; // define the Mutex that will be used to synchronise // accesses to the matrix ACE_Thread_Mutex mutex; // define thread manager ACE_Thread_Manager thread_manager; vector<DoFHandler<dim>::active_cell_iterator> first_cells (n_threads), end_cells (n_threads); DoFHandler<dim>::active_cell_iterator present_cell = dof_handler.begin_active (); for (unsigned int thread=0; thread<n_threads; ++thread) { // for each thread: first determine the range of cells on // which it shall operate: first_cells[thread] = present_cell; end_cells[thread] = first_cells[thread]; if (thread != n_threads-1) for (unsigned int i=0; i<n_cells_per_thread; ++i) ++end_cells[thread]; else end_cells[thread] = dof_handler.end(); // now start a new thread that builds the contributions of // the cells in the given range Threads::spawn (thread_manager, Threads::encapsulate(&MainClass::build_matrix_threaded) .collect_args (this, first_cells[thread], end_cells[thread], mutex)); // set start iterator for next thread present_cell = end_cells[thread]; }; // wait for the threads to finish thread_manager.wait (); }; void MainClass::build_matrix_threaded (const DoFHandler<dim>::active_cell_iterator &first_cell, const DoFHandler<dim>::active_cell_iterator &end_cell, ACE_Thread_Mutex &mutex) { FullMatrix<double> cell_matrix; vector<unsigned int> local_dof_indices; DoFHandler<dim>::active_cell_iterator cell; for (cell=first_cell; cell!=end_cell; ++cell) { // compute the elements of the cell matrix ... // get the indices of the DoFs of this cell cell->get_dof_indices (local_dof_indices); // now transfer local matrix into the global one. // synchronise this with the other threads mutex.acquire (); for (unsigned int i=0; i<dofs_per_cell; ++i) for (unsigned int j=0; j<dofs_per_cell; ++j) global_matrix.add (local_dof_indices[i], local_dof_indices[j], cell_matrix(i,j)); mutex.release (); }; };
Note that since the build_matrix_threaded function takes its arguments as references, we have to make sure that the variables to which these references point live at least as long as the spawned threads. It is thus not possible to use the same variables for start and end iterator for all threads, as the following example would do:
.... DoFHandler<dim>::active_cell_iterator first_cell = dof_handler.begin_active (); for (unsigned int thread=0; thread<n_threads; ++thread) { // for each thread: first determine the range of threads on // which it shall operate: DoFHandler<dim>::active_cell_iterator end_cell = first_cell; if (thread != n_threads-1) for (unsigned int i=0; i<n_cells_per_thread; ++i) ++end_cell; else end_cell = dof_handler.end(); // now start a new thread that builds the contributions of // the cells in the given range Threads::spawn (thread_manager, Threads::encapsulate(&MainClass::build_matrix_threaded) .collect_args (this, first_cell, end_cell, mutex)); // set start iterator for next thread first_cell = end_cell; }; ....
Since splitting a range of iterators (for example the range begin_active() to end()) is a very common task when setting up threads, there is a function
template <typename ForwardIterator> vector<pair<ForwardIterator,ForwardIterator> > split_range (const ForwardIterator &begin, const ForwardIterator &end, const unsigned int n_intervals);in the Threads namespace that splits the range [begin,end) into n_intervals subintervals of approximately the same size.
Using this function, the thread creation function can now be written as follows:
void MainClass::build_matrix () { const unsigned int n_threads = 4; ACE_Thread_Mutex mutex; ACE_Thread_Manager thread_manager; // define starting and end point for each thread typedef DoFHandler<dim>::active_cell_iterator active_cell_iterator; vector<pair<active_cell_iterator,active_cell_iterator> > thread_ranges = split_range<active_cell_iterator> (dof_handler.begin_active (), dof_handler.end (), n_threads); for (unsigned int thread=0; thread<n_threads; ++thread) spawn (thread_manager, encapsulate(&MainClass::build_matrix_threaded) .collect_args (this, thread_ranges[thread].first, thread_ranges[thread].second, mutex)); thread_manager.wait (); };We have here omitted the Threads:: prefix to make things more readable. Note that we had to explicitly specify the iterator type active_cell_iterator to the split_range function, since the two iterators given have different type (dof_handler.end() has type DoFHandler<dim> :: raw_cell_iterator, which can be converted to DoFHandler<dim>::active_cell_iterator) and C++ requires that either the type is explicitly given or the type be unique.
A word of caution is in place here: since usually in finite element computations, the system matrix is ill-conditioned, small changes in a data vector or the matrix can lead to significant changes in the output. Unfortunately, since the order in which contributions to elements of the matrix or vector are computed can not be predicted when using multiple threads, round-off can come into play here. For example, taken from a real-world program, the following contributions for an element of a right hand side vector are computed from four cells: -3.255208333333328815, -3.255208333333333694, -3.255208333333333694, and -3.255208333333331526; however, due to round-off the sum of these numbers depends on the order in which they are summed up, such that the resulting element of the vector differed depending on the number of threads used, the number of other programs on the computer, and other random sources. In subsequent runs of exactly the same programs, the sum was either -13.02083333333332827 or -13.02083333333332610. Although the difference is still only in the range of round-off error, it caused a change in the fourth digit of a derived, very ill-conditioned quantity after the matrix was inverted several times (this accuracy in this quantity was not really needed, but it showed up in the output and also led to different grid refinement due to comparison with other values of almost the same size). Tracking down the source of such problems is extremely difficult and frustrating, since they occur non-deterministically in subsequent runs of the same program, and it can take several days until the actual cause is found.
One possible work-around is to reduce the accuracy of the summands such that the value of the sum becomes irrespective of the order of the summands. One, rather crude method is to use a conversion to data type float and back; the update loop from above would then look as follows:
for (unsigned int i=0; i<dofs_per_cell; ++i) for (unsigned int j=0; j<dofs_per_cell; ++j) global_matrix.add (local_dof_indices[i], local_dof_indices[j], static_cast<float>(cell_matrix(i,j)));Note that the cast back to double is performed here implicitly. The question whether a reduction in accuracy in the order shown here is tolerable, is problem dependent. There are methods that lose less accuracy than shown above.
The other, less computationally costly possibility would be to decrease the accuracy of the resulting sum, in the hope that all accumulated round-off error is deleted. However, this is unsafe since the order dependence remains and may even be amplified if the values of the sum lie around a boundary where values are rounded up or down when reducing the accuracy. Furthermore, problems arise if the summands have different signs and the result of summation consists of round-off error only.