Question: How do I efficiently share and multiply with a large constant matrix between procs and tasks?

Hello Maple wizards,

I have two questions for you today.

First, a program I'm developing in Maple 15 does frequent matrix multiplication with a constant float[8] matrix. I hope to take advantage of multiple processors in my 6-way desktop processor and/or CUDA features of the Nvidia GPU card. The program is large enough that maintainability and good programming practice dictate that it be broken down into multiple procs. In addition, I'm considering implementation using the task model for scalability. Either way, I would like to save space by sharing a single copy of the matrix across all the procs/tasks that use it, but the correct scope and datatype to achieve this is eluding me:

  • If I create a Matrix that is global (or perhaps local to an outer proc with computation in nested procs), I believe that because Matrix is a mutable datatype, the entries of the matrix will be locked when accessed in parallel. I don't suppose I avoid the locking by specifying readonly=true? That is probably wishful thinking.
  • If I create a Matrix and pass it to the inner procs, they will be creating their own copies on each call, a lot of waste.
  • If I represent the matrix as an immutable list of lists, the shared access won't be locked, but I can't use the routine LinearAlgebra[MatrixMatrixMultiply] because the constant argument is not a Matrix. And if I convert it to a Matrix, I'm actually creating a local copy and losing the space I sought to save.

For development and testing with small matrices, I am getting by with creating a local copy for each proc call. At some point, however,  these matrices could get large enough to cause cache misses and eventually TLB faults that seem unnecessary. A lot of thought went into the design of Maple, and so there must be a simple approach for sharing a read-only matrix that I am missing. Can someone please hit this stumped chump over the head with it?

Second, I am curious about a multiprocessor/GPU tradeoff when executing in Maple 15.  When I run the program as a single thread (no task model programming) without CUDA, I'm getting a nice speedup from Maple 15 sharing the computation across all the processors. Enabling CUDA seems to throttle Maple 15's ability to use more than one processor: total cpu time consumed is lower, presumably because the GPU is doing the multiplies, but elapsed time is higher, and CPU utiliation for all but one processor are flatlined. Are Maple 15's multiprocessing and GPU speedups mutually exclusive? The CUDA documentation doesn't mention such a tradeoff.

Perhaps the cost of transferring data between GPU and RAM is so high that there isn't enough computation ready to drive up utilization of more than one CPU. Cache miss counts would be suggestive, but tools for directly observing GPU utilization might help.

I haven't seen much CUDA posting lately, and would love to hear from someone using both multiCPU and GPU computations at the same time to learn whether s/he had this problem, and if so how it was overcome.

Thanks, wizards. I'm looking forward to your insights.

 - Jimmy

Please Wait...