jimmyinhmb

121 Reputation

8 Badges

18 years, 340 days

MaplePrimes Activity


These are replies submitted by jimmyinhmb

@acer After trying several of the alternatives, I settled on the compiled versions of the updating operations. Eliminating the Array copies and controling how much of the Array is updated reduced execution time by a whopping 80%!

Score another one for Acer, with thanks.

Now I need to deal with a different memory issue: when I scale up the computation (millions of Vector updates), my memory allocation creeps up until alloc() fails. I have NULL operations at the end of all procs, and have trapped numeric events in case they are piling up, but the only one that I see is "inexact" which seems reasonable given the floating point essence of the work. So I have set the 'isexact' numeric event flag, and its seems to have no effect on the creeping memory use. 

I recall reading in the help pages that even NULL is a return value, and so I suspect I'm losing a byte on an incomplete stack frame pop or something.... I remember reading a Mapleprimes thread (or two...) discussing slow growth of memory consumption. Gotta go find it, and maybe play with the profiling features.

 - jimmyinhmb

@acer After trying several of the alternatives, I settled on the compiled versions of the updating operations. Eliminating the Array copies and controling how much of the Array is updated reduced execution time by a whopping 80%!

Score another one for Acer, with thanks.

Now I need to deal with a different memory issue: when I scale up the computation (millions of Vector updates), my memory allocation creeps up until alloc() fails. I have NULL operations at the end of all procs, and have trapped numeric events in case they are piling up, but the only one that I see is "inexact" which seems reasonable given the floating point essence of the work. So I have set the 'isexact' numeric event flag, and its seems to have no effect on the creeping memory use. 

I recall reading in the help pages that even NULL is a return value, and so I suspect I'm losing a byte on an incomplete stack frame pop or something.... I remember reading a Mapleprimes thread (or two...) discussing slow growth of memory consumption. Gotta go find it, and maybe play with the profiling features.

 - jimmyinhmb

@acer: Thank you for the additional comments, and no need to apologize. I've just now installed Maple 16 (on Win 7, AMD 6-way, MKL); as long as I am investing this time, I may as well do it with current software. Was Maple 16 assumed in your first snippets of code?

While I chase that, I have a somewhat similar problem involving Vectors but the operation here is not addition; it is taking the element-wise exponential function of each element-wise sum. If G and H are float[8] Vectors containing 1..Last elements, and c is a float[8] scalar, then for a changing intermediate index Mid, the operations would be

G[Mid..Last] := G[Mid..Last] + exp~( c +~ H[Mid..Last] ).

It eventually dawned on me that this is a natural application for LinearAlgebra[Zip].  I tried

LinearAlgebra[Zip] (proc(a,b) options operator, arrow; a + exp(b + c) end proc,

G[Mid..Last], H[Mid..Last], inplace):

with no luck -- got back floating point zeros, and it took 70% more time than the element-wise operator version.

I also tried to avoid range operators with an awkward combination of zero-Fill and Copy, Map2 with a filter to limit computational effort, and vector addition (using a temporary float[8] Vector C in addition to the scalar c mentioned earlier):

Offset := Mid - 1:
Length := Last - Offset:
ArrayTools[Fill](Offset, 0.0, C):
ArrayTools[Copy](Length, H, Offset, C, Offset):
LinearAlgebra[Map2] [proc (ii) options operator, arrow; evalb(Mid< ii) end proc]
                               ( proc (x, a) options operator, arrow; exp(x+a) end proc, c, C ):
LinearAlgebra[VectorAdd] (G, C, inplace):

This took as long as the Zip experiment did, but it failed differently -- the filter seems to always have evaluated to true. I suppose these could be Maple bugs, but it is much more likely that my inexperience with these Maple functions is revealing itself.

So for the moment I am stuck with the element-wise implementation. I too was disappointed to find that MKL did not make any use of the multiple cores on the AMD processors during any of these experiments. Perhaps there are special versions of these libraries that I need to find. But in this case, I also believe that the exp() function is throttling parallelism -- I was surprised to find exp() and ln() aren't thread-safe, so hand-programming threads isn't going to help unless I write my own expansions for these functions. I suppose the functions use shared intermediate storage, but so would other expansions approximations for sin(), cos(), etc.

Any suggestions on this twist of the original question are equally welcome, along with insight on why such elementary functions would not be thread-safe.

With thanks and every good wish,

jimmyinhmb

@acer: Thank you for the additional comments, and no need to apologize. I've just now installed Maple 16 (on Win 7, AMD 6-way, MKL); as long as I am investing this time, I may as well do it with current software. Was Maple 16 assumed in your first snippets of code?

While I chase that, I have a somewhat similar problem involving Vectors but the operation here is not addition; it is taking the element-wise exponential function of each element-wise sum. If G and H are float[8] Vectors containing 1..Last elements, and c is a float[8] scalar, then for a changing intermediate index Mid, the operations would be

G[Mid..Last] := G[Mid..Last] + exp~( c +~ H[Mid..Last] ).

It eventually dawned on me that this is a natural application for LinearAlgebra[Zip].  I tried

LinearAlgebra[Zip] (proc(a,b) options operator, arrow; a + exp(b + c) end proc,

G[Mid..Last], H[Mid..Last], inplace):

with no luck -- got back floating point zeros, and it took 70% more time than the element-wise operator version.

I also tried to avoid range operators with an awkward combination of zero-Fill and Copy, Map2 with a filter to limit computational effort, and vector addition (using a temporary float[8] Vector C in addition to the scalar c mentioned earlier):

Offset := Mid - 1:
Length := Last - Offset:
ArrayTools[Fill](Offset, 0.0, C):
ArrayTools[Copy](Length, H, Offset, C, Offset):
LinearAlgebra[Map2] [proc (ii) options operator, arrow; evalb(Mid< ii) end proc]
                               ( proc (x, a) options operator, arrow; exp(x+a) end proc, c, C ):
LinearAlgebra[VectorAdd] (G, C, inplace):

This took as long as the Zip experiment did, but it failed differently -- the filter seems to always have evaluated to true. I suppose these could be Maple bugs, but it is much more likely that my inexperience with these Maple functions is revealing itself.

So for the moment I am stuck with the element-wise implementation. I too was disappointed to find that MKL did not make any use of the multiple cores on the AMD processors during any of these experiments. Perhaps there are special versions of these libraries that I need to find. But in this case, I also believe that the exp() function is throttling parallelism -- I was surprised to find exp() and ln() aren't thread-safe, so hand-programming threads isn't going to help unless I write my own expansions for these functions. I suppose the functions use shared intermediate storage, but so would other expansions approximations for sin(), cos(), etc.

Any suggestions on this twist of the original question are equally welcome, along with insight on why such elementary functions would not be thread-safe.

With thanks and every good wish,

jimmyinhmb

@acer Thanks for the clarity of thought in your reply. If I can get the call to daxpy (or its equivalent in the Win 7 MKL) straight, there are some nice dividends in this particular application, because within the critical section of the application

  • the matrix B is fixed (just not how much of it is added to A changes), and
  • the additions to B are the only changes to A.

So establishing the temp matrices at the beginning of the critical section and the BlockCopy would be the only extra overhead.

But I like the idea of a Maple 17 BlockAdd capability much better. :)

I am much obliged once again.

jimmyinhmb

@acer Thanks for the clarity of thought in your reply. If I can get the call to daxpy (or its equivalent in the Win 7 MKL) straight, there are some nice dividends in this particular application, because within the critical section of the application

  • the matrix B is fixed (just not how much of it is added to A changes), and
  • the additions to B are the only changes to A.

So establishing the temp matrices at the beginning of the critical section and the BlockCopy would be the only extra overhead.

But I like the idea of a Maple 17 BlockAdd capability much better. :)

I am much obliged once again.

jimmyinhmb

Hi Acer,

That was a great reply, and fast -- thank you very much.

I should have reread the brain-dead part of that draft before I posted. Just as the programming guide says

In Maple, data is always passed by reference, but the immutability of most data types ensures that the procedure cannot modify the caller's copy of the data. The exceptions are Maple's mutable data structures: tables, Arrays, Matrices, Vectors, records, and objects. Modifying these within a procedure will modify the caller's copy. Fortunately, these larger data structures are the wones that you would most often want to pass by reference, since copying such data consumes time and space.

My bad, good of you to correct an RTFM issue so graciously.

I'm working in Windows 7 with an AMD Phenom II X6 11T running at 3.3 GHz, and the MKL seems to be doing a very good job of parsing out the work. The task breakdown I have in mind is coarser than matrix multiply -- the tasks would share the constant matrix but very little else. But if only one Maple thread can use the dgemm routines at a time, it sounds like this might only be useful in a grid computing situation, and of course each grid computing element would have its own version of the matrix. That's down the road.

Your explanation of how the GPU memory objects are handled makes it clear that I shouldn't be thinking about CUDA in Maple for this problem right now. Maybe when GPU-resident objects are readdressable -- actually, maybe when Maple / BLAS do it under the covers. For now I am happy to let Windows MKL keep the hardware busy -- especially if it not running Maple threads resolves the mutex problem I was worried about.

Hmm, I guess one could always unbind and redefine a variable that was originally created as read-only, so the readonly=true option in Matrix creation has no effect at the cache level, e.g., letting each processor in a multiprocessor system know that their locally-cached copy of a shared variable will never be out of date. Does it improve Maple's efficiency in other ways, or is primarily an assert capability that catches an erroneous write for Maple? Inquiring minds want to know, but I will use it anyway.

If the full C <== alpha*A*B + C functionality can be exposed, I think I can make inplace work pretty easily; but without the "+ C" part of the operation it's more contrived. Please count me as being interested in how you exploit that additional dgemm capability, and hoping to study your Win 7 example.

I'll implement as a module-scoped Matrix tomorrow, and save your remaining speedup advice to chew on a bit later. For now, a big THANKS for sharing your deep and helpful insight.

- Jimmy

Hi Acer,

That was a great reply, and fast -- thank you very much.

I should have reread the brain-dead part of that draft before I posted. Just as the programming guide says

In Maple, data is always passed by reference, but the immutability of most data types ensures that the procedure cannot modify the caller's copy of the data. The exceptions are Maple's mutable data structures: tables, Arrays, Matrices, Vectors, records, and objects. Modifying these within a procedure will modify the caller's copy. Fortunately, these larger data structures are the wones that you would most often want to pass by reference, since copying such data consumes time and space.

My bad, good of you to correct an RTFM issue so graciously.

I'm working in Windows 7 with an AMD Phenom II X6 11T running at 3.3 GHz, and the MKL seems to be doing a very good job of parsing out the work. The task breakdown I have in mind is coarser than matrix multiply -- the tasks would share the constant matrix but very little else. But if only one Maple thread can use the dgemm routines at a time, it sounds like this might only be useful in a grid computing situation, and of course each grid computing element would have its own version of the matrix. That's down the road.

Your explanation of how the GPU memory objects are handled makes it clear that I shouldn't be thinking about CUDA in Maple for this problem right now. Maybe when GPU-resident objects are readdressable -- actually, maybe when Maple / BLAS do it under the covers. For now I am happy to let Windows MKL keep the hardware busy -- especially if it not running Maple threads resolves the mutex problem I was worried about.

Hmm, I guess one could always unbind and redefine a variable that was originally created as read-only, so the readonly=true option in Matrix creation has no effect at the cache level, e.g., letting each processor in a multiprocessor system know that their locally-cached copy of a shared variable will never be out of date. Does it improve Maple's efficiency in other ways, or is primarily an assert capability that catches an erroneous write for Maple? Inquiring minds want to know, but I will use it anyway.

If the full C <== alpha*A*B + C functionality can be exposed, I think I can make inplace work pretty easily; but without the "+ C" part of the operation it's more contrived. Please count me as being interested in how you exploit that additional dgemm capability, and hoping to study your Win 7 example.

I'll implement as a module-scoped Matrix tomorrow, and save your remaining speedup advice to chew on a bit later. For now, a big THANKS for sharing your deep and helpful insight.

- Jimmy

@acer 

Very informative post as usual, Acer. I'll try to digest this on Thursday or Friday, have to get ready for an unrelated meeting tomorrow.

I will certainly let you know where these twists and turns lead.

- Jimmy

@acer 

Very informative post as usual, Acer. I'll try to digest this on Thursday or Friday, have to get ready for an unrelated meeting tomorrow.

I will certainly let you know where these twists and turns lead.

- Jimmy

@Axel Vogt 

Axel, I have the Kahan Summation formula as an appendix to Goldberg, D. "What every computer scientist should know about floating-point arithmetic," Computer Surverys, ACM, March 1991. The bounds analysis is nice, but concludes that double-precision arithmetic has much better bounds. I'm not surprised to hear that it's outdated, but I haven't spent time in the fp world for the last two decades. Can you point me to what you think is an improved summation formula that is worthy of implementation?

Thank you again,

 - Jimmy

@Axel Vogt 

Axel, I have the Kahan Summation formula as an appendix to Goldberg, D. "What every computer scientist should know about floating-point arithmetic," Computer Surverys, ACM, March 1991. The bounds analysis is nice, but concludes that double-precision arithmetic has much better bounds. I'm not surprised to hear that it's outdated, but I haven't spent time in the fp world for the last two decades. Can you point me to what you think is an improved summation formula that is worthy of implementation?

Thank you again,

 - Jimmy

The uploaded file example_for_sort_and.mw contains a little proc that I hope will illustrate most of the questions, although the attempt to reuse each Array with "for entry in bigarray do bigarray[lhs(entry)]:=... end do:" bit is out. Please let me know if you have any trouble downloading it.

Axel, I know you hate 2D math, so I consider it "grr'd" already.

Thank you for the interest, looking forward to your thoughts.

 - Jimmy

Thanks for the prompt and thorough explanations, gents. Now that FIFA is resting for a couple of days, I'll get back to work.

 - Jimmy

Thanks for the prompt and thorough explanations, gents. Now that FIFA is resting for a couple of days, I'll get back to work.

 - Jimmy

1 2 Page 1 of 2