I need a bit of advice from you, and I hope it won't take a lot of your time.
So here is my question: I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16, and I want to inverse it using CUDA.
The special part of the question is that I have 1024 idle cuda threads to perform this task. So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.
But how else can I inverse this matrices using all available threads?
Thank you for your attention!
There are at least two possible ready made options for this sort of problem:
[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]