small Matrix Inversion on CUDA

I need a bit of advice from you, and I hope it won't take a lot of your time.

So here is my question: I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16, and I want to inverse it using CUDA.

The special part of the question is that I have 1024 idle cuda threads to perform this task. So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.

But how else can I inverse this matrices using all available threads?

Thank you for your attention!

There are at least two possible ready made options for this sort of problem:

  1. Use the batched solvers shipping in recent versions of the CUBLAS library
  2. Use the BSD licensed Gauss-Jordan elimination device code functions which NVIDIA distribute to registered developers. These were intended to invert small matrices using one thread per matrix

[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]