cuBLAS :: CUDA Toolkit Documentation

2.7.7. cublas<t>syr2k()

cublasStatus_t cublasSsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const float           *alpha,
                            const float           *A, int lda,
                            const float           *B, int ldb,
                            const float           *beta,
                            float           *C, int ldc)
cublasStatus_t cublasDsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const double          *alpha,
                            const double          *A, int lda,
                            const double          *B, int ldb,
                            const double          *beta,
                            double          *C, int ldc)
cublasStatus_t cublasCsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, int ldc)

This function performs the symmetric rank- $2 k$ update

$C = α (op (A) op (B)^{T} + op (B) op (A)^{T}) + β C$

where $α$ and $β$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $op (A)$ $n \times k$ and $op (B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{T} and B^{T} & if trans == CUBLAS_OP_T \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.7.8. cublas<t>syrkx()


cublasStatus_t cublasSsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const float           *alpha,
                            const float           *A, int lda,
                            const float           *B, int ldb,
                            const float           *beta,
                            float           *C, int ldc)
cublasStatus_t cublasDsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const double          *alpha,
                            const double          *A, int lda,
                            const double          *B, int ldb,
                            const double          *beta,
                            double          *C, int ldc)
cublasStatus_t cublasCsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, int ldc)

This function performs a variation of the symmetric rank- $k$ update

$C = α (op (A) op (B)^{T} + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{T} and B^{T} & if trans == CUBLAS_OP_T \end{cases}$

This routine can be used when B is in such way that the result is garanteed to be symmetric. An usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routine cublas<t>dgmm.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssyrk, dsyrk, csyrk, zsyrk and

2.7.9. cublas<t>trmm()

cublasStatus_t cublasStrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           float                 *C, int ldc)
cublasStatus_t cublasDtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           double                *C, int ldc)
cublasStatus_t cublasCtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           cuComplex             *C, int ldc)
cublasStatus_t cublasZtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           cuDoubleComplex       *C, int ldc)

This function performs the triangular matrix-matrix multiplication

$C = \{\begin{cases} α op (A) B & if side == CUBLAS_SIDE_LEFT \\ α B op (A) & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $B$ and $C$ are $m \times n$ matrix, and $α$ is a scalar. Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

Notice that in order to achieve better parallelism cuBLAS differs from the BLAS API only for this routine. The BLAS API assumes an in-place implementation (with results written back to B), while the cuBLAS API assumes an out-of-place implementation (with results written into C). The application can obtain the in-place functionality of BLAS in the cuBLAS API by passing the address of the matrix B in place of the matrix C. No other overlapping in the input parameters is supported.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
C	device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strmm, dtrmm, ctrmm, ztrmm

2.7.10. cublas<t>trsm()

cublasStatus_t cublasStrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           float           *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           double          *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           cuComplex       *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           cuDoubleComplex *B, int ldb)

This function solves the triangular linear system with multiple right-hand-sides

$\{\begin{cases} op (A) X = α B & if side == CUBLAS_SIDE_LEFT \\ X op (A) = α B & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $X$ and $B$ are $m \times n$ matrices, and $α$ is a scalar. Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

The solution $X$ overwrites the right-hand-sides $B$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `X`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	in/out	<type> array. It has dimensions `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strsm, dtrsm, ctrsm, ztrsm

2.7.11. cublas<t>trsmBatched()

cublasStatus_t cublasStrsmBatched( cublasHandle_t    handle, 
                                   cublasSideMode_t  side, 
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans, 
                                   cublasDiagType_t  diag,
                                   int m, 
                                   int n, 
                                   const float *alpha,
                                   float *A[], 
                                   int lda,
                                   float *B[], 
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasDtrsmBatched( cublasHandle_t    handle, 
                                   cublasSideMode_t  side, 
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans, 
                                   cublasDiagType_t  diag,
                                   int m, 
                                   int n, 
                                   const double *alpha,
                                   double *A[], 
                                   int lda,
                                   double *B[], 
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasCtrsmBatched( cublasHandle_t    handle, 
                                   cublasSideMode_t  side, 
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans, 
                                   cublasDiagType_t  diag,
                                   int m, 
                                   int n, 
                                   const cuComplex *alpha,
                                   cuComplex *A[], 
                                   int lda,
                                   cuComplex *B[], 
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasZtrsmBatched( cublasHandle_t    handle, 
                                   cublasSideMode_t  side, 
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans, 
                                   cublasDiagType_t  diag,
                                   int m, 
                                   int n, 
                                   const cuDoubleComplex *alpha,
                                   cuDoubleComplex *A[], 
                                   int lda,
                                   cuDoubleComplex *B[], 
                                   int ldb,
                                   int batchCount);

This function solves an array of triangular linear systems with multiple right-hand-sides

$\{\begin{cases} op (A [i]) X [i] = α B [i] & if side == CUBLAS_SIDE_LEFT \\ X [i] op (A [i]) = α B [i] & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A [i]$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $X [i]$ and $B [i]$ are $m \times n$ matrices, and $α$ is a scalar. Also, for matrix $A$

$op (A [i]) = \{\begin{cases} A [i] & if transa == CUBLAS_OP_N \\ A^{T} [i] & if transa == CUBLAS_OP_T \\ A^{H} [i] & if transa == CUBLAS_OP_C \end{cases}$

The solution $X [i]$ overwrites the right-hand-sides $B [i]$ on exit.

No test for singularity or near-singularity is included in this function.

This function works for any sizes but is intended to be used for matrices of small sizes where the launch overhead is a significant factor. For bigger sizes, it might be advantageous to call batchCount times the regular cublas<t>trsm within a set of CUDA streams.

The current implementation is limited to devices with compute capability above or equal 2.0.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A[i]` is on the left or right of `X[i]`.
uplo		input	indicates if matrix `A[i]` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A[i]`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A[i]` are unity and should not be accessed.
m		input	number of rows of matrix `B[i]`, with matrix `A[i]` sized accordingly.
n		input	number of columns of matrix `B[i]`, with matrix `A[i]` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A[i]` is not referenced and `B[i]` does not have to be a valid input.
A	device	input	array of pointers to <type> array, with each array of dim. `lda x m` with `lda>=max(1,m)` if `transa==CUBLAS_OP_N` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A[i]`.
B	device	in/out	array of pointers to <type> array, with each array of dim. `ldb x n` with `ldb>=max(1,m)`
ldb		input	leading dimension of two-dimensional array used to store matrix `B[i]`.
batchCount		input	number of pointers contained in A and B.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0.`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device is below compute capability 2.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strsm, dtrsm, ctrsm, ztrsm

2.7.12. cublas<t>hemm()

cublasStatus_t cublasChemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZhemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the Hermitian matrix-matrix multiplication

$C = \{\begin{cases} α A B + β C & if side == CUBLAS_SIDE_LEFT \\ α B A + β C & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A$ is a Hermitian matrix stored in lower or upper mode, $B$ and $C$ are $m \times n$ matrices, and $α$ and $β$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side==CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

chemm, zhemm

2.7.13. cublas<t>herk()

cublasStatus_t cublasCherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float  *alpha,
                           const cuComplex       *A, int lda,
                           const float  *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double *alpha,
                           const cuDoubleComplex *A, int lda,
                           const double *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the Hermitian rank- $k$ update

$C = α op (A) op (A)^{H} + β C$

where $α$ and $β$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk

2.7.14. cublas<t>her2k()

cublasStatus_t cublasCher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

This function performs the Hermitian rank- $2 k$ update

$C = α op (A) op (B)^{H} + \overset{ˉ}{α} op (B) op (A)^{H} + β C$

where $α$ and $β$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $op (A)$ $n \times k$ and $op (B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{H} and B^{H} & if trans == CUBLAS_OP_C \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.7.15. cublas<t>herkx()

cublasStatus_t cublasCherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

This function performs a variation of the Hermitian rank- $k$ update

$C = α op (A) op (B)^{H} + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{H} and B^{H} & if trans == CUBLAS_OP_C \end{cases}$

This routine can be used when the matrix B is in such way that the result is garanteed to be hermitian. An usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routine cublas<t>dgmm.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	real scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk and

2.8. BLAS-like Extension

In this chapter we describe the BLAS-extension functions that perform matrix-matrix operations.

2.8.1. cublas<t>geam()

cublasStatus_t cublasSgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const float           *alpha,
                          const float           *A, int lda,
                          const float           *beta,
                          const float           *B, int ldb,
                          float           *C, int ldc)
cublasStatus_t cublasDgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const double          *alpha,
                          const double          *A, int lda,
                          const double          *beta,
                          const double          *B, int ldb,
                          double          *C, int ldc)
cublasStatus_t cublasCgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const cuComplex       *alpha,
                          const cuComplex       *A, int lda,
                          const cuComplex       *beta ,
                          const cuComplex       *B, int ldb,
                          cuComplex       *C, int ldc)
cublasStatus_t cublasZgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const cuDoubleComplex *alpha,
                          const cuDoubleComplex *A, int lda,
                          const cuDoubleComplex *beta,
                          const cuDoubleComplex *B, int ldb,
                          cuDoubleComplex *C, int ldc)

This function performs the matrix-matrix addition/transposition

$C = α op (A) + β op (B)$

where $α$ and $β$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $op (A)$ $m \times n$ , $op (B)$ $m \times n$ and $C$ $m \times n$ , respectively. Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

and $op (B)$ is defined similarly for matrix $B$ .

The operation is out-of-place if C does not overlap A or B.

The in-place mode supports the following two operations,

$C = α * C + β op (B)$

$C = α op (A) + β * C$

For in-place mode, if C = A, ldc = lda and transa = CUBLAS_OP_N. If C = B, ldc = ldb and transb = CUBLAS_OP_N. If the user does not meet above requirements, CUBLAS_STATUS_INVALID_VALUE is returned.

The operation includes the following special cases:

the user can reset matrix C to zero by setting *alpha=*beta=0.

the user can transpose matrix A by setting *alpha=1 and *beta=0.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
alpha	host or device	input	<type> scalar used for multiplication. If `*alpha == 0`, `A` does not have to be a valid input.
A	device	input	<type> array of dimensions `lda x n` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)` if `transa == CUBLAS_OP_N` and `ldb x m` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication. If `*beta == 0`, `B` does not have to be a valid input.
C	device	output	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`, `alpha,beta=NULL` or improper settings of in-place mode
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.2. cublas<t>dgmm()

cublasStatust cublasSdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const float           *A, int lda,
                          const float           *x, int incx,
                          float           *C, int ldc)
cublasStatus_t cublasDdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const double          *A, int lda,
                          const double          *x, int incx,
                          double          *C, int ldc)
cublasStatus_t cublasCdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const cuComplex       *A, int lda,
                          const cuComplex       *x, int incx,
                          cuComplex       *C, int ldc)
cublasStatus_t cublasZdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const cuDoubleComplex *A, int lda,
                          const cuDoubleComplex *x, int incx,
                          cuDoubleComplex *C, int ldc)

This function performs the matrix-matrix multiplication

$C = \{\begin{cases} A \times d i a g (X) & if mode == CUBLAS_SIDE_RIGHT \\ d i a g (X) \times A & if mode == CUBLAS_SIDE_LEFT \end{cases}$

where $A$ and $C$ are matrices stored in column-major format with dimensions $m \times n$ . $X$ is a vector of size $n$ if mode == CUBLAS_SIDE_RIGHT and of size $m$ if mode == CUBLAS_SIDE_LEFT. $X$ is gathered from one-dimensional array x with stride incx. The absolute value of incx is the stride and the sign of incx is direction of the stride. If incx is positive, then we forward x from the first element. Otherwise, we backward x from the last element. The formula of X is

$X [j] = \{\begin{cases} x [j \times i n c x] & if i n c x \geq 0 \\ x [(χ - 1) \times | i n c x | - j \times | i n c x |] & if i n c x < 0 \end{cases}$

where $χ = m$ if mode == CUBLAS_SIDE_LEFT and $χ = n$ if mode == CUBLAS_SIDE_RIGHT.

Example 1: if the user wants to perform $d i a g (d i a g (B)) \times A$ , then $i n c x = l d b + 1$ where $l d b$ is leading dimension of matrix B, either row-major or column-major.

Example 2: if the user wants to perform $α \times A$ , then there are two choices, either cublasgeam with *beta=0 and transa == CUBLAS_OP_N or cublasdgmm with incx=0 and x[0]=alpha.

The operation is out-of-place. The in-place only works if lda = ldc.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
mode		input	left multiply if `mode == CUBLAS_SIDE_LEFT` or right multiply if `mode == CUBLAS_SIDE_RIGHT`
m		input	number of rows of matrix `A` and `C`.
n		input	number of columns of matrix `A` and `C`.
A	device	input	<type> array of dimensions `lda x n` with `lda>=max(1,m)`
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
x	device	input	one-dimensional <type> array of size $\| i n c \| \times m$ if `mode == CUBLAS_SIDE_LEFT` and $\| i n c \| \times n$ if `mode == CUBLAS_SIDE_RIGHT`
incx		input	stride of one-dimensional array `x`.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0` or `mode != CUBLAS_SIDE_LEFT, CUBLAS_SIDE_RIGHT`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.3. cublas<t>getrfBatched()

cublasStatus_t cublasSgetrfBatched(cublasHandle_t handle,
                                   int n, 
                                   float *Aarray[],
                                   int lda, 
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasDgetrfBatched(cublasHandle_t handle,
                                   int n, 
                                   double *Aarray[],
                                   int lda, 
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasCgetrfBatched(cublasHandle_t handle,
                                   int n, 
                                   cuComplex *Aarray[],
                                   int lda, 
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasZgetrfBatched(cublasHandle_t handle,
                                   int n, 
                                   cuDoubleComplex *Aarray[],
                                   int lda, 
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensions nxn and leading dimension lda.

This function performs the LU factorization of each Aarray[i] for i = 0, ..., batchSize-1 by the following equation

$P * Aarray [i] = L * U$

where P is a permutation matrix which represents partial pivoting with row interchanges. L is a lower triangular matrix with unit diagonal and U is an upper triangular matrix.

Formally P is written by a product of permutation matrices Pj, for j = 1,2,...,n, say P = P1 * P2 * P3 * .... * Pn. Pj is a permutation matrix which interchanges two rows of vector x when performing Pj*x. Pj can be constructed by j element of PivotArray[i] by the following matlab code

// In Matlab PivotArray[i] is an array of base-1.
// In C, PivotArray[i] is base-0.
Pj = eye(n); 
swap Pj(j,:) and Pj(PivotArray[i][j]  ,:)

L and U are written back to original matrix A, and diagonal elements of L are discarded. The L and U can be constructed by the following matlab code

// A is a matrix of nxn after getrf.
L = eye(n);
for j = 1:n
    L(:,j+1:n) = A(:,j+1:n)
end
U = zeros(n);
for i = 1:n
    U(i,i:n) = A(i,i:n)
end

If matrix A(=Aarray[i]) is singular, getrf still works and the value of info(=infoArray[i]) reports first row index that LU factorization cannot proceed. If info is k, U(k,k) is zero. The equation P*A=L*U still holds, however L and U are from the following matlab code

// A is a matrix of nxn after getrf.
// info is k, which means U(k,k) is zero.
L = eye(n);
for j = 1:k-1
    L(:,j+1:n) = A(:,j+1:n)
end
U = zeros(n);
for i = 1:k-1
    U(i,i:n) = A(i,i:n)
end
for i = k:n
    U(i,k:n) = A(i,k:n)
end

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrfBatched supports non-pivot LU factorization if PivotArray is nil.

cublas<t>getrfBatched supports arbitrary dimension.

cublas<t>getrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `Aarray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `n x n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
PivotArray	device	output	array of size `n x batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `PivotArray` is nil, pivoting is disabled.
infoArray	device	output	array of size `batchSize` that info(=infoArray[i]) contains the information of factorization of `Aarray[i]`. If info=0, the execution is successful. If info = -j, the j-th parameter had an illegal value. If info = k, U(k,k) is 0. The factorization has been completed, but U is exactly singular.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,batchSize,lda <0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrf, dgeqrf, cgeqrf, zgeqrf

2.8.4. cublas<t>getrsBatched()

cublasStatus_t cublasSgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans, 
                                   int n, 
                                   int nrhs, 
                                   const float *Aarray[], 
                                   int lda, 
                                   const int *devIpiv, 
                                   float *Barray[], 
                                   int ldb, 
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasDgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans, 
                                   int n, 
                                   int nrhs, 
                                   const double *Aarray[], 
                                   int lda, 
                                   const int *devIpiv, 
                                   double *Barray[], 
                                   int ldb, 
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasCgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans, 
                                   int n, 
                                   int nrhs, 
                                   const cuComplex *Aarray[], 
                                   int lda, 
                                   const int *devIpiv, 
                                   cuComplex *Barray[], 
                                   int ldb, 
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasZgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans, 
                                   int n, 
                                   int nrhs, 
                                   const cuDoubleComplex *Aarray[], 
                                   int lda, 
                                   const int *devIpiv, 
                                   cuDoubleComplex *Barray[], 
                                   int ldb, 
                                   int *info,
                                   int batchSize);

This function solves an array of systems of linear equations of the form :

$op (A [i]) X [i] = α B [i]$

where $A [i]$ is a matrix which has been LU factorized with pivoting , $X [i]$ and $B [i]$ are $n \times nrhs$ matrices. Also, for matrix $A$

$op (A [i]) = \{\begin{cases} A [i] & if trans == CUBLAS_OP_N \\ A^{T} [i] & if trans == CUBLAS_OP_T \\ A^{H} [i] & if trans == CUBLAS_OP_C \end{cases}$

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrsBatched supports non-pivot LU factorization if devIpiv is nil.

cublas<t>getrsBatched supports arbitrary dimension.

cublas<t>getrsBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows and columns of `Aarray[i]`.
nrhs		input	number of columns of `Barray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `n x n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
devIpiv	device	input	array of size `n x batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `devIpiv` is nil, pivoting for all `Aarray[i]` is ignored.
Barray	device	input/output	array of pointers to <type> array, with each array of dim. `n x nrhs` with `ldb>=max(1,n)`.
ldb		input	leading dimension of two-dimensional array used to store each solution matrix `Barray[i]`.
info	host	output	If info=0, the execution is successful. If info = -j, the j-th parameter had an illegal value.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,batchSize,lda <0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrs, dgeqrs, cgeqrs, zgeqrs

2.8.5. cublas<t>getriBatched()

cublasStatus_t cublasSgetriBatched(cublasHandle_t handle,
                                   int n,
                                   float *Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   float *Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasDgetriBatched(cublasHandle_t handle,
                                   int n,
                                   double *Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   double *Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasCgetriBatched(cublasHandle_t handle,
                                   int n,
                                   cuComplex *Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   cuComplex *Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasZgetriBatched(cublasHandle_t handle,
                                   int n,
                                   cuDoubleComplex *Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   cuDoubleComplex *Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

Aarray and Carray are arrays of pointers to matrices stored in column-major format with dimensions n*n and leading dimension lda and ldc respectively.

This function performs the inversion of matrices A[i] for i = 0, ..., batchSize-1.

Prior to calling cublas<t>getriBatched, the matrix A[i] must be factorized first using the routine cublas<t>getrfBatched. After the call of cublas<t>getrfBatched, the matrix pointing by Aarray[i] will contain the LU factors of the matrix A[i] and the vector pointing by (PivotArray+i) will contain the pivoting sequence.

Following the LU factorization, cublas<t>getriBatched uses forward and backward triangular solvers to complete inversion of matrices A[i] for i = 0, ..., batchSize-1. The inversion is out-of-place, so memory space of Carray[i] cannot overlap memory space of Array[i].

Typically all parameters in cublas<t>getrfBatched would be passed into cublas<t>getriBatched. For example,

// step 1: perform in-place LU decomposition, P*A = L*U.
//      Aarray[i] is n*n matrix A[i]
    cublasDgetrfBatched(handle, n, Aarray, lda, PivotArray, infoArray, batchSize);
//      check infoArray[i] to see if factorization of A[i] is successful or not.
//      Array[i] contains LU factorization of A[i]

// step 2: perform out-of-place inversion, Carray[i] = inv(A[i])
    cublasDgetriBatched(handle, n, Aarray, lda, PivotArray, Carray, ldc, infoArray, batchSize);
//      check infoArray[i] to see if inversion of A[i] is successful or not.

The user can check singularity from either cublas<t>getrfBatched or cublas<t>getriBatched.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

If cublas<t>getrfBatched is performed by non-pivoting, PivotArray of cublas<t>getriBatched should be nil.

cublas<t>getriBatched supports arbitrary dimension.

cublas<t>getriBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `Aarray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dimension `n*n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
PivotArray	device	output	array of size `n*batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `PivotArray` is nil, pivoting is disabled.
Carray	device	output	array of pointers to <type> array, with each array of dimension `n*n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store each matrix `Carray[i]`.
infoArray	device	output	array of size `batchSize` that info(=infoArray[i]) contains the information of inversion of `A[i]`. If info=0, the execution is successful. If info = k, U(k,k) is 0. The U is exactly singular and the inversion failed.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,batchSize,lda,ldc <0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.6. cublas<t>matinvBatched()

cublasStatus_t cublasSmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const float *A[],          
                                    int lda,
                                    float *Ainv[],              
                                    int lda_inv,
                                    int *info,                   
                                    int batchSize);

cublasStatus_t cublasDmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const double *A[],          
                                    int lda,
                                    double *Ainv[],              
                                    int lda_inv,
                                    int *info,                   
                                    int batchSize);

cublasStatus_t cublasCmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const cuComplex *A[],          
                                    int lda,
                                    cuComplex *Ainv[],              
                                    int lda_inv,
                                    int *info,                   
                                    int batchSize);

cublasStatus_t cublasZmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const cuDoubleComplex *A[],          
                                    int lda,
                                    cuDoubleComplex *Ainv[],              
                                    int lda_inv,
                                    int *info,                   
                                    int batchSize);

A and Ainv are arrays of pointers to matrices stored in column-major format with dimensions n*n and leading dimension lda and lda_inv respectively.

This function performs the inversion of matrices A[i] for i = 0, ..., batchSize-1.

This function is a short cut of cublas<t>getrfBatched plus cublas<t>getriBatched. However it only works if n is less than 32. If not, the user has to go through cublas<t>getrfBatched and cublas<t>getriBatched.

If the matrix A[i] is singular, then info[i] reports singularity, the same as cublas<t>getrfBatched.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `A[i]`.
A	device	input	array of pointers to <type> array, with each array of dimension `n*n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `A[i]`.
Ainv	device	output	array of pointers to <type> array, with each array of dimension `n*n` with `lda_inv>=max(1,n)`.
lda_inv		input	leading dimension of two-dimensional array used to store each matrix `Ainv[i]`.
info	device	output	array of size `batchSize` that info[i] contains the information of inversion of `A[i]`. If info[i]=0, the execution is successful. If info[i]=k, U(k,k) is 0. The U is exactly singular and the inversion failed.
batchSize		input	number of pointers contained in A.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,batchSize,lda,lda_inv <0; or n >32`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.7. cublas<t>geqrfBatched()

cublasStatus_t cublasSgeqrfBatched( cublasHandle_t handle, 
                                    int m, 
                                    int n,
                                    float *Aarray[],  
                                    int lda, 
                                    float *TauArray[],                                                         
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasDgeqrfBatched( cublasHandle_t handle, 
                                    int m, 
                                    int n,
                                    double *Aarray[],  
                                    int lda, 
                                    double *TauArray[],                                                         
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasCgeqrfBatched( cublasHandle_t handle, 
                                    int m, 
                                    int n,
                                    cuComplex *Aarray[],  
                                    int lda, 
                                    cuComplex *TauArray[],                                                           
                                    int *info,
                                    int batchSize);
                                                            
cublasStatus_t cublasZgeqrfBatched( cublasHandle_t handle, 
                                    int m, 
                                    int n,
                                    cuDoubleComplex *Aarray[],  
                                    int lda, 
                                    cuDoubleComplex *TauArray[],                                                        
                                    int *info,
                                    int batchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensions m x n and leading dimension lda. TauArray is an array of pointers to vectors of dimension of at least max (1, min(m, n).

This function performs the QR factorization of each Aarray[i] for i = 0, ...,batchSize-1 using Householder reflections. Each matrix Q[i] is represented as a product of elementary reflectors and is stored in the lower part of each Aarray[i] as follows :

     Q[j] = H[j][1] H[j][2] . . . H[j](k), where k = min(m,n).

     Each H[j][i] has the form

          H[j][i] = I - tau[j] * v * v'
          where tau[j] is a real scalar, and v is a real vector with
          v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in Aarray[j][i+1:m,i],
          and tau in TauArray[j][i]

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>geqrfBatched supports arbitrary dimension.

cublas<t>geqrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
m		input	number of rows `Aarray[i]`.
n		input	number of columns of `Aarray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `m x n` with `lda>=max(1,m)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
TauArray	device	output	array of pointers to <type> vector, with each vector of dim. `max(1,min(m,n))`.
info	host	output	If info=0, the parameters passed to the function are valid If info<0, the parameter in postion -info is invalid
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,batchSize <0 or lda < imax(1,m)`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrf, dgeqrf, cgeqrf, zgeqrf

cublas<t>gelsBatched()

cublasStatus_t cublasSgelsBatched( cublasHandle_t handle, 
                                   cublasOperation_t trans, 
                                   int m, 
                                   int n,
                                   int nrhs,
                                   float *Aarray[],  
                                   int lda, 
                                   float *Carray[],
                                   int ldc,                                                                 
                                   int *info, 
                                   int *devInfoArray,
                                   int batchSize );

cublasStatus_t cublasDgelsBatched( cublasHandle_t handle,
                                   cublasOperation_t trans,  
                                   int m, 
                                   int n,
                                   int nrhs,
                                   double *Aarray[],  
                                   int lda, 
                                   double *Carray[],
                                   int ldc,                                                                 
                                   int *info, 
                                   int *devInfoArray,
                                   int batchSize );

cublasStatus_t cublasCgelsBatched( cublasHandle_t handle, 
                                   cublasOperation_t trans,  
                                   int m, 
                                   int n,
                                   int nrhs,
                                   cuComplex *Aarray[],  
                                   int lda, 
                                   cuComplex *Carray[],
                                   int ldc,                                                                 
                                   int *info, 
                                   int *devInfoArray,
                                   int batchSize );
                                                            
cublasStatus_t cublasZgelsBatched( cublasHandle_t handle, 
                                   cublasOperation_t trans, 
                                   int m, 
                                   int n,
                                   int nrhs,
                                   cuDoubleComplex *Aarray[],  
                                   int lda, 
                                   cuDoubleComplex *Carray[],
                                   int ldc,                                                                 
                                   int *info, 
                                   int *devInfoArray,
                                   int batchSize );

Aarray is an array of pointers to matrices stored in column-major format with dimensions m x n and leading dimension lda. Carray is an array of pointers to matrices stored in column-major format with dimensions n x nrhs and leading dimension ldc.

This function find the least squares solution of a batch of overdetermined systems : it solves the least squares problem described as follows :

            minimize  || Carray[i] - Aarray[i]*Xarray[i] || , with i = 0, ...,batchSize-1

On exit, each Aarray[i] is overwritten with their QR factorization and each Carray[i] is overwritten with the least square solution

cublas<t>gelsBatched supports only the non-transpose operation and only solves over-determined systems (m >= n).

cublas<t>gelsBatched only supports compute capability 2.0 or above.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
trans		input	operation op(`Aarray[i]`) that is non- or (conj.) transpose. Only non-transpose operation is currently supported.
m		input	number of rows `Aarray[i]`.
n		input	number of columns of each `Aarray[i]` and rows of each `Carray[i]`.
nrhs		input	number of columns of each `Carray[i]`.
Aarray	device	input/output	array of pointers to <type> array, with each array of dim. `m x n` with `lda>=max(1,m)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
Carray	device	input/output	array of pointers to <type> array, with each array of dim. `n x nrhs` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store each matrix `Carray[i]`.
info	host	output	If info=0, the parameters passed to the function are valid If info<0, the parameter in position -info is invalid
devInfoArray	device	output	optional array of integers of dimension batchsize. If non-null, every element devInfoArray[i] contain a value V with the following meaning: V = 0 : the i-th problem was sucessfully solved V > 0 : the V-th diagonal element of the Aarray[i] is zero. Aarray[i] does not have full rank.
batchSize		input	number of pointers contained in Aarray and Carray

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,batchSize <0` , `lda < imax(1,m)` or `ldc < imax(1,m)`
`CUBLAS_STATUS_NOT_SUPPORTED`	the parameters `m <n` or `trans` is different from non-transpose.
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capability < 200
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgels, dgels, cgels, zgels

cublas<t>tpttr()

cublasStatus_t cublasStpttr ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const float *AP,
                              float *A,
                              int lda );
                                       
cublasStatus_t cublasDtpttr ( cublasHandle_t handle, 
                              cublasFillMode_t uplo,
                              int n,
                              const double *AP,
                              double *A,
                              int lda );

cublasStatus_t cublasCtpttr ( cublasHandle_t handle, 
                              cublasFillMode_t uplo, 
                              int n,
                              const cuComplex *AP,
                              cuComplex *A,
                              int lda );
                                       
cublasStatus_t cublasZtpttr ( cublasHandle_t handle, 
                              cublasFillMode_t uplo
                              int n,
                              const cuDoubleComplex *AP,
                              cuDoubleComplex *A,
                              int lda );

This function performs the conversion from the triangular packed format to the triangular format

If uplo == CUBLAS_FILL_MODE_LOWER then the elements of AP are copied into the lower triangular part of the triangular matrix A and the upper part of A is left untouched. If uplo == CUBLAS_FILL_MODE_UPPER then the elements of AP are copied into the upper triangular part of the triangular matrix A and the lower part of A is left untouched.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `AP` contains lower or upper part of matrix `A`.
n		input	number of rows and columns of matrix `A`.
AP	device	input	<type> array with $A$ stored in packed format.
A	device	output	<type> array of dimensions `lda x n` , with `lda>=max(1,n)`. The opposite side of A is left untouched.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

stpttr, dtpttr, ctpttr, ztpttr

cublas<t>trttp()

cublasStatus_t cublasStrttp ( cublasHandle_t handle, 
                              cublasFillMode_t uplo, 
                              int n,
                              const float *A,
                              int lda,
                              float *AP );

cublasStatus_t cublasDtrttp ( cublasHandle_t handle, 
                              cublasFillMode_t uplo, 
                              int n, 
                              const double *A,
                              int lda,
                              double *AP );

cublasStatus_t cublasCtrttp ( cublasHandle_t handle, 
                              cublasFillMode_t uplo, 
                              int n,
                              const cuComplex *A,
                              int lda,
                              cuComplex *AP );
                              
cublasStatus_t cublasZtrttp ( cublasHandle_t handle, 
                              cublasFillMode_t uplo, 
                              int n, 
                              const cuDoubleComplex *A,
                              int lda,
                              cuDoubleComplex *AP );

This function performs the conversion from the triangular format to the triangular packed format

If uplo == CUBLAS_FILL_MODE_LOWER then the lower triangular part of the triangular matrix A is copied into the array AP. If uplo == CUBLAS_FILL_MODE_UPPER then then the upper triangular part of the triangular matrix A is copied into the array AP.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates which matrix `A` lower or upper part is referenced.
n		input	number of rows and columns of matrix `A`.
A	device	input	<type> array of dimensions `lda x n` , with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
AP	device	output	<type> array with $A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strttp, dtrttp, ctrttp, ztrttp

2.8.12. cublasGemmEx()

cublasStatus_t cublasGemmEx(cublasHandle_t handle,
                           cublasOperation_t transa, 
                           cublasOperation_t transb,
                           int m, 
                           int n, 
                           int k,
                           const void    *alpha,
                           const void     *A, 
                           cudaDataType_t Atype, 
                           int lda,
                           const void     *B, 
                           cudaDataType_t Btype, 
                           int ldb,
                           const void    *beta,
                           void           *C, 
                           cudaDataType_t Ctype, 
                           int ldc,
                           cudaDataType_t computeType, 
                           cublasGemmAlgo_t algo)

This function is an extension of cublas<t>gemm that allows the user to individally specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run. Currently supported combinations of arguments are listed further down in this section.

$C = α op (A) op (B) + β C$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

and $op (B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host or device	input	scalar scaling factor for A*B; of same type as computeType.
A	device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transa == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
Btype		input	enumerant specifying the datatype of matrix `B`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	scalar scaling factor for C; of same type as computeType. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.
computeType		input	enumerant specifying the computation type for `cublasGemmEx`.
algo		input	enumerant specifying the algorithm for `cublasGemmEx`.

Computation type supported by cublasGemmEx are listed below :

computeType
`CUDA_R_16F`
`CUDA_R_32F`
`CUDA_R_32I`
`CUDA_R_64F`
`CUDA_C_32F`
`CUDA_C_64F`

For CUDA_R_16F computation type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`

For CUDA_R_32I computation type the matrix types combinations supported by cublasGemmEx are listed below. This path is only supported with alpha, beta being either 1 or 0; A, B being 32-bit aligned; and lda, ldb being multiples of 4.

A	B	C
`CUDA_R_8I`	`CUDA_R_8I`	`CUDA_R_32I`

For CUDA_R_32F computation type the matrix types combinations supported by cublasGemmEx are listed below

A	B	C
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_8I`	`CUDA_R_8I`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`

For CUDA_R_64F computation type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`

For CUDA_C_32F computation type the matrix types combinations supported for cublasGemmEx are listed below :

A	B	C
`CUDA_C_8I`	`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`

For CUDA_C_64F computaion type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

cublasGemmEx routine is run for the following algorithm.

CublasGemmAlgo_t	Meaning
`CUBLAS_GEMM_DFALT`	Apply Heuristics to select the GEMM algorithm
`CUBLAS_GEMM_ALGO0 to CUBLAS_GEMM_ALGO17`	Explicitly choose an algorithm
`CUBLAS_GEMM_DFALT_TENSOR_OP`	Apply Heuristics to select the GEMM algorithm while allowing the use of Tensor Core operations if possible
`CUBLAS_GEMM_ALGO0_TENSOR_OP to CUBLAS_GEMM_ALGO2_TENSOR_OP`	Explicitly choose a GEMM algorithm allowing it to use Tensor Core operations if possible

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ARCH_MISMATCH`	`cublasCgemmEx` is only supported for GPU with architecture capabilities equal or greater than 5.0
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype`, `Btype` and `Ctype` and the algorithm type, `algo` is not supported
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgemm

cublasGemmEx()

cublasStatus_t cublasGemmEx(cublasHandle_t handle,
                           cublasOperation_t transa, 
                           cublasOperation_t transb,
                           int m, 
                           int n, 
                           int k,
                           const void    *alpha,
                           const void     *A, 
                           cudaDataType_t Atype, 
                           int lda,
                           const void     *B, 
                           cudaDataType_t Btype, 
                           int ldb,
                           const void    *beta,
                           void           *C, 
                           cudaDataType_t Ctype, 
                           int ldc,
                           cudaDataType_t computeType, 
                           cublasGemmAlgo_t algo)

$C = α op (A) op (B) + β C$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

and $op (B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host or device	input	scalar scaling factor for A*B; of same type as computeType.
A	device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transa == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
Btype		input	enumerant specifying the datatype of matrix `B`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	scalar scaling factor for C; of same type as computeType. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.
computeType		input	enumerant specifying the computation type for `cublasGemmEx`.
algo		input	enumerant specifying the algorithm for `cublasGemmEx`.

Computation type supported by cublasGemmEx are listed below :

computeType
`CUDA_R_16F`
`CUDA_R_32F`
`CUDA_R_32I`
`CUDA_R_64F`
`CUDA_C_32F`
`CUDA_C_64F`

For CUDA_R_16F computation type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`

A	B	C
`CUDA_R_8I`	`CUDA_R_8I`	`CUDA_R_32I`

For CUDA_R_32F computation type the matrix types combinations supported by cublasGemmEx are listed below

A	B	C
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_8I`	`CUDA_R_8I`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`

For CUDA_R_64F computation type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`

For CUDA_C_32F computation type the matrix types combinations supported for cublasGemmEx are listed below :

A	B	C
`CUDA_C_8I`	`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`

For CUDA_C_64F computaion type the matrix types combinations supported by cublasGemmEx are listed below :

A	B	C
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

cublasGemmEx routine is run for the following algorithm.

CublasGemmAlgo_t	Meaning
`CUBLAS_GEMM_DFALT`	Apply Heuristics to select the GEMM algorithm
`CUBLAS_GEMM_ALGO0 to CUBLAS_GEMM_ALGO17`	Explicitly choose an algorithm
`CUBLAS_GEMM_DFALT_TENSOR_OP`	Apply Heuristics to select the GEMM algorithm while allowing the use of Tensor Core operations if possible
`CUBLAS_GEMM_ALGO0_TENSOR_OP to CUBLAS_GEMM_ALGO2_TENSOR_OP`	Explicitly choose a GEMM algorithm allowing it to use Tensor Core operations if possible

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ARCH_MISMATCH`	`cublasCgemmEx` is only supported for GPU with architecture capabilities equal or greater than 5.0
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype`, `Btype` and `Ctype` and the algorithm type, `algo` is not supported
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgemm

2.8.13. cublasCsyrkEx()

cublasStatus_t cublasCsyrkEx(cublasHandle_t handle,
                             cublasFillMode_t uplo, 
                             cublasOperation_t trans,
                             int n, 
                             int k,
                             const float     *alpha,
                             const void      *A, 
                             cudaDataType    Atype,
                             int lda,
                             const float    *beta,
                             cuComplex      *C,
                             cudaDataType   Ctype,
                             int ldc)

This function is an extension of cublasCsyrk where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex

This function performs the symmetric rank- $k$ update

$C = α op (A) op (A)^{T} + β C$

where $α$ and $β$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \end{cases}$

Note: This routine is only supported on GPUs with architecture capabilities equal or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store matrix A.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCsyrkEx are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype` and `Ctype` is not supported
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capabilites lower than 5.0
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.8.14. cublasCsyrk3mEx()

cublasStatus_t cublasCsyrk3mEx(cublasHandle_t handle,
                               cublasFillMode_t uplo, 
                               cublasOperation_t trans,
                               int n, 
                               int k,
                               const float     *alpha,
                               const void      *A, 
                               cudaDataType    Atype,
                               int lda,
                               const float    *beta,
                               cuComplex      *C,
                               cudaDataType   Ctype,
                               int ldc)

This function is an extension of cublasCsyrk where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the symmetric rank- $k$ update

$C = α op (A) op (A)^{T} + β C$

where $α$ and $β$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \end{cases}$

Note: This routine is only supported on GPUs with architecture capabilities equal or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store matrix A.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCsyrk3mEx are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype` and `Ctype` is not supported
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capabilites lower than 5.0
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.8.15. cublasCherkEx()

cublasStatus_t cublasCherkEx(cublasHandle_t handle,
                           cublasFillMode_t uplo, 
                           cublasOperation_t trans,
                           int n, 
                           int k,
                           const float     *alpha,
                           const void      *A, 
                           cudaDataType    Atype,
                           int lda,
                           const float    *beta,
                           cuComplex      *C,
                           cudaDataType   Ctype,
                           int ldc)

This function is an extension of cublasCherk where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex

This function performs the Hermitian rank- $k$ update

$C = α op (A) op (A)^{H} + β C$

where $α$ and $β$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

Note: This routine is only supported on GPUs with architecture capabilities equal or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCherkEx are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype` and `Ctype` is not supported
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capabilites lower than 5.0
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk

2.8.16. cublasCherk3mEx()

cublasStatus_t cublasCherk3mEx(cublasHandle_t handle,
                           cublasFillMode_t uplo, 
                           cublasOperation_t trans,
                           int n, 
                           int k,
                           const float     *alpha,
                           const void      *A, 
                           cudaDataType    Atype,
                           int lda,
                           const float    *beta,
                           cuComplex      *C,
                           cudaDataType   Ctype,
                           int ldc)

This function is an extension of cublasCherk where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the Hermitian rank- $k$ update

$C = α op (A) op (A)^{H} + β C$

where $α$ and $β$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

Note: This routine is only supported on GPUs with architecture capabilities equal or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCherk3mEx are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype` and `Ctype` is not supported
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capabilites lower than 5.0
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk

2.8.17. cublasNrm2Ex()

cublasStatus_t  cublasNrm2Ex( cublasHandle_t handle, 
                              int n, 
                              const void *x, 
                              cudaDataType xType,
                              int incx, 
                              void *result,
                              cudaDataType resultType,
                              cudaDataType executionType)

This function is an API generalization of the routine cublas<t>nrm2 where input data, output data and compute type can be specified independently.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of elements in the vector `x`.
x	device	input	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
result	host or device	output	the resulting norm, which is `0.0` if `n,incx<=0`.
resultType		input	enumerant specifying the datatype of the `result`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currrently supported for cublasNrm2Ex are listed below :

x	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	the reduction buffer could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType`, `resultType` and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

snrm2, snrm2, dnrm2, dnrm2, scnrm2, scnrm2, dznrm2

2.8.18. cublasAxpyEx()

cublasStatus_t cublasAxpyEx (cublasHandle_t handle,
                             int n,
                             const void *alpha,
                             cudaDataType alphaType,
                             const void *x,
                             cudaDataType xType,
                             int incx,
                             void *y,
                             cudaDataType yType,
                             int incy,
                             cudaDataType executiontype);

This function is an API generalization of the routine cublas<t>axpy where input data, output data and compute type can be specified independently.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
alpha	host or device	input	<type> scalar used for multiplication.
n		input	number of elements in the vector `x` and `y`.
x	device	input	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
y	device	in/out	<type> vector with `n` elements.
yType		input	enumerant specifying the datatype of vector `y`.
incy		input	stride between consecutive elements of `y`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currrently supported for cublasAxpyEx are listed below :

x	y	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType`,`yType`, and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

saxpy, daxpy, caxpy, zaxpy

2.8.19. cublasDotEx()

cublasStatus_t cublasDotEx (cublasHandle_t handle,
                            int n, 
                            const void *x,
                            cudaDataType xType, 
                            int incx, 
                            const void *y, 
                            cudaDataType yType,
                            int incy,
                            void *result,
                            cudaDataType resultType,
                            cudaDataType executionType);
                            
cublasStatus_t cublasDotcEx (cublasHandle_t handle,
                             int n, 
                             const void *x,
                             cudaDataType xType, 
                             int incx, 
                             const void *y, 
                             cudaDataType yType,
                             int incy,
                             void *result,
                             cudaDataType resultType,
                             cudaDataType executionType);

These functions are an API generalization of the routines cublas<t>dot and cublas<t>dotc where input data, output data and compute type can be specified independently.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of elements in the vectors `x` and `y`.
x	device	input	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
y	device	input	<type> vector with `n` elements.
yType		input	enumerant specifying the datatype of vector `y`.
incy		input	stride between consecutive elements of `y`.
result	host or device	output	the resulting dot product, which is `0.0` if `n<=0`.
resultType		input	enumerant specifying the datatype of the `result`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currrently supported for cublasDotEx and cublasDotcEx are listed below :

x	y	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	the reduction buffer could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType`,`yType`, `resultType` and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sdot, ddot, cdotu, cdotc, zdotu, zdotc

2.8.20. cublasScalEx()

cublasStatus_t  cublasScalEx(cublasHandle_t handle, 
                             int n, 
                             const void *alpha,
                             cudaDataType alphaType,
                             void *x, 
                             cudaDataType xType,
                             int incx,
                             cudaDataType executionType);

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
alpha	host or device	input	<type> scalar used for multiplication.
n		input	number of elements in the vector `x`.
x	device	in/out	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currrently supported for cublasScalEx are listed below :

x	execution
`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType` and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sscal, dscal, csscal, cscal, zdscal, zscal

Using the CUBLASXT API

General description

The cublasXt API of cuBLAS exposes a multi-GPU capable Host interface : when using this API the application only needs to allocate the required matrices on the Host memory space. There are no restriction on the sizes of the matrices as long as they can fit into the Host memory. The cublasXt API takes care of allocating the memory across the designated GPUs and dispatched the workload between them and finally retrieves the results back to the Host. The cublasXt API supports only the compute-intensive BLAS3 routines (e.g matrix-matrix operations) where the PCI transfers back and forth from the GPU can be amortized. The cublasXt API has its own header file cublasXt.h.

Starting with release 8.0, cublasXt API allows any of the matrices to be located on a GPU device.

Note : The cublasXt API is only supported on 64-bit platforms.

Tiling design approach

To be able to share the workload between multiples GPUs, the cublasXt API uses a tiling strategy : every matrix is divided in square tiles of user-controllable dimension BlockDim x BlockDim. The resulting matrix tiling defines the static scheduling policy : each resulting tile is affected to a GPU in a round robin fashion One CPU thread is created per GPU and is responsible to do the proper memory transfers and cuBLAS operations to compute all the tiles that it is responsible for. From a performance point of view, due to this static scheduling strategy, it is better that compute capabilites and PCI bandwidth are the same for every GPU. The figure below illustrates the tiles distribution between 3 GPUs. To compute the first tile G0 from C, the CPU thread 0 responsible of GPU0, have to load 3 tiles from the first row of A and tiles from the first columun of B in a pipeline fashion in order to overlap memory transfer and computations and sum the results into the first tile G0 of C before to move on to the next tile G0.

Figure 1. Example of cublasXt<t>gemm() tiling for 3 Gpus

When the tile dimension is not an exact multiple of the dimensions of C, some tiles are partially filled on the right border or/and the bottom border. The current implementation does not pad the incomplete tiles but simply keep track of those incomplete tiles by doing the right reduced cuBLAS opearations : this way, no extra computation is done. However it still can lead to some load unbalance when all GPUS do not have the same number of incomplete tiles to work on.

When one or more matrices are located on some GPU devices, the same tiling approach and workload sharing is applied. The memory transfers are in this case done between devices. However, when the computation of a tile and some data are located on the same GPU device, the memory transfer to/from the local data into tiles is bypassed and the GPU operates directly on the local data. This can lead to a significant performance increase, especially when only one GPU is used for the computation.

The matrices can be located on any GPU device, and do not have to be located on the same GPU device. Furthermore, the matrices can even be located on a GPU device that do not participate to the computation.

On the contrary of the cuBLAS API, even if all matrices are located on the same device, the cublasXt API is still a blocking API from the Host point of view : the data results wherever located will be valid on the call return and no device synchronization is required.

Hybrid CPU-GPU computation

In the case of very large problems, the cublasXt API offers the possibility to offload some of the computation to the Host CPU. This feature can be setup with the routines cublasXtSetCpuRoutine() and cublasXtSetCpuRatio() The workload affected to the CPU is put aside : it is simply a percentage of the resulting matrix taken from the bottom and the right side whichever dimension is bigger. The GPU tiling is done after that on the reduced resulting matrix.

If any of the matrices is located on a GPU device, the feature is ignored and all computation will be done only on the GPUs

This feature should be used with caution because it could interfere with the CPU threads responsible of feeding the GPUs.

Currenty, only the routine cublasXt<t>gemm() supports this feature.

3.1.3. Results reproducibility

Currently all CUBLAS XT API routines from a given toolkit version, generate the same bit-wise results when the following conditions are respected :

all GPUs particating to the computation have the same compute-capabilities and the same number of SMs.

the tiles size is kept the same between run.

either the CPU hybrid computation is not used or the CPU Blas provided is also guaranteed to produce reproducible results.

cublasXt API Datatypes Reference

cublasXtHandle_t

The cublasXtHandle_t type is a pointer type to an opaque structure holding the cublasXt API context. The cublasXt API context must be initialized using cublasXtCreate() and the returned handle must be passed to all subsequent cublasXt API function calls. The context should be destroyed at the end using cublasXtDestroy().

cublasXtOpType_t

The cublasOpType_t enumerates the four possible types supported by BLAS routines. This enum is used as parameters of the routines cublasXtSetCpuRoutine and cublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_FLOAT`	float or single precision type
`CUBLASXT_DOUBLE`	double precision type
`CUBLASXT_COMPLEX`	single precision complex
`CUBLASXT_DOUBLECOMPLEX`	double precision complex

cublasXtBlasOp_t

The cublasXtBlasOp_t type enumerates the BLAS3 or BLAS-like routine supported by cublasXt API. This enum is used as parameters of the routines cublasXtSetCpuRoutine and cublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_GEMM`	GEMM routine
`CUBLASXT_SYRK`	SYRK routine
`CUBLASXT_HERK`	HERK routine
`CUBLASXT_SYMM`	SYMM routine
`CUBLASXT_HEMM`	HEMM routine
`CUBLASXT_TRSM`	TRSM routine
`CUBLASXT_SYR2K`	SYR2K routine
`CUBLASXT_HER2K`	HER2K routine
`CUBLASXT_SPMM`	SPMM routine
`CUBLASXT_SYRKX`	SYRKX routine
`CUBLASXT_HERKX`	HERKX routine

cublasXtPinningMemMode_t

The type is used to enable or disable the Pinning Memory mode through the routine cubasMgSetPinningMemMode

Value	Meaning
`CUBLASXT_PINNING_DISABLED`	the Pinning Memory mode is disabled
`CUBLASXT_PINNING_ENABLED`	the Pinning Memory mode is enabled

cublasXt API Helper Function Reference

cublasXtCreate()

cublasStatus_t
cublasXtCreate(cublasXtHandle_t *handle)

This function initializes the cublasXt API and creates a handle to an opaque structure holding the cublasXt API context. It allocates hardware resources on the host and device and must be called prior to making any other cublasXt API calls.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the initialization succeeded
`CUBLAS_STATUS_ALLOC_FAILED`	the resources could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	cublasXt API is only supported on 64-bit platform

cublasXtDestroy()

cublasStatus_t
cublasXtDestroy(cublasXtHandle_t handle)

This function releases hardware resources used by the cublasXt API context. The release of GPU resources may be deferred until the application exits. This function is usually the last call with a particular handle to the cublasXt API.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the shut down succeeded
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

cublasXtDeviceSelect()

cublasXtDeviceSelect(cublasXtHandle_t handle, int nbDevices, int deviceId[])

This function allows the user to provide the number of GPU devices and their respective Ids that will participate to the subsequent cublasXt API Math function calls. This function will create a cuBLAS context for every GPU provided in that list. Currently the device configuration is static and cannot be changed between Math function calls. In that regard, this function should be called only once after cublasXtCreate. To be able to run multiple configurations, multiple cublasXt API contexts should be created.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	User call was sucessful
`CUBLAS_STATUS_INVALID_VALUE`	Access to at least one of the device could not be done or a cuBLAS context could not be created on at least one of the device
`CUBLAS_STATUS_ALLOC_FAILED`	Some resources could not be allocated.

cublasXtSetBlockDim()

cublasXtSetBlockDim(cublasXtHandle_t handle, int blockDim)

This function allows the user to set the block dimension used for the tiling of the matrices for the subsequent Math function calls. Matrices are split in square tiles of blockDim x blockDim dimension. This function can be called anytime and will take effect for the following Math function calls. The block dimension should be chosen in a way to optimize the math operation and to make sure that the PCI transfers are well overlapped with the computation.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blockDim <= 0

cublasXtGetBlockDim()

cublasXtGetBlockDim(cublasXtHandle_t handle, int *blockDim)

This function allows the user to query the block dimension used for the tiling of the matrices.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

cublasXtSetCpuRoutine()

cublasXtSetCpuRoutine(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp, cublasXtOpType_t type, void *blasFunctor)

This function allows the user to provide a CPU implementation of the corresponding BLAS routine. This function can be used with the function cublasXtSetCpuRatio() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

cublasXtSetCpuRatio()

cublasXtSetCpuRatio(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp, cublasXtOpType_t type, float ratio )

This function allows the user to define the percentage of workload that should be done on a CPU in the context of an hybrid computation. This function can be used with the function cublasXtSetCpuRoutine() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

cublasXtSetPinningMemMode()

cublasXtSetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t mode)

This function allows the user to enable or disable the Pinning Memory mode. When enabled, the matrices passed in subsequent cublasXt API calls will be pinned/unpinned using the CUDART routine cudaHostRegister and cudaHostUnregister respectively if the matrices are not already pinned. If a matrix happened to be pinned partially, it will also not be pinned. Pinning the memory improve PCI transfer performace and allows to overlap PCI memory transfer with computation. However pinning/unpinning the memory take some time which might not be amortized. It is advised that the user pins the memory on its own using cudaMallocHost or cudaHostRegister and unpin it when the computation sequence is completed. By default, the Pinning Memory mode is disabled.

Note: The Pinning Memory mode should not enabled when matrices used for different calls to cublasXt API overlap. CublasXt determines that a matrix is pinned or not if the first address of that matrix is pinned using cudaHostGetFlags, thus cannot know if the matrix is already partially pinned or not. This is especially true in multi-threaded application where memory could be partially or totally pinned or unpinned while another thread is accessing that memory.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	the mode value is different from `CUBLASXT_PINNING_DISABLED` and `CUBLASXT_PINNING_ENABLED`

cublasXtGetPinningMemMode()

cublasXtGetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t *mode)

This function allows the user to query the Pinning Memory mode. By default, the Pinning Memory mode is disabled.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

cublasXt API Math Functions Reference

In this chapter we describe the actual Linear Agebra routines that cublasXt API supports. We will use abbreviations <type> for type and <t> for the corresponding short type to make a more concise and clear presentation of the implemented functions. Unless otherwise specified <type> and <t> have the following meanings:

<type>	<t>	Meaning
`float`	‘s’ or ‘S’	real single-precision
`double`	‘d’ or ‘D’	real double-precision
`cuComplex`	‘c’ or ‘C’	complex single-precision
`cuDoubleComplex`	‘z’ or ‘Z’	complex double-precision

cublasXt<t>gemm()

cublasStatus_t cublasXtSgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           size_t m, size_t n, size_t k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasXtDgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasXtCgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the matrix-matrix multiplication

$C = α op (A) op (B) + β C$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

and $op (B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transa == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication. If `beta==0`, `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgemm, dgemm, cgemm, zgemm

cublasXt<t>hemm()

cublasStatus_t cublasXtChemm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, size_t lda,
                           const cuComplex       *B, size_t ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZhemm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, size_t lda,
                           const cuDoubleComplex *B, size_t ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, size_t ldc)

This function performs the Hermitian matrix-matrix multiplication

$C = \{\begin{cases} α A B + β C & if side == CUBLAS_SIDE_LEFT \\ α B A + β C & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A$ is a Hermitian matrix stored in lower or upper mode, $B$ and $C$ are $m \times n$ matrices, and $α$ and $β$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side==CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

chemm, zhemm

cublasXt<t>symm()

cublasStatus_t cublasXtSsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const float           *alpha,
                           const float           *A, size_t lda,
                           const float           *B, size_t ldb,
                           const float           *beta,
                           float           *C, size_t ldc)
cublasStatus_t cublasXtDsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const double          *alpha,
                           const double          *A, size_t lda,
                           const double          *B, size_t ldb,
                           const double          *beta,
                           double          *C, size_t ldc)
cublasStatus_t cublasXtCsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, size_t lda,
                           const cuComplex       *B, size_t ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, size_t lda,
                           const cuDoubleComplex *B, size_t ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, size_t ldc)

This function performs the symmetric matrix-matrix multiplication

$C = \{\begin{cases} α A B + β C & if side == CUBLAS_SIDE_LEFT \\ α B A + β C & if side == CUBLAS_SIDE_RIGHT \end{cases}$

where $A$ is a symmetric matrix stored in lower or upper mode, $A$ and $A$ are $m \times n$ matrices, and $α$ and $β$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `A` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `A`, with matrix `A` sized accordingly.
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta == 0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssymm, dsymm, csymm, zsymm

cublasXt<t>syrk()

cublasStatus_t cublasXtSsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasXtDsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasXtCsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the symmetric rank- $k$ update

$C = α op (A) op (A)^{T} + β C$

where $α$ and $β$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{T} & if transa == CUBLAS_OP_T \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix A.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cublasXt<t>syr2k()

cublasStatus_t cublasXtSsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const float           *alpha,
                            const float           *A, size_t lda,
                            const float           *B, size_t ldb,
                            const float           *beta,
                            float           *C, size_t ldc)
cublasStatus_t cublasXtDsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const double          *alpha,
                            const double          *A, size_t lda,
                            const double          *B, size_t ldb,
                            const double          *beta,
                            double          *C, size_t ldc)
cublasStatus_t cublasXtCsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs the symmetric rank- $2 k$ update

$C = α (op (A) op (B)^{T} + op (B) op (A)^{T}) + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{T} and B^{T} & if trans == CUBLAS_OP_T \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cublasXt<t>syrkx()


cublasStatus_t cublasXtSsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const float           *alpha,
                            const float           *A, size_t lda,
                            const float           *B, size_t ldb,
                            const float           *beta,
                            float           *C, size_t ldc)
cublasStatus_t cublasXtDsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const double          *alpha,
                            const double          *A, size_t lda,
                            const double          *B, size_t ldb,
                            const double          *beta,
                            double          *C, size_t ldc)
cublasStatus_t cublasXtCsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs a variation of the symmetric rank- $k$ update

$C = α (op (A) op (B)^{T} + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{T} and B^{T} & if trans == CUBLAS_OP_T \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssyrk, dsyrk, csyrk, zsyrk and

cublasXt<t>herk()

cublasStatus_t cublasXtCherk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float  *alpha,
                           const cuComplex       *A, int lda,
                           const float  *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZherk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double *alpha,
                           const cuDoubleComplex *A, int lda,
                           const double *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the Hermitian rank- $k$ update

$C = α op (A) op (A)^{H} + β C$

where $α$ and $β$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $op (A)$ $n \times k$ . Also, for matrix $A$

$op (A) = \{\begin{cases} A & if transa == CUBLAS_OP_N \\ A^{H} & if transa == CUBLAS_OP_C \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk

cublasXt<t>her2k()

cublasStatus_t cublasXtCher2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const float  *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZher2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const double *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs the Hermitian rank- $2 k$ update

$C = α op (A) op (B)^{H} + \overset{ˉ}{α} op (B) op (A)^{H} + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{H} and B^{H} & if trans == CUBLAS_OP_C \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cublasXt<t>herkx()

cublasStatus_t cublasXtCherkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const float  *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZherkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const double *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs a variation of the Hermitian rank- $k$ update

$C = α op (A) op (B)^{H} + β C$

$op(A) and op(B) = \{\begin{cases} A and B & if trans == CUBLAS_OP_N \\ A^{H} and B^{H} & if trans == CUBLAS_OP_C \end{cases}$

Param.	Memory	In/out	Meaning
handle		input	handle to the cublasXt API context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transa == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	real scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_ARCH_MISMATCH`	the device does not support double-precision
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk and