1. Introduction
The cuSolver library is a high-level package based on the cuBLAS and cuSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries.
The intent of cuSolver is to provide useful LAPACK-like features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.
The first part of cuSolver is called cuSolverDN, and deals with dense matrix factorization and solve routines such as LU, QR, SVD and LDLT, as well as useful utilities such as matrix and vector permutations.
Next, cuSolverSP provides a new set of sparse routines based on a sparse QR factorization. Not all matrices have a good sparsity pattern for parallelism in factorization, so the cuSolverSP library also provides a CPU path to handle those sequential-like matrices. For those matrices with abundant parallelism, the GPU path will deliver higher performance. The library is designed to be called from C and C++.
The final part is cuSolverRF, a sparse re-factorization package that can provide very good performance when solving a sequence of matrices where only the coefficients are changed but the sparsity pattern remains the same.
The GPU path of the cuSolver library assumes data is already in the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().
cuSolverDN: Dense LAPACK
The cuSolverDN library was designed to solve dense linear systems of the form
where the coefficient matrix , right-hand-side vector and solution vector
The cuSolverDN library provides QR factorization and LU with partial pivoting to handle a general matrix A, which may be non-symmetric. Cholesky factorization is also provided for symmetric/Hermitian matrices. For symmetric indefinite matrices, we provide Bunch-Kaufman (LDL) factorization.
The cuSolverDN library also provides a helpful bidiagonalization routine and singular value decomposition (SVD).
The cuSolverDN library targets computationally-intensive and popular routines in LAPACK, and provides an API compatible with LAPACK. The user can accelerate these time-consuming routines with cuSolverDN and keep others in LAPACK without a major change to existing code.
cuSolverSP: Sparse LAPACK
The cuSolverSP library was mainly designed to a solve sparse linear system
and the least-squares problem
where sparse matrix , right-hand-side vector and solution vector . For a linear system, we require m=n.
The core algorithm is based on sparse QR factorization. The matrix A is accepted in CSR format. If matrix A is symmetric/Hermitian, the user has to provide a full matrix, ie fill missing lower or upper part.
If matrix A is symmetric positive definite and the user only needs to solve , Cholesky factorization can work and the user only needs to provide the lower triangular part of A.
On top of the linear and least-squares solvers, the cuSolverSP library provides a simple eigenvalue solver based on shift-inverse power method, and a function to count the number of eigenvalues contained in a box in the complex plane.
cuSolverRF: Refactorization
The cuSolverRF library was designed to accelerate solution of sets of linear systems by fast re-factorization when given new coefficients in the same sparsity pattern
where a sequence of coefficient matrices , right-hand-sides and solutions are given for i=1,...,k.
The cuSolverRF library is applicable when the sparsity pattern of the coefficient matrices as well as the reordering to minimize fill-in and the pivoting used during the LU factorization remain the same across these linear systems. In that case, the first linear system (i=1) requires a full LU factorization, while the subsequent linear systems (i=2,...,k) require only the LU re-factorization. The later can be performed using the cuSolverRF library.
Notice that because the sparsity pattern of the coefficient matrices, the reordering and pivoting remain the same, the sparsity pattern of the resulting triangular factors and also remains the same. Therefore, the real difference between the full LU factorization and LU re-factorization is that the required memory is known ahead of time.
1.4. Naming Conventions
The cuSolverDN library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:
cusolverDn<t><operation> |
where <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively. <operation> can be Cholesky factorization (potrf), LU with partial pivoting (getrf), QR factorization (geqrf) and Bunch-Kaufman factorization (sytrf).
The cuSolverSP library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:
cusolverSp[Host]<t>[<matrix data format>]<operation>[<output matrix data format>]<based on> |
where cuSolverSp is the GPU path and cusolverSpHost is the corresponding CPU path. <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively.
The <matrix data format> is csr, compressed sparse row format.
The <operation> can be ls, lsq, eig, eigs, corresponding to linear solver, least-square solver, eigenvalue solver and number of eigenvalues in a box, respectively.
The <output matrix data format> can be v or m, corresponding to a vector or a matrix.
<based on> describes which algorithm is used. For example, qr (sparse QR factorization) is used in linear solver and least-square solver.
All of the functions have the return type cusolverStatus_t and are explained in more detail in the chapters that follow.
routine | data format | operation | output format | based on |
csrlsvlu | csr | linear solver (ls) | vector (v) | LU (lu) with partial pivoting |
csrlsvqr | csr | linear solver (ls) | vector (v) | QR factorization (qr) |
csrlsvchol | csr | linear solver (ls) | vector (v) | Cholesky factorization (chol) |
csrlsqvqr | csr | least-square solver (lsq) | vector (v) | QR factorization (qr) |
csreigvsi | csr | eigenvalue solver (eig) | vector (v) | shift-inverse |
csreigs | csr | number of eigenvalues in a box (eigs) | ||
csrsymrcm | csr | Symmetric Reverse Cuthill-McKee (symrcm) |
The cuSolverRF library routines are available for data type double. Most of the routines follow the naming convention:
cusolverRf_<operation>_[[Host]](...) |
where the trailing optional Host qualifier indicates the data is accessed on the host versus on the device, which is the default. The <operation> can be Setup, Analyze, Refactor, Solve, ResetValues, AccessBundledFactors and ExtractSplitFactors.
Finally, the return type of the cuSolverRF library routines is cusolverStatus_t.
1.5. Asynchronous Execution
The cuSolver library functions prefer to keep asynchronous execution as much as possible. Developers can always use the cudaDeviceSynchronize() function to ensure that the execution of a particular cuSolver library routine has completed.
A developer can also use the cudaMemcpy() routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize() because the call to cudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.
1.6. Library Property
The libraryPropertyType data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield MAJOR_VERSION=X, MINOR_VERSION=Y, PATCH_LEVEL=Z)
typedef enum libraryPropertyType_t { MAJOR_VERSION, MINOR_VERSION, PATCH_LEVEL } libraryPropertyType;
The following code can show the version of cusolver library.
int major=-1,minor=-1,patch=-1; cusolverGetProperty(MAJOR_VERSION, &major); cusolverGetProperty(MINOR_VERSION, &minor); cusolverGetProperty(PATCH_LEVEL, &patch); printf("CUSOLVER Version (Major,Minor,PatchLevel): %d.%d.%d\n", major,minor,patch);
1.7. Link Openmp
The cusolver library uses openmp to improve performance of CPU part. The openmp support is only enabled on Linux platform. The user needs to link openmp library explicitly on Linux platform, by either compiler option -fopenmp or 3rd party openmp library, for example, libiomp5 from MKL.
link openmp library by -fopenmp
nvcc -ccbin g++ -Xcompiler -fopenmp <object files> or g++ -fopenmp <object files>
link openmp library from Intel MKL
g++ <object files> -L<path to MKL> -liomp5
2. Using the cuSolver API
This chapter describes how to use the cuSolver library API. It is not a reference for the cuSolver API data types and functions; that is provided in subsequent chapters.
2.1. Thread Safety
The library is thread safe and its functions can be called from multiple host threads.
2.2. Scalar Parameters
In the cuSolver API, the scalar parameters can be passed by reference on the host.
2.3. Parallelism with Streams
If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, CUDA streams can be used to overlap these tasks.
The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should create CUDA streams using the function cudaStreamCreate() and set the stream to be used by each individual cuSolver library routine by calling for example cusolverDnSetStream() just before calling the actual cuSolverDN routine. Then, computations performed in separate streams would be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.
3. cuSolver Types Reference
3.1. cuSolverDN Types
The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h. In addition, cuSolverDN uses some familiar types from cuBlas.
3.1.1. cusolverDnHandle_t
This is a pointer type to an opaque cuSolverDN context, which the user must initialize by calling cusolverDnCreate() prior to calling any other library function. An un-initialized Handle object will lead to unexpected behavior, including crashes of cuSolverDN. The handle created and returned by cusolverDnCreate() must be passed to every cuSolverDN function.
3.1.2. cublasFillMode_t
The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran characters ‘L’ or ‘l’ (lower) and ‘U’ or ‘u’ (upper) that are often used as parameters to legacy BLAS implementations.
Value | Meaning |
CUBLAS_FILL_MODE_LOWER | the lower part of the matrix is filled |
CUBLAS_FILL_MODE_UPPER | the upper part of the matrix is filled |
3.1.3. cublasOperation_t
The cublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters ‘N’ or ‘n’ (non-transpose), ‘T’ or ‘t’ (transpose) and ‘C’ or ‘c’ (conjugate transpose) that are often used as parameters to legacy BLAS implementations.
Value | Meaning |
CUBLAS_OP_N | the non-transpose operation is selected |
CUBLAS_OP_T | the transpose operation is selected |
CUBLAS_OP_C | the conjugate transpose operation is selected |
3.1.4. cusolverEigType_t
The cusolverEigType_t type indicates which type of eigenvalue solver is. Its values correspond to Fortran integer 1 (A*x = lambda*B*x), 2 (A*B*x = lambda*x), 3 (B*A*x = lambda*x), used as parameters to legacy LAPACK implementations.
Value | Meaning |
CUSOLVER_EIG_TYPE_1 | A*x = lambda*B*x |
CUSOLVER_EIG_TYPE_2 | A*B*x = lambda*x |
CUSOLVER_EIG_TYPE_3 | B*A*x = lambda*x |
3.1.5. cusolverEigMode_t
The cusolverEigMode_t type indicates whether or not eigenvectors are computed. Its values correspond to Fortran character 'N' (only eigenvalues are computed), 'V' (both eigenvalues and eigenvectors are computed) used as parameters to legacy LAPACK implementations.
Value | Meaning |
CUSOLVER_EIG_MODE_NOVECTOR | only eigenvalues are computed |
CUSOLVER_EIG_MODE_VECTOR | both eigenvalues and eigenvectors are computed |
3.2. cuSolverSP Types
The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h.
3.2.1. cusolverSpHandle_t
This is a pointer type to an opaque cuSolverSP context, which the user must initialize by calling cusolverSpCreate() prior to calling any other library function. An un-initialized Handle object will lead to unexpected behavior, including crashes of cuSolverSP. The handle created and returned by cusolverSpCreate() must be passed to every cuSolverSP function.
3.2.2. cusparseMatDescr_t
We have chosen to keep the same structure as exists in cuSparse to describe the shape and properties of a matrix. This enables calls to either cuSparse or cuSolver using the same matrix description.
typedef struct { cusparseMatrixType_t MatrixType; cusparseFillMode_t FillMode; cusparseDiagType_t DiagType; cusparseIndexBase_t IndexBase; } cusparseMatDescr_t;
Please read documenation of CUSPARSE Library to understand each field of cusparseMatDescr_t.
3.2.3. cusolverStatus_t
This is a status type returned by the library functions and it can have the following values.
CUSOLVER_STATUS_SUCCESS |
The operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED |
The cuSolver library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSolver routine, or an error in the hardware setup. To correct: call cusolverCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. |
CUSOLVER_STATUS_ALLOC_FAILED |
Resource allocation failed inside the cuSolver library. This is usually caused by a cudaMalloc() failure. To correct: prior to the function call, deallocate previously allocated memory as much as possible. |
CUSOLVER_STATUS_INVALID_VALUE |
An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. |
CUSOLVER_STATUS_ARCH_MISMATCH |
The function requires a feature absent from the device architecture; usually caused by the lack of support for atomic operations or double precision. To correct: compile and run the application on a device with compute capability 2.0 or above. |
CUSOLVER_STATUS_EXECUTION_FAILED |
The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. |
CUSOLVER_STATUS_INTERNAL_ERROR |
An internal cuSolver operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED |
The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function. To correct: check that the fields in descrA were set correctly. |
3.3. cuSolverRF Types
cuSolverRF only supports double.
cusolverRfHandle_t
The cusolverRfHandle_t is a pointer to an opaque data structure that contains the cuSolverRF library handle. The user must initialize the handle by calling cusolverRfCreate() prior to any other cuSolverRF library calls. The handle is passed to all other cuSolverRF library calls.
cusolverRfMatrixFormat_t
The cusolverRfMatrixFormat_t is an enum that indicates the input/output matrix format assumed by the cusolverRfSetupDevice(), cusolverRfSetupHost(), cusolverRfResetValues(), cusolveRfExtractBundledFactorsHost() and cusolverRfExtractSplitFactorsHost() routines.
Value | Meaning |
CUSOLVER_MATRIX_FORMAT_CSR | matrix format CSR is assumed. (default) |
CUSOLVER_MATRIX_FORMAT_CSC | matrix format CSC is assumed. |
cusolverRfNumericBoostReport_t
The cusolverRfNumericBoostReport_t is an enum that indicates whether numeric boosting (of the pivot) was used during the cusolverRfRefactor() and cusolverRfSolve() routines. The numeric boosting is disabled by default.
Value | Meaning |
CUSOLVER_NUMERIC_BOOST_NOT_USED | numeric boosting not used. (default) |
CUSOLVER_NUMERIC_BOOST_USED | numeric boosting used. |
cusolverRfResetValuesFastMode_t
The cusolverRfResetValuesFastMode_t is an enum that indicates the mode used for the cusolverRfResetValues() routine. The fast mode requires extra memory and is recommended only if very fast calls to cusolverRfResetValues() are needed.
Value | Meaning |
CUSOLVER_RESET_VALUES_FAST_MODE_OFF | fast mode disabled. (default) |
CUSOLVER_RESET_VALUES_FAST_MODE_ON | fast mode enabled. |
cusolverRfFactorization_t
The cusolverRfFactorization_t is an enum that indicates which (internal) algorithm is used for refactorization in the cusolverRfRefactor() routine.
Value | Meaning |
CUSOLVER_FACTORIZATION_ALG0 | algorithm 0. (default) |
CUSOLVER_FACTORIZATION_ALG1 | algorithm 1. |
CUSOLVER_FACTORIZATION_ALG2 | algorithm 2. Domino-based scheme. |
cusolverRfTriangularSolve_t
The cusolverRfTriangularSolve_t is an enum that indicates which (internal) algorithm is used for triangular solve in the cusolverRfSolve() routine.
Value | Meaning |
CUSOLVER_TRIANGULAR_SOLVE_ALG0 | algorithm 0. |
CUSOLVER_TRIANGULAR_SOLVE_ALG1 | algorithm 1. (default) |
CUSOLVER_TRIANGULAR_SOLVE_ALG2 | algorithm 2. Domino-based scheme. |
CUSOLVER_TRIANGULAR_SOLVE_ALG3 | algorithm 3. Domino-based scheme. |
cusolverRfUnitDiagonal_t
The cusolverRfUnitDiagonal_t is an enum that indicates whether and where the unit diagonal is stored in the input/output triangular factors in the cusolverRfSetupDevice(), cusolverRfSetupHost() and cusolverRfExtractSplitFactorsHost() routines.
Value | Meaning |
CUSOLVER_UNIT_DIAGONAL_STORED_L | unit diagonal is stored in lower triangular factor. (default) |
CUSOLVER_UNIT_DIAGONAL_STORED_U | unit diagonal is stored in upper triangular factor. |
CUSOLVER_UNIT_DIAGONAL_ASSUMED_L | unit diagonal is assumed in lower triangular factor. |
CUSOLVER_UNIT_DIAGONAL_ASSUMED_U | unit diagonal is assumed in upper triangular factor. |
cusolverStatus_t
The cusolverStatus_t is an enum that indicates success or failure of the cuSolverRF library call. It is returned by all the cuSolver library routines, and it uses the same enumerated values as the sparse and dense Lapack routines.
4. cuSolver Formats Reference
4.1. Index Base Format
The CSR or CSC format requires either zero-based or one-based index for a sparse matrix A. The GLU library supports only zero-based indexing. Otherwise, both one-based and zero-based indexing are supported in cuSolver.
4.2. Vector (Dense) Format
The vectors are assumed to be stored linearly in memory. For example, the vector
|
is represented as
|
4.3. Matrix (Dense) Format
The dense matrices are assumed to be stored in column-major order in memory. The sub-matrix can be accessed using the leading dimension of the original matrix. For examle, the m*n (sub-)matrix
|
is represented as
|
with its elements arranged linearly in memory as
|
where lda ≥ m is the leading dimension of A.
4.4. Matrix (CSR) Format
In CSR format the matrix is represented by the following parameters
parameter | type | size | Meaning |
n | (int) | the number of rows (and columns) in the matrix. | |
nnz | (int) | the number of non-zero elements in the matrix. | |
csrRowPtr | (int *) | n+1 | the array of offsets corresponding to the start of each row in the arrays csrColInd and csrVal. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. |
csrColInd | (int *) | nnz | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. |
csrVal | (S|D|C|Z)* | nnz | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. |
Note that in our CSR format sparse matrices are assumed to be stored in row-major order, in other words, the index arrays are first sorted by row indices and then within each row by column indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4 matrix
|
is represented as
|
|
|
4.5. Matrix (CSC) Format
In CSC format the matrix is represented by the following parameters
parameter | type | size | Meaning |
n | (int) | the number of rows (and columns) in the matrix. | |
nnz | (int) | the number of non-zero elements in the matrix. | |
cscColPtr | (int *) | n+1 | the array of offsets corresponding to the start of each column in the arrays cscRowInd and cscVal. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. |
cscRowInd | (int *) | nnz | the array of row indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. |
cscVal | (S|D|C|Z)* | nnz | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. |
Note that in our CSC format sparse matrices are assumed to be stored in column-major order, in other words, the index arrays are first sorted by column indices and then within each column by row indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4 matrix
|
is represented as
|
|
|
cuSolverDN: dense LAPACK Function Reference
This chapter describes the API of cuSolverDN, which provides a subset of dense LAPACK functions.
cuSolverDN Helper Function Reference
The cuSolverDN helper functions are described in this section.
5.1.1. cusolverDnCreate()
cusolverStatus_t cusolverDnCreate(cusolverDnHandle_t *handle);
This function initializes the cuSolverDN library and creates a handle on the cuSolverDN context. It must be called before any other cuSolverDN API function is invoked. It allocates hardware resources necessary for accessing the GPU.
parameter | Memory | In/out | Meaning |
handle | host | output | the pointer to the handle to the cuSolverDN context. |
CUSOLVER_STATUS_SUCCESS | the initialization succeeded. |
CUSOLVER_STATUS_NOT_INITIALIZED | the CUDA Runtime initialization failed. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
5.1.2. cusolverDnDestroy()
cusolverStatus_t cusolverDnDestroy(cusolverDnHandle_t handle);
This function releases CPU-side resources used by the cuSolverDN library.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
CUSOLVER_STATUS_SUCCESS | the shutdown succeeded. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverDnSetStream()
cusolverStatus_t cusolverDnSetStream(cusolverDnHandle_t handle, cudaStream_t streamId)
This function sets the stream to be used by the cuSolverDN library to execute its routines.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
streamId | host | input | the stream to be used by the library. |
CUSOLVER_STATUS_SUCCESS | the stream was set successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverDnGetStream()
cusolverStatus_t cusolverDnGetStream(cusolverDnHandle_t handle, cudaStream_t *streamId)
This function sets the stream to be used by the cuSolverDN library to execute its routines.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
streamId | host | output | the stream to be used by the library. |
CUSOLVER_STATUS_SUCCESS | the stream was set successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
5.1.5. cusolverDnCreateSyevjInfo()
cusolverStatus_t cusolverDnCreateSyevjInfo( syevjInfo_t *info);
This function creates and initializes the structure of syevj, syevjBatched and sygvj to default values.
parameter | Memory | In/out | Meaning |
info | host | output | the pointer to the structure of syevj. |
CUSOLVER_STATUS_SUCCESS | the structure was initialized successfully. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
5.1.6. cusolverDnDestroySyevjInfo()
cusolverStatus_t cusolverDnDestroySyevjInfo( syevjInfo_t info);
This function destroys and releases any memory required by the structure.
parameter | Memory | In/out | Meaning |
info | host | input | the pointer to the structure of syevj. |
CUSOLVER_STATUS_SUCCESS | the resources are released successfully. |
5.1.7. cusolverDnXsyevjSetTolerance()
cusolverStatus_t cusolverDnXsyevjSetTolerance( syevjInfo_t info, double tolerance)
This function configures tolerance of syevj.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of syevj. |
tolerance | host | input | accuracy of numerical eigenvalues. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.8. cusolverDnXsyevjSetMaxSweeps()
cusolverStatus_t cusolverDnXsyevjSetMaxSweeps( syevjInfo_t info, int max_sweeps)
This function configures maximum number of sweeps in syevj. The default value is 100.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of syevj. |
max_sweeps | host | input | maximum number of sweeps. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.9. cusolverDnXsyevjSetSortEig()
cusolverStatus_t cusolverDnXsyevjSetSortEig( syevjInfo_t info, int sort_eig)
if sort_eig is zero, the eigenvalues are not sorted. This function only works for syevjBatched. syevj and sygvj always sort eigenvalues in ascending order. By default, eigenvalues are always sorted in ascending order.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of syevj. |
sort_eig | host | input | if sort_eig is zero, the eigenvalues are not sorted. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.10. cusolverDnXsyevjGetResidual()
cusolverStatus_t cusolverDnXsyevjGetResidual( cusolverDnHandle_t handle, syevjInfo_t info, double *residual)
This function reports residual of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
info | host | input | the pointer to the structure of syevj. |
residual | host | output | residual of syevj. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_SUPPORTED | does not support batched version |
5.1.11. cusolverDnXsyevjGetSweeps()
cusolverStatus_t cusolverDnXsyevjGetSweeps( cusolverDnHandle_t handle, syevjInfo_t info, int *executed_sweeps)
This function reports number of executed sweeps of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
info | host | input | the pointer to the structure of syevj. |
executed_sweeps | host | output | number of executed sweeps. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_SUPPORTED | does not support batched version |
5.1.12. cusolverDnCreateGesvdjInfo()
cusolverStatus_t cusolverDnCreateGesvdjInfo( gesvdjInfo_t *info);
This function creates and initializes the structure of gesvdj and gesvdjBatched to default values.
parameter | Memory | In/out | Meaning |
info | host | output | the pointer to the structure of gesvdj. |
CUSOLVER_STATUS_SUCCESS | the structure was initialized successfully. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
5.1.13. cusolverDnDestroyGesvdjInfo()
cusolverStatus_t cusolverDnDestroyGesvdjInfo( gesvdjInfo_t info);
This function destroys and releases any memory required by the structure.
parameter | Memory | In/out | Meaning |
info | host | input | the pointer to the structure of gesvdj. |
CUSOLVER_STATUS_SUCCESS | the resources are released successfully. |
5.1.14. cusolverDnXgesvdjSetTolerance()
cusolverStatus_t cusolverDnXgesvdjSetTolerance( gesvdjInfo_t info, double tolerance)
This function configures tolerance of gesvdj.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of gesvdj. |
tolerance | host | input | accuracy of numerical singular values. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.15. cusolverDnXgesvdjSetMaxSweeps()
cusolverStatus_t cusolverDnXgesvdjSetMaxSweeps( gesvdjInfo_t info, int max_sweeps)
This function configures maximum number of sweeps in gesvdj. The default value is 100.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of gesvdj. |
max_sweeps | host | input | maximum number of sweeps. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.16. cusolverDnXgesvdjSetSortEig()
cusolverStatus_t cusolverDnXgesvdjSetSortEig( gesvdjInfo_t info, int sort_svd)
if sort_svd is zero, the singular values are not sorted. This function only works for gesvdjBatched. gesvdj always sorts singular values in descending order. By default, singular values are always sorted in descending order.
parameter | Memory | In/out | Meaning |
info | host | in/out | the pointer to the structure of gesvdj. |
sort_svd | host | input | if sort_svd is zero, the singular values are not sorted. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
5.1.17. cusolverDnXgesvdjGetResidual()
cusolverStatus_t cusolverDnXgesvdjGetResidual( cusolverDnHandle_t handle, gesvdjInfo_t info, double *residual)
This function reports residual of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
info | host | input | the pointer to the structure of gesvdj. |
residual | host | output | residual of gesvdj. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_SUPPORTED | does not support batched version |
5.1.18. cusolverDnXgesvdjGetSweeps()
cusolverStatus_t cusolverDnXgesvdjGetSweeps( cusolverDnHandle_t handle, gesvdjInfo_t info, int *executed_sweeps)
This function reports number of executed sweeps of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
info | host | input | the pointer to the structure of gesvdj. |
executed_sweeps | host | output | number of executed sweeps. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_SUPPORTED | does not support batched version |
Dense Linear Solver Reference
This chapter describes linear solver API of cuSolverDN, including Cholesky factorization, LU with partial pivoting, QR factorization and Bunch-Kaufman (LDLT) factorization.
cusolverDn<t>potrf()
cusolverStatus_t cusolverDnSpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDpotrf_bufferSize(cusolveDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, int *Lwork);
cusolverStatus_t cusolverDnSpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, float *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, double *Workspace, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, cuComplex *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *Workspace, int Lwork, int *devInfo );
This function computes the Cholesky factorization of a Hermitian positive-definite matrix.
A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular Cholesky factor L.
If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular Cholesky factor U.
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by potrf_bufferSize().
If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, or equivalently some diagonal elements of L or U is not a real number. The output parameter devInfo would indicate smallest leading minor of A which is not positive definite.
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
uplo | host | input | indicates if matrix A lower or upper part is stored, the other part is not referenced. |
n | host | input | number of rows and columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
Workspace | device | in/out | working space, <type> array of size Lwork. |
Lwork | host | input | size of Workspace, returned by potrf_bufferSize. |
devInfo | device | output | if devInfo = 0, the Cholesky factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the leading minor of order i is not positive definite. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0 or lda<max(1,n)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>potrs()
cusolverStatus_t cusolverDnSpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const float *A, int lda, float *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnDpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const double *A, int lda, double *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnCpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const cuComplex *A, int lda, cuComplex *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnZpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const cuDoubleComplex *A, int lda, cuDoubleComplex *B, int ldb, int *devInfo);
This function solves a system of linear equations
where A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.
The user has to call potrf first to factorize matrix A. If input parameter uplo is CUBLAS_FILL_MODE_LOWER, A is lower triangular Cholesky factor L correspoding to . If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, A is upper triangular Cholesky factor U corresponding to .
The operation is in-place, i.e. matrix X overwrites matrix B with the same leading dimension ldb.
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolveDN library context. |
uplo | host | input | indicates if matrix A lower or upper part is stored, the other part is not referenced. |
n | host | input | number of rows and columns of matrix A. |
nrhs | host | input | number of columns of matrix X and B. |
A | device | input | <type> array of dimension lda * n with lda is not less than max(1,n). A is either lower cholesky factor L or upper Cholesky factor U. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
B | device | in/out | <type> array of dimension ldb * nrhs. ldb is not less than max(1,n). As an input, B is right hand side matrix. As an output, B is the solution matrix. |
devInfo | device | output | if devInfo = 0, the Cholesky factorization is successful. if devInfo = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, nrhs<0, lda<max(1,n) or ldb<max(1,n)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>getrf()
cusolverStatus_t cusolverDnSgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSgetrf(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *Workspace, int *devIpiv, int *devInfo ); cusolverStatus_t cusolverDnDgetrf(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *Workspace, int *devIpiv, int *devInfo );
cusolverStatus_t cusolverDnCgetrf(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, cuComplex *Workspace, int *devIpiv, int *devInfo ); cusolverStatus_t cusolverDnZgetrf(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *Workspace, int *devIpiv, int *devInfo );
This function computes the LU factorization of a m×n matrix
where A is a m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by getrf_bufferSize().
If LU factorization failed, i.e. matrix A (U) is singular, The output parameter devInfo=i indicates U(i,i) = 0.
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
If devIpiv is null, no pivoting is performed. The factorization is A=L*U, which is not numerically stable.
No matter LU factorization failed or not, the output parameter devIpiv contains pivoting sequence, row i is interchanged with row devIpiv(i).
The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,m). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
Workspace | device | in/out | working space, <type> array of size Lwork. |
devIpiv | device | output | array of size at least min(m,n), containing pivot indices. |
devInfo | device | output | if devInfo = 0, the LU factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the U(i,i) = 0. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or lda<max(1,m)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>getrs()
cusolverStatus_t cusolverDnSgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const float *A, int lda, const int *devIpiv, float *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnDgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const double *A, int lda, const int *devIpiv, double *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnCgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const cuComplex *A, int lda, const int *devIpiv, cuComplex *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnZgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const cuDoubleComplex *A, int lda, const int *devIpiv, cuDoubleComplex *B, int ldb, int *devInfo );
This function solves a linear system of multiple right-hand sides
where A is a n×n matrix, and was LU-factored by getrf, that is, lower trianular part of A is L, and upper triangular part (including diagonal elements) of A is U. B is a n×nrhs right-hand side matrix.
The input parameter trans is defined by
The input parameter devIpiv is an output of getrf. It contains pivot indices, which are used to permutate right-hand sides.
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
trans | host | input | operation op(A) that is non- or (conj.) transpose. |
n | host | input | number of rows and columns of matrix A. |
nrhs | host | input | number of right-hand sides. |
A | device | input | <type> array of dimension lda * n with lda is not less than max(1,n). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
devIpiv | device | input | array of size at least n, containing pivot indices. |
B | device | output | <type> array of dimension ldb * nrhs with ldb is not less than max(1,n). |
ldb | host | input | leading dimension of two-dimensional array used to store matrix B. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0 or lda<max(1,n) or ldb<max(1,n)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>geqrf()
cusolverStatus_t cusolverDnSgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSgeqrf(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *TAU, float *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDgeqrf(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *TAU, double *Workspace, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCgeqrf(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, cuComplex *TAU, cuComplex *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZgeqrf(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *TAU, cuDoubleComplex *Workspace, int Lwork, int *devInfo );
This function computes the QR factorization of a m×n matrix
where A is a m×n matrix, Q is a m×n matrix, and R is a n×n upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by geqrf_bufferSize().
The matrix R is overwritten in upper triangular part of A, including diagonal elements.
The matrix Q is not formed explicitly, instead, a sequence of householder vectors are stored in lower triangular part of A. The leading nonzero element of householder vector is assumed to be 1 such that output parameter TAU contains the scaling factor τ. If v is original householder vector, q is the new householder vector corresponding to τ, satisying the following relation
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,m). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
TAU | device | output | <type> array of dimension at least min(m,n). |
Workspace | device | in/out | working space, <type> array of size Lwork. |
Lwork | host | input | size of working array Workspace. |
devInfo | device | output | if info = 0, the LU factorization is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or lda<max(1,m)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>ormqr()
cusolverStatus_t cusolverDnSormqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const float *A, int lda, const float *C, int ldc, int *lwork); cusolverStatus_t cusolverDnDormqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const double *A, int lda, const double *C, int ldc, int *lwork); cusolverStatus_t cusolverDnCunmqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *C, int ldc, int *lwork); cusolverStatus_t cusolverDnZunmqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *C, int ldc, int *lwork);
cusolverStatus_t cusolverDnSormqr(cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const float *A, int lda, const float *tau, float *C, int ldc, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDormqr(cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const double *A, int lda, const double *tau, double *C, int ldc, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCunmqr(cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, cuComplex *C, int ldc, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZunmqr(cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *C, int ldc, cuDoubleComplex *work, int lwork, int *devInfo);
This function overwrites m×n matrix C by
The operation of Q is defined by
Q is a unitary matrix formed by a sequence of elementary reflection vectors from QR factorization (geqrf) of A.
Q=H(1)H(2) ... H(k)
Q is of order m if side = CUBLAS_SIDE_LEFT and of order n if side = CUBLAS_SIDE_RIGHT.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by geqrf_bufferSize() or ormqr_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
The user can combine geqrf, ormqr and trsm to complete a linear solver or a least-square solver. Please refer to appendix C.1.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
side | host | input | indicates if matrix Q is on the left or right of C. |
trans | host | input | operation op(Q) that is non- or (conj.) transpose. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
k | host | input | number of elementary relfections. |
A | device | in/out | <type> array of dimension lda * k with lda is not less than max(1,m). The matrix A is from geqrf, so i-th column contains elementary reflection vector. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). |
tau | device | output | <type> array of dimension at least min(m,n). The vector tau is from geqrf, so tau(i) is the scalar of i-th elementary reflection vector. |
C | device | in/out | <type> array of size ldc * n. On exit, C is overwritten by op(Q)*C. |
ldc | host | input | leading dimension of two-dimensional array of matrix C. ldc >= max(1,m). |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working array work. |
devInfo | device | output | if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or wrong lda or ldc). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>orgqr()
cusolverStatus_t cusolverDnSorgqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const float *A, int lda, int *lwork); cusolverStatus_t cusolverDnDorgqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const double *A, int lda, int *lwork); cusolverStatus_t cusolverDnCungqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const cuComplex *A, int lda, int *lwork); cusolverStatus_t cusolverDnZungqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const cuDoubleComplex *A, int lda, int *lwork);
cusolverStatus_t cusolverDnSorgqr( cusolverDnHandle_t handle, int m, int n, int k, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgqr( cusolverDnHandle_t handle, int m, int n, int k, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungqr( cusolverDnHandle_t handle, int m, int n, int k, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungqr( cusolverDnHandle_t handle, int m, int n, int k, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function overwrites m×n matrix A by
where Q is a unitary matrix formed by a sequence of elementary reflection vectors stored in A.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgqr_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
The user can combine geqrf, orgqr to complete orthogonalization. Please refer to appendix C.2.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
m | host | input | number of rows of matrix Q. m >= 0; |
n | host | input | number of columns of matrix Q. m >= n >= 0; |
k | host | input | number of elementary relfections whose product defines the matrix Q. n >= k >= 0; |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,m). i-th column of A contains elementary reflection vector. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda >= max(1,m). |
tau | device | output | <type> array of dimension k. tau(i) is the scalar of i-th elementary reflection vector. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working array work. |
devInfo | device | output | if info = 0, the orgqr is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,k<0, n>m, k>n or lda<m). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>sytrf()
cusolverStatus_t cusolverDnSsytrf_bufferSize(cusolverDnHandle_t handle, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDsytrf_bufferSize(cusolverDnHandle_t handle, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCsytrf_bufferSize(cusolverDnHandle_t handle, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZsytrf_bufferSize(cusolverDnHandle_t handle, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, int *ipiv, float *work, int lwork, int *devInfo ); cusolverStatus_t cusolverDnDsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, int *ipiv, double *work, int lwork, int *devInfo );
cusolverStatus_t cusolverDnCsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, int *ipiv, cuComplex *work, int lwork, int *devInfo ); cusolverStatus_t cusolverDnZsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, int *ipiv, cuDoubleComplex *work, int lwork, int *devInfo );
This function computes the Bunch-Kaufman factorization of a n×n symmetric indefinite matrix
A is a n×n symmetric matrix, only lower or upper part is meaningful. The input parameter uplo which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular factor L and block diagonal matrix D. Each block of D is either 1x1 or 2x2 block, depending on pivoting.
If input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular factor U and block diagonal matrix D.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrf_bufferSize().
If Bunch-Kaufman factorization failed, i.e. A is singular. The output parameter devInfo = i would indicate D(i,i)=0.
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
The output parameter devIpiv contains pivoting sequence. If devIpiv(i) = k > 0, D(i,i) is 1x1 block, and i-th row/column of A is interchanged with k-th row/column of A. If uplo is CUSBLAS_FILL_MODE_UPPER and devIpiv(i-1) = devIpiv(i) = -m < 0, D(i-1:i,i-1:i) is a 2x2 block, and (i-1)-th row/column is interchanged with m-th row/column. If uplo is CUSBLAS_FILL_MODE_LOWER and devIpiv(i+1) = devIpiv(i) = -m < 0, D(i:i+1,i:i+1) is a 2x2 block, and (i+1)-th row/column is interchanged with m-th row/column.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
uplo | host | input | indicates if matrix A lower or upper part is stored, the other part is not referenced. |
n | host | input | number of rows and columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
ipiv | device | output | array of size at least n, containing pivot indices. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working space work. |
devInfo | device | output | if devInfo = 0, the LU factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the D(i,i) = 0. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0 or lda<max(1,n)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
Dense Eigenvalue Solver Reference
This chapter describes eigenvalue solver API of cuSolverDN, including bidiagonalization and SVD.
cusolverDn<t>gebrd()
cusolverStatus_t cusolverDnSgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnDgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnCgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnZgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork );
cusolverStatus_t cusolverDnSgebrd(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *D, float *E, float *TAUQ, float *TAUP, float *Work, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDgebrd(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *D, double *E, double *TAUQ, double *TAUP, double *Work, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCgebrd(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, float *D, float *E, cuComplex *TAUQ, cuComplex *TAUP, cuComplex *Work, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZgebrd(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, double *D, double *E, cuDoubleComplex *TAUQ, cuDoubleComplex *TAUP, cuDoubleComplex *Work, int Lwork, int *devInfo );
This function reduces a general m×n matrix A to a real upper or lower bidiagonal form B by an orthogonal transformation:
If m>=n, B is upper bidiagonal; if m<n, B is lower bidiagonal.
The matrix Q and P are overwritten into matrix A in the following sense:
if m>=n, the diagonal and the first superdiagonal are overwritten with the upper bidiagonal matrix B; the elements below the diagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the first superdiagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.
if m<n, the diagonal and the first subdiagonal are overwritten with the lower bidiagonal matrix B; the elements below the first subdiagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the diagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter Work. The input parameter Lwork is size of the working space, and it is returned by gebrd_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
Remark: gebrd only supports m>=n.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
D | device | output | real array of dimension min(m,n). The diagonal elements of the bidiagonal matrix B: D(i) = A(i,i). |
E | device | output | real array of dimension min(m,n). The off-diagonal elements of the bidiagonal matrix B: if m>=n, E(i) = A(i,i+1) for i = 1,2,...,n-1; if m<n, E(i) = A(i+1,i) for i = 1,2,...,m-1. |
TAUQ | device | output | <type> array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. |
TAUP | device | output | <type> array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix P. |
Work | device | in/out | working space, <type> array of size Lwork. |
Lwork | host | input | size of Work, returned by gebrd_bufferSize. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0, or lda<max(1,m)). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>orgbr()
cusolverStatus_t cusolverDnSorgbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const float *A, int lda, const float *tau, int *lwork); cusolverStatus_t cusolverDnDorgbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const double *A, int lda, const double *tau, int *lwork); cusolverStatus_t cusolverDnCungbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZungbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSorgbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function generates one of the unitary matrices Q or P**H determined by gebrd when reducing a matrix A to bidiagonal form:
Q and P**H are defined as products of elementary reflectors H(i) or G(i) respectively.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgbr_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
side | host | input | if side = CUBLAS_SIDE_LEFT, generate Q. if side = CUBLAS_SIDE_RIGHT, generate P**T. |
m | host | input | number of rows of matrix Q or P**T. |
n | host | input | if side = CUBLAS_SIDE_LEFT, m>= n>= min(m,k). if side = CUBLAS_SIDE_RIGHT, n>= m>= min(n,k). |
k | host | input | if side = CUBLAS_SIDE_LEFT, the number of columns in the original m-by-k matrix reduced by gebrd. if side = CUBLAS_SIDE_RIGHT, the number of rows in the original k-by-n matrix reduced by gebrd. |
A | device | in/out | <type> array of dimension lda * n On entry, the vectors which define the elementary reflectors, as returned by gebrd. On exit, the m-by-n matrix Q or P**T. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda >= max(1,m); |
tau | device | output | <type> array of dimension min(m,k) if side is CUBLAS_SIDE_LEFT; of dimension min(n,k) if side is CUBLAS_SIDE_RIGHT; tau(i) must contain the scalar factor of the elementary reflector H(i) or G(i), which determines Q or P**T, as returned by gebrd in its array argument TAUQ or TAUP. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working array work. |
devInfo | device | output | if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or wrong lda ). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>sytrd()
cusolverStatus_t cusolverDnSsytrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const float *A, int lda, const float *d, const float *e, const float *tau, int *lwork); cusolverStatus_t cusolverDnDsytrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const double *A, int lda, const double *d, const double *e, const double *tau, int *lwork); cusolverStatus_t cusolverDnChetrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const float *d, const float *e, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZhetrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const double *d, const double *e, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSsytrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, float *d, float *e, float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDsytrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, double *d, double *e, double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnChetrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, float *d, float *e, cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t CUDENSEAPI cusolverDnZhetrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, double *d, double *e, cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function reduces a general symmetric (Hermitian) n×n matrix A to real symmetric tridiagonal form T by an orthogonal transformation:
As an output, A contains T and householder reflection vectors. If uplo = CUBLAS_FILL_MODE_UPPER, the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; If uplo = CUBLAS_FILL_MODE_LOWER, the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrd_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
uplo | host | input | specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. |
n | host | input | number of rows (columns) of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. On exit, A is overwritten by T and householder reflection vectors. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda >= max(1,n). |
D | device | output | real array of dimension n. The diagonal elements of the tridiagonal matrix T: D(i) = A(i,i). |
E | device | output | real array of dimension (n-1). The off-diagonal elements of the tridiagonal matrix T: if uplo = CUBLAS_FILL_MODE_UPPER, E(i) = A(i,i+1). if uplo = CUBLAS_FILL_MODE_LOWERE(i) = A(i+1,i). |
tau | device | output | <type> array of dimension (n-1). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of work, returned by sytrd_bufferSize. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, or lda<max(1,n), or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>ormtr()
cusolverStatus_t cusolverDnSormtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const float *A, int lda, const float *tau, const float *C, int ldc, int *lwork); cusolverStatus_t cusolverDnDormtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const double *A, int lda, const double *tau, const double *C, int ldc, int *lwork); cusolverStatus_t cusolverDnCunmtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const cuComplex *A, int lda, const cuComplex *tau, const cuComplex *C, int ldc, int *lwork); cusolverStatus_t cusolverDnZunmtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, const cuDoubleComplex *C, int ldc, int *lwork);
cusolverStatus_t cusolverDnSormtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, float *A, int lda, float *tau, float *C, int ldc, float *work, int lwork, int *info); cusolverStatus_t cusolverDnDormtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, double *A, int lda, double *tau, double *C, int ldc, double *work, int lwork, int *info);
cusolverStatus_t cusolverDnCunmtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, cuComplex *A, int lda, cuComplex *tau, cuComplex *C, int ldc, cuComplex *work, int lwork, int *info); cusolverStatus_t cusolverDnZunmtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *tau, cuDoubleComplex *C, int ldc, cuDoubleComplex *work, int lwork, int *info);
This function overwrites m×n matrix C by
where Q is a unitary matrix formed by a sequence of elementary reflection vectors from sytrd.
The operation on Q is defined by
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by ormtr_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
side | host | input | side = CUBLAS_SIDE_LEFT, apply Q or Q**T from the Left; side = CUBLAS_SIDE_RIGHT, apply Q or Q**T from the Right. |
uplo | host | input | uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. |
trans | host | input | operation op(Q) that is non- or (conj.) transpose. |
m | host | input | number of rows of matrix C. |
n | host | input | number of columns of matrix C. |
A | device | in/out | <type> array of dimension lda * m if side = CUBLAS_SIDE_LEFT; lda * n if side = CUBLAS_SIDE_RIGHT. The matrix A from sytrd contains the elementary reflectors. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). |
tau | device | output | <type> array of dimension (m-1) if side is CUBLAS_SIDE_LEFT; of dimension (n-1) if side is CUBLAS_SIDE_RIGHT; The vector tau is from sytrd, so tau(i) is the scalar of i-th elementary reflection vector. |
C | device | in/out | <type> array of size ldc * n. On exit, C is overwritten by op(Q)*C or C*op(Q). |
ldc | host | input | leading dimension of two-dimensional array of matrix C. ldc >= max(1,m). |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working array work. |
devInfo | device | output | if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or wrong lda or ldc). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>orgtr()
cusolverStatus_t cusolverDnSorgtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const float *A, int lda, const float *tau, int *lwork); cusolverStatus_t cusolverDnDorgtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const double *A, int lda, const double *tau, int *lwork); cusolverStatus_t cusolverDnCungtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZungtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSorgtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function generates a unitary matrix Q which is defined as the product of n-1 elementary reflectors of order n, as returned by sytrd:
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgtr_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
uplo | host | input | uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. |
n | host | input | number of rows (columns) of matrix Q. |
A | device | in/out | <type> array of dimension lda * n On entry, matrix A from sytrd contains the elementary reflectors. On exit, matrix A contains the n-by-n orthogonal matrix Q. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda >= max(1,n). |
tau | device | output | <type> array of dimension (n-1)tau(i) is the scalar of i-th elementary reflection vector. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of working array work. |
devInfo | device | output | if info = 0, the orgtr is successful. if info = -i, the i-th parameter is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0 or wrong lda ). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>gesvd()
cusolverStatus_t cusolverDnSgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnDgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnCgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnZgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork );
cusolverStatus_t cusolverDnSgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *VT, int ldvt, float *work, int lwork, float *rwork, int *devInfo); cusolverStatus_t cusolverDnDgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *VT, int ldvt, double *work, int lwork, double *rwork, int *devInfo);
cusolverStatus_t cusolverDnCgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *VT, int ldvt, cuComplex *work, int lwork, float *rwork, int *devInfo); cusolverStatus_t cusolverDnZgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *VT, int ldvt, cuDoubleComplex *work, int lwork, double *rwork, int *devInfo);
This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written
where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by gesvd_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. if bdsqr did not converge, devInfo specifies how many superdiagonals of an intermediate bidiagonal form did not converge to zero.
The rwork is real array of dimension (min(m,n)-1). If devInfo>0 and rwork is not nil, rwork contains the unconverged superdiagonal elements of an upper bidiagonal matrix. This is slightly different from LAPACK which puts unconverged superdiagonal elements in work if type is real; in rwork if type is complex. rwork can be a NULL pointer if the user does not want the information from supperdiagonal.
Appendix F.1 provides a simple example of gesvd.
Remark 1: gesvd only supports m>=n.
Remark 2: the routine returns , not V.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobu | host | input | specifies options for computing all or part of the matrix U: = 'A': all m columns of U are returned in array U: = 'S': the first min(m,n) columns of U (the left singular vectors) are returned in the array U; = 'O': the first min(m,n) columns of U (the left singular vectors) are overwritten on the array A; = 'N': no columns of U (no left singular vectors) are computed. |
jobvt | host | input | specifies options for computing all or part of the matrix V**T: = 'A': all N rows of V**T are returned in the array VT; = 'S': the first min(m,n) rows of V**T (the right singular vectors) are returned in the array VT; = 'O': the first min(m,n) rows of V**T (the right singular vectors) are overwritten on the array A; = 'N': no rows of V**T (no right singular vectors) are computed. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
S | device | output | real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). |
U | device | output | <type> array of dimension ldu * m with ldu is not less than max(1,m). U contains the m×m unitary matrix U. |
ldu | host | input | leading dimension of two-dimensional array used to store matrix U. |
VT | device | output | <type> array of dimension ldvt * n with ldvt is not less than max(1,n). VT contains the n×n unitary matrix V**T. |
ldvt | host | input | leading dimension of two-dimensional array used to store matrix Vt. |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of work, returned by gesvd_bufferSize. |
rwork | device | input | real array of dimension min(m,n)-1. It contains the unconverged superdiagonal elements of an upper bidiagonal matrix if devInfo > 0. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo > 0, devInfo indicates how many superdiagonals of an intermediate bidiagonal form did not converge to zero. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldvt<max(1,n) ). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>gesvdj()
cusolverStatus_t cusolverDnSgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const float *A, int lda, const float *S, const float *U, int ldu, const float *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnDgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const double *A, int lda, const double *S, const double *U, int ldu, const double *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnCgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const cuComplex *A, int lda, const float *S, const cuComplex *U, int ldu, const cuComplex *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnZgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const cuDoubleComplex *A, int lda, const double *S, const cuDoubleComplex *U, int ldu, const cuDoubleComplex *V, int ldv, int *lwork, gesvdjInfo_t params);
cusolverStatus_t cusolverDnSgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *V, int ldv, float *work, int lwork, int *info, gesvdjInfo_t params); cusolverStatus_t cusolverDnDgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *V, int ldv, double *work, int lwork, int *info, gesvdjInfo_t params);
cusolverStatus_t cusolverDnCgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *V, int ldv, cuComplex *work, int lwork, int *info, gesvdjInfo_t params); cusolverStatus_t cusolverDnZgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *V, int ldv, cuDoubleComplex *work, int lwork, int *info, gesvdjInfo_t params);
This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written
where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.
gesvdj has the same functionality as gesvd. The difference is that gesvd uses QR algorithm and gesvdj uses Jacobi method. The parallelism of Jacobi method gives GPU better performance on small and medium size matrices. Moreover the user can configure gesvdj to perform approximation up to certain accuracy.
gesvdj iteratively generates a sequence of unitary matrices to transform matrix A to the following form
where S is diagonal and diagonal of E is zero.
During the iterations, the Frobenius norm of E decreases monotonically. As E goes down to zero, S is the set of singular values. In practice, Jacobi method stops if
where eps is given tolerance.
gesvdj has two parameters to control the accuracy. First parameter is tolerance (eps). The default value is machine accuracy but The user can use function cusolverDnXgesvdjSetTolerance to set a priori tolerance. The second parameter is maximum number of sweeps which controls number of iterations of Jacobi method. The default value is 100 but the user can use function cusolverDnXgesvdjSetMaxSweeps to set a proper bound. The experimentis show 15 sweeps are good enough to converge to machine accuracy. gesvdj stops either tolerance is met or maximum number of sweeps is met.
Jacobi method has quadratic convergence, so the accuracy is not proportional to number of sweeps. To guarantee certain accuracy, the user should configure tolerance only.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdj_bufferSize().
If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = min(m,n)+1, gesvdj does not converge under given tolerance and maximum sweeps.
If the user sets an improper tolerance, gesvdj may not converge. For example, tolerance should not be smaller than machine accuracy.
Appendix F.2 provides a simple example of gesvdj.
Remark 1: gesvdj supports any combination of m and n.
Remark 2: the routine returns V, not . This is different from gesvd.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobz | host | input | specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. |
econ | host | input | econ = 1 for economy size for U and V. |
m | host | input | number of rows of matrix A. |
n | host | input | number of columns of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
S | device | output | real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). |
U | device | output | <type> array of dimension ldu * m if econ is zero. If econ is nonzero, the dimension is ldu * min(m,n). U contains the left singular vectors. |
ldu | host | input | leading dimension of two-dimensional array used to store matrix U. ldu is not less than max(1,m). |
V | device | output | <type> array of dimension ldv * n if econ is zero. If econ is nonzero, the dimension is ldv * min(m,n). V contains the right singular vectors. |
ldv | host | input | leading dimension of two-dimensional array used to store matrix V. ldv is not less than max(1,n). |
work | device | in/out | <type> array of size lwork, working space. |
lwork | host | input | size of work, returned by gesvdj_bufferSize. |
info | device | output | if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = min(m,n)+1, gesvdj dose not converge under given tolerance and maximum sweeps. |
params | host | in/out | structure filled with parameters of Jacobi algorithm and results of gesvdj. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldv<max(1,n) or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR ). |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>gesvdjBatched()
cusolverStatus_t cusolverDnSgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const float *A, int lda, const float *S, const float *U, int ldu, const float *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnDgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const double *A, int lda, const double *S, const double *U, int ldu, const double *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnCgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const cuComplex *A, int lda, const float *S, const cuComplex *U, int ldu, const cuComplex *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnZgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const cuDoubleComplex *A, int lda, const double *S, const cuDoubleComplex *U, int ldu, const cuDoubleComplex *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize);
cusolverStatus_t cusolverDnSgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *V, int ldv, float *work, int lwork, int *info, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnDgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *V, int ldv, double *work, int lwork, int *info, gesvdjInfo_t params, int batchSize);
cusolverStatus_t cusolverDnCgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *V, int ldv, cuComplex *work, int lwork, int *info, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnZgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *V, int ldv, cuDoubleComplex *work, int lwork, int *info, gesvdjInfo_t params, int batchSize);
This function computes singular values and singular vectors of a squence of general m×n matrices
where is a real m×n diagonal matrix which is zero except for its min(m,n) diagonal elements. (left singular vectors) is a m×m unitary matrix and (right singular vectors) is a n×n unitary matrix. The diagonal elements of are the singular values of in either descending order or non-sorting order.
gesvdjBatched performs gesvdj on each matrix. It requires that all matrices are of the same size m,n no greater than 32 and are packed in contiguous way,
|
Each matrix is column-major with leading dimension lda, so the formula for random access is .
The parameter S also contains singular values of each matrix in contiguous way,
|
The formula for random access of S is .
Except for tolerance and maximum sweeps, gesvdjBatched can either sort the singular values in descending order (default) or chose as-is (without sorting) by the function cusolverDnXgesvdjSetSortEig. If the user packs several tiny matrices into diagonal blocks of one matrix, non-sorting option can separate singular values of those tiny matrices.
gesvdjBatched cannot report residual and executed sweeps by function cusolverDnXgesvdjGetResidual and cusolverDnXgesvdjGetSweeps. Any call of the above two returns CUSOLVER_STATUS_NOT_SUPPORTED. The user needs to compute residual explicitly.
The user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdjBatched_bufferSize().
The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = min(m,n)+1, gesvdjBatched does not converge on i-th matrix under given tolerance and maximum sweeps.
Appendix F.3 provides a simple example of gesvdjBatched.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobz | host | input | specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. |
m | host | input | number of rows of matrix Aj. m is no greater than 32. |
n | host | input | number of columns of matrix Aj. n is no greater than 32. |
A | device | in/out | <type> array of dimension lda * n * batchSize with lda is not less than max(1,n). on Exit: the contents of Aj are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix Aj. |
S | device | output | a real array of dimension min(m,n)*batchSize. It stores the singular values of Aj in descending order or non-sorting order. |
U | device | output | <type> array of dimension ldu * m * batchSize. Uj contains the left singular vectors of Aj. |
ldu | host | input | leading dimension of two-dimensional array used to store matrix Uj. ldu is not less than max(1,m). |
V | device | output | <type> array of dimension ldv * n * batchSize. Vj contains the right singular vectors of Aj. |
ldv | host | input | leading dimension of two-dimensional array used to store matrix Vj. ldv is not less than max(1,n). |
work | device | in/out | <type> array of size lwork, working space. |
lwork | host | input | size of work, returned by gesvdjBatched_bufferSize. |
info | device | output | an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = 0, the operation is successful. if info[i] = min(m,n)+1, gesvdjBatched dose not converge on i-th matrix under given tolerance and maximum sweeps. |
params | host | in/out | structure filled with parameters of Jacobi algorithm. |
batchSize | host | input | number of matrices. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldv<max(1,n) or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR , or batchSize<0 ). |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>syevd()
cusolverStatus_t cusolverDnSsyevd_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const float *A, int lda, const float *W, int *lwork); cusolverStatus_t cusolverDnDsyevd_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const double *A, int lda, const double *W, int *lwork); cusolverStatus_t cusolverDnCheevd_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const float *W, int *lwork); cusolverStatus_t cusolverDnZheevd_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const double *W, int *lwork);
cusolverStatus_t cusolverDnSsyevd( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, float *A, int lda, float *W, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDsyevd( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, double *A, int lda, double *W, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCheevd( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuComplex *A, int lda, float *W, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZheevd( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, double *W, cuDoubleComplex *work, int lwork, int *devInfo);
This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix A. The standard symmetric eigenvalue problem is
where Λ is a real n×n diagonal matrix. V is an n×n unitary matrix. The diagonal elements of Λ are the eigenvalues of A in ascending order.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by syevd_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. If devInfo = i (greater than zero), i off-diagonal elements of an intermediate tridiagonal form did not converge to zero.
if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthonormal eigenvectors of the matrix A. The eigenvectors are computed by a divide and conquer algorithm.
Appendix E.1 provides a simple example of syevd.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobz | host | input | specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. |
uplo | host | input | specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. |
n | host | input | number of rows (or columns) of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and devInfo = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
W | device | output | a real array of dimension n. The eigenvalue values of A, in ascending order ie, sorted so that W(i) <= W(i+1). |
work | device | in/out | working space, <type> array of size lwork. |
Lwork | host | input | size of work, returned by syevd_bufferSize. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i (> 0), devInfo indicates i off-diagonal elements of an intermediate tridiagonal form did not converge to zero; |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, or lda<max(1,n), or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR, or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>sygvd()
cusolverStatus_t cusolverDnSsygvd_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const float *A, int lda, const float *B, int ldb, const float *W, int *lwork); cusolverStatus_t cusolverDnDsygvd_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const double *A, int lda, const double *B, int ldb, const double *W, int *lwork); cusolverStatus_t cusolverDnChegvd_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const cuComplex *B, int ldb, const float *W, int *lwork); cusolverStatus_t cusolverDnZhegvd_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *B, int ldb, const double *W, int *lwork);
cusolverStatus_t cusolverDnSsygvd( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, float *A, int lda, float *B, int ldb, float *W, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDsygvd( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, double *A, int lda, double *B, int ldb, double *W, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnChegvd( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuComplex *A, int lda, cuComplex *B, int ldb, float *W, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZhegvd( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *B, int ldb, double *W, cuDoubleComplex *work, int lwork, int *devInfo);
This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix-pair (A,B). The generalized symmetric-definite eigenvalue problem is
where the matrix B is positive definite. Λ is a real n×n diagonal matrix. The diagonal elements of Λ are the eigenvalues of (A, B) in ascending order. V is an n×n orthogonal matrix. The eigenvectors are normalized as follows:
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sygvd_bufferSize().
If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. If devInfo = i (i > 0 and i<=n) and jobz = CUSOLVER_EIG_MODE_NOVECTOR, i off-diagonal elements of an intermediate tridiagonal form did not converge to zero. If devInfo = N + i (i > 0), then the leading minor of order i of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.
if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthogonal eigenvectors of the matrix A. The eigenvectors are computed by divide and conquer algorithm.
Appendix E.2 provides a simple example of sygvd.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
itype | host | input | Specifies the problem type to be solved: itype=CUSOLVER_EIG_TYPE_1: A*x = (lambda)*B*x. itype=CUSOLVER_EIG_TYPE_2: A*B*x = (lambda)*x. itype=CUSOLVER_EIG_TYPE_3: B*A*x = (lambda)*x. |
jobz | host | input | specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. |
uplo | host | input | specifies which part of A and B are stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A and B are stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A and B are stored. |
n | host | input | number of rows (or columns) of matrix A and B. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and devInfo = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda is not less than max(1,n). |
B | device | in/out | <type> array of dimension ldb * n. If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of B contains the upper triangular part of the matrix B. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of B contains the lower triangular part of the matrix B. On exit, if devInfo is less than n, B is overwritten by triangular factor U or L from the Cholesky factorization of B. |
ldb | host | input | leading dimension of two-dimensional array used to store matrix B. ldb is not less than max(1,n). |
W | device | output | a real array of dimension n. The eigenvalue values of A, sorted so that W(i) >= W(i+1). |
work | device | in/out | working space, <type> array of size lwork. |
Lwork | host | input | size of work, returned by sygvd_bufferSize. |
devInfo | device | output | if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i (> 0), devInfo indicates either potrf or syevd is wrong. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, or lda<max(1,n), or ldb<max(1,n), or itype is not 1, 2 or 3, or jobz is not 'N' or 'V', or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>syevj()
cusolverStatus_t cusolverDnSsyevj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const float *A, int lda, const float *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnDsyevj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const double *A, int lda, const double *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnCheevj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const float *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnZheevj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const double *W, int *lwork, syevjInfo_t params);
cusolverStatus_t cusolverDnSsyevj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, float *A, int lda, float *W, float *work, int lwork, int *info, syevjInfo_t params); cusolverStatus_t cusolverDnDsyevj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, double *A, int lda, double *W, double *work, int lwork, int *info, syevjInfo_t params);
cusolverStatus_t cusolverDnCheevj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuComplex *A, int lda, float *W, cuComplex *work, int lwork, int *info, syevjInfo_t params); cusolverStatus_t cusolverDnZheevj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, double *W, cuDoubleComplex *work, int lwork, int *info, syevjInfo_t params);
This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix A. The standard symmetric eigenvalue problem is
where Λ is a real n×n diagonal matrix. Q is an n×n unitary matrix. The diagonal elements of Λ are the eigenvalues of A in ascending order.
syevj has the same functionality as syevd. The difference is that syevd uses QR algorithm and syevj uses Jacobi method. The parallelism of Jacobi method gives GPU better performance on small and medium size matrices. Moreover the user can configure syevj to perform approximation up to certain accuracy.
How does it work?
syevj iteratively generates a sequence of unitary matrices to transform matrix A to the following form
where W is diagonal and E is symmetric without diagonal.
During the iterations, the Frobenius norm of E decreases monotonically. As E goes down to zero, W is the set of eigenvalues. In practice, Jacobi method stops if
where eps is given tolerance.
syevj has two parameters to control the accuracy. First parameter is tolerance (eps). The default value is machine accuracy but The user can use function cusolverDnXsyevjSetTolerance to set a priori tolerance. The second parameter is maximum number of sweeps which controls number of iterations of Jacobi method. The default value is 100 but the user can use function cusolverDnXsyevjSetMaxSweeps to set a proper bound. The experimentis show 15 sweeps are good enough to converge to machine accuracy. syevj stops either tolerance is met or maximum number of sweeps is met.
Jacobi method has quadratic convergence, so the accuracy is not proportional to number of sweeps. To guarantee certain accuracy, the user should configure tolerance only.
After syevj, the user can query residual by function cusolverDnXsyevjGetResidual and number of executed sweeps by function cusolverDnXsyevjGetSweeps. However the user needs to be aware that residual is the Frobenius norm of E, not accuracy of individual eigenvalue, i.e.
The same as syevd, the user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by syevj_bufferSize().
If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = n+1, syevj does not converge under given tolerance and maximum sweeps.
If the user sets an improper tolerance, syevj may not converge. For example, tolerance should not be smaller than machine accuracy.
if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthonormal eigenvectors V.
Appendix E.3 provides a simple example of syevj.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobz | host | input | specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. |
uplo | host | input | specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. |
n | host | input | number of rows (or columns) of matrix A. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. |
W | device | output | a real array of dimension n. The eigenvalue values of A, in ascending order ie, sorted so that W(i) <= W(i+1). |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of work, returned by syevj_bufferSize. |
info | device | output | if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = n+1, syevj dose not converge under given tolerance and maximum sweeps. |
params | host | in/out | structure filled with parameters of Jacobi algorithm and results of syevj. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, or lda<max(1,n), or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR, or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>sygvj()
cusolverStatus_t cusolverDnSsygvj_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const float *A, int lda, const float *B, int ldb, const float *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnDsygvj_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const double *A, int lda, const double *B, int ldb, const double *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnChegvj_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const cuComplex *B, int ldb, const float *W, int *lwork, syevjInfo_t params); cusolverStatus_t cusolverDnZhegvj_bufferSize( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *B, int ldb, const double *W, int *lwork, syevjInfo_t params);
cusolverStatus_t cusolverDnSsygvj( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, float *A, int lda, float *B, int ldb, float *W, float *work, int lwork, int *info, syevjInfo_t params); cusolverStatus_t cusolverDnDsygvj( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, double *A, int lda, double *B, int ldb, double *W, double *work, int lwork, int *info, syevjInfo_t params);
cusolverStatus_t cusolverDnChegvj( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuComplex *A, int lda, cuComplex *B, int ldb, float *W, cuComplex *work, int lwork, int *info, syevjInfo_t params); cusolverStatus_t cusolverDnZhegvj( cusolverDnHandle_t handle, cusolverEigType_t itype, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *B, int ldb, double *W, cuDoubleComplex *work, int lwork, int *info, syevjInfo_t params);
This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix-pair (A,B). The generalized symmetric-definite eigenvalue problem is
where the matrix B is positive definite. Λ is a real n×n diagonal matrix. The diagonal elements of Λ are the eigenvalues of (A, B) in ascending order. V is an n×n orthogonal matrix. The eigenvectors are normalized as follows:
This function has the same functionality as sygvd except that syevd in sygvd is replaced by syevj in sygvj. Therefore, sygvj inherits properties of syevj, the user can use cusolverDnXsyevjSetTolerance and cusolverDnXsyevjSetMaxSweeps to configure tolerance and maximum sweeps.
However the meaning of residual is different from syevj. sygvj first computes Cholesky factorization of matrix B,
transform the problem to standard eigenvalue problem, then calls syevj.
For example, the standard eigenvalue problem of type I is
where matrix M is symmtric
The residual is the result of syevj on matrix M, not A.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by sygvj_bufferSize().
If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = i (i > 0 and i<=n), B is not positive definite, the factorization of B could not be completed and no eigenvalues or eigenvectors were computed. If info = n+1, syevj does not converge under given tolerance and maximum sweeps. In this case, the eigenvalues and eigenvectors are still computed because non-convergence comes from improper tolerance of maximum sweeps.
if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthogonal eigenvectors V.
Appendix E.4 provides a simple example of sygvj.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
itype | host | input | Specifies the problem type to be solved: itype=CUSOLVER_EIG_TYPE_1: A*x = (lambda)*B*x. itype=CUSOLVER_EIG_TYPE_2: A*B*x = (lambda)*x. itype=CUSOLVER_EIG_TYPE_3: B*A*x = (lambda)*x. |
jobz | host | input | specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. |
uplo | host | input | specifies which part of A and B are stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A and B are stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A and B are stored. |
n | host | input | number of rows (or columns) of matrix A and B. |
A | device | in/out | <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix A. lda is not less than max(1,n). |
B | device | in/out | <type> array of dimension ldb * n. If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of B contains the upper triangular part of the matrix B. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of B contains the lower triangular part of the matrix B. On exit, if info is less than n, B is overwritten by triangular factor U or L from the Cholesky factorization of B. |
ldb | host | input | leading dimension of two-dimensional array used to store matrix B. ldb is not less than max(1,n). |
W | device | output | a real array of dimension n. The eigenvalue values of A, sorted so that W(i) >= W(i+1). |
work | device | in/out | working space, <type> array of size lwork. |
lwork | host | input | size of work, returned by sygvj_bufferSize. |
info | device | output | if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = i (> 0), info indicates either B is not positive definite or syevj (called by sygvj) does not converge. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, or lda<max(1,n), or ldb<max(1,n), or itype is not 1, 2 or 3, or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR, or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverDn<t>syevjBatched()
cusolverStatus_t cusolverDnSsyevjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const float *A, int lda, const float *W, int *lwork, syevjInfo_t params, int batchSize ); cusolverStatus_t cusolverDnDsyevjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const double *A, int lda, const double *W, int *lwork, syevjInfo_t params, int batchSize ); cusolverStatus_t cusolverDnCheevjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const float *W, int *lwork, syevjInfo_t params, int batchSize ); cusolverStatus_t cusolverDnZheevjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const double *W, int *lwork, syevjInfo_t params, int batchSize );
cusolverStatus_t cusolverDnSsyevjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, float *A, int lda, float *W, float *work, int lwork, int *info, syevjInfo_t params, int batchSize ); cusolverStatus_t cusolverDnDsyevjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, double *A, int lda, double *W, double *work, int lwork, int *info, syevjInfo_t params, int batchSize );
cusolverStatus_t cusolverDnCheevjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuComplex *A, int lda, float *W, cuComplex *work, int lwork, int *info, syevjInfo_t params, int batchSize ); cusolverStatus_t cusolverDnZheevjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, double *W, cuDoubleComplex *work, int lwork, int *info, syevjInfo_t params, int batchSize );
This function computes eigenvalues and eigenvectors of a squence of symmetric (Hermitian) n×n matrices
where is a real n×n diagonal matrix. is an n×n unitary matrix. The diagonal elements of are the eigenvalues of in either ascending order or non-sorting order.
syevjBatched performs syevj on each matrix. It requires that all matrices are of the same size n no greater than 32 and are packed in contiguous way,
|
Each matrix is column-major with leading dimension lda, so the formula for random access is .
The parameter W also contains eigenvalues of each matrix in contiguous way,
|
The formula for random access of W is .
Except for tolerance and maximum sweeps, syevjBatched can either sort the eigenvalues in ascending order (default) or chose as-is (without sorting) by the function cusolverDnXsyevjSetSortEig. If the user packs several tiny matrices into diagonal blocks of one matrix, non-sorting option can separate spectrum of those tiny matrices.
syevjBatched cannot report residual and executed sweeps by function cusolverDnXsyevjGetResidual and cusolverDnXsyevjGetSweeps. Any call of the above two returns CUSOLVER_STATUS_NOT_SUPPORTED. The user needs to compute residual explicitly.
The user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by syevjBatched_bufferSize().
The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = n+1, syevjBatched does not converge on i-th matrix under given tolerance and maximum sweeps.
if jobz = CUSOLVER_EIG_MODE_VECTOR, contains the orthonormal eigenvectors .
Appendix E.5 provides a simple example of syevjBatched.
parameter | Memory | In/out | Meaning |
handle | host | input | handle to the cuSolverDN library context. |
jobz | host | input | specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. |
uplo | host | input | specifies which part of Aj is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of Aj is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of Aj is stored. |
n | host | input | number of rows (or columns) of matrix each Aj. n is no greater than 32. |
A | device | in/out | <type> array of dimension lda * n * batchSize with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of Aj contains the upper triangular part of the matrix Aj. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of Aj contains the lower triangular part of the matrix Aj. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info[j] = 0, Aj contains the orthonormal eigenvectors of the matrix Aj. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of Aj are destroyed. |
lda | host | input | leading dimension of two-dimensional array used to store matrix Aj. |
W | device | output | a real array of dimension n*batchSize. It stores the eigenvalues of Aj in ascending order or non-sorting order. |
work | device | in/out | <type> array of size lwork, workspace. |
lwork | host | input | size of work, returned by syevjBatched_bufferSize. |
info | device | output | an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = 0, the operation is successful. if info[i] = n+1, syevjBatched dose not converge on i-th matrix under given tolerance and maximum sweeps. |
params | host | in/out | structure filled with parameters of Jacobi algorithm. |
batchSize | host | input | number of matrices. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n<0, n>32 or lda<max(1,n), or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR, or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER), or batchSize<0. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cuSolverSP: sparse LAPACK Function Reference
This chapter describes the API of cuSolverSP, which provides a subset of LAPACK funtions for sparse matrices in CSR or CSC format.
Helper Function Reference
cusolverSpCreate()
cusolverStatus_t cusolverSpCreate(cusolverSpHandle_t *handle)
This function initializes the cuSolverSP library and creates a handle on the cuSolver context. It must be called before any other cuSolverSP API function is invoked. It allocates hardware resources necessary for accessing the GPU.
handle | the pointer to the handle to the cuSolverSP context. |
CUSOLVER_STATUS_SUCCESS | the initialization succeeded. |
CUSOLVER_STATUS_NOT_INITIALIZED | the CUDA Runtime initialization failed. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
cusolverSpDestroy()
cusolverStatus_t cusolverSpDestroy(cusolverSpHandle_t handle)
This function releases CPU-side resources used by the cuSolverSP library.
handle | the handle to the cuSolverSP context. |
CUSOLVER_STATUS_SUCCESS | the shutdown succeeded. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverSpSetStream()
cusolverStatus_t cusolverSpSetStream(cusolverSpHandle_t handle, cudaStream_t streamId)
This function sets the stream to be used by the cuSolverSP library to execute its routines.
handle | the handle to the cuSolverSP context. |
streamId | the stream to be used by the library. |
CUSOLVER_STATUS_SUCCESS | the stream was set successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverSpXcsrissym()
cusolverStatus_t cusolverSpXcsrissymHost(cusolverSpHandle_t handle, int m, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrEndPtrA, const int *csrColIndA, int *issym);
This function checks if A has symmetric pattern or not. The output parameter issym reports 1 if A is symmetric; otherwise, it reports 0.
The matrix A is an m×m sparse matrix that is defined in CSR storage format by the four arrays csrValA, csrRowPtrA, csrEndPtrA and csrColIndA.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL.
The csrlsvlu and csrlsvqr do not accept non-general matrix. the user has to extend the matrix into its missing upper/lower part, otherwise the result is not expected. The user can use csrissym to check if the matrix has symmetric pattern or not.
Remark 1: only CPU path is provided.
Remark 2: the user has to check returned status to get valid information. The function converts A to CSC format and compare CSR and CSC format. If the CSC failed because of insufficient resources, issym is undefined, and this state can only be detected by the return status code.
parameter | MemorySpace | description |
handle | host | handle to the cuSolverSP library context. |
m | host | number of rows and columns of matrix A. |
nnzA | host | number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. |
descrA | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | host | integer array of m elements that contains the start of every row. |
csrEndPtrA | host | integer array of m elements that contains the end of the last row plus one. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix A. |
parameter | MemorySpace | description |
issym | host | 1 if A is symmetric; 0 otherwise. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
High Level Function Reference
This section describes high level API of cuSolverSP, including linear solver, least-square solver and eigenvalue solver. The high-level API is designed for ease-of-use, so it allocates any required memory under the hood automatically. If the host or GPU system memory is not enough, an error is returned.
6.2.1. cusolverSp<t>csrlsvlu()
cusolverStatus_t cusolverSpScsrlsvlu[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, const float *b, float tol, int reorder, float *x, int *singularity); cusolverStatus_t cusolverSpDcsrlsvlu[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, const double *b, double tol, int reorder, double *x, int *singularity); cusolverStatus_t cusolverSpCcsrlsvlu[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuComplex *b, float tol, int reorder, cuComplex *x, int *singularity); cusolverStatus_t cusolverSpZcsrlsvlu[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuDoubleComplex *b, double tol, int reorder, cuDoubleComplex *x, int *singularity);
This function solves the linear system
A is an n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size n, and x is the solution vector of size n.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If matrix A is symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result would be wrong.
The linear system is solved by sparse LU with partial pivoting,
cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.
If reorder is nonzero, csrlsvlu does
where .
If A is singular under given tolerance (max(tol,0)), then some diagonal elements of U is zero, i.e.
The output parameter singularity is the smallest index of such j. If A is non-singular, singularity is -1. The index is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is singular and singularity = 1 which means U(1,1)≈0.
Remark 1: csrlsvlu performs traditional LU with partial pivoting, the pivot of k-th column is determined dynamically based on the k-th column of intermediate matrix. csrlsvlu follows Gilbert and Peierls's algorithm [4] which uses depth-first-search and topological ordering to solve triangular system (Davis also describes this algorithm in detail in his book [1]). Before performing LU factorization, csrlsvlu over-estimates size of L and U, and allocates a buffer to contain factors L and U. George and Ng [5] proves that sparsity pattern of cholesky factor of is a superset of sparsity pattern of L and U. Furthermore, they propose an algorithm to find sparisty pattern of QR factorization which is a superset of LU [6]. csrlsvlu uses QR factorization to estimate size of LU in the analysis phase. The cost of analysis phase is mainly on figuring out sparsity pattern of householder vectors in QR factorization. The idea to avoid computing in [7] is adopted. If system memory is insufficient to keep sparsity pattern of QR, csrlsvlu returns CUSOLVER_STATUS_ALLOC_FAILED. If the matrix is not banded, it is better to enable reordering to avoid CUSOLVER_STATUS_ALLOC_FAILED.
Remark 2: approximate minimum degree ordering (symamd) is a well-known technique to reduce zero fill-in of QR factorization. However in most cases, symrcm still performs well.
Remark 3: only CPU (Host) path is provided.
Remark 4: multithreaded csrlsvlu is not avaiable yet. If QR does not incur much zero fill-in, csrlsvqr would be faster than csrlsvlu.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzAcsrRowPtrA(n)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of n elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcsrRowPtrA(n)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
b | device | host | right hand side vector of size n. |
tol | host | host | tolerance to decide if singular or not. |
reorder | host | host | no ordering if reorder=0. Otherwise, symrcm is used to reduce zero fill-in. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | solution vector of size n, x = inv(A)*b. |
singularity | host | host | -1 if A is invertible. Otherwise, first index j such that U(j,j)≈0 |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.2.2. cusolverSp<t>csrlsvqr()
cusolverStatus_t cusolverSpScsrlsvqr[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, const float *b, float tol, int reorder, float *x, int *singularity); cusolverStatus_t cusolverSpDcsrlsvqr[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, const double *b, double tol, int reorder, double *x, int *singularity); cusolverStatus_t cusolverSpCcsrlsvqr[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuComplex *b, float tol, int reorder, cuComplex *x, int *singularity); cusolverStatus_t cusolverSpZcsrlsvqr[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuDoubleComplex *b, double tol, int reorder, cuDoubleComplex *x, int *singularity);
This function solves the linear system
A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the solution vector of size m.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If matrix A is symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result would be wrong.
The linear system is solved by sparse QR factorization,
If A is singular under given tolerance (max(tol,0)), then some diagonal elements of R is zero, i.e.
The output parameter singularity is the smallest index of such j. If A is non-singular, singularity is -1. The singularity is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is singular and singularity = 1 which means R(1,1)≈0.
cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows and columns of matrix A. |
nnz | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzcsrRowPtrA(m)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzcsrRowPtrA(m)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
b | device | host | right hand side vector of size m. |
tol | host | host | tolerance to decide if singular or not. |
reorder | host | host | no effect. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | solution vector of size m, x = inv(A)*b. |
singularity | host | host | -1 if A is invertible. Otherwise, first index j such that R(j,j)≈0 |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.2.3. cusolverSp<t>csrlsvchol()
cusolverStatus_t cusolverSpScsrlsvchol[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const float *csrVal, const int *csrRowPtr, const int *csrColInd, const float *b, float tol, int reorder, float *x, int *singularity); cusolverStatus_t cusolverSpDcsrlsvchol[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const double *csrVal, const int *csrRowPtr, const int *csrColInd, const double *b, double tol, int reorder, double *x, int *singularity); cusolverStatus_t cusolverSpCcsrlsvchol[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuComplex *csrVal, const int *csrRowPtr, const int *csrColInd, const cuComplex *b, float tol, int reorder, cuComplex *x, int *singularity); cusolverStatus_t cusolverSpZcsrlsvchol[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrVal, const int *csrRowPtr, const int *csrColInd, const cuDoubleComplex *b, double tol, int reorder, cuDoubleComplex *x, int *singularity);
This function solves the linear system
A is an m×m symmetric postive definite sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the solution vector of size m.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL and upper triangular part of A is ignored (if parameter reorder is zero). In other words, suppose input matrix A is decomposed as , where L is lower triangular, D is diagonal and U is upper triangular. The function would ignore U and regard A as a symmetric matrix with the formula . If parameter reorder is nonzero, the user has to extend A to a full matrix, otherwise the solution would be wrong.
The linear system is solved by sparse Cholesky factorization,
where G is the Cholesky factor, a lower triangular matrix.
The output parameter singularity has two meanings:
- If A is not postive definite, there exists some integer k such that A(0:k, 0:k) is not positive definite. singularity is the minimum of such k.
- If A is postive definite but near singular under tolerance (max(tol,0)), i.e. there exists some integer k such that . singularity is the minimum of such k.
singularity is base-0. If A is positive definite and not near singular under tolerance, singularity is -1. If the user wants to know if A is postive definite or not, tol=0 is enough.
cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.
Remark 1: the function works for in-place (x and b point to the same memory block) and out-of-place.
Remark 2: the function only works on 32-bit index, if matrix G has large zero fill-in such that number of nonzeros is bigger than , then CUSOLVER_STATUS_ALLOC_FAILED is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows and columns of matrix A. |
nnz | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzcsrRowPtrA(m)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzcsrRowPtrA(m)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
b | device | host | right hand side vector of size m. |
tol | host | host | tolerance to decide singularity. |
reorder | host | host | no effect. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | solution vector of size m, x = inv(A)*b. |
singularity | host | host | -1 if A is symmetric postive definite. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.2.4. cusolverSp<t>csrlsqvqr()
cusolverStatus_t cusolverSpScsrlsqvqr[Host](cusolverSpHandle_t handle, int m, int n, int nnz, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, const float *b, float tol, int *rankA, float *x, int *p, float *min_norm); cusolverStatus_t cusolverSpDcsrlsqvqr[Host](cusolverSpHandle_t handle, int m, int n, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, const double *b, double tol, int *rankA, double *x, int *p, double *min_norm);
cusolverStatus_t cusolverSpCcsrlsqvqr[Host](cusolverSpHandle_t handle, int m, int n, int nnz, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuComplex *b, float tol, int *rankA, cuComplex *x, int *p, float *min_norm); cusolverStatus_t cusolverSpZcsrlsqvqr[Host](cusolverSpHandle_t handle, int m, int n, int nnz, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuDoubleComplex *b, double tol, int *rankA, cuDoubleComplex *x, int *p, double *min_norm);
This function solves the following least-square problem
A is an m×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the least-square solution vector of size n.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is square, symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result is wrong.
This function only works if m is greater or equal to n, in other words, A is a tall matrix.
The least-square problem is solved by sparse QR factorization with column pivoting,
If A is of full rank (i.e. all columns of A are linear independent), then matrix P is an identity. Suppose rank of A is k, less than n, the permutation matrix P reorders columns of A in the following sense:
|
where and A have the same rank, but is almost zero, i.e. every column of is linear combination of .
The input parameter tol decides numerical rank. The absolute value of every entry in is less than or equal to tolerance=max(tol,0).
The output parameter rankA denotes numerical rank of A.
Suppose and , the least square problem can be reformed by
|
or in matrix form
|
The output parameter min_norm is , which is minimum value of least-square problem.
If A is not of full rank, above equation does not have a unique solution. The least-square problem is equivalent to
|
Or equivalently another least-square problem
|
The output parameter x is , the solution of least-square problem.
The output parameter p is a vector of size n. It corresponds to a permutation matrix P. p(i)=j means (P*x)(i) = x(j). If A is of full rank, p=0:n-1.
Remark 1: p is always base 0, independent of base index of A.
Remark 2: only CPU (Host) path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolver library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
nnz | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzcsrRowPtrA(m)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzcsrRowPtrA(m)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
b | device | host | right hand side vector of size m. |
tol | host | host | tolerance to decide rank of A. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
rankA | host | host | numerical rank of A. |
x | device | host | solution vector of size n, x=pinv(A)*b. |
p | device | host | a vector of size n, which represents the permuation matrix P satisfying A*P^T=Q*R. |
min_norm | host | host | ||A*x-b||, x=pinv(A)*b. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnz<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.2.5. cusolverSp<t>csreigvsi()
cusolverStatus_t cusolverSpScsreigvsi[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, float mu0, const float *x0, int maxite, float tol, float *mu, float *x); cusolverStatus_t cusolverSpDcsreigvsi[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, double mu0, const double *x0, int maxite, double tol, double *mu, double *x);
cusolverStatus_t cusolverSpCcsreigvsi[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuComplex mu0, const cuComplex *x0, int maxite, float tol, cuComplex *mu, cuComplex *x); cusolverStatus_t cusolverSpZcsreigvsi(cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuDoubleComplex mu0, const cuDoubleComplex *x0, int maxite, double tol, cuDoubleComplex *mu, cuDoubleComplex *x);
This function solves the simple eigenvalue problem by shift-inverse method.
A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. The output paramter x is the approximated eigenvector of size m,
The following shift-inverse method corrects eigenpair step-by-step until convergence.
It accepts several parameters:
mu0 is an initial guess of eigenvalue. The shift-inverse method will converge to the eigenvalue mu nearest mu0 if mu is a singleton. Otherwise, the shift-inverse method may not converge.
x0 is an initial eigenvector. If the user has no preference, just chose x0 randomly. x0 must be nonzero. It can be non-unit length.
tol is the tolerance to decide convergence. If tol is less than zero, it would be treated as zero.
maxite is maximum number of iterations. It is useful when shift-inverse method does not converge because the tolerance is too small or the desired eigenvalue is not a singleton.
|
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result is wrong.
Remark 1: [cu|h]solver[S|D]csreigvsi only allows mu0 as a real number. This works if A is symmetric. Otherwise, the non-real eigenvalue has a conjugate counterpart on the complex plan, and shift-inverse method would not converge to such eigevalue even the eigenvalue is a singleton. The user has to extend A to complex numbre and call [cu|h]solver[C|Z]csreigvsi with mu0 not on real axis.
Remark 2: the tolerance tol should not be smaller than |mu0|*eps, where eps is machine zero. Otherwise, shift-inverse may not converge because of small tolerance.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolver library context. |
m | host | host | number of rows and columns of matrix A. |
nnz | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzcsrRowPtrA(m)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzcsrRowPtrA(m)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
mu0 | host | host | initial guess of eigenvalue. |
x0 | device | host | initial guess of eigenvector, a vecotr of size m. |
maxite | host | host | maximum iterations in shift-inverse method. |
tol | host | host | tolerance for convergence. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
mu | device | host | approximated eigenvalue nearest mu0 under tolerance. |
x | device | host | approximated eigenvector of size m. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.2.6. cusolverSp<t>csreigs()
cusolverStatus_t solverspScsreigs[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuComplex left_bottom_corner, cuComplex right_upper_corner, int *num_eigs); cusolverStatus_t cusolverSpDcsreigs[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuDoubleComplex left_bottom_corner, cuDoubleComplex right_upper_corner, int *num_eigs); cusolverStatus_t cusolverSpCcsreigs[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuComplex left_bottom_corner, cuComplex right_upper_corner, int *num_eigs); cusolverStatus_t cusolverSpZcsreigs[Host](cusolverSpHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuDoubleComplex left_bottom_corner, cuDoubleComplex right_upper_corner, int *num_eigs);
This function computes number of algebraic eigenvalues in a given box B by contour integral
where closed line C is boundary of the box B which is a rectangle specified by two points, one is left bottom corner (input parameter left_botoom_corner) and the other is right upper corner (input parameter right_upper_corner). P(z)=det(A - z*I) is the characteristic polynomial of A.
A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.
The output parameter num_eigs is number of algebraic eigenvalues in the box B. This number may not be accurate due to several reasons:
1. the contour C is close to some eigenvalues or even passes through some eigenvalues.
2. the numerical integration is not accurate due to coarse grid size. The default resolution is 1200 grids along contour C uniformly.
Even though csreigs may not be accurate, it still can give the user some idea how many eigenvalues in a region where the resolution of disk theorem is bad. For example, standard 3-point stencil of finite difference of Laplacian operator is a tridiagonal matrix, and disk theorem would show "all eigenvalues are in the interval [0, 4*N^2]" where N is number of grids. In this case, csreigs is useful for any interval inside [0, 4*N^2].
Remark 1: if A is symmetric in real or hermitian in complex, all eigenvalues are real. The user still needs to specify a box, not an interval. The height of the box can be much smaller than the width.
Remark 2: only CPU (Host) path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows and columns of matrix A. |
nnz | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzcsrRowPtrA(m)csrRowPtrA(0) nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzcsrRowPtrA(m)csrRowPtrA(0) column indices of the nonzero elements of matrix A. |
left_bottom_corner | host | host | left bottom corner of the box. |
right_upper_corner | host | host | right upper corner of the box. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
num_eigs | host | host | number of algebraic eigenvalues in a box. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
Low Level Function Reference
This section describes low level API of cuSolverSP, including symrcm and batched QR.
6.3.1. cusolverSpXcsrsymrcm()
cusolverStatus_t cusolverSpXcsrsymrcmHost(cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, int *p);
This function implements Symmetric Reverse Cuthill-McKee permutation. It returns a permutation vector p such that A(p,p) would concentrate nonzeros to diagonal. This is equivalent to symrcm in MATLAB, however the result may not be the same because of different heuristics in the pseudoperipheral finder. The cuSolverSP library implements symrcm based on the following two papers:
E. Chuthill and J. McKee, reducing the bandwidth of sparse symmetric matrices, ACM '69 Proceedings of the 1969 24th national conference, Pages 157-172
Alan George, Joseph W. H. Liu, An Implementation of a Pseudoperipheral Node Finder, ACM Transactions on Mathematical Software (TOMS) Volume 5 Issue 3, Sept. 1979, Pages 284-295
The output parameter p is an integer array of n elements. It represents a permutation array and it indexed using the base-0 convention. The permutation array p corresponds to a permutation matrix P, and satisfies the following relation:
A is an n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Internally rcm works on , the user does not need to extend the matrix if the matrix is not symmetric.
Remark 1: only CPU (Host) path is provided.
parameter | *Host MemSpace | description |
handle | host | handle to the cuSolverSP library context. |
n | host | number of rows and columns of matrix A. |
nnzA | host | number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. |
descrA | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix A. |
parameter | hsolver | description |
p | host | permutation vector of size n. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.3.2. cusolverSpXcsrsymmdq()
cusolverStatus_t cusolverSpXcsrsymmdqHost(cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, int *p);
This function implements Symmetric Minimum Degree Algorithm based on Quotient Graph. It returns a permutation vector p such that A(p,p) would have less zero fill-in during Cholesky factorization. The cuSolverSP library implements symmdq based on the following two papers:
Patrick R. Amestoy, Timothy A. Davis, Iain S. Duff, An Approximate Minimum Degree Ordering Algorithm, SIAM J. Matrix Analysis Applic. Vol 17, no 4, pp. 886-905, Dec. 1996.
Alan George, Joseph W. Liu, A Fast Implementation of the Minimum Degree Algorithm Using Quotient Graphs, ACM Transactions on Mathematical Software, Vol 6, No. 3, September 1980, page 337-358.
The output parameter p is an integer array of n elements. It represents a permutation array with base-0 index. The permutation array p corresponds to a permutation matrix P, and satisfies the following relation:
A is an n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Internally mdq works on , the user does not need to extend the matrix if the matrix is not symmetric.
Remark 1: only CPU (Host) path is provided.
parameter | *Host MemSpace | description |
handle | host | handle to the cuSolverSP library context. |
n | host | number of rows and columns of matrix A. |
nnzA | host | number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. |
descrA | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix A. |
parameter | hsolver | description |
p | host | permutation vector of size n. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.3.3. cusolverSpXcsrsymamd()
cusolverStatus_t cusolverSpXcsrsymamdHost(cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, int *p);
This function implements Symmetric Approximate Minimum Degree Algorithm based on Quotient Graph. It returns a permutation vector p such that A(p,p) would have less zero fill-in during Cholesky factorization. The cuSolverSP library implements symamd based on the following paper:
Patrick R. Amestoy, Timothy A. Davis, Iain S. Duff, An Approximate Minimum Degree Ordering Algorithm, SIAM J. Matrix Analysis Applic. Vol 17, no 4, pp. 886-905, Dec. 1996.
The output parameter p is an integer array of n elements. It represents a permutation array with base-0 index. The permutation array p corresponds to a permutation matrix P, and satisfies the following relation:
A is an n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Internally amd works on , the user does not need to extend the matrix if the matrix is not symmetric.
Remark 1: only CPU (Host) path is provided.
parameter | *Host MemSpace | description |
handle | host | handle to the cuSolverSP library context. |
n | host | number of rows and columns of matrix A. |
nnzA | host | number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. |
descrA | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix A. |
parameter | hsolver | description |
p | host | permutation vector of size n. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.3.4. cusolverSpXcsrperm()
cusolverStatus_t cusolverSpXcsrperm_bufferSizeHost(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, int *csrRowPtrA, int *csrColIndA, const int *p, const int *q, size_t *bufferSizeInBytes); cusolverStatus_t cusolverSpXcsrpermHost(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, int *csrRowPtrA, int *csrColIndA, const int *p, const int *q, int *map, void *pBuffer);
Given a left permutation vector p which corresponds to permutation matrix P and a right permutation vector q which corresponds to permutation matrix Q, this function computes permutation of matrix A by
A is an m×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA and csrColIndA.
The operation is in-place, i.e. the matrix A is overwritten by B.
The permutation vector p and q are base 0. p performs row permutation while q performs column permutation. One can also use MATLAB command to permutate matrix A.
This function only computes sparsity pattern of B. The user can use parameter map to get csrValB as well. The parameter map is an input/output. If the user sets map=0:1:(nnzA-1) before calling csrperm, csrValB=csrValA(map).
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is symmetric and only lower/upper part is provided, the user has to pass into this function.
This function requires a buffer size returned by csrperm_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If it is not, CUSOLVER_STATUS_INVALID_VALUE is returned.
For example, if matrix A is
|
and left permutation vector p=(0,2,1), right permutation vector q=(2,1,0), then is
|
Remark 1: only CPU (Host) path is provided.
Remark 2: the user can combine csrsymrcm and csrperm to get which has less zero fill-in during QR factorization.
parameter | cusolverSp MemSpace | description |
handle | host | handle to the cuSolver library context. |
m | host | number of rows of matrix A. |
n | host | number of columns of matrix A. |
nnzA | host | number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. |
descrA | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | host | integer array of m+1 elements that contains the start of every row and end of last row plus one of matrix A. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix A. |
p | host | left permutation vector of size m. |
q | host | right permutation vector of size n. |
map | host | integer array of nnzA indices. If the user wants to get relationship between A and B, map must be set 0:1:(nnzA-1). |
pBuffer | host | buffer allocated by the user, the size is returned by csrperm_bufferSize(). |
parameter | hsolver | description |
csrRowPtrA | host | integer array of m+1 elements that contains the start of every row and end of last row plus one of matrix B. |
csrColIndA | host | integer array of nnzAcolumn indices of the nonzero elements of matrix B. |
map | host | integer array of nnzA indices that maps matrix A to matrix B. |
pBufferSizeInBytes | host | number of bytes of the buffer. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.3.5. cusolverSpXcsrqrBatched()
cusolverStatus_t cusolverSpCreateCsrqrInfo(csrqrInfo_t *info); cusolverStatus_t cusolverSpDestroyCsrqrInfo(csrqrInfo_t info);
cusolverStatus_t cusolverSpXcsrqrAnalysisBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo_t info); cusolverStatus_t cusolverSpScsrqrBufferInfoBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, int batchSize, csrqrInfo_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpDcsrqrBufferInfoBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, int batchSize, csrqrInfo_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
cusolverStatus_t cusolverSpCcsrqrBufferInfoBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, int batchSize, csrqrInfo_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpZcsrqrBufferInfoBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, int batchSize, csrqrInfo_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
cusolverStatus_t cusolverSpScsrqrsvBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, const float *b, float *x, int batchSize, csrqrInfo_t info, void *pBuffer); cusolverStatus_t cusolverSpDcsrqrsvBatched(cusolverSpHandle_t handle, int m, int n, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, const double *b, double *x, int batchSize, csrqrInfo_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrqrsvBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuComplex *b, cuComplex *x, int batchSize, csrqrInfo_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrqrsvBatched(cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, const cuDoubleComplex *b, cuDoubleComplex *x, int batchSize, csrqrInfo_t info, void *pBuffer);
The batched sparse QR factorization is used to solve either a set of least-squares problems
or a set of linear systems
where each is a m×n sparse matrix that is defined in CSR storage format by the four arrays csrValA, csrRowPtrA and csrColIndA.
The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is symmetric and only lower/upper part is prvided, the user has to pass into this function.
The prerequisite to use batched sparse QR has two-folds. First all matrices must have the same sparsity pattern. Second, no column pivoting is used in least-square problem, so the solution is valid only if is of full rank for all j = 1,2,..., batchSize . All matrices have the same sparity pattern, so only one copy of csrRowPtrA and csrColIndA is used. But the array csrValA stores coefficients of one after another. In other words, csrValA[k*nnzA : (k+1)*nnzA] is the value of .
The batched QR uses opaque data structure csrqrInfo to keep intermediate data, for example, matrix Q and matrix R of QR factorization. The user needs to create csrqrInfo first by cusolverSpCreateCsrqrInfo before any function in batched QR operation. The csrqrInfo would not release internal data until cusolverSpDestroyCsrqrInfo is called.
There are three routines in batched sparse QR, cusolverSpXcsrqrAnalysisBatched, cusolverSp[S|D|C|Z]csrqrBufferInfoBatched and cusolverSp[S|D|C|Z]csrqrsvBatched.
First, cusolverSpXcsrqrAnalysisBatched is the analysis phase, used to analyze sparsity pattern of matrix Q and matrix R of QR factorization. Also parallelism is extracted during analysis phase. Once analysis phase is done, the size of working space to perform QR is known. However cusolverSpXcsrqrAnalysisBatched uses CPU to analyze the structure of matrix A, and this may consume a lot of memory. If host memory is not sufficient to finish the analysis, CUSOLVER_STATUS_ALLOC_FAILED is returned. The required memory for analysis is proportional to zero fill-in in QR factorization. The user may need to perform some kind of reordering to minimize zero fill-in, for example, colamd or symrcm in MATLAB. cuSolverSP library provides symrcm (cusolverSpXcsrsymrcm).
Second, the user needs to choose proper batchSize and to prepare working space for sparse QR. There are two memory blocks used in batched sparse QR. One is internal memory block used to store matrix Q and matrix R. The other is working space used to perform numerical factorization. The size of the former is proportional to batchSize, and the size is specified by returned parameter internalDataInBytes of cusolverSp[S|D|C|Z]csrqrBufferInfoBatched. while the size of the latter is almost independent of batchSize, and the size is specified by returned parameter workspaceInBytes of cusolverSp[S|D|C|Z]csrqrBufferInfoBatched. The internal memory block is allocated implicitly during first call of cusolverSp[S|D|C|Z]csrqrsvBatched. The user only needs to allocate working space for cusolverSp[S|D|C|Z]csrqrsvBatched.
Instead of trying all batched matrices, the user can find maximum batchSize by querying cusolverSp[S|D|C|Z]csrqrBufferInfoBatched. For example, the user can increase batchSize till summation of internalDataInBytes and workspaceInBytes is greater than size of available device memory.
Suppose that the user needs to perform 253 linear solvers and available device memory is 2GB. if cusolverSp[S|D|C|Z]csrqrsvBatched can only afford batchSize 100, the user has to call cusolverSp[S|D|C|Z]csrqrsvBatched three times to finish all. The user calls cusolverSp[S|D|C|Z]csrqrBufferInfoBatched with batchSize 100. The opaque info would remember this batchSize and any subsequent call of cusolverSp[S|D|C|Z]csrqrsvBatched cannot exceed this value. In this example, the first two calls of cusolverSp[S|D|C|Z]csrqrsvBatched will use batchSize 100, and last call of cusolverSp[S|D|C|Z]csrqrsvBatched will use batchSize 53.
Example: suppose that A0, A1, .., A9 have the same sparsity pattern, the following code solves 10 linear systems by batched sparse QR.
// Suppose that A0, A1, .., A9 are m x m sparse matrix represented by CSR format, // Each matrix Aj has nonzero nnzA, and shares the same csrRowPtrA and csrColIndA. // csrValA is aggregation of A0, A1, ..., A9. int m ; // number of rows and columns of each Aj int nnzA ; // number of nonzeros of each Aj int *csrRowPtrA ; // each Aj has the same csrRowPtrA int *csrColIndA ; // each Aj has the same csrColIndA double *csrValA ; // aggregation of A0,A1,...,A9 cont int batchSize = 10; // 10 linear systems cusolverSpHandle_t handle; // handle to cusolver library csrqrInfo_t info = NULL; cusparseMatDescr_t descrA = NULL; void *pBuffer = NULL; // working space for numerical factorization // step 1: create a descriptor cusparseCreateMatDescr(&descrA); cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE); // A is base-1 cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL); // A is a general matrix // step 2: create empty info structure cusolverSpCreateCsrqrInfo(&info); // step 3: symbolic analysis cusolverSpXcsrqrAnalysisBatched( handle, m, m, nnzA, descrA, csrRowPtrA, csrColIndA, info);
// step 4: allocate working space for Aj*xj=bj cusolverSpDcsrqrBufferInfoBatched( handle, m, m, nnzA, descrA, csrValA, csrRowPtrA, csrColIndA, batchSize, info, &internalDataInBytes, &workspaceInBytes); cudaMalloc(&pBuffer, workspaceInBytes); // step 5: solve Aj*xj = bj cusolverSpDcsrqrsvBatched( handle, m, m, nnzA, descrA, csrValA, csrRowPtrA, csrColIndA, b, x, batchSize, info, pBuffer); // step 7: destroy info cusolverSpDestroyCsrqrInfo(info);
Please refer to Appendix B for detailed examples.
Remark 1: only GPU (device) path is provided.
parameter | cusolverSp MemSpace | description |
handle | host | handle to the cuSolverSP library context. |
m | host | number of rows of each matrix Aj. |
n | host | number of columns of each matrix Aj. |
nnzA | host | number of nonzeros of each matrix Aj. It is the size csrColIndA. |
descrA | host | the descriptor of each matrix Aj. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | <type> array of nnzA*batchSize nonzero elements of matrices A0, A1, .... All matrices are aggregated one after another. |
csrRowPtrA | device | integer array of m+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | integer array of nnzAcolumn indices of the nonzero elements of each matrix Aj. |
b | device | <type> array of m*batchSize of right-hand-side vectors b0, b1, .... All vectors are aggregated one after another. |
batchSize | host | number of systems to be solved. |
info | host | opaque structure for QR factorization. |
pBuffer | device | buffer allocated by the user, the size is returned by cusolverSpXcsrqrBufferInfoBatched(). |
parameter | cusolverSp MemSpace | description |
x | device | <type> array of m*batchSize of solution vectors x0, x1, .... All vectors are aggregated one after another. |
internalDataInBytes | host | number of bytes of the internal data. |
workspaceInBytes | host | number of bytes of the buffer in numerical factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_ARCH_MISMATCH | the device only supports compute capability 2.0 and above. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4. cuda 7.5 Preview
This section describes new low level APIs of cuSolverSP in cuda 7.5. The low level APIs include sparse LU, sparse Cholesky and sparse QR. The user has to include header file cusolverSp_LOWLEVEL_PREVIEW.h.
LU, Cholesky and QR have the same flow, including
- analysis phase to find sparsity pattern of numerical factor.
- query size of buffer.
- numerical factorization.
- report singularity of numerical factorization.
- numerical solve to complete linear solver or least-square solver.
The user has to follow the above sequence to perform either a linear solver or a least-square solver.
6.4.1. cusolverSpXcsrlu()
The sparse LU factorization is used to factorize matrix A in the following form
A is a n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA and csrColIndA. P is a left permutation matrix mainly on pivoting and Q is a right permutation matrix from postordering of the elimination tree. L is a lower triangular matrix with implicit diagonal one while U is a upper triangular matrix.
If A is symmetric, the user has to extend it to a full matrix and sets the matrix type as CUSPARSE_MATRIX_TYPE_GENERAL.
The low-level API does not reorder the matrix to minimize zero fill-in. The user can use cusolverSpXcsrsymrcm or cusolverSpXcsrsymamd to reorder the matrix to reduce zero fill-in.
cusolverSP LU can be first step of refactorization. Please refer SDK samples/7_CUDALibraries/cuSolverRf.
6.4.1.1. cusolverSpCreateCsrluInfo()
cusolverStatus_t cusolverSpCreateCsrluInfo[Host](csrluInfo[Host]_t *info); cusolverStatus_t cusolverSpDestroyCsrluInfo[Host](csrluInfo[Host]_t info);
The function cusolverSpCreateCsrluInfo creates and initializes the opaque structure of LU to default values.
The function cusolverSpDestroyCsrluInfo releases any memory required by the structure.
Remark 1: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | opaque structure for LU factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
6.4.1.2. cusolverSpXcsrluAnalysis()
cusolverStatus_t cusolverSpXcsrluAnalysis[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info);
This function analyzes sparsity pattern of matrix L and matrix U of LU factorization. The pivoting is determined at runtime, so only superset of L and U can be found. After analysis, the size of working space to perform LU can be retrieved from cusolverSpXcsrluBufferInfo.
The analysis phase needs working space to estimate sparsity pattern of L and U. If host memory is not sufficient to finish the analysis, CUSOLVER_STATUS_ALLOC_FAILED is returned.
Remark 1: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | recording scheduling information used in numerical factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.1.3. cusolverSpXcsrluBufferInfo()
cusolverStatus_t cusolverSpScsrluBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpDcsrluBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
cusolverStatus_t cusolverSpCcsrluBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpZcsrluBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
There are two memory blocks used in sparse LU. One is internal memory used to store matrix L and matrix U. The other is working space used to perform numerical factorization. The size of the former is specified by returned parameter internalDataInBytes; while the size of the latter is specified by returned parameter workspaceInBytes.
The first call of cusolverSpXcsrluFactor would allocate L and U whose size is bounded by internalDataInBytes. Once internal memory (of size internalDataInBytes bytes) is allocated by cusolverSpXcsrluFactor, the life time is the same as info. Such internal memory is different from working space of size workspaceInBytes bytes, whose life time starts at the beginning of the calling function and ends when the function returns.
Remark 1: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
info | host | host | opaque structure for LU factorization. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
internalDataInBytes | host | host | number of bytes of the internal data. |
workspaceInBytes | host | host | number of bytes of the buffer in numerical factorization. |
info | host | host | recording internal parameters for buffer. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.1.4. cusolverSpXcsrluFactor()
cusolverStatus_t cusolverSpScsrluFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, float pivot_threshold, void *pBuffer); cusolverSpDcsrluFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, double pivot_threshold, void *pBuffer);
cusolverStatus_t cusolverSpCcsrluFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, float pivot_threshold, void *pBuffer); cusolverStatus_t cusolverSpZcsrluFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrluInfo[Host]_t info, double pivot_threshold, void *pBuffer);
This function performs numerical factorization
The first call to cusolverSpXcsrluFactor would allocate space for L and U. If the memory is insufficient, CUSOLVER_STATUS_ALLOC_FAILED is returned. The numerical factor L and U are kept in structure info and can be used in cusolverSpXcsrluSolve.
The parameter pivot_threshold is for diagonal pivoting. The value is between 0 and 1. If pivot_threshold is 0, then no pivoting is chosen; if pivot_threshold is 1, traditional pivoting is chosen. Assuming that first j-1 columns are done, A is updated, and ξ = max{|A(j:end,j)|} is the condition of traditional pivoting, the formula to choose diagonal A(j,j) as the pivot is
Remark 1: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
info | host | host | opaque structure for LU factorization. |
pivot_threshold | host | host | a threshold to enable diagonal pivoting. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrluBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | containing numerical factor L and Q. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.1.5. cusolverSpXcsrluZeroPivot()
cusolverStatus_t cusolverSpScsrluZeroPivot[Host](cusolverSpHandle_t handle, csrluInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpDcsrluZeroPivot[Host](cusolverSpHandle_t handle, csrluInfo[Host]_t info, double tol, int *position);
cusolverStatus_t cusolverSpCcsrluZeroPivot[Host](cusolverSpHandle_t handle, csrluInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpZcsrluZeroPivot[Host](cusolverSpHandle_t handle, csrluInfo[Host]_t info, double tol, int *position);
If A is singular under given tolerance (max(tol,0)), then some diagonal elements of U are zero, i.e.
The output parameter position is the smallest index of such j. If A is non-singular, position is -1. The index is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is singular and position = 1 which means U(1,1)≈0.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
Remark 1: only a CPU path is provided.
Remark 2: This routine is not intended to prove that a matrix is singular or non-singular, but to show the need for pivoting. When the pivot threshold is set to 0.0 (no pivoting) this routine may return false positives, ie show a zero pivot when the matrix is not singular. When the pivoting threshold is 1.0, you can trust the output of the zero-pivot routine.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
info | host | host | opaque structure for LU factorization. |
tol | host | host | tolerance to determine singularity. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
position | host | host | -1 if A is non-singular; otherwise, first column that U(j,j) is zero under given tolerance. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
6.4.1.6. cusolverSpXcsrluSolve()
cusolverStatus_t cusolverSpScsrluSolve[Host](cusolverSpHandle_t handle, int n, const float *b, float *x, csrluInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpDcsrluSolve[Host](cusolverSpHandle_t handle, int n, const double *b, double *x, csrluInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrluSolve[Host](cusolverSpHandle_t handle, int n, const cuComplex *b, cuComplex *x, csrluInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrluSolve[Host](cusolverSpHandle_t handle, int n, const cuDoubleComplex *b, cuDoubleComplex *x, csrluInfo[Host]_t info, void *pBuffer);
This function solves the linear system by forward and backward substitution. The user has to complete numerical factorization before calling this function. If numerical factorization is not done, CUSOLVER_STATUS_INVALID_VALUE is returned.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
Remark 1: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
b | device | host | <type> array of n of right-hand-side vectors b. |
info | host | host | opaque structure for LU factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrluBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | <type> array of n of solution vectors x. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
6.4.1.7. cusolverSpXcsrluExtract()
cusolverStatus_t cusolverSpXcsrluNnz[Host](cusolverSpHandle_t handle, int *nnzLRef, int *nnzURef, csrluInfo[Host]_t info);
cusolverStatus_t cusolverSpScsrluExtract[Host](cusolverSpHandle_t handle, int *P, int *Q, const cusparseMatDescr_t descrL, float *csrValL, int *csrRowPtrL, int *csrColIndL, const cusparseMatDescr_t descrU, float *csrValU, int *csrRowPtrU, int *csrColIndU, csrluInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpDcsrluExtract[Host](cusolverSpHandle_t handle, int *P, int *Q, const cusparseMatDescr_t descrL, double *csrValL, int *csrRowPtrL, int *csrColIndL, const cusparseMatDescr_t descrU, double *csrValU, int *csrRowPtrU, int *csrColIndU, csrluInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrluExtract[Host](cusolverSpHandle_t handle, int *P, int *Q, const cusparseMatDescr_t descrL, cuComplex *csrValL, int *csrRowPtrL, int *csrColIndL, const cusparseMatDescr_t descrU, cuComplex *csrValU, int *csrRowPtrU, int *csrColIndU, csrluInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrluExtract[Host](cusolverSpHandle_t handle, int *P, int *Q, const cusparseMatDescr_t descrL, cuDoubleComplex *csrValL, int *csrRowPtrL, int *csrColIndL, const cusparseMatDescr_t descrU, cuDoubleComplex *csrValU, int *csrRowPtrU, int *csrColIndU, csrluInfo[Host]_t info, void *pBuffer);
The function cusolverSpXcsrluExtract extracts information of LU factorization, including left permutation vector P, right permutation vector Q, lower triangular matrix L and upper triangular matrix U.
P, Q, L and U satisfy the relation
First, the user gathers the nonzeros of L and U from cusolverSpXcsrluNnz; then allocates CSR of L and CSR of U; finally retrieves matrix L and U from cusolverSpXcsrluExtract.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
Remark 1: L has diagonal one implicitly.
Remark 2: permutation vectors P and Q are base-0.
Remark 3: only CPU path is provided.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
descrL | host | host | the descriptor of matrix L. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
descrU | host | host | the descriptor of matrix U. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
info | host | host | opaque structure for LU factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrluBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
nnzLRef | host | host | number of nonzeros of matrix L. |
nnzURef | host | host | number of nonzeros of matrix U. |
P | device | host | integer array of n of left permutation vector. |
Q | device | host | integer array of n of right permutation vector. |
csrValL | device | host | <type> array of nnzL nonzero elements of matrix L. |
csrRowPtrL | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one of matrix L. |
csrColIndL | device | host | integer array of nnzLcolumn indices of the nonzero elements of matrix L. |
csrValU | device | host | <type> array of nnzU nonzero elements of matrix U. |
csrRowPtrU | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one of matrix U. |
csrColIndU | device | host | integer array of nnzUcolumn indices of the nonzero elements of matrix U. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence or base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.2. cusolverSpXcsrqr()
The sparse QR factorization is used to factorize matrix A in the following form
A is a m×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA and csrColIndA.
The QR factorization only works if m is not less than n.
The following three applications can take advantage of sparse QR.
1. linear solver:
2. least-square solver:
3. eigenvalue solver:
To cover above three applications within the same flow, factorization phase is separated by two steps
Step 1: shift diagonal of A by μ.
This is designed for eigenvalue solver, mainly on shift-inverse power method. For linear solver and least-square solver, the user should set μ to zero.
Step 2: numerical factorization
If A is not of full rank, cusolverSpXcsrqrZeroPivot would report singularity.
6.4.2.1. cusolverSpCreateCsrqrInfo()
cusolverStatus_t cusolverSpCreateCsrqrInfo[Host](csrqrInfo[Host]_t *info); cusolverStatus_t cusolverSpDestroyCsrqrInfo[Host](csrqrInfo[Host]_t info);
The function cusolverSpCreateCsrqrInfo creates and initializes the opaque structure of QR to default values.
The function cusolverSpDestroyCsrqrInfo releases any memory required by the structure.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | opaque structure for QR factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
6.4.2.2. cusolverSpXcsrqrAnalysis()
cusolverStatus_t cusolverSpXcsrqrAnalysis[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo[Host]_t info);
This function analyzes sparsity pattern of matrix H and matrix R of QR factorization. After analysis, the size of working space to perform QR can be retrieved from cusolverSpXcsrqrBufferInfo.
The analysis phase needs working space to find sparsity pattern of H and R. If host memory is not sufficient to finish the analysis, CUSOLVER_STATUS_ALLOC_FAILED is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | device | host | integer array of m+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | recording scheduling information used in numerical factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.2.3. cusolverSpXcsrqrBufferInfo()
cusolverStatus_t cusolverSpScsrqrBufferInfo[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpDcsrqrBufferInfo[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
cusolverStatus_t cusolverSpCcsrqrBufferInfo[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpZcsrqrBufferInfo[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrqrInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
There are two memory blocks used in sparse QR. One is internal memory used to store matrix H and matrix R. The other is working space used to perform numerical factorization. The size of the former is specified by returned parameter internalDataInBytes; while the size of the latter is specified by returned parameter workspaceInBytes.
The first call of cusolverSpXcsrqrSetup would allocate H and R whose size is bounded by internalDataInBytes. Once internal memory (of size internalDataInBytes bytes) is allocated by cusolverSpXcsrqrSetup, the life time is the same as info. Such internal memory is different from working space of size workspaceInBytes bytes, whose life time starts at the beginning of the calling function and ends when the function returns.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
info | host | host | opaque structure for QR factorization. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
internalDataInBytes | host | host | number of bytes of the internal data. |
workspaceInBytes | host | host | number of bytes of the buffer in numerical factorization. |
info | host | host | recording internal parameters for buffer. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.2.4. cusolverSpXcsrqrSetup()
cusolverStatus_t cusolverSpScsrqrSetup[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, float mu, csrqrInfo[Host]_t info); cusolverSpDcsrqrSetup[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, double mu, csrqrInfo[Host]_t info);
cusolverStatus_t cusolverSpCcsrqrSetup[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuComplex mu, csrqrInfo[Host]_t info); cusolverStatus_t cusolverSpZcsrqrSetup[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, cuDoubleComplex mu, csrqrInfo[Host]_t info);
This function shifts diagonal of A by parameter mu such that we can factorize
For linear solver, the user just sets mu to zero. For eigenvalue solver, mu can be a value of shift in inverse-power method.
The first call to cusolverSpXcsrqrSetup would allocate space for H and R. If the memory is insufficient, CUSOLVER_STATUS_ALLOC_FAILED is returned. The numerical factor H and R are kept in structure info and can be used in cusolverSpXcsrqrSolve.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of m+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
mu | host | host | value of shift. |
info | host | host | opaque structure for QR factorization. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | subtract mu from diagonal of A. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.2.5. cusolverSpXcsrqrFactor()
cusolverStatus_t cusolverSpScsrqrFactor[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, float *b, float *x, csrqrInfo[Host]_t info, void *pBuffer); cusolverSpDcsrqrFactor[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, double *b, double *x, csrqrInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrqrFactor[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, cuComplex *b, cuComplex *x, csrqrInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrqrFactor[Host](cusolverSpHandle_t handle, int m, int n, int nnzA, cuDoubleComplex *b, cuDoubleComplex *x, csrqrInfo[Host]_t info, void *pBuffer);
This function performs numerical factorization
cusolverSpXcsrqrSetup subtracts μ from A. The numerical factor H and R are kept in structure info and can be used in cusolverSpXcsrqrSolve.
If either x or b is nil, only factorization is done. The user needs cusolverSpXcsrqrSolve to find the least-square solution.
If both x and b are not nil, QR factorization and solve are combined together. b is overwritten by c and x is the solution of least-square.
In this case, the user does not need cusolverSpXcsrqrSolve.
It would be better to combine factorization and solve together for GPU because solve phase is sequential.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
b | device | host | <type> array of m elements of right-hand-side vector. |
info | host | host | opaque structure for QR factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrqrBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | containing numerical factor H and R. |
x | device | host | <type> array of n elements of least-square solution if x and b are not nil. |
b | device | host | overwritten by c if x and b are not nil. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (m,n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.2.6. cusolverSpXcsrqrZeroPivot()
cusolverStatus_t cusolverSpScsrqrZeroPivot[Host](cusolverSpHandle_t handle, csrqrInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpDcsrqrZeroPivot[Host](cusolverSpHandle_t handle, csrqrInfo[Host]_t info, double tol, int *position);
cusolverStatus_t cusolverSpCcsrqrZeroPivot[Host](cusolverSpHandle_t handle, csrqrInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpZcsrqrZeroPivot[Host](cusolverSpHandle_t handle, csrqrInfo[Host]_t info, double tol, int *position);
If A is not full rank under given tolerance (max(tol,0)), then some diagonal elements of R is zero, i.e.
The output parameter position is the smallest index of such j. If A is of full rank, position is -1. The index is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is rank deficient and position = 1 which means R(1,1)≈0.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
info | host | host | opaque structure for QR factorization. |
tol | host | host | tolerance to determine singularity. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
position | host | host | -1 if A is non-singular; otherwise, first column that R(j,j) is zero under given tolerance. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
6.4.2.7. cusolverSpXcsrqrSolve()
cusolverStatus_t cusolverSpScsrqrSolve[Host](cusolverSpHandle_t handle, int m, int n, float *b, float *x, csrqrInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpDcsrqrSolve[Host](cusolverSpHandle_t handle, int m, int n, double *b, double *x, csrqrInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrqrSolve[Host](cusolverSpHandle_t handle, int m, int n, cuComplex *b, cuComplex *x, csrqrInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrqrSolve[Host](cusolverSpHandle_t handle, int m, int n, cuDoubleComplex *b, cuDoubleComplex *x, csrqrInfo[Host]_t info, void *pBuffer);
This function solves the following least-square problem
b is overwritten by c and x is the solution of least-square.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
Remark 1: matrix A is actually
Remark 2:
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
m | host | host | number of rows of matrix A. |
n | host | host | number of columns of matrix A. |
b | device | host | <type> array of m of right-hand-side vectors b. |
info | host | host | opaque structure for LU factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrqrBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | <type> array of n of solution vectors x. |
b | device | host | overwritten by c. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
6.4.3. cusolverSpXcsrchol()
The sparse Cholesky factorization is used to factorize symmetric positive definite matrix A in the following form
A is a n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA and csrColIndA. The low-level API only factors lower triangle part of A. The upper triangular part is assumed to be symmetric of lower triangular part implicitly.
The low-level API does not reorder the matrix to minimize zero fill-in. The user can use cusolverSpXcsrsymrcm or cusolverSpXcsrsymamd to reorder the matrix to reduce zero fill-in. The permutation matrix P is the post-ordering of elimination tree.
The Choleksy factor L is a lower triangular matrix which is more denser than A. The diagonal of L is positive if A is positive definite. Otherwise, cusolverSpXcsrcholZeroPivot can report singularity.
To solve a linear system , the user needs symbolic analysis from cusolverSpXcsrcholAnalysis, numerical factorization from cusolverSpXcsrcholFactor and forward/backward substitution from cusolverSpXcsrcholSolve.
6.4.3.1. cusolverSpCreateCsrcholInfo()
cusolverStatus_t cusolverSpCreateCsrcholInfo[Host](csrcholInfo[Host]_t *info); cusolverStatus_t cusolverSpDestroyCsrcholInfo[Host](csrcholInfo[Host]_t info);
The function cusolverSpCreateCsrcholInfo creates and initializes the opaque structure of Cholesky to default values.
The function cusolverSpDestroyCsrcholInfo releases any memory required by the structure.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | opaque structure for Cholesky factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
6.4.3.2. cusolverSpXcsrcholAnalysis()
cusolverStatus_t cusolverSpXcsrcholAnalysis[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info);
This function analyzes sparsity pattern of matrix L of Cholesky factorization. After analysis, the size of working space to perform Cholesky can be retrieved from cusolverSpXcsrcholBufferInfo.
The analysis phase needs working space to find sparsity pattern of L. If host memory is not sufficient to finish the analysis, CUSOLVER_STATUS_ALLOC_FAILED is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | recording scheduling information used in numerical factorization. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.3.3. cusolverSpXcsrcholBufferInfo()
cusolverStatus_t cusolverSpScsrcholBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpDcsrcholBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
cusolverStatus_t cusolverSpCcsrcholBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes); cusolverStatus_t cusolverSpZcsrcholBufferInfo[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, size_t *internalDataInBytes, size_t *workspaceInBytes);
There are two memory blocks used in sparse Cholesky. One is internal memory used to store matrix L. The other is working space used to perform numerical factorization. The size of the former is specified by returned parameter internalDataInBytes; while the size of the latter is specified by returned parameter workspaceInBytes.
The first call of cusolverSpXcsrcholFactor would allocate L whose size is bounded by internalDataInBytes. Once internal memory (of size internalDataInBytes bytes) is allocated by cusolverSpXcsrcholFactor, the life time is the same as info. Such internal memory is different from working space of size workspaceInBytes bytes, whose life time starts at the beginning of the calling function and ends when the function returns.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
info | host | host | opaque structure for Cholesky factorization. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
internalDataInBytes | host | host | number of bytes of the internal data. |
workspaceInBytes | host | host | number of bytes of the buffer in numerical factorization. |
info | host | host | recording internal parameters for buffer. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.3.4. cusolverSpXcsrcholFactor()
cusolverStatus_t cusolverSpScsrcholFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, void *pBuffer); cusolverSpDcsrcholFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrcholFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrcholFactor[Host](cusolverSpHandle_t handle, int n, int nnzA, const cusparseMatDescr_t descrA, const cuDoubleComplex *csrValA, const int *csrRowPtrA, const int *csrColIndA, csrcholInfo[Host]_t info, void *pBuffer);
This function performs numerical factorization
The first call to cusolverSpXcsrcholFactor would allocate space for L. If the memory is insufficient, CUSOLVER_STATUS_ALLOC_FAILED is returned. The numerical factor L is kept in structure info and can be used in cusolverSpXcsrcholSolve.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
nnzA | host | host | number of nonzeros of matrix A. |
descrA | host | host | the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. |
csrValA | device | host | <type> array of nnzA nonzero elements of matrix A. |
csrRowPtrA | device | host | integer array of n+1 elements that contains the start of every row and the end of the last row plus one. |
csrColIndA | device | host | integer array of nnzAcolumn indices of the nonzero elements. |
info | host | host | opaque structure for Cholesky factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrcholBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
info | host | host | containing numerical factor L. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | the resources could not be allocated. |
CUSOLVER_STATUS_INVALID_VALUE | invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED | the matrix type is not supported. |
6.4.3.5. cusolverSpXcsrcholZeroPivot()
cusolverStatus_t cusolverSpScsrcholZeroPivot[Host](cusolverSpHandle_t handle, csrcholInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpDcsrcholZeroPivot[Host](cusolverSpHandle_t handle, csrcholInfo[Host]_t info, double tol, int *position);
cusolverStatus_t cusolverSpCcsrcholZeroPivot[Host](cusolverSpHandle_t handle, csrcholInfo[Host]_t info, float tol, int *position); cusolverStatus_t cusolverSpZcsrcholZeroPivot[Host](cusolverSpHandle_t handle, csrcholInfo[Host]_t info, double tol, int *position);
If A is not postive definite, there exists some integer k such that A(0:k, 0:k) is not positive definite. The output parameter position is the minimum of such k.
If A is postive definite but near singular under tolerance (max(tol,0)), i.e. there exists some integer k such that . The output parameter position is the minimum of such k.
If A is non-singular, position is -1. The position is base-0, independent of base index of A.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
info | host | host | opaque structure for Cholesky factorization. |
tol | host | host | tolerance to determine singularity. |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
position | host | host | -1 if A is non-singular; otherwise, smallest k that A(0:k,0:k) is not positive definite under given tolerance. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
6.4.3.6. cusolverSpXcsrcholSolve()
cusolverStatus_t cusolverSpScsrcholSolve[Host](cusolverSpHandle_t handle, int n, const float *b, float *x, csrcholInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpDcsrcholSolve[Host](cusolverSpHandle_t handle, int n, const double *b, double *x, csrcholInfo[Host]_t info, void *pBuffer);
cusolverStatus_t cusolverSpCcsrcholSolve[Host](cusolverSpHandle_t handle, int n, const cuComplex *b, cuComplex *x, csrcholInfo[Host]_t info, void *pBuffer); cusolverStatus_t cusolverSpZcsrcholSolve[Host](cusolverSpHandle_t handle, int n, const cuDoubleComplex *b, cuDoubleComplex *x, csrcholInfo[Host]_t info, void *pBuffer);
This function solves the linear system by forward and backward substitution. The user has to complete numerical factorization before calling this function. If numerical factorization is not done, CUSOLVER_STATUS_INVALID_VALUE is returned.
The numerical factorization must be done before calling this function, otherwise, CUSOLVER_STATUS_INVALID_VALUE is returned.
parameter | cusolverSp MemSpace | *Host MemSpace | description |
handle | host | host | handle to the cuSolverSP library context. |
n | host | host | number of rows and columns of matrix A. |
b | device | host | <type> array of n of right-hand-side vectors b. |
info | host | host | opaque structure for Cholesky factorization. |
pBuffer | device | host | buffer allocated by the user, the size is returned by cusolverSpXcsrcholBufferInfo(). |
parameter | cusolverSp MemSpace | *Host MemSpace | description |
x | device | host | <type> array of n of solution vectors x. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | invalid calling sequence. |
cuSolverRF: Refactorization Reference
This chapter describes API of cuSolverRF, a library for fast refactorization.
cusolverRfAccessBundledFactors()
cusolverStatus_t cusolverRfAccessBundledFactors(/* Input */ cusolverRfHandle_t handle, /* Output (in the host memory) */ int* nnzM, /* Output (in the device memory) */ int** Mp, int** Mi, double** Mx);
This routine allows direct access to the lower L and upper U triangular factors stored in the cuSolverRF library handle. The factors are compressed into a single matrix M=(L-I)+U, where the unitary diagonal of L is not stored. It is assumed that a prior call to the cusolverRfRefactor() was done in order to generate these triangular factors.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
nnzM | host | output | the number of non-zero elements of matrix M. |
Mp | device | output | the array of offsets corresponding to the start of each row in the arrays Mi and Mx. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix $M$. The array size is n+1. |
Mi | device | output | the array of column indices corresponding to the non-zero elements in the matrix M. It is assumed that this array is sorted by row and by column within each row. The array size is nnzM. |
Mx | device | output | the array of values corresponding to the non-zero elements in the matrix M. It is assumed that this array is sorted by row and by column within each row. The array size is nnzM. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfAnalyze()
cusolverStatus_t cusolverRfAnalyze(cusolverRfHandle_t handle);
This routine performs the appropriate analysis of parallelism available in the LU re-factorization depending upon the algorithm chosen by the user.
It is assumed that a prior call to the cusolverRfSetup[Host|Device]() was done in order to create internal data structures needed for the analysis.
This routine needs to be called only once for a single linear system
parameter | MemSpace | In/out | Meaning |
handle | host | in/out | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfSetupDevice()
cusolverStatus_t cusolverRfSetupDevice(/* Input (in the device memory) */ int n, int nnzA, int* csrRowPtrA, int* csrColIndA, double* csrValA, int nnzL, int* csrRowPtrL, int* csrColIndL, double* csrValL, int nnzU, int* csrRowPtrU, int* csrColIndU, double* csrValU, int* P, int* Q, /* Output */ cusolverRfHandle_t handle);
This routine assembles the internal data structures of the cuSolverRF library. It is often the first routine to be called after the call to the cusolverRfCreate() routine.
This routine accepts as input (on the device) the original matrix A, the lower (L) and upper (U) triangular factors, as well as the left (P) and the right (Q) permutations resulting from the full LU factorization of the first (i=1) linear system
The permutations P and Q represent the final composition of all the left and right reorderings applied to the original matrix A, respectively. However, these permutations are often associated with partial pivoting and reordering to minimize fill-in, respectively.
This routine needs to be called only once for a single linear system
parameter | MemSpace | In/out | Meaning |
n | host | input | the number of rows (and columns) of matrix A. |
nnzA | host | input | the number of non-zero elements of matrix A. |
csrRowPtrA | device | input | the array of offsets corresponding to the start of each row in the arrays csrColIndA and csrValA. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. The array size is n+1. |
csrColIndA | device | input | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
csrValA | device | input | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
nnzL | host | input | the number of non-zero elements of matrix L. |
csrRowPtrL | device | input | the array of offsets corresponding to the start of each row in the arrays csrColIndL and csrValL. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix L. The array size is n+1. |
csrColIndL | device | input | the array of column indices corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
csrValL | device | input | the array of values corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
nnzU | host | input | the number of non-zero elements of matrix U. |
csrRowPtrU | device | input | the array of offsets corresponding to the start of each row in the arrays csrColIndU and csrValU. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix U. The array size is n+1. |
csrColIndU | device | input | the array of column indices corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
csrValU | device | input | the array of values corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
P | device | input | the left permutation (often associated with pivoting). The array size in n. |
Q | device | input | the right permutation (often associated with reordering). The array size in n. |
handle | host | output | the handle to the GLU library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfSetupHost()
cusolverStatus_t cusolverRfSetupHost(/* Input (in the host memory) */ int n, int nnzA, int* h_csrRowPtrA, int* h_csrColIndA, double* h_csrValA, int nnzL, int* h_csrRowPtrL, int* h_csrColIndL, double* h_csrValL, int nnzU, int* h_csrRowPtrU, int* h_csrColIndU, double* h_csrValU, int* h_P, int* h_Q, /* Output */ cusolverRfHandle_t handle);
This routine assembles the internal data structures of the cuSolverRF library. It is often the first routine to be called after the call to the cusolverRfCreate() routine.
This routine accepts as input (on the host) the original matrix A, the lower (L) and upper (U) triangular factors, as well as the left (P) and the right (Q) permutations resulting from the full LU factorization of the first (i=1) linear system
The permutations P and Q represent the final composition of all the left and right reorderings applied to the original matrix A, respectively. However, these permutations are often associated with partial pivoting and reordering to minimize fill-in, respectively.
This routine needs to be called only once for a single linear system
parameter | MemSpace | In/out | Meaning |
n | host | input | the number of rows (and columns) of matrix A. |
nnzA | host | input | the number of non-zero elements of matrix A. |
h_csrRowPtrA | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndA and h_csrValA. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. The array size is n+1. |
h_csrColIndA | host | input | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
h_csrValA | host | input | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
nnzL | host | input | the number of non-zero elements of matrix L. |
h_csrRowPtrL | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndL and h_csrValL. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix L. The array size is n+1. |
h_csrColIndL | host | input | the array of column indices corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
h_csrValL | host | input | the array of values corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
nnzU | host | input | the number of non-zero elements of matrix U. |
h_csrRowPtrU | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndU and h_csrValU. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix U. The array size is n+1. |
h_csrColIndU | host | input | the array of column indices corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
h_csrValU | host | input | the array of values corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
h_P | host | input | the left permutation (often associated with pivoting). The array size in n. |
h_Q | host | input | the right permutation (often associated with reordering). The array size in n. |
handle | host | output | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfCreate()
cusolverStatus_t cusolverRfCreate(cusolverRfHandle_t *handle);
This routine initializes the cuSolverRF library. It allocates required resources and must be called prior to any other cuSolverRF library routine.
parameter | MemSpace | In/out | Meaning |
handle | host | output | the pointer to the cuSolverRF library handle. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfExtractBundledFactorsHost()
cusolverStatus_t cusolverRfExtractBundledFactorsHost(/* Input */ cusolverRfHandle_t handle, /* Output (in the host memory) */ int* h_nnzM, int** h_Mp, int** h_Mi, double** h_Mx);
This routine extracts lower (L) and upper (U) triangular factors from the cuSolverRF library handle into the host memory. The factors are compressed into a single matrix M=(L-I)+U, where the unitary diagonal of (L) is not stored. It is assumed that a prior call to the cusolverRfRefactor() was done in order to generate these triangular factors.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
h_nnzM | host | output | the number of non-zero elements of matrix M. |
h_Mp | host | output | the array of offsets corresponding to the start of each row in the arrays h_Mi and h_Mx. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix M. The array size is n+1. |
h_Mi | host | output | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzM. |
h_Mx | host | output | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzM. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfExtractSplitFactorsHost()
cusolverStatus_t cusolverRfExtractSplitFactorsHost(/* Input */ cusolverRfHandle_t handle, /* Output (in the host memory) */ int* h_nnzL, int** h_Lp, int** h_Li, double** h_Lx, int* h_nnzU, int** h_Up, int** h_Ui, double** h_Ux);
This routine extracts lower (L) and upper (U) triangular factors from the cuSolverRF library handle into the host memory. It is assumed that a prior call to the cusolverRfRefactor() was done in order to generate these triangular factors.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
h_nnzL | host | output | the number of non-zero elements of matrix L. |
h_Lp | host | output | the array of offsets corresponding to the start of each row in the arrays h_Li and h_Lx. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix L. The array size is n+1. |
h_Li | host | output | the array of column indices corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzL. |
h_Lx | host | output | the array of values corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzL. |
h_nnzU | host | output | the number of non-zero elements of matrix U. |
h_Up | host | output | the array of offsets corresponding to the start of each row in the arrays h_Ui and h_Ux. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix U. The array size is n+1. |
h_Ui | host | output | the array of column indices corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzU. |
h_Ux | host | output | the array of values corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is h_nnzU. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfDestroy()
cusolverStatus_t cusolverRfDestroy(cusolverRfHandle_t handle);
This routine shuts down the cuSolverRF library. It releases acquired resources and must be called after all the cuSolverRF library routines.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the cuSolverRF library handle. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfGetMatrixFormat()
cusolverStatus_t cusolverRfGetMatrixFormat(cusolverRfHandle_t handle, cusolverRfMatrixFormat_t *format, cusolverRfUnitDiagonal_t *diag);
This routine gets the matrix format used in the cusolverRfSetupDevice(), cusolverRfSetupHost(), cusolverRfResetValues(), cusolverRfExtractBundledFactorsHost() and cusolverRfExtractSplitFactorsHost() routines.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
format | host | output | the enumerated matrix format type. |
diag | host | output | the enumerated unit diagonal type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfGetNumericProperties()
cusolverStatus_t cusolverRfGetNumericProperties(cusolverRfHandle_t handle, double *zero, double *boost);
This routine gets the numeric values used for checking for ''zero'' pivot and for boosting it in the cusolverRfRefactor() and cusolverRfSolve() routines. The numeric boosting will be used only if boost > 0.0.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
zero | host | output | the value below which zero pivot is flagged. |
boost | host | output | the value which is substituted for zero pivot (if the later is flagged). |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfGetNumericBoostReport()
cusolverStatus_t cusolverRfGetNumericBoostReport(cusolverRfHandle_t handle, cusolverRfNumericBoostReport_t *report);
This routine gets the report whether numeric boosting was used in the cusolverRfRefactor() and cusolverRfSolve() routines.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
report | host | output | the enumerated boosting report type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfGetResetValuesFastMode()
cusolverStatus_t cusolverRfGetResetValuesFastMode(cusolverRfHandle_t handle, rfResetValuesFastMode_t *fastMode);
This routine gets the mode used in the cusolverRfResetValues routine.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
fastMode | host | output | the enumerated mode type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfGet_Algs()
cusolverStatus_t cusolverRfGet_Algs(cusolverRfHandle_t handle, cusolverRfFactorization_t* fact_alg, cusolverRfTriangularSolve_t* solve_alg);
This routine gets the algorithm used for the refactorization in cusolverRfRefactor() and the triangular solve in cusolverRfSolve().
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
alg | host | output | the enumerated algorithm type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfRefactor()
cusolverStatus_t cusolverRfRefactor(cusolverRfHandle_t handle);
This routine performs the LU re-factorization
exploring the available parallelism on the GPU. It is assumed that a prior call to the glu_analyze() was done in order to find the available paralellism.
This routine may be called multiple times, once for each of the linear systems
parameter | Memory | In/out | Meaning |
handle | host | in/out | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_ZERO_PIVOT | a zero pivot was encountered during the computation. |
cusolverRfResetValues()
cusolverStatus_t cusolverRfResetValues(/* Input (in the device memory) */ int n, int nnzA, int* csrRowPtrA, int* csrColIndA, double* csrValA, int* P, int* Q, /* Output */ cusolverRfHandle_t handle);
This routine updates internal data structures with the values of the new coefficient matrix. It is assumed that the arrays csrRowPtrA, csrColIndA, P and Q have not changed since the last call to the cusolverRfSetup[Host|Device] routine. This assumption reflects the fact that the sparsity pattern of coefficient matrices as well as reordering to minimize fill-in and pivoting remain the same in the set of linear systems
This routine may be called multiple times, once for each of the linear systems
parameter | MemSpace | In/out | Meaning |
n | host | input | the number of rows (and columns) of matrix A. |
nnzA | host | input | the number of non-zero elements of matrix A. |
csrRowPtrA | device | input | the array of offsets corresponding to the start of each row in the arrays csrColIndA and csrValA. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. The array size is n+1. |
csrColIndA | device | input | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
csrValA | device | input | the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
P | device | input | the left permutation (often associated with pivoting). The array size in n. |
Q | device | input | the right permutation (often associated with reordering). The array size in n. |
handle | host | output | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfSetMatrixFormat()
cusolverStatus_t cusolverRfSetMatrixFormat(cusolverRfHandle_t handle, gluMatrixFormat_t format, gluUnitDiagonal_t diag);
This routine sets the matrix format used in the cusolverRfSetupDevice(), cusolverRfSetupHost(), cusolverRfResetValues(), cusolverRfExtractBundledFactorsHost() and cusolverRfExtractSplitFactorsHost() routines. It may be called once prior to cusolverRfSetupDevice() and cusolverRfSetupHost() routines.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
format | host | input | the enumerated matrix format type. |
diag | host | input | the enumerated unit diagonal type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an enumerated mode parameter is wrong. |
cusolverRfSetNumericProperties()
cusolverStatus_t cusolverRfSetNumericProperties(cusolverRfHandle_t handle, double zero, double boost);
This routine sets the numeric values used for checking for ''zero'' pivot and for boosting it in the cusolverRfRefactor() and cusolverRfSolve() routines. It may be called multiple times prior to cusolverRfRefactor() and cusolverRfSolve() routines. The numeric boosting will be used only if boost > 0.0.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
zero | host | input | the value below which zero pivot is flagged. |
boost | host | input | the value which is substituted for zero pivot (if the later is flagged). |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfSetResetValuesFastMode()
cusolverStatus_t cusolverRfSetResetValuesFastMode(cusolverRfHandle_t handle, gluResetValuesFastMode_t fastMode);
This routine sets the mode used in the cusolverRfResetValues routine. The fast mode requires extra memory and is recommended only if very fast calls to cusolverRfResetValues() are needed. It may be called once prior to cusolverRfAnalyze() routine.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
fastMode | host | input | the enumerated mode type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an enumerated mode parameter is wrong. |
cusolverRfSetAlgs()
cusolverStatus_t cusolverRfSetAlgs(cusolverRfHandle_t handle, gluFactorization_t fact_alg, gluTriangularSolve_t alg);
This routine sets the algorithm used for the refactorization in cusolverRfRefactor() and the triangular solve in cusolverRfSolve(). It may be called once prior to cusolverRfAnalyze() routine.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
alg | host | input | the enumerated algorithm type. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
cusolverRfSolve()
cusolverStatus_t cusolverRfSolve(/* Input (in the device memory) */ cusolverRfHandle_t handle, int *P, int *Q, int nrhs, double *Temp, int ldt, /* Input/Output (in the device memory) */ double *XF, /* Input */ int ldxf);
This routine performs the forward and backward solve with the lower and upper triangular factors resulting from the LU re-factorization
which is assumed to have been computed by a prior call to the cusolverRfRefactor() routine.
The routine can solve linear systems with multiple right-hand-sides (rhs),
even though currently only a single rhs is supported.
This routine may be called multiple times, once for each of the linear systems
parameter | MemSpace | In/out | Meaning |
handle | host | output | the handle to the cuSolverRF library. |
P | device | input | the left permutation (often associated with pivoting). The array size in n. |
Q | device | input | the right permutation (often associated with reordering). The array size in n. |
nrhs | host | input | the number right-hand-sides to be solved. |
Temp | host | input | the dense matrix that contains temporary workspace (of size ldt*nrhs). |
ldt | host | input | the leading dimension of dense matrix Temp (ldt >= n). |
XF | host | in/out | the dense matrix that contains the right-hand-sides F and solutions X (of size ldxf*nrhs). |
ldxf | host | input | the leading dimension of dense matrix XF (ldxf >= n). |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfBatchSetupHost()
cusolverStatus_t cusolverRfBatchSetupHost(/* Input (in the host memory) */ int batchSize, int n, int nnzA, int* h_csrRowPtrA, int* h_csrColIndA, double *h_csrValA_array[], int nnzL, int* h_csrRowPtrL, int* h_csrColIndL, double *h_csrValL, int nnzU, int* h_csrRowPtrU, int* h_csrColIndU, double *h_csrValU, int* h_P, int* h_Q, /* Output */ cusolverRfHandle_t handle);
This routine assembles the internal data structures of the cuSolverRF library for batched operation. It is called after the call to the cusolverRfCreate() routine, and before any other batched routines.
The batched operation assumes that the user has the following linear systems
where each matrix in the set has the same sparsity pattern, and quite similar such that factorization can be done by the same permutation P and Q. In other words, is a small perturbation of .
This routine accepts as input (on the host) the original matrix A (sparsity pattern and batched values), the lower (L) and upper (U) triangular factors, as well as the left (P) and the right (Q) permutations resulting from the full LU factorization of the first (i=1) linear system
The permutations P and Q represent the final composition of all the left and right reorderings applied to the original matrix A, respectively. However, these permutations are often associated with partial pivoting and reordering to minimize fill-in, respectively.
Remark 1: the matrices A, L and U must be CSR format and base-0.
Remark 2: to get best performance, batchSize should be multiple of 32 and greater or equal to 32. The algorithm is memory-bound, once bandwidth limit is reached, there is no room to improve performance by large batchSize. In practice, batchSize of 32 - 128 is often enough to obtain good performance, but in some cases larger batchSize might be beneficial.
This routine needs to be called only once for a single linear system
parameter | MemSpace | In/out | Meaning |
batchSize | host | input | the number of matrices in the batched mode. |
n | host | input | the number of rows (and columns) of matrix A. |
nnzA | host | input | the number of non-zero elements of matrix A. |
h_csrRowPtrA | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndA and h_csrValA. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. The array size is n+1. |
h_csrColIndA | host | input | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
h_csrValA_array | host | input | array of pointers of size batchSize, each pointer points to the array of values corresponding to the non-zero elements in the matrix. |
nnzL | host | input | the number of non-zero elements of matrix L. |
h_csrRowPtrL | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndL and h_csrValL. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix L. The array size is n+1. |
h_csrColIndL | host | input | the array of column indices corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
h_csrValL | host | input | the array of values corresponding to the non-zero elements in the matrix L. It is assumed that this array is sorted by row and by column within each row. The array size is nnzL. |
nnzU | host | input | the number of non-zero elements of matrix U. |
h_csrRowPtrU | host | input | the array of offsets corresponding to the start of each row in the arrays h_csrColIndU and h_csrValU. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix U. The array size is n+1. |
h_csrColIndU | host | input | the array of column indices corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
h_csrValU | host | input | the array of values corresponding to the non-zero elements in the matrix U. It is assumed that this array is sorted by row and by column within each row. The array size is nnzU. |
h_P | host | input | the left permutation (often associated with pivoting). The array size in n. |
h_Q | host | input | the right permutation (often associated with reordering). The array size in n. |
handle | host | output | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfBatchAnalyze()
cusolverStatus_t cusolverRfBatchAnalyze(cusolverRfHandle_t handle);
This routine performs the appropriate analysis of parallelism available in the batched LU re-factorization.
It is assumed that a prior call to the cusolverRfBatchSetup[Host]() was done in order to create internal data structures needed for the analysis.
This routine needs to be called only once for a single linear system
parameter | Memory | In/out | Meaning |
handle | host | in/out | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_ALLOC_FAILED | an allocation of memory failed. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfBatchResetValues()
cusolverStatus_t cusolverRfBatchResetValues(/* Input (in the device memory) */ int batchSize, int n, int nnzA, int* csrRowPtrA, int* csrColIndA, double* csrValA_array[], int *P, int *Q, /* Output */ cusolverRfHandle_t handle);
This routine updates internal data structures with the values of the new coefficient matrix. It is assumed that the arrays csrRowPtrA, csrColIndA, P and Q have not changed since the last call to the cusolverRfbatch_setup_host routine.
This assumption reflects the fact that the sparsity pattern of coefficient matrices as well as reordering to minimize fill-in and pivoting remain the same in the set of linear systems
The input parameter csrValA_array is an array of pointers on device memory. csrValA_array(j) points to matrix which is also on device memory.
parameter | MemSpace | In/out | Meaning |
batchSize | host | input | the number of matrices in batched mode. |
n | host | input | the number of rows (and columns) of matrix A. |
nnzA | host | input | the number of non-zero elements of matrix A. |
csrRowPtrA | device | input | the array of offsets corresponding to the start of each row in the arrays csrColIndA and csrValA. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. The array size is n+1. |
csrColIndA | device | input | the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. The array size is nnzA. |
csrValA_array | device | input | array of pointers of size batchSize, each pointer points to the array of values corresponding to the non-zero elements in the matrix. |
P | device | input | the left permutation (often associated with pivoting). The array size in n. |
Q | device | input | the right permutation (often associated with reordering). The array size in n. |
handle | host | output | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfBatchRefactor()
cusolverStatus_t cusolverRfBatchRefactor(cusolverRfHandle_t handle);
This routine performs the LU re-factorization
exploring the available parallelism on the GPU. It is assumed that a prior call to the cusolverRfBatchAnalyze() was done in order to find the available paralellism.
Remark: cusolverRfBatchRefactor() would not report any failure of LU refactorization. The user has to call cusolverRfBatchZeroPivot() to know which matrix failed the LU refactorization.
parameter | Memory | In/out | Meaning |
handle | host | in/out | the handle to the cuSolverRF library. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
cusolverRfBatchSolve()
cusolverStatus_t cusolverRfBatchSolve(/* Input (in the device memory) */ cusolverRfHandle_t handle, int *P, int *Q, int nrhs, double *Temp, int ldt, /* Input/Output (in the device memory) */ double *XF_array[], /* Input */ int ldxf);
To solve , first we reform the equation by where . Then do refactorization by cusolverRfBatch_Refactor(). Further cusolverRfBatch_Solve() takes over the remaining steps, including:
The input parameter XF_array is an array of pointers on device memory. XF_array(j) points to matrix which is also on device memory.
Remark 1: only a single rhs is supported.
Remark 2: no singularity is reported during backward solve. If some matrix failed the refactorization and has some zero diagonal, backward solve would compute NAN. The user has to call cusolverRfBatch_Zero_Pivot to check if refactorization is successful or not.
parameter | Memory | In/out | Meaning |
handle | host | output | the handle to the cuSolverRF library. |
P | device | input | the left permutation (often associated with pivoting). The array size in n. |
Q | device | input | the right permutation (often associated with reordering). The array size in n. |
nrhs | host | input | the number right-hand-sides to be solved. |
Temp | host | input | the dense matrix that contains temporary workspace (of size ldt*nrhs). |
ldt | host | input | the leading dimension of dense matrix Temp (ldt >= n). |
XF_array | host | in/out | array of pointers of size batchSize, each pointer points to the dense matrix that contains the right-hand-sides F and solutions X (of size ldxf*nrhs). |
ldxf | host | input | the leading dimension of dense matrix XF (ldxf >= n). |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_INVALID_VALUE | an unsupported value or parameter was passed. |
CUSOLVER_STATUS_EXECUTION_FAILED | a kernel failed to launch on the GPU. |
CUSOLVER_STATUS_INTERNAL_ERROR | an internal operation failed. |
cusolverRfBatchZeroPivot()
cusolverStatus_t cusolverRfBatchZeroPivot(/* Input */ cusolverRfHandle_t handle /* Output (in the host memory) */ int *position);
Although is close to each other, it does not mean exists for every j. The user can query which matrix failed LU refactorization by checking corresponding value in position array. The input parameter position is an integer array of size batchSize.
The j-th component denotes the refactorization result of matrix . If position(j) is -1, the LU refactorization of matrix is successful. If position(j) is k >= 0, matrix is not LU factorizable and its matrix is zero.
The return value of cusolverRfBatch_Zero_Pivot is CUSOLVER_STATUS_ZERO_PIVOT if there exists one which failed LU refactorization. The user can redo LU factorization to get new permutation P and Q if error code CUSOLVER_STATUS_ZERO_PIVOT is returned.
parameter | MemSpace | In/out | Meaning |
handle | host | input | the handle to the cuSolverRF library. |
position | host | output | integer array of size batchSize. The value of position(j) reports singularity of matrix Aj, -1 if no structural/numerical zero, k >= 0 if Aj(k,k) is either structural zero or numerical zero. |
CUSOLVER_STATUS_SUCCESS | the operation completed successfully. |
CUSOLVER_STATUS_NOT_INITIALIZED | the library was not initialized. |
CUSOLVER_STATUS_ZERO_PIVOT | a zero pivot was encountered during the computation. |
A. cuSolverRF Examples
A.1. cuSolverRF In-memory Example
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include "cusolverRf.h" #define TEST_PASSED 0 #define TEST_FAILED 1 int main (void){ /* matrix A */ int n; int nnzA; int *Ap=NULL; int *Ai=NULL; double *Ax=NULL; int *d_Ap=NULL; int *d_Ai=NULL; double *d_rAx=NULL; /* matrices L and U */ int nnzL, nnzU; int *Lp=NULL; int *Li=NULL; double* Lx=NULL; int *Up=NULL; int *Ui=NULL; double* Ux=NULL; /* reordering matrices */ int *P=NULL; int *Q=NULL; int * d_P=NULL; int * d_Q=NULL; /* solution and rhs */ int nrhs; //# of rhs for each system (currently only =1 is supported) double *d_X=NULL; double *d_T=NULL; /* cuda */ cudaError_t cudaStatus; /* cuolverRf */ cusolverRfHandle_t gH=NULL; cusolverStatus_t status; /* host sparse direct solver */ /* ... */ /* other variables */ int tnnzL, tnnzU; int *tLp=NULL; int *tLi=NULL; double *tLx=NULL; int *tUp=NULL; int *tUi=NULL; double *tUx=NULL; double t1, t2;
/* ASSUMPTION: recall that we are solving a set of linear systems A_{i} x_{i} = f_{i} for i=0,...,k-1 where the sparsity pattern of the coefficient matrices A_{i} as well as the reordering to minimize fill-in and the pivoting used during the LU factorization remain the same. */ /* Step 1: solve the first linear system (i=0) on the host, using host sparse direct solver, which involves full LU factorization and solve. */ /* ... */ /* Step 2: interface to the library by extracting the following information from the first solve: a) triangular factors L and U b) pivoting and reordering permutations P and Q c) also, allocate all the necessary memory */ /* ... */ /* Step 3: use the library to solve subsequent (i=1,...,k-1) linear systems a) the library setup (called only once) */ //create handle status = cusolverRfCreate(&gH); if (status != CUSOLVER_STATUS_SUCCESS){ printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } //set fast mode status = cusolverRfSetResetValuesFastMode(gH,GLU_RESET_VALUES_FAST_MODE_ON); if (status != CUSOLVER_STATUS_SUCCESS){ printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; }
//assemble internal data structures (you should use the coeffcient matrix A //corresponding to the second (i=1) linear system in this call) t1 = cusolver_test_seconds(); status = cusolverRfSetupHost(n, nnzA, Ap, Ai, Ax, nnzL, Lp, Li, Lx, nnzU, Up, Ui, Ux, P, Q, gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } printf("cusolverRfSetupHost time = \%f (s)\n", t2-t1); //analyze available parallelism t1 = cusolver_test_seconds(); status = cusolverRfAnalyze(gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } printf("cusolverRfAnalyze time = \%f (s)\n", t2-t1); /* b) The library subsequent (i=1,...,k-1) LU re-factorization and solve (called multiple times). */ for (i=1; i<k; i++){ //LU re-factorization t1 = cusolver_test_seconds(); status = cusolverRfRefactor(gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRF status \%d]\n",status); return TEST_FAILED; } printf("cuSolverReRefactor time = \%f (s)\n", t2-t1); //forward and backward solve t1 = cusolver_test_seconds(); status = cusolverRfSolve(gH, d_P, d_Q, nrhs, d_T, n, d_X, n); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } printf("cusolverRfSolve time = \%f (s)\n", t2-t1);
// extract the factors (if needed) status = cusolverRfExtractSplitFactorsHost(gH, &tnnzL, &tLp, &tLi, &tLx, &tnnzU, &tUp, &tUi, &tUx); if(status != CUSOLVER_STATUS_SUCCESS){ printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } /* //print int row, j; printf("printing L\n"); for (row=0; row<n; row++){ for (j=tLp[row]; j<tLp[row+1]; j++){ printf("\%d,\%d,\%f\n",row,tLi[j],tLx[j]); } } printf("printing U\n"); for (row=0; row<n; row++){ for (j=tUp[row]; j<tUp[row+1]; j++){ printf("\%d,\%d,\%f\n",row,tUi[j],tUx[j]); } } */ /* perform any other operations based on the solution */ /* ... */ /* check if done */ /* ... */ /* proceed to solve the next linear system */ // update the coefficient matrix using reset values // (assuming that the new linear system, in other words, // new values are already on the GPU in the array d_rAx) t1 = cusolver_test_seconds(); status = cusolverRfResetValues(n,nnzA,d_Ap,d_Ai,d_rAx,d_P,d_Q,gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status \%d]\n",status); return TEST_FAILED; } printf("cusolverRfResetValues time = \%f (s)\n", t2-t1); } /* free memory and exit */ /* ... */ return TEST_PASSED; }
A.2. cuSolverRF-batch Example
This chapter provides an example in the C programming language of how to use the batched routines in the cuSolverRF library. We focus on solving the set of linear systems
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include "cusolverRf.h" #define TEST_PASSED 0 #define TEST_FAILED 1 int main (void){ /* matrix A */ int batchSize; int n; int nnzA; int *Ap=NULL; int *Ai=NULL; //array of pointers to the values of each matrix in the batch (of size //batchSize) on the host double **Ax_array=NULL; //For example, if Ax_batch is the array (of size batchSize*nnzA) containing //the values of each matrix in the batch written contiguosly one matrix //after another on the host, then Ax_array[j] = &Ax_batch[nnzA*j]; //for j=0,...,batchSize-1. double *Ax_batch=NULL; int *d_Ap=NULL; int *d_Ai=NULL; //array of pointers to the values of each matrix in the batch (of size //batchSize) on the device double **d_Ax_array=NULL; //For example, if d_Ax_batch is the array (of size batchSize*nnzA) containing //the values of each matrix in the batch written contiguosly one matrix //after another on the device, then d_Ax_array[j] = &d_Ax_batch[nnzA*j]; //for j=0,...,batchSize-1. double *d_Ax_batch=NULL; /* matrices L and U */ int nnzL, nnzU; int *Lp=NULL; int *Li=NULL; double* Lx=NULL; int *Up=NULL; int *Ui=NULL; double* Ux=NULL; /* reordering matrices */ int *P=NULL; int *Q=NULL; int *d_P=NULL; int *d_Q=NULL;
/* solution and rhs */ int nrhs; //# of rhs for each system (currently only =1 is supported) //temporary storage (of size 2*batchSize*n*nrhs) double *d_T=NULL; //array (of size batchSize*n*nrhs) containing the values of each rhs in //the batch written contiguously one rhs after another on the device double **d_X_array=NULL; //array (of size batchSize*n*nrhs) containing the values of each rhs in //the batch written contiguously one rhs after another on the host double **X_array=NULL; /* cuda */ cudaError_t cudaStatus; /* cusolverRf */ cusolverRfHandle_t gH=NULL; cusolverStatus_t status; /* host sparse direct solver */ ... /* other variables */ double t1, t2; /* ASSUMPTION: recall that we are solving a batch of linear systems A_{j} x_{j} = f_{j} for j=0,...,batchSize-1 where the sparsity pattern of the coefficient matrices A_{j} as well as the reordering to minimize fill-in and the pivoting used during the LU factorization remain the same. */ /* Step 1: solve the first linear system (j=0) on the host, using host sparse direct solver, which involves full LU factorization and solve. */ /* ... */ /* Step 2: interface to the library by extracting the following information from the first solve: a) triangular factors L and U b) pivoting and reordering permutations P and Q c) also, allocate all the necessary memory */ /* ... */ /* Step 3: use the library to solve the remaining (j=1,...,batchSize-1) linear systems. a) the library setup (called only once) */ //create handle status = cusolverRfcreate(&gH); if (status != CUSOLVER_STATUS_SUCCESS){ printf ("[cusolverRf status %d]\n",status); return TEST_FAILED; }
//assemble internal data structures t1 = cusolver_test_seconds(); status = cusolverRfBatchSetupHost(batchSize, n, nnzA, Ap, Ai, Ax_array, nnzL, Lp, Li, Lx, nnzU, Up, Ui, Ux, P, Q, gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status %d]\n",status); return TEST_FAILED; } printf("cusolverRfBatchSetupHost time = %f (s)\n", t2-t1); //analyze available parallelism t1 = cusolver_test_seconds(); status = cusolverRfBatchAnalyze(gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status %d]\n",status); return TEST_FAILED; } printf("cusolverRfBatchAnalyze time = %f (s)\n", t2-t1); /* b) The library subsequent (j=1,...,batchSize-1) LU re-factorization and solve (may be called multiple times). For the subsequent batches the values can be reset using cusolverRfBatch_reset_values_routine. */ //LU re-factorization t1 = cusolver_test_seconds(); status = cusolverRfBatchRefactor(gH); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status %d]\n",status); return TEST_FAILED; } printf("cusolverRfBatchRefactor time = %f (s)\n", t2-t1); //forward and backward solve t1 = cusolver_test_seconds(); status = cusolverRfBatchSolve(gH, d_P, d_Q, nrhs, d_T, n, d_X_array, n); cudaStatus = cudaDeviceSynchronize(); t2 = cusolver_test_seconds(); if ((status != CUSOLVER_STATUS_SUCCESS) || (cudaStatus != cudaSuccess)) { printf ("[cusolverRf status %d]\n",status); return TEST_FAILED; } printf("cusolverRfBatchSolve time = %f (s)\n", t2-t1); /* free memory and exit */ /* ... */ return TEST_PASSED; }
B. CSR QR Batch Examples
B.1. Batched Sparse QR example 1
This chapter provides a simple example in the C programming language of how to use batched sparse QR to solver a set of linear systems
All matrices are small perturbations of
|
All right-hand side vectors are small perturbation of the Matlab vector 'ones(4,1)'.
We assume device memory is big enough to compute all matrices in one pass.
#include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cusolverSp.h> #include <cuda_runtime_api.h> int main(int argc, char*argv[]) { cusolverSpHandle_t cusolverH = NULL; // GPU does batch QR csrqrInfo_t info = NULL; cusparseMatDescr_t descrA = NULL; cusparseStatus_t cusparse_status = CUSPARSE_STATUS_SUCCESS; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; cudaError_t cudaStat5 = cudaSuccess; // GPU does batch QR // d_A is CSR format, d_csrValA is of size nnzA*batchSize // d_x is a matrix of size batchSize * m // d_b is a matrix of size batchSize * m int *d_csrRowPtrA = NULL; int *d_csrColIndA = NULL; double *d_csrValA = NULL; double *d_b = NULL; // batchSize * m double *d_x = NULL; // batchSize * m size_t size_qr = 0; size_t size_internal = 0; void *buffer_qr = NULL; // working space for numerical factorization /* | 1 | * A = | 2 | * | 3 | * | 0.1 0.1 0.1 4 | * CSR of A is based-1 * * b = [1 1 1 1] */
const int m = 4 ; const int nnzA = 7; const int csrRowPtrA[m+1] = { 1, 2, 3, 4, 8}; const int csrColIndA[nnzA] = { 1, 2, 3, 1, 2, 3, 4}; const double csrValA[nnzA] = { 1.0, 2.0, 3.0, 0.1, 0.1, 0.1, 4.0}; const double b[m] = {1.0, 1.0, 1.0, 1.0}; const int batchSize = 17; double *csrValABatch = (double*)malloc(sizeof(double)*nnzA*batchSize); double *bBatch = (double*)malloc(sizeof(double)*m*batchSize); double *xBatch = (double*)malloc(sizeof(double)*m*batchSize); assert( NULL != csrValABatch ); assert( NULL != bBatch ); assert( NULL != xBatch ); // step 1: prepare Aj and bj on host // Aj is a small perturbation of A // bj is a small perturbation of b // csrValABatch = [A0, A1, A2, ...] // bBatch = [b0, b1, b2, ...] for(int colidx = 0 ; colidx < nnzA ; colidx++){ double Areg = csrValA[colidx]; for (int batchId = 0 ; batchId < batchSize ; batchId++){ double eps = ((double)((rand() % 100) + 1)) * 1.e-4; csrValABatch[batchId*nnzA + colidx] = Areg + eps; } } for(int j = 0 ; j < m ; j++){ double breg = b[j]; for (int batchId = 0 ; batchId < batchSize ; batchId++){ double eps = ((double)((rand() % 100) + 1)) * 1.e-4; bBatch[batchId*m + j] = breg + eps; } } // step 2: create cusolver handle, qr info and matrix descriptor cusolver_status = cusolverSpCreate(&cusolverH); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cusparse_status = cusparseCreateMatDescr(&descrA); assert(cusparse_status == CUSPARSE_STATUS_SUCCESS); cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL); cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE); // base-1 cusolver_status = cusolverSpCreateCsrqrInfo(&info); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS);
// step 3: copy Aj and bj to device cudaStat1 = cudaMalloc ((void**)&d_csrValA , sizeof(double) * nnzA * batchSize); cudaStat2 = cudaMalloc ((void**)&d_csrColIndA, sizeof(int) * nnzA); cudaStat3 = cudaMalloc ((void**)&d_csrRowPtrA, sizeof(int) * (m+1)); cudaStat4 = cudaMalloc ((void**)&d_b , sizeof(double) * m * batchSize); cudaStat5 = cudaMalloc ((void**)&d_x , sizeof(double) * m * batchSize); assert(cudaStat1 == cudaSuccess); assert(cudaStat2 == cudaSuccess); assert(cudaStat3 == cudaSuccess); assert(cudaStat4 == cudaSuccess); assert(cudaStat5 == cudaSuccess); cudaStat1 = cudaMemcpy(d_csrValA , csrValABatch, sizeof(double) * nnzA * batchSize, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_csrColIndA, csrColIndA, sizeof(int) * nnzA, cudaMemcpyHostToDevice); cudaStat3 = cudaMemcpy(d_csrRowPtrA, csrRowPtrA, sizeof(int) * (m+1), cudaMemcpyHostToDevice); cudaStat4 = cudaMemcpy(d_b, bBatch, sizeof(double) * m * batchSize, cudaMemcpyHostToDevice); assert(cudaStat1 == cudaSuccess); assert(cudaStat2 == cudaSuccess); assert(cudaStat3 == cudaSuccess); assert(cudaStat4 == cudaSuccess); // step 4: symbolic analysis cusolver_status = cusolverSpXcsrqrAnalysisBatched( cusolverH, m, m, nnzA, descrA, d_csrRowPtrA, d_csrColIndA, info); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); // step 5: prepare working space cusolver_status = cusolverSpDcsrqrBufferInfoBatched( cusolverH, m, m, nnzA, descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, batchSize, info, &size_internal, &size_qr); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); printf("numerical factorization needs internal data %lld bytes\n", (long long)size_internal); printf("numerical factorization needs working space %lld bytes\n", (long long)size_qr); cudaStat1 = cudaMalloc((void**)&buffer_qr, size_qr); assert(cudaStat1 == cudaSuccess);
// step 6: numerical factorization // assume device memory is big enough to compute all matrices. cusolver_status = cusolverSpDcsrqrsvBatched( cusolverH, m, m, nnzA, descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, d_b, d_x, batchSize, info, buffer_qr); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); // step 7: check residual // xBatch = [x0, x1, x2, ...] cudaStat1 = cudaMemcpy(xBatch, d_x, sizeof(double)*m*batchSize, cudaMemcpyDeviceToHost); assert(cudaStat1 == cudaSuccess); const int baseA = (CUSPARSE_INDEX_BASE_ONE == cusparseGetMatIndexBase(descrA))? 1:0 ; for(int batchId = 0 ; batchId < batchSize; batchId++){ // measure |bj - Aj*xj| double *csrValAj = csrValABatch + batchId * nnzA; double *xj = xBatch + batchId * m; double *bj = bBatch + batchId * m; // sup| bj - Aj*xj| double sup_res = 0; for(int row = 0 ; row < m ; row++){ const int start = csrRowPtrA[row ] - baseA; const int end = csrRowPtrA[row+1] - baseA; double Ax = 0.0; // Aj(row,:)*xj for(int colidx = start ; colidx < end ; colidx++){ const int col = csrColIndA[colidx] - baseA; const double Areg = csrValAj[colidx]; const double xreg = xj[col]; Ax = Ax + Areg * xreg; } double r = bj[row] - Ax; sup_res = (sup_res > fabs(r))? sup_res : fabs(r); } printf("batchId %d: sup|bj - Aj*xj| = %E \n", batchId, sup_res); } for(int batchId = 0 ; batchId < batchSize; batchId++){ double *xj = xBatch + batchId * m; for(int row = 0 ; row < m ; row++){ printf("x%d[%d] = %E\n", batchId, row, xj[row]); } printf("\n"); } return 0; }
B.2. Batched Sparse QR example 2
This is the same as example 1 in appendix C except that we assume device memory is not enough, so we need to cut 17 matrices into several chunks and compute each chunk by batched sparse QR.
#include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cusolverSp.h> #include <cuda_runtime_api.h> #define imin( x, y ) ((x)<(y))? (x) : (y) int main(int argc, char*argv[]) { cusolverSpHandle_t cusolverH = NULL; // GPU does batch QR csrqrInfo_t info = NULL; cusparseMatDescr_t descrA = NULL; cusparseStatus_t cusparse_status = CUSPARSE_STATUS_SUCCESS; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; cudaError_t cudaStat5 = cudaSuccess; // GPU does batch QR // d_A is CSR format, d_csrValA is of size nnzA*batchSize // d_x is a matrix of size batchSize * m // d_b is a matrix of size batchSize * m int *d_csrRowPtrA = NULL; int *d_csrColIndA = NULL; double *d_csrValA = NULL; double *d_b = NULL; // batchSize * m double *d_x = NULL; // batchSize * m size_t size_qr = 0; size_t size_internal = 0; void *buffer_qr = NULL; // working space for numerical factorization /* | 1 | * A = | 2 | * | 3 | * | 0.1 0.1 0.1 4 | * CSR of A is based-1 * * b = [1 1 1 1] */
const int m = 4 ; const int nnzA = 7; const int csrRowPtrA[m+1] = { 1, 2, 3, 4, 8}; const int csrColIndA[nnzA] = { 1, 2, 3, 1, 2, 3, 4}; const double csrValA[nnzA] = { 1.0, 2.0, 3.0, 0.1, 0.1, 0.1, 4.0}; const double b[m] = {1.0, 1.0, 1.0, 1.0}; const int batchSize = 17; double *csrValABatch = (double*)malloc(sizeof(double)*nnzA*batchSize); double *bBatch = (double*)malloc(sizeof(double)*m*batchSize); double *xBatch = (double*)malloc(sizeof(double)*m*batchSize); assert( NULL != csrValABatch ); assert( NULL != bBatch ); assert( NULL != xBatch ); // step 1: prepare Aj and bj on host // Aj is a small perturbation of A // bj is a small perturbation of b // csrValABatch = [A0, A1, A2, ...] // bBatch = [b0, b1, b2, ...] for(int colidx = 0 ; colidx < nnzA ; colidx++){ double Areg = csrValA[colidx]; for (int batchId = 0 ; batchId < batchSize ; batchId++){ double eps = ((double)((rand() % 100) + 1)) * 1.e-4; csrValABatch[batchId*nnzA + colidx] = Areg + eps; } } for(int j = 0 ; j < m ; j++){ double breg = b[j]; for (int batchId = 0 ; batchId < batchSize ; batchId++){ double eps = ((double)((rand() % 100) + 1)) * 1.e-4; bBatch[batchId*m + j] = breg + eps; } } // step 2: create cusolver handle, qr info and matrix descriptor cusolver_status = cusolverSpCreate(&cusolverH); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cusparse_status = cusparseCreateMatDescr(&descrA); assert(cusparse_status == CUSPARSE_STATUS_SUCCESS); cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL); cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE); // base-1 cusolver_status = cusolverSpCreateCsrqrInfo(&info); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS);
// step 3: copy Aj and bj to device cudaStat1 = cudaMalloc ((void**)&d_csrValA , sizeof(double) * nnzA * batchSize); cudaStat2 = cudaMalloc ((void**)&d_csrColIndA, sizeof(int) * nnzA); cudaStat3 = cudaMalloc ((void**)&d_csrRowPtrA, sizeof(int) * (m+1)); cudaStat4 = cudaMalloc ((void**)&d_b , sizeof(double) * m * batchSize); cudaStat5 = cudaMalloc ((void**)&d_x , sizeof(double) * m * batchSize); assert(cudaStat1 == cudaSuccess); assert(cudaStat2 == cudaSuccess); assert(cudaStat3 == cudaSuccess); assert(cudaStat4 == cudaSuccess); assert(cudaStat5 == cudaSuccess); // don't copy csrValABatch and bBatch because device memory may be big enough cudaStat1 = cudaMemcpy(d_csrColIndA, csrColIndA, sizeof(int) * nnzA, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_csrRowPtrA, csrRowPtrA, sizeof(int) * (m+1), cudaMemcpyHostToDevice); assert(cudaStat1 == cudaSuccess); assert(cudaStat2 == cudaSuccess); // step 4: symbolic analysis cusolver_status = cusolverSpXcsrqrAnalysisBatched( cusolverH, m, m, nnzA, descrA, d_csrRowPtrA, d_csrColIndA, info); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); // step 5: find "proper" batchSize // get available device memory size_t free_mem = 0; size_t total_mem = 0; cudaStat1 = cudaMemGetInfo( &free_mem, &total_mem ); assert( cudaSuccess == cudaStat1 ); int batchSizeMax = 2; while(batchSizeMax < batchSize){ printf("batchSizeMax = %d\n", batchSizeMax); cusolver_status = cusolverSpDcsrqrBufferInfoBatched( cusolverH, m, m, nnzA, // d_csrValA is don't care descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, batchSizeMax, // WARNING: use batchSizeMax info, &size_internal, &size_qr); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); if ( (size_internal + size_qr) > free_mem ){ // current batchSizeMax exceeds hardware limit, so cut it by half. batchSizeMax /= 2; break; } batchSizeMax *= 2; // double batchSizMax and try it again. } // correct batchSizeMax such that it is not greater than batchSize. batchSizeMax = imin(batchSizeMax, batchSize); printf("batchSizeMax = %d\n", batchSizeMax); // Assume device memory is not big enough, and batchSizeMax = 2 batchSizeMax = 2;
// step 6: prepare working space // [necessary] // Need to call cusolverDcsrqrBufferInfoBatched again with batchSizeMax // to fix batchSize used in numerical factorization. cusolver_status = cusolverSpDcsrqrBufferInfoBatched( cusolverH, m, m, nnzA, // d_csrValA is don't care descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, batchSizeMax, // WARNING: use batchSizeMax info, &size_internal, &size_qr); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); printf("numerical factorization needs internal data %lld bytes\n", (long long)size_internal); printf("numerical factorization needs working space %lld bytes\n", (long long)size_qr); cudaStat1 = cudaMalloc((void**)&buffer_qr, size_qr); assert(cudaStat1 == cudaSuccess); // step 7: solve Aj*xj = bj for(int idx = 0 ; idx < batchSize; idx += batchSizeMax){ // current batchSize 'cur_batchSize' is the batchSize used in numerical factorization const int cur_batchSize = imin(batchSizeMax, batchSize - idx); printf("current batchSize = %d\n", cur_batchSize); // copy part of Aj and bj to device cudaStat1 = cudaMemcpy(d_csrValA, csrValABatch + idx*nnzA, sizeof(double) * nnzA * cur_batchSize, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_b, bBatch + idx*m, sizeof(double) * m * cur_batchSize, cudaMemcpyHostToDevice); assert(cudaStat1 == cudaSuccess); assert(cudaStat2 == cudaSuccess); // solve part of Aj*xj = bj cusolver_status = cusolverSpDcsrqrsvBatched( cusolverH, m, m, nnzA, descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, d_b, d_x, cur_batchSize, // WARNING: use current batchSize info, buffer_qr); assert(cusolver_status == CUSOLVER_STATUS_SUCCESS); // copy part of xj back to host cudaStat1 = cudaMemcpy(xBatch + idx*m, d_x, sizeof(double) * m * cur_batchSize, cudaMemcpyDeviceToHost); assert(cudaStat1 == cudaSuccess); }
// step 7: check residual // xBatch = [x0, x1, x2, ...] const int baseA = (CUSPARSE_INDEX_BASE_ONE == cusparseGetMatIndexBase(descrA))? 1:0 ; for(int batchId = 0 ; batchId < batchSize; batchId++){ // measure |bj - Aj*xj| double *csrValAj = csrValABatch + batchId * nnzA; double *xj = xBatch + batchId * m; double *bj = bBatch + batchId * m; // sup| bj - Aj*xj| double sup_res = 0; for(int row = 0 ; row < m ; row++){ const int start = csrRowPtrA[row ] - baseA; const int end = csrRowPtrA[row+1] - baseA; double Ax = 0.0; // Aj(row,:)*xj for(int colidx = start ; colidx < end ; colidx++){ const int col = csrColIndA[colidx] - baseA; const double Areg = csrValAj[colidx]; const double xreg = xj[col]; Ax = Ax + Areg * xreg; } double r = bj[row] - Ax; sup_res = (sup_res > fabs(r))? sup_res : fabs(r); } printf("batchId %d: sup|bj - Aj*xj| = %E \n", batchId, sup_res); } for(int batchId = 0 ; batchId < batchSize; batchId++){ double *xj = xBatch + batchId * m; for(int row = 0 ; row < m ; row++){ printf("x%d[%d] = %E\n", batchId, row, xj[row]); } printf("\n"); } return 0; }
C. QR Examples
C.1. QR Factorization Dense Linear Solver
This chapter provides a simple example in the C programming language of how to use a dense QR factorization to solve a linear system
A is a 3x3 dense matrix, nonsingular.
|
The following code uses three steps:
Step 1: A = Q*R by geqrf.
Step 2: B := Q^T*B by ormqr.
Step 3: solve R*X = B by trsm.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include ormqr_example.cpp * nvcc -o -fopenmp a.out ormqr_example.o -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusolver * */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cublas_v2.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cublasHandle_t cublasH = NULL; cublasStatus_t cublas_status = CUBLAS_STATUS_SUCCESS; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; const int lda = m; const int ldb = m; const int nrhs = 1; // number of right hand side vectors /* | 1 2 3 | * A = | 4 5 6 | * | 2 1 1 | * * x = (1 1 1)' * b = (6 15 4)' */
double A[lda*m] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0, 3.0, 6.0, 1.0}; // double X[ldb*nrhs] = { 1.0, 1.0, 1.0}; // exact solution double B[ldb*nrhs] = { 6.0, 15.0, 4.0}; double XC[ldb*nrhs]; // solution matrix from GPU double *d_A = NULL; // linear memory of GPU double *d_tau = NULL; // linear memory of GPU double *d_B = NULL; int *devInfo = NULL; // info in gpu (device copy) double *d_work = NULL; int lwork = 0; int info_gpu = 0; const double one = 1; printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n"); printf("B = (matlab base-1)\n"); printMatrix(m, nrhs, B, ldb, "B"); printf("=====\n"); // step 1: create cusolver/cublas handle cusolver_status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); cublas_status = cublasCreate(&cublasH); assert(CUBLAS_STATUS_SUCCESS == cublas_status); // step 2: copy A and B to device cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_tau, sizeof(double) * m); cudaStat3 = cudaMalloc ((void**)&d_B , sizeof(double) * ldb * nrhs); cudaStat4 = cudaMalloc ((void**)&devInfo, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double) * lda * m , cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_B, B, sizeof(double) * ldb * nrhs, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2);
// step 3: query working space of geqrf and ormqr cusolver_status = cusolverDnDgeqrf_bufferSize( cusolverH, m, m, d_A, lda, &lwork); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); // step 4: compute QR factorization cusolver_status = cusolverDnDgeqrf( cusolverH, m, m, d_A, lda, d_tau, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1); // check if QR is good or not cudaStat1 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("after geqrf: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); // step 5: compute Q^T*B cusolver_status= cusolverDnDormqr( cusolverH, CUBLAS_SIDE_LEFT, CUBLAS_OP_T, m, nrhs, m, d_A, lda, d_tau, d_B, ldb, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1);
// check if QR is good or not cudaStat1 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("after ormqr: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); // step 6: compute x = R \ Q^T*B cublas_status = cublasDtrsm( cublasH, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, m, nrhs, &one, d_A, lda, d_B, ldb); cudaStat1 = cudaDeviceSynchronize(); assert(CUBLAS_STATUS_SUCCESS == cublas_status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(XC, d_B, sizeof(double)*ldb*nrhs, cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("X = (matlab base-1)\n"); printMatrix(m, nrhs, XC, ldb, "X"); // free resources if (d_A ) cudaFree(d_A); if (d_tau ) cudaFree(d_tau); if (d_B ) cudaFree(d_B); if (devInfo) cudaFree(devInfo); if (d_work ) cudaFree(d_work); if (cublasH ) cublasDestroy(cublasH); if (cusolverH) cusolverDnDestroy(cusolverH); cudaDeviceReset(); return 0; }
C.2. orthogonalization
This chapter provides a simple example in the C programming language of how to do orthgonalization by QR factorization.
A is a 3x2 dense matrix,
|
The following code uses three steps:
Step 1: A = Q*R by geqrf.
Step 2: form Q by orgqr.
Step 3: check if Q is unitary or not.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include orgqr_example.cpp * g++ -fopenmp -o a.out orgqr_example.o -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusolver * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include <cuda_runtime.h> #include <cublas_v2.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cublasHandle_t cublasH = NULL; cublasStatus_t cublas_status = CUBLAS_STATUS_SUCCESS; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; const int n = 2; const int lda = m; /* | 1 2 | * A = | 4 5 | * | 2 1 | */
double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0}; double Q[lda*n]; // orthonormal columns double R[n*n]; // R = I - Q**T*Q double *d_A = NULL; double *d_tau = NULL; int *devInfo = NULL; double *d_work = NULL; double *d_R = NULL; int lwork_geqrf = 0; int lwork_orgqr = 0; int lwork = 0; int info_gpu = 0; const double h_one = 1; const double h_minus_one = -1; printf("A = (matlab base-1)\n"); printMatrix(m, n, A, lda, "A"); printf("=====\n"); // step 1: create cusolverDn/cublas handle cusolver_status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); cublas_status = cublasCreate(&cublasH); assert(CUBLAS_STATUS_SUCCESS == cublas_status); // step 2: copy A and B to device cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n); cudaStat2 = cudaMalloc ((void**)&d_tau, sizeof(double)*n); cudaStat3 = cudaMalloc ((void**)&devInfo, sizeof(int)); cudaStat4 = cudaMalloc ((void**)&d_R , sizeof(double)*n*n); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1);
// step 3: query working space of geqrf and orgqr cusolver_status = cusolverDnDgeqrf_bufferSize( cusolverH, m, n, d_A, lda, &lwork_geqrf); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cusolver_status = cusolverDnDorgqr_bufferSize( cusolverH, m, n, n, d_A, lda, &lwork_orgqr); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); // lwork = max(lwork_geqrf, lwork_orgqr) lwork = (lwork_geqrf > lwork_orgqr)? lwork_geqrf : lwork_orgqr; cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); // step 4: compute QR factorization cusolver_status = cusolverDnDgeqrf( cusolverH, m, n, d_A, lda, d_tau, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1); // check if QR is successful or not cudaStat1 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("after geqrf: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); // step 5: compute Q cusolver_status= cusolverDnDorgqr( cusolverH, m, n, n, d_A, lda, d_tau, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1);
// check if QR is good or not cudaStat1 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("after orgqr: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); cudaStat1 = cudaMemcpy(Q, d_A, sizeof(double)*lda*n, cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("Q = (matlab base-1)\n"); printMatrix(m, n, Q, lda, "Q"); // step 6: measure R = I - Q**T*Q memset(R, 0, sizeof(double)*n*n); for(int j = 0 ; j < n ; j++){ R[j + n*j] = 1.0; // R(j,j)=1 } cudaStat1 = cudaMemcpy(d_R, R, sizeof(double)*n*n, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); // R = -Q**T*Q + I cublas_status = cublasDgemm_v2( cublasH, CUBLAS_OP_T, // Q**T CUBLAS_OP_N, // Q n, // number of rows of R n, // number of columns of R m, // number of columns of Q**T &h_minus_one, /* host pointer */ d_A, // Q**T lda, d_A, // Q lda, &h_one, /* hostpointer */ d_R, n); assert(CUBLAS_STATUS_SUCCESS == cublas_status); double dR_nrm2 = 0.0; cublas_status = cublasDnrm2_v2( cublasH, n*n, d_R, 1, &dR_nrm2); assert(CUBLAS_STATUS_SUCCESS == cublas_status); printf("|I - Q**T*Q| = %E\n", dR_nrm2);
// free resources if (d_A ) cudaFree(d_A); if (d_tau ) cudaFree(d_tau); if (devInfo) cudaFree(devInfo); if (d_work ) cudaFree(d_work); if (d_R ) cudaFree(d_R); if (cublasH ) cublasDestroy(cublasH); if (cusolverH) cusolverDnDestroy(cusolverH); cudaDeviceReset(); return 0; }
D. LU Examples
D.1. LU Factorization
This chapter provides a simple example in the C programming language of how to use a dense LU factorization to solve a linear system
A is a 3x3 dense matrix, nonsingular.
|
The code uses getrf to do LU factorization and getrs to do backward and forward solve. The parameter pivot_on decides whether partial pivoting is performed or not.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include getrf_example.cpp * g++ -fopenmp -o a.out getrf_example.o -L/usr/local/cuda/lib64 -lcusolver -lcudart */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; const int lda = m; const int ldb = m; /* | 1 2 3 | * A = | 4 5 6 | * | 7 8 10 | * * without pivoting: A = L*U * | 1 0 0 | | 1 2 3 | * L = | 4 1 0 |, U = | 0 -3 -6 | * | 7 2 1 | | 0 0 1 | * * with pivoting: P*A = L*U * | 0 0 1 | * P = | 1 0 0 | * | 0 1 0 | * * | 1 0 0 | | 7 8 10 | * L = | 0.1429 1 0 |, U = | 0 0.8571 1.5714 | * | 0.5714 0.5 1 | | 0 0 -0.5 | */
double A[lda*m] = { 1.0, 4.0, 7.0, 2.0, 5.0, 8.0, 3.0, 6.0, 10.0}; double B[m] = { 1.0, 2.0, 3.0 }; double X[m]; /* X = A\B */ double LU[lda*m]; /* L and U */ int Ipiv[m]; /* host copy of pivoting sequence */ int info = 0; /* host copy of error info */ double *d_A = NULL; /* device copy of A */ double *d_B = NULL; /* device copy of B */ int *d_Ipiv = NULL; /* pivoting sequence */ int *d_info = NULL; /* error info */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* device workspace for getrf */ const int pivot_on = 0; printf("example of getrf \n"); if (pivot_on){ printf("pivot is on : compute P*A = L*U \n"); }else{ printf("pivot is off: compute A = L*U (not numerically stable)\n"); } printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n"); printf("B = (matlab base-1)\n"); printMatrix(m, 1, B, ldb, "B"); printf("=====\n"); /* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: copy A to device */ cudaStat1 = cudaMalloc ((void**)&d_A, sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_B, sizeof(double) * m); cudaStat2 = cudaMalloc ((void**)&d_Ipiv, sizeof(int) * m); cudaStat4 = cudaMalloc ((void**)&d_info, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*m, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_B, B, sizeof(double)*m, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2);
/* step 3: query working space of getrf */ status = cusolverDnDgetrf_bufferSize( cusolverH, m, m, d_A, lda, &lwork); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); /* step 4: LU factorization */ if (pivot_on){ status = cusolverDnDgetrf( cusolverH, m, m, d_A, lda, d_work, d_Ipiv, d_info); }else{ status = cusolverDnDgetrf( cusolverH, m, m, d_A, lda, d_work, NULL, d_info); } cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); if (pivot_on){ cudaStat1 = cudaMemcpy(Ipiv , d_Ipiv, sizeof(int)*m, cudaMemcpyDeviceToHost); } cudaStat2 = cudaMemcpy(LU , d_A , sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); if ( 0 > info ){ printf("%d-th parameter is wrong \n", -info); exit(1); } if (pivot_on){ printf("pivoting sequence, matlab base-1\n"); for(int j = 0 ; j < m ; j++){ printf("Ipiv(%d) = %d\n", j+1, Ipiv[j]); } } printf("L and U = (matlab base-1)\n"); printMatrix(m, m, LU, lda, "LU"); printf("=====\n");
/* * step 5: solve A*X = B * | 1 | | -0.3333 | * B = | 2 |, X = | 0.6667 | * | 3 | | 0 | * */ if (pivot_on){ status = cusolverDnDgetrs( cusolverH, CUBLAS_OP_N, m, 1, /* nrhs */ d_A, lda, d_Ipiv, d_B, ldb, d_info); }else{ status = cusolverDnDgetrs( cusolverH, CUBLAS_OP_N, m, 1, /* nrhs */ d_A, lda, NULL, d_B, ldb, d_info); } cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(X , d_B, sizeof(double)*m, cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); printf("X = (matlab base-1)\n"); printMatrix(m, 1, X, ldb, "X"); printf("=====\n"); /* free resources */ if (d_A ) cudaFree(d_A); if (d_B ) cudaFree(d_B); if (d_Ipiv ) cudaFree(d_Ipiv); if (d_info ) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH ) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); cudaDeviceReset(); return 0; }
E. Examples of Dense Eigenvalue Solver
E.1. Standard Symmetric Dense Eigenvalue Solver
This chapter provides a simple example in the C programming language of how to use syevd to compute the spectrum of a dense symmetric system by
where A is a 3x3 dense symmetric matrix
|
The following code uses syevd to compute eigenvalues and eigenvectors, then compare to exact eigenvalues {2,3,4}.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include syevd_example.cpp * g++ -o -fopenmp a.out syevd_example.o -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusolver * */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; const int m = 3; const int lda = m; /* | 3.5 0.5 0 | * A = | 0.5 3.5 0 | * | 0 0 2 | * */ double A[lda*m] = { 3.5, 0.5, 0, 0.5, 3.5, 0, 0, 0, 2.0}; double lambda[m] = { 2.0, 3.0, 4.0}; double V[lda*m]; // eigenvectors double W[m]; // eigenvalues double *d_A = NULL; double *d_W = NULL; int *devInfo = NULL; double *d_work = NULL; int lwork = 0; int info_gpu = 0; printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n");
// step 1: create cusolver/cublas handle cusolver_status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); // step 2: copy A and B to device cudaStat1 = cudaMalloc ((void**)&d_A, sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_W, sizeof(double) * m); cudaStat3 = cudaMalloc ((void**)&devInfo, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double) * lda * m, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); // step 3: query working space of syevd cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvalues and eigenvectors. cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER; cusolver_status = cusolverDnDsyevd_bufferSize( cusolverH, jobz, uplo, m, d_A, lda, d_W, &lwork); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); // step 4: compute spectrum cusolver_status = cusolverDnDsyevd( cusolverH, jobz, uplo, m, d_A, lda, d_W, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(W, d_W, sizeof(double)*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V, d_A, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3);
printf("after syevd: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); printf("eigenvalue = (matlab base-1), ascending order\n"); for(int i = 0 ; i < m ; i++){ printf("W[%d] = %E\n", i+1, W[i]); } printf("V = (matlab base-1)\n"); printMatrix(m, m, V, lda, "V"); printf("=====\n"); // step 4: check eigenvalues double lambda_sup = 0; for(int i = 0 ; i < m ; i++){ double error = fabs( lambda[i] - W[i]); lambda_sup = (lambda_sup > error)? lambda_sup : error; } printf("|lambda - W| = %E\n", lambda_sup); // free resources if (d_A ) cudaFree(d_A); if (d_W ) cudaFree(d_W); if (devInfo) cudaFree(devInfo); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); cudaDeviceReset(); return 0; }
E.2. Generalized Symmetric-Definite Dense Eigenvalue Solver
This chapter provides a simple example in the C programming language of how to use sygvd to compute spectrum of a pair of dense symmetric matrices (A,B) by
where A is a 3x3 dense symmetric matrix
|
and B is a 3x3 positive definite matrix
|
The following code uses sygvd to compute eigenvalues and eigenvectors, then compare to exact eigenvalues {0.158660256604, 0.370751508101882, 0.6}.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include sygvd_example.cpp * g++ -o -fopenmp a.out sygvd_example.o -L/usr/local/cuda/lib64 -lcublas -lcusolver * */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; const int lda = m; /* * | 3.5 0.5 0 | * A = | 0.5 3.5 0 | * | 0 0 2 | * * | 10 2 3 | * B = | 2 10 5 | * | 3 5 10 | */ double A[lda*m] = { 3.5, 0.5, 0, 0.5, 3.5, 0, 0, 0, 2.0}; double B[lda*m] = { 10.0, 2.0, 3.0, 2.0, 10.0, 5.0, 3.0, 5.0, 10.0}; double lambda[m] = { 0.158660256604, 0.370751508101882, 0.6}; double V[lda*m]; // eigenvectors double W[m]; // eigenvalues double *d_A = NULL; double *d_B = NULL; double *d_W = NULL; int *devInfo = NULL; double *d_work = NULL; int lwork = 0; int info_gpu = 0; printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n"); printf("B = (matlab base-1)\n"); printMatrix(m, m, B, lda, "B"); printf("=====\n");
// step 1: create cusolver/cublas handle cusolver_status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); // step 2: copy A and B to device cudaStat1 = cudaMalloc ((void**)&d_A, sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_B, sizeof(double) * lda * m); cudaStat3 = cudaMalloc ((void**)&d_W, sizeof(double) * m); cudaStat4 = cudaMalloc ((void**)&devInfo, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double) * lda * m, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_B, B, sizeof(double) * lda * m, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); // step 3: query working space of sygvd cusolverEigType_t itype = CUSOLVER_EIG_TYPE_1; // A*x = (lambda)*B*x cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvalues and eigenvectors. cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER; cusolver_status = cusolverDnDsygvd_bufferSize( cusolverH, itype, jobz, uplo, m, d_A, lda, d_B, lda, d_W, &lwork); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); // step 4: compute spectrum of (A,B) cusolver_status = cusolverDnDsygvd( cusolverH, itype, jobz, uplo, m, d_A, lda, d_B, lda, d_W, d_work, lwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1);
cudaStat1 = cudaMemcpy(W, d_W, sizeof(double)*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V, d_A, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); printf("after sygvd: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); printf("eigenvalue = (matlab base-1), ascending order\n"); for(int i = 0 ; i < m ; i++){ printf("W[%d] = %E\n", i+1, W[i]); } printf("V = (matlab base-1)\n"); printMatrix(m, m, V, lda, "V"); printf("=====\n"); // step 4: check eigenvalues double lambda_sup = 0; for(int i = 0 ; i < m ; i++){ double error = fabs( lambda[i] - W[i]); lambda_sup = (lambda_sup > error)? lambda_sup : error; } printf("|lambda - W| = %E\n", lambda_sup); // free resources if (d_A ) cudaFree(d_A); if (d_B ) cudaFree(d_B); if (d_W ) cudaFree(d_W); if (devInfo) cudaFree(devInfo); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); cudaDeviceReset(); return 0; }
E.3. Standard Symmetric Dense Eigenvalue Solver (via Jacobi method)
This chapter provides a simple example in the C programming language of how to use syevj to compute the spectrum of a dense symmetric system by
where A is a 3x3 dense symmetric matrix
|
The following code uses syevj to compute eigenvalues and eigenvectors, then compare to exact eigenvalues {2,3,4}.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include syevj_example.cpp * g++ -fopenmp -o syevj_example syevj_example.o -L/usr/local/cuda/lib64 -lcusolver -lcudart */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; syevjInfo_t syevj_params = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; const int m = 3; const int lda = m; /* | 3.5 0.5 0 | * A = | 0.5 3.5 0 | * | 0 0 2 | * */ double A[lda*m] = { 3.5, 0.5, 0, 0.5, 3.5, 0, 0, 0, 2.0}; double lambda[m] = { 2.0, 3.0, 4.0}; double V[lda*m]; /* eigenvectors */ double W[m]; /* eigenvalues */ double *d_A = NULL; /* device copy of A */ double *d_W = NULL; /* eigenvalues */ int *d_info = NULL; /* error info */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* device workspace for syevj */ int info = 0; /* host copy of error info */ /* configuration of syevj */ const double tol = 1.e-7; const int max_sweeps = 15; const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors. const cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
/* numerical results of syevj */ double residual = 0; int executed_sweeps = 0; printf("example of syevj \n"); printf("tol = %E, default value is machine zero \n", tol); printf("max. sweeps = %d, default value is 100\n", max_sweeps); printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n"); /* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: configuration of syevj */ status = cusolverDnCreateSyevjInfo(&syevj_params); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of tolerance is machine zero */ status = cusolverDnXsyevjSetTolerance( syevj_params, tol); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of max. sweeps is 100 */ status = cusolverDnXsyevjSetMaxSweeps( syevj_params, max_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 3: copy A to device */ cudaStat1 = cudaMalloc ((void**)&d_A, sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_W, sizeof(double) * m); cudaStat3 = cudaMalloc ((void**)&d_info, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*m, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1);
/* step 4: query working space of syevj */ status = cusolverDnDsyevj_bufferSize( cusolverH, jobz, uplo, m, d_A, lda, d_W, &lwork, syevj_params); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); /* step 5: compute eigen-pair */ status = cusolverDnDsyevj( cusolverH, jobz, uplo, m, d_A, lda, d_W, d_work, lwork, d_info, syevj_params); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(W, d_W, sizeof(double)*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V, d_A, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); if ( 0 == info ){ printf("syevj converges \n"); }else if ( 0 > info ){ printf("%d-th parameter is wrong \n", -info); exit(1); }else{ printf("WARNING: info = %d : syevj does not converge \n", info ); } printf("Eigenvalue = (matlab base-1), ascending order\n"); for(int i = 0 ; i < m ; i++){ printf("W[%d] = %E\n", i+1, W[i]); } printf("V = (matlab base-1)\n"); printMatrix(m, m, V, lda, "V"); printf("=====\n");
/* step 6: check eigenvalues */ double lambda_sup = 0; for(int i = 0 ; i < m ; i++){ double error = fabs( lambda[i] - W[i]); lambda_sup = (lambda_sup > error)? lambda_sup : error; } printf("|lambda - W| = %E\n", lambda_sup); status = cusolverDnXsyevjGetSweeps( cusolverH, syevj_params, &executed_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); status = cusolverDnXsyevjGetResidual( cusolverH, syevj_params, &residual); assert(CUSOLVER_STATUS_SUCCESS == status); printf("residual |A - V*W*V**H|_F = %E \n", residual ); printf("number of executed sweeps = %d \n", executed_sweeps ); /* free resources */ if (d_A ) cudaFree(d_A); if (d_W ) cudaFree(d_W); if (d_info ) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH ) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); if (syevj_params) cusolverDnDestroySyevjInfo(syevj_params); cudaDeviceReset(); return 0; }
E.4. Generalized Symmetric-Definite Dense Eigenvalue Solver (via Jacobi method)
This chapter provides a simple example in the C programming language of how to use sygvj to compute spectrum of a pair of dense symmetric matrices (A,B) by
where A is a 3x3 dense symmetric matrix
|
and B is a 3x3 positive definite matrix
|
The following code uses sygvj to compute eigenvalues and eigenvectors.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include sygvj_example.cpp * g++ -fopenmp -o sygvj_example sygvj_example.o -L/usr/local/cuda/lib64 -lcusolver -lcudart */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; syevjInfo_t syevj_params = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; const int lda = m; /* * | 3.5 0.5 0 | * A = | 0.5 3.5 0 | * | 0 0 2 | * * | 10 2 3 | * B = | 2 10 5 | * | 3 5 10 | */ double A[lda*m] = { 3.5, 0.5, 0, 0.5, 3.5, 0, 0, 0, 2.0}; double B[lda*m] = { 10.0, 2.0, 3.0, 2.0, 10.0, 5.0, 3.0, 5.0, 10.0}; double lambda[m] = { 0.158660256604, 0.370751508101882, 0.6}; double V[lda*m]; /* eigenvectors */ double W[m]; /* eigenvalues */ double *d_A = NULL; /* device copy of A */ double *d_B = NULL; /* device copy of B */ double *d_W = NULL; /* numerical eigenvalue */ int *d_info = NULL; /* error info */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* device workspace for sygvj */ int info = 0; /* host copy of error info */
/* configuration of sygvj */ const double tol = 1.e-7; const int max_sweeps = 15; const cusolverEigType_t itype = CUSOLVER_EIG_TYPE_1; // A*x = (lambda)*B*x const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors. const cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER; /* numerical results of syevj */ double residual = 0; int executed_sweeps = 0; printf("example of sygvj \n"); printf("tol = %E, default value is machine zero \n", tol); printf("max. sweeps = %d, default value is 100\n", max_sweeps); printf("A = (matlab base-1)\n"); printMatrix(m, m, A, lda, "A"); printf("=====\n"); printf("B = (matlab base-1)\n"); printMatrix(m, m, B, lda, "B"); printf("=====\n"); /* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: configuration of syevj */ status = cusolverDnCreateSyevjInfo(&syevj_params); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of tolerance is machine zero */ status = cusolverDnXsyevjSetTolerance( syevj_params, tol); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of max. sweeps is 100 */ status = cusolverDnXsyevjSetMaxSweeps( syevj_params, max_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status);
/* step 3: copy A and B to device */ cudaStat1 = cudaMalloc ((void**)&d_A, sizeof(double) * lda * m); cudaStat2 = cudaMalloc ((void**)&d_B, sizeof(double) * lda * m); cudaStat3 = cudaMalloc ((void**)&d_W, sizeof(double) * m); cudaStat4 = cudaMalloc ((void**)&d_info, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double) * lda * m, cudaMemcpyHostToDevice); cudaStat2 = cudaMemcpy(d_B, B, sizeof(double) * lda * m, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); /* step 4: query working space of sygvj */ status = cusolverDnDsygvj_bufferSize( cusolverH, itype, jobz, uplo, m, d_A, lda, d_B, lda, /* ldb */ d_W, &lwork, syevj_params); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); /* step 5: compute spectrum of (A,B) */ status = cusolverDnDsygvj( cusolverH, itype, jobz, uplo, m, d_A, lda, d_B, lda, /* ldb */ d_W, d_work, lwork, d_info, syevj_params); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(W, d_W, sizeof(double)*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V, d_A, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3);
if ( 0 == info ){ printf("sygvj converges \n"); }else if ( 0 > info ){ printf("Error: %d-th parameter is wrong \n", -info); exit(1); }else if ( m >= info ){ printf("Error: leading minor of order %d of B is not positive definite\n", -info); exit(1); }else { /* info = m+1 */ printf("WARNING: info = %d : sygvj does not converge \n", info ); } printf("Eigenvalue = (matlab base-1), ascending order\n"); for(int i = 0 ; i < m ; i++){ printf("W[%d] = %E\n", i+1, W[i]); } printf("V = (matlab base-1)\n"); printMatrix(m, m, V, lda, "V"); printf("=====\n"); /* step 6: check eigenvalues */ double lambda_sup = 0; for(int i = 0 ; i < m ; i++){ double error = fabs( lambda[i] - W[i]); lambda_sup = (lambda_sup > error)? lambda_sup : error; } printf("|lambda - W| = %E\n", lambda_sup); status = cusolverDnXsyevjGetSweeps( cusolverH, syevj_params, &executed_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); status = cusolverDnXsyevjGetResidual( cusolverH, syevj_params, &residual); assert(CUSOLVER_STATUS_SUCCESS == status); printf("residual |M - V*W*V**H|_F = %E \n", residual ); printf("number of executed sweeps = %d \n", executed_sweeps ); /* free resources */ if (d_A ) cudaFree(d_A); if (d_B ) cudaFree(d_B); if (d_W ) cudaFree(d_W); if (d_info ) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); if (syevj_params) cusolverDnDestroySyevjInfo(syevj_params); cudaDeviceReset(); return 0; }
E.5. batch eigenvalue solver for dense symmetric matrix
This chapter provides a simple example in the C programming language of how to use syevjBatched to compute the spectrum of a sequence of dense symmetric matrices by
where A0 and A1 are 3x3 dense symmetric matrices
|
|
The following code uses syevjBatched to compute eigenvalues and eigenvectors
The user can disable/enable sorting by the function cusolverDnXsyevjSetSortEig.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include batchsyevj_example.cpp * g++ -fopenmp -o batchsyevj_example batchsyevj_example.o -L/usr/local/cuda/lib64 -lcusolver -lcudart * */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; syevjInfo_t syevj_params = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; const int m = 3; // 1<= m <= 32 const int lda = m; const int batchSize = 2; /* * | 1 -1 0 | * A0 = | -1 2 0 | * | 0 0 0 | * * A0 = V0 * W0 * V0**T * * W0 = diag(0, 0.3820, 2.6180) * * | 3 4 0 | * A1 = | 4 7 0 | * | 0 0 0 | * * A1 = V1 * W1 * V1**T * * W1 = diag(0, 0.5279, 9.4721) * */
double A[lda*m*batchSize]; /* A = [A0 ; A1] */ double V[lda*m*batchSize]; /* V = [V0 ; V1] */ double W[m*batchSize]; /* W = [W0 ; W1] */ int info[batchSize]; /* info = [info0 ; info1] */ double *d_A = NULL; /* lda-by-m-by-batchSize */ double *d_W = NULL; /* m-by-batchSizee */ int* d_info = NULL; /* batchSize */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* device workspace for syevjBatched */ const double tol = 1.e-7; const int max_sweeps = 15; const int sort_eig = 0; /* don't sort eigenvalues */ const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; /* compute eigenvectors */ const cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER; /* residual and executed_sweeps are not supported on syevjBatched */ double residual = 0; int executed_sweeps = 0; double *A0 = A; double *A1 = A + lda*m; /* * | 1 -1 0 | * A0 = | -1 2 0 | * | 0 0 0 | * A0 is column-major */ A0[0 + 0*lda] = 1.0; A0[1 + 0*lda] = -1.0; A0[2 + 0*lda] = 0.0; A0[0 + 1*lda] = -1.0; A0[1 + 1*lda] = 2.0; A0[2 + 1*lda] = 0.0; A0[0 + 2*lda] = 0.0; A0[1 + 2*lda] = 0.0; A0[2 + 2*lda] = 0.0; /* * | 3 4 0 | * A1 = | 4 7 0 | * | 0 0 0 | * A1 is column-major */ A1[0 + 0*lda] = 3.0; A1[1 + 0*lda] = 4.0; A1[2 + 0*lda] = 0.0; A1[0 + 1*lda] = 4.0; A1[1 + 1*lda] = 7.0; A1[2 + 1*lda] = 0.0; A1[0 + 2*lda] = 0.0; A1[1 + 2*lda] = 0.0; A1[2 + 2*lda] = 0.0;
/* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: configuration of syevj */ status = cusolverDnCreateSyevjInfo(&syevj_params); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of tolerance is machine zero */ status = cusolverDnXsyevjSetTolerance( syevj_params, tol); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of max. sweeps is 100 */ status = cusolverDnXsyevjSetMaxSweeps( syevj_params, max_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); /* disable sorting */ status = cusolverDnXsyevjSetSortEig( syevj_params, sort_eig); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 3: copy A to device */ cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double) * lda * m * batchSize); cudaStat2 = cudaMalloc ((void**)&d_W , sizeof(double) * m * batchSize); cudaStat3 = cudaMalloc ((void**)&d_info, sizeof(int ) * batchSize); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double) * lda * m * batchSize, cudaMemcpyHostToDevice); cudaStat2 = cudaDeviceSynchronize(); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); /* step 4: query working space of syevjBatched */ status = cusolverDnDsyevjBatched_bufferSize( cusolverH, jobz, uplo, m, d_A, lda, d_W, &lwork, syevj_params, batchSize ); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1);
/* step 5: compute spectrum of A0 and A1 */ status = cusolverDnDsyevjBatched( cusolverH, jobz, uplo, m, d_A, lda, d_W, d_work, lwork, d_info, syevj_params, batchSize ); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(V , d_A , sizeof(double) * lda * m * batchSize, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(W , d_W , sizeof(double) * m * batchSize , cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(&info, d_info, sizeof(int) * batchSize , cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); for(int i = 0 ; i < batchSize ; i++){ if ( 0 == info[i] ){ printf("matrix %d: syevj converges \n", i); }else if ( 0 > info[i] ){ /* only info[0] shows if some input parameter is wrong. * If so, the error is CUSOLVER_STATUS_INVALID_VALUE. */ printf("Error: %d-th parameter is wrong \n", -info[i] ); exit(1); }else { /* info = m+1 */ /* if info[i] is not zero, Jacobi method does not converge at i-th matrix. */ printf("WARNING: matrix %d, info = %d : sygvj does not converge \n", i, info[i] ); } } /* Step 6: show eigenvalues and eigenvectors */ double *W0 = W; double *W1 = W + m; printf("==== \n"); for(int i = 0 ; i < m ; i++){ printf("W0[%d] = %f\n", i, W0[i]); } printf("==== \n"); for(int i = 0 ; i < m ; i++){ printf("W1[%d] = %f\n", i, W1[i]); } printf("==== \n"); double *V0 = V; double *V1 = V + lda*m; printf("V0 = (matlab base-1)\n"); printMatrix(m, m, V0, lda, "V0"); printf("V1 = (matlab base-1)\n"); printMatrix(m, m, V1, lda, "V1");
/* * The folowing two functions do not support batched version. * The error CUSOLVER_STATUS_NOT_SUPPORTED is returned. */ status = cusolverDnXsyevjGetSweeps( cusolverH, syevj_params, &executed_sweeps); assert(CUSOLVER_STATUS_NOT_SUPPORTED == status); status = cusolverDnXsyevjGetResidual( cusolverH, syevj_params, &residual); assert(CUSOLVER_STATUS_NOT_SUPPORTED == status); /* free resources */ if (d_A ) cudaFree(d_A); if (d_W ) cudaFree(d_W); if (d_info ) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); if (syevj_params) cusolverDnDestroySyevjInfo(syevj_params); cudaDeviceReset(); return 0; }
F. Examples of Singular Value Decomposition
F.1. SVD with singular vectors
This chapter provides a simple example in the C programming language of how to singular value decomposition.
A is a 3x2 dense matrix,
|
The following code uses three steps:
Step 1: compute A = U*S*VT
Step 2: check accuracy of singular value
Step 3: measure residual A-U*S*VT
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include svd_example.cpp * g++ -fopenmp -o a.out svd_example.o -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusolver * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include <cuda_runtime.h> #include <cublas_v2.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cublasHandle_t cublasH = NULL; cublasStatus_t cublas_status = CUBLAS_STATUS_SUCCESS; cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; cudaError_t cudaStat5 = cudaSuccess; cudaError_t cudaStat6 = cudaSuccess; const int m = 3; const int n = 2; const int lda = m; /* | 1 2 | * A = | 4 5 | * | 2 1 | */ double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0}; double U[lda*m]; // m-by-m unitary matrix double VT[lda*n]; // n-by-n unitary matrix double S[n]; // singular value double S_exact[n] = {7.065283497082729, 1.040081297712078}; double *d_A = NULL; double *d_S = NULL; double *d_U = NULL; double *d_VT = NULL; int *devInfo = NULL; double *d_work = NULL; double *d_rwork = NULL; double *d_W = NULL; // W = S*VT int lwork = 0; int info_gpu = 0; const double h_one = 1; const double h_minus_one = -1;
printf("A = (matlab base-1)\n"); printMatrix(m, n, A, lda, "A"); printf("=====\n"); // step 1: create cusolverDn/cublas handle cusolver_status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); cublas_status = cublasCreate(&cublasH); assert(CUBLAS_STATUS_SUCCESS == cublas_status); // step 2: copy A and B to device cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n); cudaStat2 = cudaMalloc ((void**)&d_S , sizeof(double)*n); cudaStat3 = cudaMalloc ((void**)&d_U , sizeof(double)*lda*m); cudaStat4 = cudaMalloc ((void**)&d_VT , sizeof(double)*lda*n); cudaStat5 = cudaMalloc ((void**)&devInfo, sizeof(int)); cudaStat6 = cudaMalloc ((void**)&d_W , sizeof(double)*lda*n); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); assert(cudaSuccess == cudaStat5); assert(cudaSuccess == cudaStat6); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); // step 3: query working space of SVD cusolver_status = cusolverDnDgesvd_bufferSize( cusolverH, m, n, &lwork ); assert (cusolver_status == CUSOLVER_STATUS_SUCCESS); cudaStat1 = cudaMalloc((void**)&d_work , sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); // step 4: compute SVD signed char jobu = 'A'; // all m columns of U signed char jobvt = 'A'; // all n columns of VT cusolver_status = cusolverDnDgesvd ( cusolverH, jobu, jobvt, m, n, d_A, lda, d_S, d_U, lda, // ldu d_VT, lda, // ldvt, d_work, lwork, d_rwork, devInfo); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == cusolver_status); assert(cudaSuccess == cudaStat1);
cudaStat1 = cudaMemcpy(U , d_U , sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(VT, d_VT, sizeof(double)*lda*n, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(S , d_S , sizeof(double)*n , cudaMemcpyDeviceToHost); cudaStat4 = cudaMemcpy(&info_gpu, devInfo, sizeof(int), cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); printf("after gesvd: info_gpu = %d\n", info_gpu); assert(0 == info_gpu); printf("=====\n"); printf("S = (matlab base-1)\n"); printMatrix(n, 1, S, lda, "S"); printf("=====\n"); printf("U = (matlab base-1)\n"); printMatrix(m, m, U, lda, "U"); printf("=====\n"); printf("VT = (matlab base-1)\n"); printMatrix(n, n, VT, lda, "VT"); printf("=====\n"); // step 5: measure error of singular value double ds_sup = 0; for(int j = 0; j < n; j++){ double err = fabs( S[j] - S_exact[j] ); ds_sup = (ds_sup > err)? ds_sup : err; } printf("|S - S_exact| = %E \n", ds_sup); // step 6: |A - U*S*VT| // W = S*VT cublas_status = cublasDdgmm( cublasH, CUBLAS_SIDE_LEFT, n, n, d_VT, lda, d_S, 1, d_W, lda); assert(CUBLAS_STATUS_SUCCESS == cublas_status);
// A := -U*W + A cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1); cublas_status = cublasDgemm_v2( cublasH, CUBLAS_OP_N, // U CUBLAS_OP_N, // W m, // number of rows of A n, // number of columns of A n, // number of columns of U &h_minus_one, /* host pointer */ d_U, // U lda, d_W, // W lda, &h_one, /* hostpointer */ d_A, lda); assert(CUBLAS_STATUS_SUCCESS == cublas_status); double dR_fro = 0.0; cublas_status = cublasDnrm2_v2( cublasH, lda*n, d_A, 1, &dR_fro); assert(CUBLAS_STATUS_SUCCESS == cublas_status); printf("|A - U*S*VT| = %E \n", dR_fro); // free resources if (d_A ) cudaFree(d_A); if (d_S ) cudaFree(d_S); if (d_U ) cudaFree(d_U); if (d_VT ) cudaFree(d_VT); if (devInfo) cudaFree(devInfo); if (d_work ) cudaFree(d_work); if (d_rwork) cudaFree(d_rwork); if (d_W ) cudaFree(d_W); if (cublasH ) cublasDestroy(cublasH); if (cusolverH) cusolverDnDestroy(cusolverH); cudaDeviceReset(); return 0; }
F.2. SVD with singular vectors (via Jacobi method)
This chapter provides a simple example in the C programming language of how to singular value decomposition by gesvdj.
A is a 3x2 dense matrix,
|
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include gesvdj_example.cpp * g++ -fopenmp -o gesvdj_example gesvdj_example.o -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusolver */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %20.16E\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; gesvdjInfo_t gesvdj_params = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; cudaError_t cudaStat5 = cudaSuccess; const int m = 3; const int n = 2; const int lda = m; /* | 1 2 | * A = | 4 5 | * | 2 1 | */ double A[lda*n] = { 1.0, 4.0, 2.0, 2.0, 5.0, 1.0}; double U[lda*m]; /* m-by-m unitary matrix, left singular vectors */ double V[lda*n]; /* n-by-n unitary matrix, right singular vectors */ double S[n]; /* numerical singular value */ /* exact singular values */ double S_exact[n] = {7.065283497082729, 1.040081297712078}; double *d_A = NULL; /* device copy of A */ double *d_S = NULL; /* singular values */ double *d_U = NULL; /* left singular vectors */ double *d_V = NULL; /* right singular vectors */ int *d_info = NULL; /* error info */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* devie workspace for gesvdj */ int info = 0; /* host copy of error info */
/* configuration of gesvdj */ const double tol = 1.e-7; const int max_sweeps = 15; const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvectors. const int econ = 0 ; /* econ = 1 for economy size */ /* numerical results of gesvdj */ double residual = 0; int executed_sweeps = 0; printf("example of gesvdj \n"); printf("tol = %E, default value is machine zero \n", tol); printf("max. sweeps = %d, default value is 100\n", max_sweeps); printf("econ = %d \n", econ); printf("A = (matlab base-1)\n"); printMatrix(m, n, A, lda, "A"); printf("=====\n"); /* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: configuration of gesvdj */ status = cusolverDnCreateGesvdjInfo(&gesvdj_params); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of tolerance is machine zero */ status = cusolverDnXgesvdjSetTolerance( gesvdj_params, tol); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of max. sweeps is 100 */ status = cusolverDnXgesvdjSetMaxSweeps( gesvdj_params, max_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 3: copy A and B to device */ cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n); cudaStat2 = cudaMalloc ((void**)&d_S , sizeof(double)*n); cudaStat3 = cudaMalloc ((void**)&d_U , sizeof(double)*lda*m); cudaStat4 = cudaMalloc ((void**)&d_V , sizeof(double)*lda*n); cudaStat5 = cudaMalloc ((void**)&d_info, sizeof(int)); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); assert(cudaSuccess == cudaStat5); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n, cudaMemcpyHostToDevice); assert(cudaSuccess == cudaStat1);
/* step 4: query workspace of SVD */ status = cusolverDnDgesvdj_bufferSize( cusolverH, jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */ /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */ econ, /* econ = 1 for economy size */ m, /* nubmer of rows of A, 0 <= m */ n, /* number of columns of A, 0 <= n */ d_A, /* m-by-n */ lda, /* leading dimension of A */ d_S, /* min(m,n) */ /* the singular values in descending order */ d_U, /* m-by-m if econ = 0 */ /* m-by-min(m,n) if econ = 1 */ lda, /* leading dimension of U, ldu >= max(1,m) */ d_V, /* n-by-n if econ = 0 */ /* n-by-min(m,n) if econ = 1 */ lda, /* leading dimension of V, ldv >= max(1,n) */ &lwork, gesvdj_params); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work , sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); /* step 5: compute SVD */ status = cusolverDnDgesvdj( cusolverH, jobz, /* CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only */ /* CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors */ econ, /* econ = 1 for economy size */ m, /* nubmer of rows of A, 0 <= m */ n, /* number of columns of A, 0 <= n */ d_A, /* m-by-n */ lda, /* leading dimension of A */ d_S, /* min(m,n) */ /* the singular values in descending order */ d_U, /* m-by-m if econ = 0 */ /* m-by-min(m,n) if econ = 1 */ lda, /* leading dimension of U, ldu >= max(1,m) */ d_V, /* n-by-n if econ = 0 */ /* n-by-min(m,n) if econ = 1 */ lda, /* leading dimension of V, ldv >= max(1,n) */ d_work, lwork, d_info, gesvdj_params); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(U, d_U, sizeof(double)*lda*m, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V, d_V, sizeof(double)*lda*n, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(S, d_S, sizeof(double)*n , cudaMemcpyDeviceToHost); cudaStat4 = cudaMemcpy(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost); cudaStat5 = cudaDeviceSynchronize(); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); assert(cudaSuccess == cudaStat5);
if ( 0 == info ){ printf("gesvdj converges \n"); }else if ( 0 > info ){ printf("%d-th parameter is wrong \n", -info); exit(1); }else{ printf("WARNING: info = %d : gesvdj does not converge \n", info ); } printf("S = singular values (matlab base-1)\n"); printMatrix(n, 1, S, lda, "S"); printf("=====\n"); printf("U = left singular vectors (matlab base-1)\n"); printMatrix(m, m, U, lda, "U"); printf("=====\n"); printf("V = right singular vectors (matlab base-1)\n"); printMatrix(n, n, V, lda, "V"); printf("=====\n"); /* step 6: measure error of singular value */ double ds_sup = 0; for(int j = 0; j < n; j++){ double err = fabs( S[j] - S_exact[j] ); ds_sup = (ds_sup > err)? ds_sup : err; } printf("|S - S_exact|_sup = %E \n", ds_sup); status = cusolverDnXgesvdjGetSweeps( cusolverH, gesvdj_params, &executed_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); status = cusolverDnXgesvdjGetResidual( cusolverH, gesvdj_params, &residual); assert(CUSOLVER_STATUS_SUCCESS == status); printf("residual |A - U*S*V**H|_F = %E \n", residual ); printf("number of executed sweeps = %d \n", executed_sweeps ); /* free resources */ if (d_A ) cudaFree(d_A); if (d_S ) cudaFree(d_S); if (d_U ) cudaFree(d_U); if (d_V ) cudaFree(d_V); if (d_info) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); if (gesvdj_params) cusolverDnDestroyGesvdjInfo(gesvdj_params); cudaDeviceReset(); return 0; }
F.3. batch dense SVD solver
This chapter provides a simple example in the C programming language of how to use gesvdjBatched to compute the SVD of a sequence of dense matrices
where A0 and A1 are 3x2 dense matrices
|
|
The following code uses gesvdjBatched to compute singular values and singular vectors.
The user can disable/enable sorting by the function cusolverDnXgesvdjSetSortEig.
/* * How to compile (assume cuda is installed at /usr/local/cuda/) * nvcc -c -I/usr/local/cuda/include gesvdjbatch_example.cpp * g++ -fopenmp -o gesvdjbatch_example gesvdjbatch_example.o -L/usr/local/cuda/lib64 -lcusolver -lcudart */ #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <cuda_runtime.h> #include <cusolverDn.h> void printMatrix(int m, int n, const double*A, int lda, const char* name) { for(int row = 0 ; row < m ; row++){ for(int col = 0 ; col < n ; col++){ double Areg = A[row + col*lda]; printf("%s(%d,%d) = %20.16E\n", name, row+1, col+1, Areg); } } } int main(int argc, char*argv[]) { cusolverDnHandle_t cusolverH = NULL; cudaStream_t stream = NULL; gesvdjInfo_t gesvdj_params = NULL; cusolverStatus_t status = CUSOLVER_STATUS_SUCCESS; cudaError_t cudaStat1 = cudaSuccess; cudaError_t cudaStat2 = cudaSuccess; cudaError_t cudaStat3 = cudaSuccess; cudaError_t cudaStat4 = cudaSuccess; cudaError_t cudaStat5 = cudaSuccess; const int m = 3; /* 1 <= m <= 32 */ const int n = 2; /* 1 <= n <= 32 */ const int lda = m; /* lda >= m */ const int ldu = m; /* ldu >= m */ const int ldv = n; /* ldv >= n */ const int batchSize = 2; const int minmn = (m < n)? m : n; /* min(m,n) */ /* * | 1 -1 | * A0 = | -1 2 | * | 0 0 | * * A0 = U0 * S0 * V0**T * S0 = diag(2.6180, 0.382) * * | 3 4 | * A1 = | 4 7 | * | 0 0 | * * A1 = U1 * S1 * V1**T * S1 = diag(9.4721, 0.5279) */
double A[lda*n*batchSize]; /* A = [A0 ; A1] */ double U[ldu*m*batchSize]; /* U = [U0 ; U1] */ double V[ldv*n*batchSize]; /* V = [V0 ; V1] */ double S[minmn*batchSize]; /* S = [S0 ; S1] */ int info[batchSize]; /* info = [info0 ; info1] */ double *d_A = NULL; /* lda-by-n-by-batchSize */ double *d_U = NULL; /* ldu-by-m-by-batchSize */ double *d_V = NULL; /* ldv-by-n-by-batchSize */ double *d_S = NULL; /* minmn-by-batchSizee */ int* d_info = NULL; /* batchSize */ int lwork = 0; /* size of workspace */ double *d_work = NULL; /* device workspace for gesvdjBatched */ const double tol = 1.e-7; const int max_sweeps = 15; const int sort_svd = 0; /* don't sort singular values */ const cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; /* compute singular vectors */ /* residual and executed_sweeps are not supported on gesvdjBatched */ double residual = 0; int executed_sweeps = 0; double *A0 = A; double *A1 = A + lda*n; /* Aj is m-by-n */ /* * | 1 -1 | * A0 = | -1 2 | * | 0 0 | * A0 is column-major */ A0[0 + 0*lda] = 1.0; A0[1 + 0*lda] = -1.0; A0[2 + 0*lda] = 0.0; A0[0 + 1*lda] = -1.0; A0[1 + 1*lda] = 2.0; A0[2 + 1*lda] = 0.0; /* * | 3 4 | * A1 = | 4 7 | * | 0 0 | * A1 is column-major */ A1[0 + 0*lda] = 3.0; A1[1 + 0*lda] = 4.0; A1[2 + 0*lda] = 0.0; A1[0 + 1*lda] = 4.0; A1[1 + 1*lda] = 7.0; A1[2 + 1*lda] = 0.0; printf("example of gesvdjBatched \n"); printf("m = %d, n = %d \n", m, n); printf("tol = %E, default value is machine zero \n", tol); printf("max. sweeps = %d, default value is 100\n", max_sweeps); printf("A0 = (matlab base-1)\n"); printMatrix(m, n, A0, lda, "A0"); printf("A1 = (matlab base-1)\n"); printMatrix(m, n, A1, lda, "A1"); printf("=====\n");
/* step 1: create cusolver handle, bind a stream */ status = cusolverDnCreate(&cusolverH); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); assert(cudaSuccess == cudaStat1); status = cusolverDnSetStream(cusolverH, stream); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 2: configuration of gesvdj */ status = cusolverDnCreateGesvdjInfo(&gesvdj_params); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of tolerance is machine zero */ status = cusolverDnXgesvdjSetTolerance( gesvdj_params, tol); assert(CUSOLVER_STATUS_SUCCESS == status); /* default value of max. sweeps is 100 */ status = cusolverDnXgesvdjSetMaxSweeps( gesvdj_params, max_sweeps); assert(CUSOLVER_STATUS_SUCCESS == status); /* disable sorting */ status = cusolverDnXgesvdjSetSortEig( gesvdj_params, sort_svd); assert(CUSOLVER_STATUS_SUCCESS == status); /* step 3: copy A to device */ cudaStat1 = cudaMalloc ((void**)&d_A , sizeof(double)*lda*n*batchSize); cudaStat2 = cudaMalloc ((void**)&d_U , sizeof(double)*ldu*m*batchSize); cudaStat3 = cudaMalloc ((void**)&d_V , sizeof(double)*ldv*n*batchSize); cudaStat4 = cudaMalloc ((void**)&d_S , sizeof(double)*minmn*batchSize); cudaStat5 = cudaMalloc ((void**)&d_info, sizeof(int )*batchSize); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4); assert(cudaSuccess == cudaStat5); cudaStat1 = cudaMemcpy(d_A, A, sizeof(double)*lda*n*batchSize, cudaMemcpyHostToDevice); cudaStat2 = cudaDeviceSynchronize(); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2);
/* step 4: query working space of gesvdjBatched */ status = cusolverDnDgesvdjBatched_bufferSize( cusolverH, jobz, m, n, d_A, lda, d_S, d_U, ldu, d_V, ldv, &lwork, gesvdj_params, batchSize ); assert(CUSOLVER_STATUS_SUCCESS == status); cudaStat1 = cudaMalloc((void**)&d_work, sizeof(double)*lwork); assert(cudaSuccess == cudaStat1); /* step 5: compute singular values of A0 and A1 */ status = cusolverDnDgesvdjBatched( cusolverH, jobz, m, n, d_A, lda, d_S, d_U, ldu, d_V, ldv, d_work, lwork, d_info, gesvdj_params, batchSize ); cudaStat1 = cudaDeviceSynchronize(); assert(CUSOLVER_STATUS_SUCCESS == status); assert(cudaSuccess == cudaStat1); cudaStat1 = cudaMemcpy(U , d_U , sizeof(double)*ldu*m*batchSize, cudaMemcpyDeviceToHost); cudaStat2 = cudaMemcpy(V , d_V , sizeof(double)*ldv*n*batchSize, cudaMemcpyDeviceToHost); cudaStat3 = cudaMemcpy(S , d_S , sizeof(double)*minmn*batchSize, cudaMemcpyDeviceToHost); cudaStat4 = cudaMemcpy(&info, d_info, sizeof(int) * batchSize , cudaMemcpyDeviceToHost); assert(cudaSuccess == cudaStat1); assert(cudaSuccess == cudaStat2); assert(cudaSuccess == cudaStat3); assert(cudaSuccess == cudaStat4);
for(int i = 0 ; i < batchSize ; i++){ if ( 0 == info[i] ){ printf("matrix %d: gesvdj converges \n", i); }else if ( 0 > info[i] ){ /* only info[0] shows if some input parameter is wrong. * If so, the error is CUSOLVER_STATUS_INVALID_VALUE. */ printf("Error: %d-th parameter is wrong \n", -info[i] ); exit(1); }else { /* info = m+1 */ /* if info[i] is not zero, Jacobi method does not converge at i-th matrix. */ printf("WARNING: matrix %d, info = %d : gesvdj does not converge \n", i, info[i] ); } } /* Step 6: show singular values and singular vectors */ double *S0 = S; double *S1 = S + minmn; printf("==== \n"); for(int i = 0 ; i < minmn ; i++){ printf("S0(%d) = %20.16E\n", i+1, S0[i]); } printf("==== \n"); for(int i = 0 ; i < minmn ; i++){ printf("S1(%d) = %20.16E\n", i+1, S1[i]); } printf("==== \n"); double *U0 = U; double *U1 = U + ldu*m; /* Uj is m-by-m */ printf("U0 = (matlab base-1)\n"); printMatrix(m, m, U0, ldu, "U0"); printf("U1 = (matlab base-1)\n"); printMatrix(m, m, U1, ldu, "U1"); double *V0 = V; double *V1 = V + ldv*n; /* Vj is n-by-n */ printf("V0 = (matlab base-1)\n"); printMatrix(n, n, V0, ldv, "V0"); printf("V1 = (matlab base-1)\n"); printMatrix(n, n, V1, ldv, "V1");
/* * The folowing two functions do not support batched version. * The error CUSOLVER_STATUS_NOT_SUPPORTED is returned. */ status = cusolverDnXgesvdjGetSweeps( cusolverH, gesvdj_params, &executed_sweeps); assert(CUSOLVER_STATUS_NOT_SUPPORTED == status); status = cusolverDnXgesvdjGetResidual( cusolverH, gesvdj_params, &residual); assert(CUSOLVER_STATUS_NOT_SUPPORTED == status); /* free resources */ if (d_A ) cudaFree(d_A); if (d_U ) cudaFree(d_U); if (d_V ) cudaFree(d_V); if (d_S ) cudaFree(d_S); if (d_info ) cudaFree(d_info); if (d_work ) cudaFree(d_work); if (cusolverH) cusolverDnDestroy(cusolverH); if (stream ) cudaStreamDestroy(stream); if (gesvdj_params) cusolverDnDestroyGesvdjInfo(gesvdj_params); cudaDeviceReset(); return 0; }
Acknowledgements
NVIDIA would like to thank the following individuals and institutions for their contributions:
- CPU LAPACK routines from netlib, LAPACK 3.5.0 (http://www.netlib.org/lapack/)
The following is license of LAPACK (modified BSD license).
Copyright (c) 1992-2013 The University of Tennessee and The University of Tennessee Research Foundation. All rights reserved.
Copyright (c) 2000-2013 The University of California Berkeley. All rights reserved.
Copyright (c) 2006-2013 The University of Colorado Denver. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
The copyright holders provide no reassurances that the source code provided does not infringe any patent, copyright, or any other intellectual property rights of third parties. The copyright holders disclaim any liability to any recipient for claims brought against recipient by any third party for infringement of that parties intellectual property rights.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
H. Bibliography
[1] Timothy A. Davis, Direct Methods for sparse Linear Systems, siam 2006.
[2] E. Chuthill and J. McKee, reducing the bandwidth of sparse symmetric matrices, ACM '69 Proceedings of the 1969 24th national conference, Pages 157-172.
[3] Alan George, Joseph W. H. Liu, An Implementation of a Pseudoperipheral Node Finder, ACM Transactions on Mathematical Software (TOMS) Volume 5 Issue 3, Sept. 1979 Pages 284-295.
[4] J. R. Gilbert and T. Peierls, Sparse partial pivoting in time proportional to arithmetic operations, SIAM J. Sci. Statist. Comput., 9 (1988), pp. 862-874.
[5] Alan George and Esmond Ng, An Implementation of Gaussian Elimination with Partial Pivoting for Sparse Systems, SIAM J. Sci. and Stat. Comput., 6(2), 390-409.
[6] Alan George and Esmond Ng, Symbolic Factorization for Sparse Gaussian Elimination with Paritial Pivoting, SIAM J. Sci. and Stat. Comput., 8(6), 877-898.
[7] John R. Gilbert, Xiaoye S. Li, Esmond G. Ng, Barry W. Peyton, Computing Row and Column Counts for Sparse QR and LU Factorization, BIT 2001, Vol. 41, No. 4, pp. 693-711.
[8] Patrick R. Amestoy, Timothy A. Davis, Iain S. Duff, An Approximate Minimum Degree Ordering Algorithm, SIAM J. Matrix Analysis Applic. Vol 17, no 4, pp. 886-905, Dec. 1996.
[9] Alan George, Joseph W. Liu, A Fast Implementation of the Minimum Degree Algorithm Using Quotient Graphs, ACM Transactions on Mathematical Software, Vol 6, No. 3, September 1980, page 337-358.
[10] Alan George, Joseph W. Liu, Computer Solution of Large Sparse Positive Definite Systems, Englewood Cliffs, New Jersey: Prentice-Hall, 1981.
Notices
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.