DragonFly On-Line Manual Pages

HPL_pdpanrlT(3)              HPL Library Functions             HPL_pdpanrlT(3)

NAME
       HPL_pdpanrlT - Right-looking panel factorization.

SYNOPSIS
       #include "hpl.h"

       void HPL_pdpanrlT( HPL_T_panel * PANEL, const int M, const int N, const
       int ICOFF, double * WORK );

DESCRIPTION
       HPL_pdpanrlT factorizes  a panel of columns  that is a sub-array of a
       larger one-dimensional panel A using the Right-looking variant of the
       usual one-dimensional algorithm.  The lower triangular N0-by-N0 upper
       block of the panel is stored in transpose form.

       Bi-directional  exchange  is  used  to  perform  the  swap::broadcast
       operations  at once  for one column in the panel.  This  results in a
       lower number of slightly larger  messages than usual.  On P processes
       and assuming bi-directional links,  the running time of this function
       can be approximated by (when N is equal to N0):

          N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
          N0^2 * ( M - N0/3 ) * gam2-3

       where M is the local number of rows of  the panel, lat and bdwth  are
       the latency and bandwidth of the network for  double  precision  real
       words,  and  gam2-3  is an estimate of the  Level 2 and Level 3  BLAS
       rate of execution. The  recursive  algorithm  allows indeed to almost
       achieve  Level 3 BLAS  performance  in the panel factorization.  On a
       large  number of modern machines,  this  operation is however latency
       bound,  meaning  that its cost can  be estimated  by only the latency
       portion N0 * log_2(P) * lat.  Mono-directional links will double this
       communication cost.

       Note that  one  iteration of the the main loop is unrolled. The local
       computation of the absolute value max of the next column is performed
       just after its update by the current column. This allows to bring the
       current column only  once through  cache at each  step.  The  current
       implementation  does not perform  any blocking  for  this sequence of
       BLAS operations, however the design allows for plugging in an optimal
       (machine-specific) specialized  BLAS-like kernel.  This idea has been
       suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.

ARGUMENTS
       PANEL   (local input/output)    HPL_T_panel *
               On entry,  PANEL  points to the data structure containing the
               panel information.

       M       (local input)           const int
               On entry,  M specifies the local number of rows of sub(A).

       N       (local input)           const int
               On entry,  N specifies the local number of columns of sub(A).

       ICOFF   (global input)          const int
               On entry, ICOFF specifies the row and column offset of sub(A)
               in A.

       WORK    (local workspace)       double *
               On entry, WORK  is a workarray of size at least 2*(4+2*N0).

SEE ALSO
       HPL_dlocmax (3), HPL_dlocswpN (3), HPL_dlocswpT (3), HPL_pdmxswp (3),
       HPL_pdpancrN (3), HPL_pdpancrT (3), HPL_pdpanllN (3), HPL_pdpanllT (3),
       HPL_pdpanrlN (3).

HPL 2.1                        October 26, 2012                HPL_pdpanrlT(3)