算法修炼之练气篇——练气十九层-CFANZ编程社区

Layouts 和Tensors

Tensors是由内存中数值单元多维阵列来表示的数学类型。这些可以定义可以在其上定义经典线性代数计算的二维矩阵，或者经常用于构造深度学习应用程序和框架所使用的数据的更高维对象。Layout中的函数可以将逻辑地址空间映射到内存中，也可以映射到TensorRef/TensorView等非直接相关tensors。Layouts包含以下属性：

size（scalar）：一个tensor中的单元数量
capacity（scalar）：在内存中存储tensor所需的单元数（可能比size大）
rank（scalar）：描述tensor的逻辑维度数量
extent（vector）：tensor中每个逻辑维度的尺寸
layout是以下元素的系统性设计模板：
将逻辑地址空间映射到内存中的物理偏移
将上述计算所需的动态状态存储下来
为其他cutlass组件的部分专用化定义类型系统
layout是如下定义的：

struct LayoutConcept {
  /// Logical rank of tensor
  static int const kRank;

  /// Rank of stride vector
  static int const kStrideRank;

  /// Index type used for coordinates
  struct Index;

  /// Long index type used for offsets
  struct LongIndex;

  /// Logical coordinate - satisfies Coord<kRank, ..>
  struct TensorCoord;

  /// Stride object - satisfies Coord<kStrideRank, ..>
  struct Stride

  //
  // Methods
  //

  /// Constructor
  CUTLASS_HOST_DEVICE
  LayoutConcept();

  /// Ctor
  CUTLASS_HOST_DEVICE
  LayoutConcept(Stride stride);

  /// Helper returns a layout to a tightly packed tensor
  CUTLASS_HOST_DEVICE
  static LayoutConcept packed(TensorCoord const &extent);

  /// Function call operator returns the offset of a coordinate in linear memory. 
  /// Assumes coordinate has convention (row, column)
  CUTLASS_HOST_DEVICE
  LongIndex operator()(TensorCoord const &coord) const;

  /// Inverse of layout function, mapping linear offset to logical coordinate
  CUTLASS_HOST_DEVICE
  TensorCoord inverse(LongIndex offset) const;

  /// Returns the stride of the layout
  CUTLASS_HOST_DEVICE
  Stride stride() const;

  /// Returns the stride of the layout
  CUTLASS_HOST_DEVICE
  Stride & stride();

  /// Compute the number of contiguous elements needed to store a tensor with the given size
  CUTLASS_HOST_DEVICE
  LongIndex capacity(TensorCoord const &extent) const;
};

_Layout_对象概括了_BLAS_实现中典型的矩阵的前导维数。例如，cuBLAS假设Fortran样式_column-major_矩阵布局，并将其称为矩阵的“前导维度”：

cublasGemmEx(
  ...
  ptr_A,      // pointer to first element of matrix A
  lda,        // leading dimension
  ...
);

他的坐标是(row, column)，偏移是row + lda * column，这等价于cutlass的layout::ColumnMajor：

layout::ColumnMajor layout(lda); 
int offset = layout({row, column});     // returns row  + lda * column

如果是row-major则如下实现：

layout::RowMajor layout(lda); 

int offset = layout({row, column});     // returns lda * row + column

在这两种情况下，_logical_坐标（row，column）由同一对象表示。这使得算法能够实现为通用模板，Tensor中的位置总是在逻辑空间中指定_Layout_objects将其映射到内存中的物理偏移。
给定紧束缚Tensor的范围，布局的“::packed（）”静态方法可用于构造布局对象。当算法必须定义任意布局的缓冲区时，需要使用此方法。比如

typename ArbitraryLayout::TensorCoord extent = make_Coord(...);
typename ArbitraryLayout::TensorCoord coord;

ArbitraryLayout layout = ArbitraryLayout::packed(extent);

int offset = layout({coord});

layout::capacity()计算表示张量所需的内存中的位置数。这在分配内存时很有用，因为可能需要比完全压缩的张量需要更多的存储：

int lda = columns + padding;
MatrixCoord extent{rows, columns};

layout::RowMajor layout(lda);

auto capacity = layout.capacity(extent);    // returns rows * (columns + padding)

利用tensor来获取一个单元

TensorRef
TensorRef<class T, class Layout>包含了Tensor的首地址和layout，可以用来获取其中的单元。用它来获取对象非常方便，当stride元素的数量很多时，可以将其传递给函数以限制参数的爆炸。

int4_t *ptr = ...;
int ldm = ...;

int row = ...;
int column = ...;

layout::ColumnMajor layout(ldm);
TensorRef<int4_t, layout::ColumnMajor> ref(ptr, layout);

int4_t x = ref.at({row, column});     // loads a 4-bit signed integer from the tensor

ref.at({row, column}) = x * 2_s4;     // transforms this quantity and stores it back

TensorView
线性代数计算中使用的矩阵和张量总是有限的。TensorView<class T, class Layout>衍生自TensorRef<>并添加了extent向量来描述tensor或者matrix的逻辑尺寸。

int4_t *ptr = ...;
int ldm = ...;
MatrixCoord extent = ...;

int row = ...;
int column = ...;

layout::ColumnMajor layout(ldm);
TensorView<int4_t, layout::ColumnMajor> view(ptr, layout, extent);

MatrixCoord coord = {row, column};

if (view.contains(coord)) {     // verify coordinate is in bounds before performing access
  int4_t x = ref.at(coord);  
  ref.at({row, column}) = x * 2_s4;
}

“TensorView＜＞”可以由“TensorRef＜＞”构造：

layout::ColumnMajor layout(ldm);
TensorRef<int4_t, layout::ColumnMajor> ref(ptr, layout);
TensorView<int4_t, layout::ColumnMajor> view(ref, extent);    // construct TensorView from TensorRef and extent

通过接受单个问题大小组件和每个操作数的“TensorRef”对象，其范围隐含为算子的先决条件，计算可以避免分配过多内存。通过避免扩展区数量的冗余存储，CUTLASS 最大限度地减少了宝贵资源（如常量内存）的容量利用率，这与 BLAS 的规则一致。

现有的layout类型

- `PitchLinear`: data layout defined by _contiguous_ and _strided_ dimensions. _contiguous_ refers to consecutive elements in memory, where as _strided_ refers to data separated by a uniform stride
-- Rank: 2
-- TensorCoord type: `PitchLinearCoord`
-- Shape type: `PitchLinearShape`
-- Stride rank: 1

- `ColumnMajor`: data layout defined by _rows_ and _columns_ dimensions. Can be mapped to `PitchLinear` by: (_contiguous_ = _rows_, _strided_ = _columns_)
-- Rank: 2
-- TensorCoord type: `MatrixCoord`
-- Shape type: `MatrixShape`
-- Stride rank: 1

- `RowMajor`: data layout defined by _rows_ and _columns_ dimensions. Can be mapped to `PitchLinear` by: (_contiguous_ = _columns_, _strided_ = _rows_)
-- Rank: 2
-- TensorCoord type: `MatrixCoord`
-- Shape type: `MatrixShape`
-- Stride rank: 1

- `ColumnMajorInterleaved<k>`: data layout defined by _rows_ and _columns_ dimensions. Data is packed into a 'column-major' arrangement of row vectors of fixed length.
-- Rank: 2
-- TensorCoord type: `MatrixCoord`
-- Shape type: `MatrixShape`
-- Stride rank: 1

- `RowMajorInterleaved<k>`: data layout defined by _rows_ and _columns_ dimensions. Data is packed into a 'row-major' arrangement of column vectors of fixed length.
-- Rank: 2
-- TensorCoord type: `MatrixCoord`
-- Shape type: `MatrixShape`
-- Stride rank: 1

Tensor layouts:
- `TensorNHWC`:

Permuted Shared Memory Layouts:
- `TensorOpCongruous<ElementSize>`
- `TensorOpCrosswise<ElementSize>`