在gpu上的医学图像分割-一个综合的综述 Medical image segmentationonGPUs

在gpu上的医学图像分割-一个综合的综述

Medical image segmentation on GPUs – A comprehensive review

解剖结构的分割，从计算机断层扫描（CT）、磁共振成像（MRI）和超声波，是诊断、规划和指导等医疗应用的关键启用技术。更有效的实现是必要的，因为大多数分割方法的计算成本都很高，而且医学成像数据的数量正在增长。近年来，图形处理单元（gpu）的可编程性的提高，使它们能够在一些领域得到使用。GPU可以以比传统CPU更高的速度解决大型数据并行问题，同时比分布式系统更经济、更节能。此外，使用GPU可以实现并行可视化和交互式分割，用户可以帮助算法获得令人满意的结果。本文综述了利用gpu来加速分割医学图像的方法。定义了一套有效使用gpu的标准，并对每种分割方法进行了相应的评价。此外，还提供并讨论了相关GPU实现和对GPU优化的见解综述认为，由于分割方法的数据并行结构和线程数高，大多数分割方法都可能受益于GPU处理。然而，诸如同步、分支分歧和内存使用等因素可能会限制加速。2014年的作者。这是一篇在CC BY许可下的开放获取文章

Segmentation of anatomical structures, from modalities like computed tomography (CT), magnetic resonance imaging (MRI) and ultrasound, is a key enabling technology for medical applications such as diagnostics, planning and guidance. More efficient implementations are necessary, as most segmentationmethods are computationally expensive, and the amount of medical imaging data is growing. Theincreased programmability of graphic processing units (GPUs) in recent years have enabled their use in several areas. GPUs can solve large data parallel problems at a higher speed than the traditional CPU, while being more affordable and energy efficient than distributed systems. Furthermore, using aGPU enables concurrent visualization and interactive segmentation, where the user can help the algorithm to achieve a satisfactory result. This review investigates the use of GPUs to accelerate medical image segmentation methods. A set of criteria for efficient use of GPUs are defined and each segmentation method is rated accordingly. In addition, references to relevant GPU implementations and insight into GPU optimization are provided and discussed. The review concludes that most segmentation methods may benefit from GPU processing due to the methods’ data parallel structure and high thread count.However, factors such as synchronization, branch divergence and memory usage can limit the speedup. 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license

1.图像分割，也称为标记，是将图像或体积的单个元素分割成一组组，使组中的所有元素都具有公共属性的过程。在医学领域，这种共同的特性通常是元素属于同一组织类型或器官。解剖结构的分割是诊断、规划和指导等医疗应用的关键启用技术。医学图像包含了大量的信息，而且通常只有一两个结构令人感兴趣。分割允许对感兴趣的结构可视化，去除不必要的信息。分割还可以进行结构分析，如计算肿瘤的体积，并执行基于特征的图像对患者以及图像对图像的配准，这是图像引导手术的重要组成部分。图1显示了一个包含血管的体积的分割。分割结果，或标记体积，被用于使用行进立方体算法创建血管的表面模型（Lorensen和Cline，1987）。

1. Introduction

Image segmentation, also called labeling, is the process of dividing the individual elements of an image or volume into a set ofgroups, so that all elements in a group have a common property.In the medical domain, this common property is usually that elements belong to the same tissue type or organ. Segmentation ofanatomical structures is a key enabling technology for medical

applications such as diagnostics, planning and guidance. Medicalimages contain a lot of information, and often only one or two structures are of interest. Segmentation allows visualization of the structures of interest, removing unnecessary information. Segmentation also enables structure analysis such as calculating thevolume of a tumor, and performing feature-based image-to-patient as well as image-to-image registration, which is an important part

of image guided surgery. Fig. 1 illustrates segmentation of a volume containing blood vessels. The segmentation result, or label volume, is used to create a surface model of the blood vessels using the marching cubes algorithm (Lorensen and Cline, 1987).

许多分割方法的计算成本都很高，特别是当在大型医疗数据集上运行时。在手术前和手术过程中获得的图像数据的分割必须快速和准确，以便在临床环境中有用。此外，任何给定患者的可用数据量都在稳步增加（Scholl et al.，2010），这使得快速分割算法更加重要。图形处理单元（gpu）最初是为渲染图形而创建的。然而，在过去的十年里，gpu已经成为一种流行的通用高性能计算技术，包括医学图像处理。这很可能是由于这些设备的可编程性增强，再加上低成本和高性能。

Many segmentation methods are computationally expensive,

especially when run on large medical datasets. Segmentation of image data, acquired just before the operation as well as during the operation, has to be fast and accurate in order to be useful in

a clinical setting. Furthermore, the amount of data available for any given patient is steadily increasing (Scholl et al., 2010), making fast segmentation algorithms even more important.

Graphic processing units (GPUs) were originally created for rendering graphics. However, in the last ten years, GPUs have become popular for general-purpose high performance computation,

including medical image processing. This is most likely due to the increased programmability of these devices, combined with low cost and high performance.

Shi等人（2012）最近提出了一项关于基于gpu的医学图像计算技术的调查，如分割、配准和可视化。作者提供了几个关于在这些领域使用gpu的例子。然而，只提到了少数分割方法，很少提供不同的分割方法如何从GPU计算中获益的细节。Pratx和Xing（2011）对医学物理中GPU计算的应用，重点介绍了图像重建、剂量计算、治疗方案优化和图像处理的应用。Eklund等人（2013年）对gpu上的医学图像处理进行了更广泛的调查。他们调查了

Shi et al. (2012) recently presented a survey on GPU-based

medical image computing techniques such as segmentation, registration and visualization. The authors provided several examples on the use of GPUs in these areas. However, only a few segmentation methods are mentioned, and few details on how different segmentation methods can benefit from GPU computing is provided. Pratx and Xing (2011) provided a review on GPU computing in medical physics with focus on the applications image reconstruction, dose calculation and treatment plan optimization, and image processing. A more extensive survey on medical image processing on GPUs was presented by Eklund et al. (2013). They investigated

GPU计算主要应用于图像配准、分割、去噪、滤波、插值和重建等多个医学图像处理领域。本文将综述医学图像的分割，从而提供更多的参考和细节，并对不同的分割算法进行全面的比较。本次审查的目的是：1。给出关于GPU计算的必要背景信息，提供一个框架来评价一个算法对GPU加速的适应性，并解释如何为GPU优化分割方法。（第2节）2。解释和评价使用这个框架的最常见的分割方法，并提供了其他人如何使用gpu加速这些分割方法的调查。（第三节）

GPU computing in several medical image processing areas such as

image registration, segmentation, denoising, filtering, interpolation

and reconstruction.

This review will focus exclusively on medical image segmentation, and thus provide more references and details as well as a comprehensive comparison of the different segmentation algorithms. The goals of this review are to: 1. Give the necessary background information regarding GPU computing, provide a framework for rating how suitable an algorithm is for GPU acceleration, and explain how segmentation methods can be optimized for GPUs. (Section 2) 2. Explain and rate the most common segmentation methods using this framework and provide a survey of how others have accelerated these segmentation methods using GPUs. (Section 3)

2. GPU计算本节解释了GPU的基础知识，及其与医学图像分割相关的潜力和局限性。关于GPU计算的概述，包括应用程序的例子，可以在Owens等人（2008）中找到。对GPU计算理解良好的读者可能会跳过这一节。

2. GPU computing

This section explains the basics of GPUs, and their potential and limitations related to medical image segmentation. An overview of GPU computing, including examples of applications, can be found in Owens et al. (2008). This section may be skipped by readers with a good understanding of GPU computing.

用于通用计算的现代gpu具有高度数据并行的架构。它们由许多核组成，每个核都有许多功能单元，如算术逻辑单元（ALUs）。这些功能单元中的一个或多个用于处理每个执行的线程，而这些功能单元组在整个审查过程中被称为线程处理器。GPU核心中的所有线程处理器都执行相同的指令，因为它们共享一个控制单元。这意味着gpu可以并行地对图像的每个像素执行相同的指令。GPU领域中使用的术语是多样的，GPU的架构是复杂的，不同型号和制造商不同。例如，两个GPU制造商NVIDIA和AMD将线程处理器分别称为CUDA核和流处理器。此外，线程处理器在CUDA编程语言中称为CUDA核心，在OpenCL（开放计算语言）中称为处理元素。由于这种多样性，表1收集了本文中和OpenCL、AMD和NVIDIA/CUDA使用的术语概述

Modern GPUs used for general-purpose computations have a highly data parallel architecture. They are composed of a number of cores, each of which has a number of functional units, such as arithmetic logic units (ALUs). One or more of these functional units are used to process each thread of execution, and these groups of functional units are called thread processors throughout this review. All thread processors in a core of a GPU perform the same instructions, as they share a control unit. This means that GPUs can perform the same instruction on each pixel of an image in parallel. The terminology used in the GPU domain is diverse, and the architecture of a GPU is complex and differs from one model and manufacturer to another. For instance, the two GPU manufacturers NVIDIA and AMD refer to the thread processors as CUDA cores and stream processors, respectively. Furthermore, the thread processors are called CUDA cores in the CUDA programming language and processing elements in OpenCL (Open Computing Language). Because of this diversity, an overview of the terminology used in this review and by OpenCL, AMD and NVIDIA/CUDA is collected in Table 1

线程处理器有时被称为内核，给人一种错误的印象，即这些内核与CPU类似于CPU的内核。线程处理器和CPU核心之间的主要区别是，每个CPU核心可以对不同的数据并行执行不同的指令。这是因为每个CPU核心都有一个单独的控制单元。McCool（2008）将核心定义为具有独立控制流的处理元素。根据这些定义，本文回顾将引用共享的线程处理器组

Thread processors are sometimes referred to as cores, giving the

false impression that these cores are similar to the cores of a CPU.The main difference between a thread processor and a CPU core, is that each CPU core can perform different instructions on different data in parallel. This is because each CPU core has a separate control unit. McCool (2008) defined a core as a processing element with an independent flow of control. Following these definitions, this review will refer to the group of thread processors that share

一个控制单元，作为核心。gpu通常在芯片上的许多线程处理器，而cpu则设计有高级控制单元和大型缓存。在撰写本文时，高端gpu有数千个线程处理器和大约20到40个核（高级微设备公司，2012年）。另一方面，现代的cpu大约有4到12个核。图2显示了一个GPU的总体布局及其内存层次结构。

a control unit, as cores. GPUs are generally constructed to fit many thread processors on a chip, while CPUs are designed with advanced control units and large caches. At the time of writing,

high-end GPUs have several thousand thread processors and around 20 to 40 cores (Advanced Micro Devices, 2012). On the other hand, modern CPUs have around 4 to 12 cores. Fig. 2 shows

the general layout of a GPU and its memory hierarchy.

用于通用计算的gpu的第一个采用者必须使用最初为图形设计的框架和语言，如OpenGL阴影语言（GLSL）和用于图形的C（Cg）。随着GPU编程的日益普及，通用的GPU（GPGPU）框架，如CUDA和OpenCL被引入。与图形框架不同，它们不需要关于图形管道的知识，因此更适合用于通用编程。OpenCL是一个可以在不同设备上进行并行编程的开放标准，包括gpu、cpu和现场可编程门阵列（FPGAs）。OpenCL被许多处理器制造商支持，包括AMD、NVIDIA和Intel，而CUDA只能与来自NVIDIA的gpu一起使用。

The first adopters of GPUs for general-purpose computing had to use frameworks and languages originally designed for graphics, such as OpenGL Shading Language (GLSL) and C for graphics (Cg).

As the popularity of GPU programming increased, general-purpose GPU (GPGPU) frameworks such as CUDA and OpenCL were introduced. As opposed to graphic frameworks, these do not require knowledge about the graphics pipeline, and are therefore better suited for general-purpose programming. OpenCL is an open standard for parallel programming on different devices, including GPUs, CPUs and field-programmable gate arrays (FPGAs). OpenCL

is supported by many processor manufacturers including AMD, NVIDIA and Intel, while CUDA can only be used with GPUs from NVIDIA.

提供几种低级图像处理算法的GPU实现的图像处理库正在出现。然而，大多数库仍然缺乏高级算法，如分割方法。两个最大的图像处理库，OpenCV和洞察工具包（ITK），都提供了一个支持基本图像处理算法的GPU模块。这两个工具包的不同之处在于，OpenCV同时支持CUDA和OpenCL，而ITK只支持OpenCL。到目前为止，加速分割方法仅限于这些库中基于阈值的分割。其他基于gpu的图像处理库包括NVIDIA性能基语（NPP）、ArrayFire、Intel集成性能基语（IPP）、CUVILIB和OpenCL集成性能基语（OpenCLIPP）。在编写本文时，这些库主要提供GPU加速的低级图像处理例程。

Image processing libraries that provide GPU implementations of several low-level image processing algorithms are emerging. However, most libraries still lack high-level algorithms such as segmentation methods. Two of the largest image processing libraries, OpenCV and the Insight Toolkit (ITK), both provide a GPU module with support for basic image processing algorithms. A difference between the two toolkits is that OpenCV supports both CUDA and OpenCL, while ITK only supports OpenCL. Accelerated segmentation methods are so far limited to threshold-based segmentation in these libraries. Other GPU-based image processing libraries include NVIDIA Performance Primitives (NPP), ArrayFire, Intel Integrated Performance Primitives (IPP), CUVILIB and OpenCL Integrated Performance Primitives (OpenCLIPP). At the time of writing, these libraries mainly provide GPU accelerated low-level image processing routines.

有几个方面定义了一个算法对GPU实现的适用性。在这篇综述中，已经确定了五个关键因素：数据并行性、线程计数、分支分歧、内存使用和同步。下面的部分将讨论这些因素，并解释为什么它们对于有效的GPU实现很重要。此外，为每个因素定义了几个级别（例如低、中、高和无/动态），从而创建了一个框架，以评级算法可以从GPU加速中获益。

Several aspects define the suitability of an algorithm towards a GPU implementation. In this review, five key factors have been identified: Data parallelism, thread count, branch divergence, memory usage and synchronization. The following sections will discuss each of these factors, and explain why they are important for an efficient GPU implementation. Furthermore, several levels

are defined for each factor (e.g. low, medium, high and none/ dynamic), thereby creating a framework for rating to what extent an algorithm can benefit from GPU acceleration.

2.1. 数据并行性一种能够对多个数据元素并行执行相同指令的算法称为数据并行性，并且要对每个元素执行的指令集称为内核。另一方面，任务并行性是一种限制较少的类型

2.1. Data parallelism

An algorithm that can perform the same instructions on multiple data elements in parallel is said to be data parallel, and the set of instructions to be executed for each element is called a kernel. Task parallelism on the other hand, is a less restrictive type of

图2。GPU及其内存层次结构的总体布局。寄存器对每个线程处理器都是私有的，共享内存对每个核心都是私有的，全局、常数和纹理内存可以从所有线程处理器访问。请注意，实际的布局更加复杂，并且对于每个GPU都有所不同。

Fig. 2. General layout of a GPU and its memory hierarchy. The registers are private

to each thread processor, the shared memory is private to each core, and the global,

constant and texture memory is accessible from all thread processors. Note that the

actual layout is much more complex and differ for each GPU.

算法并行执行不同指令的并行性。如前所述，gpu的一个重要特征是高度数据并行的体系结构。因此，一个算法必须是数据并行的，以便在GPU上从执行中获益。相比之下，任务并行算法更适合用于多核cpu。

parallelism in which algorithms execute different instructions in parallel. As previously discussed, an important characteristic of GPUs is the highly data parallel architecture. Hence, an algorithm

has to be data parallel in order to benefit from execution on a GPU. In comparison, task parallel algorithms are more suited for multi-core CPUs.

并行化所实现的加速程度受到算法的顺序分数的限制。根据Amdahl定律（Amdahl，1967），无论使用的核或线程处理器的数量如何，95%并行执行的程序的最大理论加速都是20倍。其原因是，对代码的串行部分的处理时间将保持不变。然而，在实践中，在文献中测量和报道的加速往往远远高于理论极限。这有很多原因，一个是程序的串行版本没有完全优化。另一个原因是，程序的并行版本可以更有效地使用内存缓存

The degree of speedup achieved by parallelization is limited by the sequential fraction of the algorithm. According to Amdahl’s law (Amdahl, 1967), the maximum theoretical speedup of a program where 95% is executed in parallel is a factor of 20, regardless of the number of cores or thread processors being used. The reason for this is that the processing time for the serial part of the code will remain constant. However, in practice the speedup measured and reported in the literature is often much higher than the theoretical limit. There are many reasons for this, one is that the serial version of the program is not fully optimized. Another reason is that the parallel version of the program may use the memory cache more efficiently.

y. Lee et al. (2010) discussed how to make a fair comparison between a CPU and GPU program. Throughout this review the degree of parallelism in a segmentation method is rated as follows:

High: Almost entire method is data parallel (75–100%).

Medium: More than half of the method is data parallel (50–

75%).

Low: None or up to half of the method is data parallel (0–50%).

高：几乎整个方法都是数据并行的（75-100%）。介质：超过一半的方法是数据并行的（50-75%）。低：没有或多达一半的方法是数据并行的（0-50%）。

低：没有实质性的加速（快0-2倍）。2.7.GPU优化本节提供了关于如何优化分割方法的一些见解。

2.7. GPU optimization

This section provides some insight on how segmentation methods can be optimized for GPUs.

2.7.1.分组如前一节中提到的，线程在组（AUE）中的gpu上进行原子式调度和执行。gpu还提供更高层次的分组，在软件中强制执行，而不是在像aue这样的硬件中。这些在CUDA中被称为线程块，在OpenCL中被称为工作组。这些更高层次的工作组的一个好处是，它们能够访问相同的共享内存，从而在它们自己之间进行同步。这些工作组的规模可能会影响性能，并应根据GPU制造商提供的指导方针进行适当的设置（见高级微设备公司，2012；NVIDIA，2013a）。

2.7.1. Grouping

As mentioned in the previous section, threads are scheduled and executed atomically on the GPUs in groups (AUE). GPUs also provide grouping at a higher level, enforced in software and not

in hardware like AUEs. These are called thread blocks in CUDA, and are referred to as work-groups in OpenCL. One benefit of these higher level work-groups is that they are able to access the same shared memory, and thus synchronize among themselves. The size of these work-groups can impact performance, and should be set properly according to guidelines provided by the GPU manufacturers (see Advanced Micro Devices, 2012; NVIDIA, 2013a).

2.7.2.除了全局内存之外，gpu通常有其他三种内存类型，它们可以用来加快内存访问速度。这些内存类型称为纹理、常量和共享（也称为本地）内存。它们在GPU上以不同的方式缓存，但是，GPU上的这些缓存的大小与CPU相比要很小。图2显示了该内存层次结构在GPU上通常是如何组织的

2.7.2. Texture, constant and shared memory

In addition to global memory, GPUs often have three other memory types, which can be used to speed up memory access. These memory types are called texture, constant and shared (also called local) memory. They are cached in different ways on the GPU, however, the size of these caches on the GPU are small compared to that of the CPU. Fig. 2 show how this memory hierarchy is typically organized on a GPU

GPU有一个专门的图像存储系统，称为纹理系统。该纹理系统专门从2D和3D纹理中提取和缓存数据（NVIDIA，2010；高级微设备公司，2012年）。它还具有一个获取单元，可以在硬件上执行插值和数据类型转换。使用纹理系统来存储图像和卷可以提高性能。大多数GPU纹理系统都支持标准化的8位和16位整数。使用这种格式，数据存储为8位或16位整数。但是，当请求时，纹理获取单元将整数转换为具有标准化范围的32位浮点数。这降低了内存的使用量，但也降低了准确性，而且可能还不足以用于所有的应用程序

The GPU has a specialized memory system for images, called the texture system. The texture system specializes in fetching and caching data from 2D and 3D textures (NVIDIA, 2010; Advanced Micro Devices, 2012). It also has a fetch unit which can perform interpolation and data type conversion in hardware. Using the texture system to store images and volumes can improve performance. Most GPU texture systems support normalized 8 and 16- bit integers. With this format, the data is stored as 8 or 16-bit integers in textures. However, when requested, the texture fetch unit converts the integers to 32-bit floating point numbers with a normalized range. This decreases the memory usage, but also reduces accuracy, and may not be sufficient for all applications

常量内存是全局芯片外存储器的一个缓存的只读区域。此内存对于存储保持不变的数据非常有用。然而，只有当AUE中的线程读取相同的数据元素时，缓存的好处才能实现（高级微设备公司，2012）。在AMD和NVIDIA GPU上，常量缓存小于纹理系统使用的缓存（L1）（高级微设备，2012；NVIDIA，2013a)

The constant memory is a cached read-only area of the global off-chip memory. This memory is useful for storing data that remains unchanged. However, the benefit of caching is only achieved when threads in an AUE read the same data elements (Advanced Micro Devices, 2012). On AMD and NVIDIA GPUs the constant cache is smaller than the cache used by the texture system (L1) (Advanced Micro Devices, 2012; NVIDIA, 2013a).

共享内存是一个用户控制的缓存，也称为刮刮板或本地内存。这个内存在一个组中的所有线程之间共享，并且对GPU的每个核心（计算单元）都是本地的。

The shared memory is a user-controlled cache, also called a scratchpad or local memory. This memory is shared amongst all threads in a group and is local to each core (compute unit) of the

GPU

一般来说，速度最快访问的GPU内存是寄存器，其次是共享内存、L1缓存、L2缓存、恒定缓存、全局内存，最后是主机内存（通过pci-express）（高级微设备公司，2012）。每个核心的寄存器数量是有限的，超过这个限制会导致寄存器溢出，从而降低性能。为了给人留下对这些内存空间的典型大小的印象，AMD Radeon HD7970的整个GPU有一个128 kB的恒定缓存，一个64 kB的共享内存，每个核心有一个256 kB的寄存器（高级微设备公司，2012）

Generally, the GPU memory that is fastest to access is registers, followed by shared memory, L1 cache, L2 cache, constant cache, global memory and finally host memory (via PCI-express) (Advanced Micro Devices, 2012). The number of registers per core is limited, and exceeding this limit causes register spill, which will reduce performance. To give an impression of the typical size of these memory spaces, the AMD Radeon HD7970 has a 128 kB constant cache for the entire GPU and 64 kB shared memory and 256 kB of registers for each core (Advanced Micro Devices, 2012)

使用尽可能少的比特也可以大大加快处理速度。当范围足够时，使用8位和16位整数，而不是默认的32位整数，不仅减少了所需的内存，而且还减少了内存访问延迟

Using as few bits as possible can also speed up processing considerably. Using 8 and 16-bit integers when the range is sufficient instead of the default 32-bit, not only reduces the memory needed, but also memory access latency

表2分割方法适合于GPU计算的情况比较。有关每个标准的详细评级，请参见第2节。评级是基于最常见的并行实现、参数和输入的

Table 2

Comparison of how well the segmentation methods are suited for GPU computation. See Section 2 for details on how each method is rated for each criteria. The ratings are based on the most common parallel implementations, parameters and input

2.7.3.流压缩某些应用程序可能只需要处理数据集的一部分。这将导致内核中的一个分支，其中一个执行路径进行处理，而另一个执行路径什么都不做。如果同一AUE中的线程遵循两个执行路径，则是一个发散的分支

2.7.3. Stream compaction

Some applications may only require a part of the dataset to be processed. This will lead to a branch in the kernel, where one execution path does processing while another does nothing. If threads in the same AUE follow both execution paths, a divergent branch

已发生，并且没有节省任何时间。在这些情况下，提前删除不必要的元素可能会更有效，从而删除发散的分支。这被称为流压缩，这两种方法是并行前缀和（见Billeter等人（2009）的概述）和Ziegler等人（2006）的直方图金字塔。

occurs and no time is saved. In these cases, it may be more efficient to remove the unnecessary elements in advance, thus removing the divergent branch. This is called stream compaction, and two such methods are parallel prefix sum (see Billeter et al. (2009) for an overview) and histogram pyramids by Ziegler et al. (2006).

3. 本节在GPU计算方面，提出并讨论了几种常用的图像分割方法。所有这些分割方法都可以用于二维和三维图像，并且术语像素和体素在整个审查过程中可以互换使用。3.1.阈值分割阈值分割使用一个或多个阈值根据其强度对每个体素进行分割，如图1所示。在其最简单的形式中，该方法使用一个单一的阈值T来执行二进值分割：

3. Segmentation methods

In this section, several commonly used image segmentation methods are presented and discussed in terms of GPU computing.All of these segmentation methods can be used on both 2D and 3D images, and the terms pixel and voxel are used interchangeably throughout the review.

3.1. Thresholding

Thresholding segments each voxel based on its intensity using one or more thresholds, as shown in Fig. 1. In its simplest form,the method performs a binary segmentation using a single threshold T:

其中T是阈值，Ið~xÞ是~x位置的体积的强度，Sð~xÞ是~x位置的体素的结果标签或类。从这个方程中可以看出，该方法是完全数据并行的，因为每个体素都可以独立于所有其他体素进行分类，并且不需要同步。所需的线程数等于像素或体素的总数。虽然该方法包含一个发散分支（一个分支是对某些aue执行两个路径的分支），但它的简单性使该分支可以简化为一条指令。该方法的内存使用量较低，只需要存储实际分割结果的存储空间，且与输入图像的大小相同。由于在GPU上实现起来很简单，因此没有提供关于该分割方法的GPU实现的参考资料。在算法1中给出了一个阈值核的一个例子。本示例使用了一个单个阈值T和一个2D线程ID。

where T is the threshold, Ið~xÞ is the intensity of the volume at position ~x and Sð~xÞ is the resulting label or class of the voxel at position ~x. As seen in this equation, the method is completely data parallel, since each voxel can be classified independently of all others, and has no need for synchronization. The number of threads needed is equal to the total number of pixels or voxels. While the method contains a divergent branch (a branch were both paths are executed for some AUEs), its simplicity enables the branch to be reduced to a single instruction. The memory usage of the method is low, as only storage for the actual segmentation result is needed, which has the same size as the input image. No references on GPU implementation of this segmentation method are provided as it is trivial to implement on the GPU. An example of a threshold kernel is provided in Algorithm 1. This example uses a single threshold T and a 2D thread

ID.

需要注意的是，这个内核是内存绑定的，因为它对全局内存执行一个读写操作，这比比较操作要慢。可以通过最小化全局内存访问的数量来提高性能。这可以通过在每个读取操作中对每个线程读取几个像素来实现，同时增加每个内存操作的计算操作的数量。

It is important to note that this kernel is memory bound because it performs one read and write operation to global memory, which is slower than the comparison operation. The performance may be increased by minimizing the number of globalmemory accesses. This can be achieved by reading several pixels per thread in each read operation, while at the same time increasing the number of compute operations per memory operation.

3.2.区域生长种子生长（Adams和Bischof，1994）是另一种常用的分割方法。该方法从在感兴趣的对象内部已知的一组种子像素开始。这些种子可以使用图形用户界面手动设置，也可以使用先验知识自动设置。从这些种子中，包含感兴趣对象的区域将扩展到邻近的像素，如果它们满足一个或多个预定义的标准。这些标准使用强度、渐变或颜色等属性，将当前像素与种子或已经包含的像素进行比较。只要有邻国存在，该地区就将继续扩大

3.2. Region growing

Seeded region growing (Adams and Bischof, 1994) is another commonly used segmentation method. This method starts with a set of seed pixels known to be inside the object of interest. The seeds are either set manually using a graphical user interface or automatically using a priori knowledge. From these seeds, regions containing the object of interest will expand to the neighboring

pixels if they satisfy one or more predefined criteria. These criteria

compare the current pixel to the seed or the pixels already included, using attributes such as intensity, gradient or color. The region will continue to expand as long as there exist neighboring

满足这些条件的像素。该方法类似于宽度优先搜索和洪水填充算法。当背景和感兴趣的区域具有重叠的像素强度，并且在空间上被某些壁或区域分开时，区域增长特别有用。一个例子是胸部CT，其中气道和实质的体素都有低强度，并被一个高强度的充满血液的组织分开。

pixels that satisfy the criteria. This method is similar to breadth

first search and flood fill algorithms. Region growing is especially useful when the background and the region of interest have overlapping pixel intensities, and are separated spatially by some wall or region. One example is thorax CT, where the voxels of the airways and the parenchyma both have low intensities, and are separated by a blood filled tissue with high

区域增长是一种数据并行的方法，因为所有沿着进化分割区域边界的像素都使用相同的指令进行检查。但是，随着边框的扩展，线程的数量也会发生变化。这是有问题的，因为更改线程的数量通常需要重新启动内核，而这需要再次从全局内存中读取所有的值。然而，该方法可以在GPU上执行，在每次迭代中，整个图像中的每个像素都有一个线程。图3描述了当使用双缓冲时，区域增长的数据并行版本是如何工作的。这涉及到添加更多的工作，并引入分支分歧，限制了优化串行实现的潜在加速。此外，由于这是一种迭代方法，需要全局同步，这也限制了加速。内存使用量较低（2N），因为只需要输入数据和分割结果。

Region growing is a data parallel method as all pixels along the border of the evolving segmentation region are checked using the same instructions. However, as the border expands, the number of threads change. This is problematic because changing the number of threads typically involves restarting the kernel, and this requires reading all the values from global memory again. Nevertheless, the method can be executed on the GPU by having one thread for each pixel in the entire image in each iteration. Fig. 3 depicts how the data parallel version of region growing works when double buffering is used. This involves adding more work and introduces branch divergence, limiting the potential speedup over an optimized serial implementation. Furthermore, as this is an iterative method, global synchronization is needed, which also limits the speedup. The memory usage is low (2N), as only the input data and the segmentation result are needed.

在算法2中显示了一个区域增长实现的例子。这是基于Harish和纳拉ayanan（2007）的并行宽度优先搜索算法。在结果数据结构中，分段体素用1标记，排队体素用2标记，该结构与输入图像大小相同。函数Cð~xÞ检查体素~x的增长条件。在该算法中，纹理内存可以用来加快全局内存的访问。但是，这需要双缓冲，从而增加了内存的使用量。共享内存也可以用于首先将全局数据读取到共享内存中，然后在共享内存所覆盖的区域中增长区域，最后将结果写回全局内存

An example of a region growing implementation is shown in

Algorithm 2. This is based on the parallel breadth first search algorithm by Harish and Narayanan (2007). Segmented voxels are

marked with 1, queued voxels with 2 and others 0, in a result data

structure Swhich has the same size as the input image. The function Cð~xÞ checks the growing criteria for voxel ~x. In this algorithm,

texture memory can be used to speed up the global memory

access. However, this requires double buffering which increases

the memory usage. Shared memory may also be used by first reading global data to shared memory, then grow the region in the area

covered by the shared memory and finally write the result back to

global memory

Schenke等人（2005）使用GLSL在GPU上实现了种子区域的增长，但对其实现的描述很少。Pan等人（2008）提出了一个使用CUDA的实现，并建议增加种子的数量，以充分利用GPU。Sherbondy等人（2003）提出了一种不同的类型

Schenke et al. (2005) implemented seeded region growing on the GPU using GLSL, but provided little description on the implementation. Pan et al. (2008) presented an implementation using CUDA and suggested increasing the number of seeds to make full use of the GPU. Sherbondy et al. (2003) presented a different type

图3。平行区域增长的说明。标记为S的像素是种子像素。这些数字表示在哪个迭代过程中，红色区域中的像素被添加到最终的分割中。（为了解释本图例中对颜色的参考资料，读者可以参考本文的网络版本。）

Fig. 3. Illustration of parallel region growing with double buffering. The pixel labeled S is the seed pixel. The numbers indicate at which iteration the pixels in the red regions are added to the final segmentation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

在GPU上实现种子区域生长，利用扩散来进化分割。为了减少由于分支分歧而产生的不必要的计算，他们的实现使用了一个活动体素的计算掩码，并在每次迭代中进行更新。Chen等人（2006）提出了一个在GPU上增长的交互式区域的实现。在这个实现中，用户在2D中标记一个感兴趣的区域，该区域被挤压到3D。这个感兴趣的区域用于创建一个约束分割的计算掩模。他们的实现还使用了GLSL，并且他们报告了医疗3D数据集的实时速度。

of seeded region growing implemented on the GPU with GLSL,

which uses diffusion to evolve the segmentation. To reduce unnecessary computations due to branch divergence, their implementation uses a computational mask of active voxels which is updated in each iteration. Chen et al. (2006) presented an implementation of interactive region growing on a GPU. In this implementation, the user marks a region of interest in 2D, which is extruded to 3D. This region of interest is used to create a computational mask that constrains the segmentation. Their implementation also uses GLSL and they reported real-time speeds for medical 3D datasets.

3.3.形态学形态学图像处理经常与阈值分割等其他分割算法相结合，因此被包括在这篇综述中。形态学技术的例子包括填充孔，和寻找分段管状结构的中心线。参见Serra（1986）对计算机视觉中的数学形态学的详细介绍。形态学技术使用一种称为结构元素的掩模来研究每个像素。每个像素的值由结构元素内的相邻像素决定。最简单的形态学操作是扩张和侵蚀。对于二值图像，如果结构中中心像素下的当前像素，则扩张将添加结构元素中的所有像素

3.4.分水岭分割的概念（Vincent和Soille，1991）是基于将一幅图像视为一个三维对象，其中第三维是每个像素的高度。该高度由像素的强度值决定，如图5所示。在由此产生的景观中，有三种类型的点。这些是由一个落在特定点上的水滴如何根据景观的地形布局移动来决定的：

3.4. Watershed

The concept of watershed segmentation (Vincent and Soille, 1991) is based on viewing an image as a three dimensional object, where the third dimension is the height of each pixel. This height is

determined by the intensity value of the pixel, as shown in Fig. 5. In the resulting landscape, there are three types of points. These are determined by the analogy of how a drop of water falling on that specific point would move according to the topographic layout of the landscape:

1. 点是局部最小值，以及一滴水在这一点停留的地方。 2.一滴水会向下移动到一个特定的局部最小值的点。 3.一滴水会向下移动到一个以上的局部最小值的点。属于类型2的点通常被称为流域或流域盆地，而属于类型3的点通常被称为分界线线或流域线

1. Points that are local minima and where a drop of water would stay in this point. 2. Points at which a drop of water would move downwards into one specific local minimum. 3. Points at which a drop of water would move downwards into more than one local minimum. The points belonging to type 2 are often called watersheds or catchment basins and the points belonging to type 3 are often called divide lines or watershed lines

图4。形态扩张使用3 3 方形结构元素（左侧红色）。由于中心像素为1，所以结构元素下的所有0个值像素都被翻转为1。（为了解释本图例中对颜色的参考资料，读者可以参考本文的网络版本。）

Fig. 4. Morphological dilation using a 3 3 square structuring element (shown in red to the left). Since the center pixel is 1, all 0 valued pixels under the structuring element are flipped to 1. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

基于这些概念的分割算法的主要思想是寻找分水岭线。为了找到它们，我们使用了从这个地形景观中得到的另一个类比。假设在所有的局部最小值点上都产生了孔，并且水流过这些孔。地形景观中的流域将以恒定的速度被淹没。当两个流域即将合并时，在它们之间建一座大坝。大坝的高度随着水位的上升而以同样的速度增加。这个过程一直持续到水到达景观中的最高点，对应于具有最大强度的像素。然后，这些大坝就对应于分水岭线

The main idea of segmentation algorithms based on these concepts is to find the watershed lines. To find them, another analogy from this topographic landscape is used. Suppose that holes are created in all the points that are local minima, and that water flow through these holes. The watersheds in the topographic landscape will then be flooded at a constant rate. When two watersheds are about to merge, a dam is built between them. The height of the dam is increased at the same rate as the water level rises. This process is continued until the water reaches the highest point in the landscape, corresponding to the pixel with maximum intensity. The dams then correspond to the watershed lines

关于分水岭分割的不同实现的回顾，读者可以参考罗尔丁克和Meijster（2001）。他们还研究了该方法的并行实现，并得出结论，并行化是困难的，因为它的顺序性质。通过将景观转换为图形、细分图像或并行淹没每个局部最小值，可以实现并行实现。然而，罗尔丁克和梅杰斯特得出结论说，所有这些方法都能导致适度的加速。以数据并行的方式进行分水岭分割需要增加更多的工作和分支分歧。因此，对一个优化的串行实现的加速将不会很高。这在文献中是明显的，只报道了2-7倍。

For a review of different implementations of watershed segmentation the reader is referred to Roerdink and Meijster (2001). They also investigated parallel implementations of the method, and concluded that parallelization is hard, because of its sequential nature. A parallel implementation is possible by transforming the landscape into a graph, subdividing the image, or flooding each local minimum in parallel. However, Roerdink and Meijster concluded that all of these methods lead to modest speedups. Performing watershed segmentation in a data parallel manner entails adding more work and branch divergence. Thus the speedup over an optimized serial implementation will not be high. This is evident in the literature, where speedups of only 2–7 times are reported.

Kauffmann和Piche（2008）提出了一种使用算法3中描述的元胞自动机方法的分水岭分割的GPU实现。该方法使用福特-贝尔曼算法计算从每个局部最小值到所有像素的最短路径。通过创建一个成本函数，在景观中攀登的成本是无限的，最短的路径总是向下引导。然后给像素分配与其最近的最小值相同的分割标签。使用这种方法，可以使用相同的指令并行地处理图像中的所有像素。达到收敛所需的迭代次数取决于最长的路径，且分支的收敛性较高。由于双缓冲，内存使用量是4N，并且必须为每个像素存储距离。考夫曼和皮切报告了2.5倍的加速，同时也给出了3D图像的结果。

Kauffmann and Piche (2008) presented a GPU implementation of watershed segmentation using the cellular automaton approach described in Algorithm 3. This method calculates the shortest path from each local minima to all pixels using the Ford-Bellman algorithm. By creating a cost function where the cost of climbing in the landscape is infinite, the shortest path will always lead downwards. Pixels are then assigned the same segmentation label as their closest minima. Using this approach, all the pixels in the image may be processed in parallel using the same instructions. The number of iterations needed to reach convergence depends on the longest path and the branch convergence is high. The memory usage is 4N because of double buffering, and that the distance has to be stored for each pixel. Kauffmann and Piche reported a speedup of 2.5 times, and presented results for 3D images as well.

算法3。使用分胞自动机进行并行分水岭分割（考夫mann和Piche，2008）

Algorithm 3. Parallel watershed segmentation using a cellular automaton (Kauffmann and Piche, 2008)

图5。流域分割。如果在左边的图像的像素的强度值被解释为高度，它将给出创建在右边的景观。

Fig. 5. Watershed segmentation. If the intensity values of the pixels of the images on the left are interpreted as heights, it will give create the landscape to the right.

Pan等人（2008）提出了一种使用多层次分水岭方法的CUDA实现。然而，很少包括实施细节和结果。Vitor等人（2009）创建了一个GPU和一个混合的CPU-GPU实现。他们的结论是，混合方法的速度快两倍。他们的方法最初使用最陡下降遍历找到每个像素的最低点。然后处理高原像素，以找到最近的边界。最后，利用洪水填充算法对每个最小值相似的像素进行标记。Korbes等人（2009,2011）提出了一个基于Vitor等人（2009）的工作的实现。他们还将性能与考夫曼和Piche（2008）的元胞自动机方法进行了比较，并得出结论，他们的实现速度比顺序版本快6倍。该并行方法还对每个像素进行迭代处理，并存在分支发散。Wagner等人（2010）从最低强度开始按顺序处理每个强度水平。这些标签在每次迭代中都被合并。他们的实现使用了CUDA，并且比串行实现快5-7倍

Pan et al. (2008) presented a CUDA implementation, using a multi-level watershed method. However, few implementation details and results were included. Vitor et al. (2009) created one GPU and one hybrid CPU-GPU implementation. They concluded that the hybrid approach was up to two times faster. Their method initially finds the lowest point from each pixel using a steepest

descent traversal. The plateau pixels are then processed to find the nearest border. Finally, the pixels are labeled using a flood fill algorithm from each minimum similar to seeded region growing. Körbes et al. (2009, 2011) presented an implementation based onthe work of Vitor et al. (2009). They also compared performance to the cellular automaton approach by Kauffmann and Piche (2008), and concluded that their implementation was about six times faster than a sequential version. This parallel method also processes each pixel iteratively and suffers from branch divergence. Wagner et al. (2010) processed each intensity level in order starting with the lowest intensity. The labels were merged in each iteration. Their implementation used CUDA, and was 5–7 times faster than a serial implementation on 3D images.

3.5.主动轮廓主动轮廓，也被称为蛇，是由Kass等人（1988）介绍的。这些轮廓在图像中移动，同时尽量最小化它们的能量，如图6所示。它们被参数化地定义为vðsÞ¼½xðsÞ；yðsÞ，，其中xðsÞ和yðsÞ是部分轮廓的坐标。轮廓的能量E由内部能量和外部能量Eext组成：E¼Z0 1 Eintðv；sÞþEextðvðsÞÞdsð2Þ内能取决于轮廓的形状，例如，可以定义为：

3.5. Active contours

Active contours, also known as snakes, were introduced by Kass et al. (1988). These contours move in an image while trying to minimize their energy, as shown in Fig. 6. They are defined parametrically as vðsÞ¼½xðsÞ; yðsÞ, where xðsÞ and yðsÞ are the coordinates for part s of the contour. The energy E of the contour is composed of an internal Eint and external energy Eext: E ¼ Z0 1 Eintðv; sÞ þ EextðvðsÞÞds ð2Þ The internal energy depends on the shape of the contour and can, for example, be defined as:

图6。活动轮廓的说明。左边的图像是输入图像，右边的图像显示了输入图像与高斯核卷积后的梯度大小。叠加在右侧图像上的红线是主动轮廓，它被驱动到该图像的高梯度部分，对应于原始图像中的边缘。叠加在两幅图像上的绿线表示管腔的轮廓。（为了解释本图例中对颜色的参考资料，读者可以参考本文的网络版本。）

Fig. 6. Illustration of active contours. The image to the left is the input image, and the image to the right shows the gradient magnitude of the input image convolved with a Gaussian kernel. The red line superimposed on the right image is the active contour, which is driven towards the high gradient parts of that image, corresponding to the edges in the original image. The green line superimposed on both images show the contour of the lumen. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

其中，a和b是控制轮廓的张力和刚度的参数。轮廓可以通过在有趣的特征上具有低值，而在其他地方具有高值的外部能量，来驱动到图像中有趣的特征。外部能量有几种不同的选择。一个流行的选择是图像梯度的负幅度，即Eextð~xÞ¼jr½GrIð~xÞj2，其中Gr是与高斯低通滤波器的卷积。这种能量的选择驱动轮廓朝向图像中的边缘，如图6所示。卷积和梯度计算可以对每个像素并行执行，并使用纹理或共享内存进行优化。关于如何优化gpu的图像卷积，可以在Podlozhnyuk等人（2007）的技术报告中找到研究。活动轮廓可分为两个处理步骤。第一个是计算外部能量，第二个是演化轮廓。两者都是数据并行操作。用于计算外部能量的线程数一般与像素数相同，而用于演化轮廓的线程数较低

where a and b are parameters that control the tension and rigidity

of the contour. The contour can be driven towards interesting features in the image, by having an external energy with low values at the interesting features and high elsewhere. There are several different choices of external energy. A popular choice is the negative magnitude of the image gradient, i.e. Eextð~xÞ ¼ jr½Gr Ið~xÞj2 , where Gr is convolution with a Gaussian lowpass filter. This choice of energy drives the contour towards the edges in the image, as depicted in Fig. 6. The convolution and gradient calculation can be executed in parallel for each pixel, and optimized using texture or shared memory. A study on how to optimize image convolution for GPUs can be found in the technical report by Podlozhnyuk et al. (2007). Active contours can be divided into two processing steps. The first is calculating the external energy, and the second is evolving the contour. Both are data parallel operations. The number of

threads for calculating the external energy is generally the same as the number of pixels, while the thread count for evolving the contour is lower

通过使轮廓随时间动态变化vðs；tÞ，可以找到使能量E最小的轮廓的数值解。av00ðs；tÞbvð4Þðs；tÞrEext¼0ð4Þ的欧拉河等式(4)可以在He和Kuester（2006）和张和Zhang（2012）的GPU上求解。线程计数等于轮廓上的采样点数量，远低于图像中的像素数量。Eidheim等人（2005）得出结论，只要轮廓上的点数低于大约500个，演化CPU上的活动轮廓就会更快。为了演化轮廓，每个点s必须使用插值从图像中提取。因此，主动轮廓可以受益于使用纹理记忆，它可以执行插值效率

A numerical solution to find a contour that minimize the energy E can be found by making the contour dynamic over time vðs;tÞ. av00ðs;tÞ bvð4Þðs;tÞ rEext ¼ 0 ð4Þ The Euler Eq. (4) can be solved on the GPU as done by He and Kuester (2006) and Zheng and Zhang (2012). The thread count is equal to the number of sample points on the contour, which is much lower than the number of pixels in the image. Eidheim et al. (2005) concluded that evolving the active contour on the CPU was faster, as long as the number of points on the contour was below approximately 500. To evolve the contour, each point s has to be extracted from the image using interpolation. Thus, active contours may benefit from using the texture memory, which can perform interpolation efficientl

在GPU上已经实现了其他几个活动轮廓的公式。Perrot等人（2011）加速了一种主动轮廓，它优化了GPU上的广义对数似然函数。他们使用前缀和算法来计算图像的和，并使用共享内存来提高内存访问延迟。Schmid等人（2010）使用CUDA在GPU上实现了一个有数千个顶点的离散可变形模型。它们的实现还允许通过将顶点插入到顶点缓冲区对象中，并使用OpenGL渲染该对象来进行交互式和并发可视化。Li等人（2011）使用基于gpu上实现的傅里叶描述符的主动轮廓，在超声视频中进行实时轮廓跟踪。Kamalakannan等人（2009）提出了一个统计蛇的GPU实现，它将每个样本点的强度值与一个种子点进行了比较。他们的实施被用来评估织物上的污渍

Several other formulations of active contours have been implemented on the GPU. Perrot et al. (2011) accelerated a type of active

contours that optimizes a generalized log-likelihood function on the GPU. They used a prefix sum algorithm to calculate sums of the image, and shared memory to improve memory access latency. Schmid et al. (2010) implemented a discrete deformable model with several thousand vertices on the GPU using CUDA. Their implementation also allows interactive and concurrent visualization by inserting the vertices into a vertex buffer object, and rendering it with OpenGL. Li et al. (2011) used active contours based on Fourier descriptors implemented on GPUs, for real-time contour

tracking in ultrasound video. Kamalakannan et al. (2009) presented a GPU implementation of statistical snakes, which compared the intensity value of each sample point to a seed point. Their implementation was used to assess stains on fabrics.

如Xu和Prince（1998）所示，一些不同的外力场rEext公式可能会陷入局部极小值，特别是在边界凹度存在的情况下。Xu和Prince（1998）引入了一种新的外力场，梯度矢量流（GVF），解决了这个问题。GVF场被定义为向量场~V，它使能量函数最小化E：Eð~VÞ¼Zljr~Vð~xÞj2þj~Vð~xÞ~V0ð~xÞj2j~V0ð~xÞj2d~xð5Þ，其中~V0是初始向量场，l是一个应用相关的常数。该方程可以用迭代欧拉法求解，如图7所示。这种方法不同于其他外部能量的选择，后者通常不是迭代的。因此，GVF更耗时，因为需要进行许多迭代来达到收敛。并行GPU实现是可能的

图7。梯度向量流如何扩散梯度，同时保持大的输入梯度。最左边的图像是输入的图像。接下来的图像描述了经过0、50和500次迭代后的矢量场的大小。最下面一行显示了由小红方块表示的放大区域的向量场。（为了解释本图例中对颜色的参考资料，读者可以参考本文的网络版本。）

Fig. 7. Example of how gradient vector flow diffuses the gradients while preserving the large input gradients. The image to the far left is the input image. The next images depict the magnitude of the vector field after 0, 50 and 500 iterations. The bottom row shows the vector field of the zoomed area indicated by the small red square. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

每个像素都可以在每次迭代中使用算法4进行独立处理。这就提供了一个很高的线程计数，并且需要在每次迭代时进行全局同步。在计算中没有分支分歧，但内存使用率很高，因为该方法为每个像素创建了一个向量，并且需要双缓冲。算法4中的离散拉普拉斯算子是作为一个模板操作来计算的，它需要访问相邻的像素。此计算可能受益于纹理系统的二维/三维空间缓存。Eidheim等人（2005）、何和库斯特（2006）、Zheng和张（2012）均提出了使用着色语言的2D图像的GPU实现。Alvarado等人（2013）完成了一个用CUDA编写的2D GVF的GPU实现。Smistad等人（2012b）提出了一种使用OpenCL的2D和3D GVF的优化GPU实现。该实现同时使用纹理内存和16位存储格式来减少内存延迟。

each pixel can be processed independently in each iteration using Algorithm 4. This gives a high thread count and requires global synchronization at each iteration. There is no branch divergence in the calculations, but the memory usage is high, as the method creates a vector for each pixel and requires double buffering. The discrete Laplacian operator in Algorithm 4 is calculated as a stencil operation, which requires access to neighboring pixels. This calculation may benefit from the 2D/3D spatial caching of the texture system. Eidheim et al. (2005), He and Kuester (2006), Zheng and Zhang (2012) all presented GPU implementations of GVF and active contours for 2D images using shader languages. A GPU implementation of 2D GVF written in CUDA was done by Alvarado et al. (2013). Smistad et al. (2012b) presented an optimized GPU implementation of GVF for 2D and 3D using OpenCL. This implementation use both texture memory and a 16-bit storage format to reduce memory latency.

3.6.水平集类似于主动轮廓，水平集方法通过在图像中传播一个轮廓来进行分割（Sethian，1999）。与上一节中的方法相比，级别集的优点是，它允许不需要任何额外的处理就可以分割和合并轮廓。水平集方法中的轮廓由水平集函数表示，该函数比轮廓高一维。因此，当二维图像被分割时，水平集函数是一个三维曲面，而对于三维图像，则是一个四维超曲面。二维分割中的水平集函数，z¼/ðx；y；tÞ，被定义为一个函数，它将高度z从图像平面中的位置x；y返回到t时刻的水平集曲面。轮廓隐式定义为零水平集，即从平面到曲面的高度为零（/ðx；y；tÞ¼0）。这就是在哪里

3.6. Level sets

Similar to active contours, level set methods perform segmentation by propagating a contour in the image (Sethian, 1999). The advantage of level sets compared to the methods in the previous

section, is that it allows for splitting and merging of the contours without any additional processing. Contours in the level set method are represented by the level set function, which is one dimension higher than the contour. Hence, the level set function is a 3D surface when 2D images are being segmented, and a 4D hypersurface for 3D images. The level set function in 2D segmentation, z ¼ /ðx; y;tÞ, is defined as a function which returns the height z from the position x; y in the image plane to the level set surface at time t. The contour is defined implicitly as the zero level set, which is where the height fromthe plane to the surface is zero (/ðx; y;tÞ ¼ 0). This is where the

图像平面与曲面相交。为了在x；y平面上传播轮廓，将水平设置表面沿z方向移动，如图8所示。轮廓的特定部分移动的速度和方向，取决于水平设置曲面如何弯曲和曲线。曲面越接近与图像平面平行，它传播得就越快。当水平设置曲面与图像平面正交时，轮廓根本不传播。假设轮廓上的每个点的移动方向与速度为F，则可以使用以下偏微分方程演化轮廓：

image plane and the surface intersect. To propagate the contour in the x; y plane, the level set surface is moved in the z direction as shown in Fig. 8. How fast and in which direction a specific part of the contour moves, is determined by how the level set surface bends and curves. The closer the surface is to being parallel with the image plane, the faster it propagates. When the level set surface is orthogonal to the image plane, the contour does not propagate at all. Assuming that each point on the contour moves in a direction normal to the contour with speed F, the contour can be evolved using the following PDE:

速度函数F因图像I的不同区域而变化，可以被设计为迫使轮廓朝向感兴趣的区域并避免其他区域。在图像分割中，速度函数通常由像素的强度或梯度以及水平集函数的曲率决定。负F使等高线收缩，而正F使其膨胀。水平集方法首先在感兴趣的对象上设置一个初始轮廓。这是通过手动或自动使用先验知识来完成的。接下来，将水平集函数初始化为初始轮廓的有符号距离变换。最后，更新轮廓直到收敛

The speed function F varies for different areas of the image I and can be designed to force the contour towards areas of interest and avoid other areas. In image segmentation, the speed function is usually determined by the intensity or gradient of the pixels, and the curvature of the level set function. A negative F makes the contour contract, while a positive F makes it expand. The level set method starts by setting an initial contour on the object of interest. This is done either manually or automatically using prior knowledge. Next, the level set function is initialized to the signed distance transform of the initial contour. Finally, the contour is updated until convergence

上述偏微分方程可以用迭代数据并行法和算法5中所示的有限差分法来求解。线程计数等于图像中的像素数，因为水平集函数为每个像素迭代更新。Rumpf和Strzodka（2001）早在2001年就提出了一个GPU的实现。更新远离轮廓的体素的水平集函数/，不会显著影响轮廓的移动。这一观察结果导致了两种不同的优化技术，即窄带场和稀疏场。两者都减少了在每次迭代中更新的体素的数量。窄带方法仅在轮廓周围的薄带内进行更新。然而，稀疏场方法只在轮廓的相邻像素处进行更新。虽然这些方法大大减少了线程的数量，但它们也引入了分支发散。所有这些级别集方法也都需要在每次迭代后进行全局同步。

The PDE above can be solved using an iterative data parallel method, and finite difference methods as shown in Algorithm 5. The thread count is equal to the number of pixels in the image, as the level set function is updated iteratively for each pixel. Rumpf and Strzodka (2001) presented a GPU implementation as early as in 2001. Updating the level set function / for voxels far away from the contour, does not significantly affect the movement of the contour. This observation has lead to two different optimization techniques, known as narrow band and sparse field. Both reduce the number of voxels updated in each iteration. The narrow band method updates / only within a thin band around the contour. However, the sparse field method updates / only at the neighbor pixels of the contour. Although these methods reduce the number of threads considerably, they introduce branch divergence. All of these level set methods also require global synchronization after each iteration.

Hong和Wang（2004）使用着色器编程为2D图像创建了级别集的GPU实现，并报告了CPU实现加速的10倍以上。Cates等人（2004）提出了一种在GPU上进行三维图像水平集分割的交互式应用程序。Lefohn等人（2004）为卷创建了一个GPU实现，它比优化的串行版本快10-15倍。他们使用了窄带优化方法，并只流化了该卷的相关部分

Hong and Wang (2004) used shader programming to create a GPU implementation of level sets for 2D images, and reported a speedup of over 10 times that of a CPU implementation. Cates et al. (2004) presented an interactive application for level set segmentation of 3D images on the GPU. Lefohn et al. (2004) created a GPU implementation for volumes, which was 10–15 times faster than an optimized serial version. They used the narrow band optimization method and streamed only the relevant parts of the vol

图8。水平集分割的说明。对于每个时间步，通过水平集（超）曲面移动图像平面x；y。分割的当前轮廓定义为到（超）曲面的高度h为零的位置。这也被称为零水平集。在这个例子中，水平集分割是一个随着时间的推移而逐渐膨胀的圆。

Fig. 8. Illustration of level set segmentation. A level set (hyper) surface is moved through the image plane x; y for each time step. The current contour of the segmentation is defined as the location where the height h to the (hyper) surface is zero. This is also called the zero level set. In this example the level set segmentation is a circle that is gradually inflated over time.

从CPU转到GPU。这样做是因为GPU内存太小，无法容纳当时的整个卷。Jeong等人（2009）也采用了窄带法。然而，他们使用原子操作更新了GPU上的活动体素设置。Roberts等人（2010）提出了一种类似于稀疏场方法的优化技术。他们使用前缀和扫描（见Billeter等人，2009年）来压缩包含GPU上活动体素坐标的缓冲区。

ume to the GPU from the CPU. This was done because the GPU memory was too small to fit the entire volume at that time. Jeong et al. (2009) also used the narrow band method. However, they updated the active voxel set on the GPU using atomic operations. Roberts et al. (2010) presented an optimization technique similar to the sparse field method. They used prefix sum scan (see Billeter et al., 2009) to compact the buffers containing the coordinates of the active voxels on the GPU

3.7.地图集是一个预先分割的图像或体积。基于地图集的分割方法使用配准算法来寻找地图集和输入图像之间的一对一映射。这个映射是分割的结果。输入图像中的每个像素在图谱中都有一个相应的像素和分割类。

3.7. Atlas/registration-based

An atlas is a pre-segmented image or volume. Atlas-based segmentation methods use registration algorithms to find a one-toone mapping between the atlas and the input image. This mapping is the segmentation result. Each pixel in the input image will have a corresponding pixel and segmentation class in the atlas.

Pham等人（2000）认为，基于图集的分割通常更适合于对研究人群中稳定的结构进行分割。这使得创建一个具有代表性的地图集变得更容易。尽管如此，当存在较大的变异或病理（例如，对脑瘤患者的MRI扫描）时，基于图谱的方法仍然可以作为其他方法的初始化。此外，基于地图集的方法的优势是可以根据地图集中的标签自动分类区域

Pham et al. (2000) argued that atlas-based segmentation is generally better suited for segmentation of structures that are stable in the population at study. This makes it easier to create a representative atlas. Still, atlas-based methods can be used as an initialization of other methods, when large variation or pathology (e.g. an MRI scan of a patient with a brain tumor) is present. In addition, atlas-based methods have the advantage that regions may be automatically classified, based on labels from the atlas.

目前存在几种配准方法，通常可分为基于强度方法和基于特征的方法两类。基于强度的配准方法使用两个图像（或图像和图集）中的强度值，以及一个相似性度量来执行配准。基于特征的配准方法首先从图像中提取一些共同的特征，然后通过匹配这些特征对图像进行配准。互信息和迭代最近点分别是最常见的基于强度和特征的配准方法，下面将进行更详细的讨论。有关如何在GPU上加速配准方法的更多细节，请参考Shams等人（2010a）和Fluck等人（2011）。

Several registration methods exist, and are often divided into the two categories intensity- and feature-based methods. Intensity-based registration methods use the intensity values in the two images (or image and atlas), and a similarity measure to perform the registration. Feature-based registration methods first extract some common features from the images, and then register the images by matching these features. Mutual information and iterative closest point are the most common intensity- and feature-based registration methods respectively, and both are discussed in more detail below. For even more details on how to accelerate registration methods on the GPU, the reader is referred to Shams et al. (2010a) and Fluck et al. (2011).

3.7.1.基于强度的配准互信息互信息（MI）是一种可以用来评估一幅图像注册到另一幅图像的测量方法。该度量基于这样的假设：在一幅图像中，即强度分布相似的区域对应于另一幅图像中强度分布相似的区域（即一幅图像中的暗区域可以与另一幅图像中的亮区域相似）。MI测度M基于香农熵H，定义为：MðA；BÞ¼HðBÞHðBjAÞð7Þ，其中A和B是两幅图像。香农熵定义为：HðAÞ¼Xi2A pilog 1pið8Þ对于图像，pi是图像A中当前像素i具有特定灰度值的概率。概率率可以从图像的直方图中计算出来。MI可以解释为当呈现另一个图像A时，图像B的不确定性降低。换句话说，如果MI很高，则图像是相似的。

3.7.1. Intensity-based registration - Mutual Information

Mutual information (MI) is a measure that can be used to assess

how well one image is registered to another. This measure is based

on the assumption that regions of similar intensity distribution in

one image, correspond to regions with similar intensity distribution in the other image (i.e. a dark region in one image can be similar to a bright region in another image). The MI measure M is based on Shannon’s entropy H and is defined as: MðA; BÞ ¼ HðBÞ HðBjAÞ ð7Þ where A and B are two images. Shannon’s entropy is defined as: HðAÞ ¼ Xi2A pilog 1pi ð8Þ For images, pi is the probability that the current pixel i in image A has a specific gray value. The probability pi can be calculated from the histogram of the image. MI can be interpreted as the decrease in uncertainty of image B, when another image A is presented. In other words, if the MI is high, the images are similar.

为了使用MI注册两个图像，对其中一个图像进行转换以最大化MI测量。GPU纹理内存有硬件支持插值，这是图像转换通常需要的。不同的优化技术，如梯度下降和鲍威尔的方法，可以用来寻找最大化MI所需的变换。关于使用MI进行医学图像配准的详细回顾，请参见Pluim等人（2003年）。MI测度的计算需要求和，这可以使用前缀和扫描方法并行进行。直方图可以使用排序和计数并行计算。线程数很高，但需要全局同步，因为这是一种迭代方法。优化技术的梯度下降和Powell的方法并不适合并行执行，因为它们的顺序性质（Fluck et al.，2011）。因此，几种基于GPU的配准方法在CPU上运行优化，并在GPU上运行相似性度量。全局优化技术，如进化算法（EAs），非常易于并行化。然而，EAs通常更常见

To register two images using MI, one of the images is transformed to maximize the MI measure. The GPU texture memory has hardware support for interpolation, which is often needed for the image transformations. Different optimization techniques such as gradient descent and Powell’s method can be used to find the transformation needed to maximize MI. For a detailed review of registration of medical images using MI see Pluim et al. (2003). The calculation of the MI measure requires summation, which can be done in parallel using the prefix sum scan method. The histogram may be calculated in parallel using sort and count. The number of threads is high, but global synchronization is needed, as this is an iterative method. The optimization techniques gradient descent and Powell’s method are not ideal for parallel execution because of their sequential nature (Fluck et al., 2011). Thus,several GPU-based registration methods run the optimization on the CPU, and the similarity measure on the GPU. Global optimization techniques such as evolutionary algorithms (EAs) are highly amenable to parallelization. However, EAs are generally more com

假定价格昂贵，即使并行运行也可能会很慢。Lin和Medioni（2008）和Shams和Barnes（2007）提出了使用CUDA的MI计算的GPU实现。Shams等人（2010b）通过优化直方图计算，改进了他们之前的实现。这是使用并行的位子排序和计数方法来完成的，以避免执行昂贵的同步和使用原子计数器。通过这些改进，他们报告了3D图像的实时配准，并且比CPU版本的MI提高了50倍。

putationally expensive, and may be slow even when run in parallel. Lin and Medioni (2008) and Shams and Barnes (2007) presented GPU implementations of the MI computation using CUDA. Shams et al. (2010b) improved their previous implementation by optimizing the histogram computations. This was done using a parallel bitonic sort and count method to avoid performing expensive synchronization and use of atomic counters. With these improvements they reported real-time registration of 3D images, and a 50 times speedup over a CPU version of MI.

3.7.2.基于特征的配准-迭代最近点迭代最近点（ICP）是一种最小化两组点之间的差异的算法。该算法由Besl和McKay（1992）首次用于配准。为了使用该算法进行配准，必须在两幅图像中识别相应的物理点。这可以通过手动来完成，或通过使用图像处理技术来完成。该算法首先为第一个点集中的每个点在第二个点集中找到最近的点。然后使用相应的点来计算一个变换，它将其中一个点集变换得更接近另一个。变换参数通常使用均方代价函数来估计。只要需要，就会重复这个过程，图9中描述了两行。

3.7.2. Feature-based registration - Iterative closest point Iterative closest point (ICP) is an algorithm for minimizing the difference between two sets of points. This algorithm was first used for registration by Besl and McKay (1992). In order to use this algorithm for registration, corresponding physical points have to be identified in both images. This can be done either manually or by using image processing techniques. The algorithm starts by finding the closest point in the second point set for each point in the first point set. The corresponding points are then used to calculate a transformation, which transforms one of the point sets closer to the other. Transformation parameters are usually estimated using a mean square cost function. This procedure is repeated as long as necessary, and is depicted in Fig. 9 for two lines

寻找最近的点并转换相应的点都是数据并行操作。线程计数等于点的数量，这通常明显低于图像中的像素数。内存使用率很低，而且

Finding the closest points and transforming the corresponding points are both data parallel operations. The thread count is equal to the number of points, which is typically significantly lower than the number of pixels in the image. The memory usage is low, and

没有分支上的分歧。但是，在每次迭代结束时都需要进行全局同步。Langis等人（2001）描述了一种并行实现的集群ICP方法。采用基于四元数的最小二乘法并行计算了刚性变换。由于增加了并行化和减少了节点之间的通信，从而提高了加速速度。Qiu等人（2009）提出了一个ICP算法的GPU实现，比顺序CPU版本加速了88倍。

there is no branch divergence. However, global synchronization is needed at the end of each iteration. Langis et al. (2001) described a parallel implementation of ICP for clusters where the points were distributed on several nodes. The rigid transformation was computed in parallel using a quaternion-based least squares method. This resulted in an improved speedup due to increased parallelization and reduced communication among the nodes. Qiu et al. (2009) presented a GPU implementation of the ICP algorithm with 88 times speedup over a sequential CPU version.

3.8.统计形状模型人体中的一些器官对不同的个体有相似的形状。这些器官的形状可以使用统计形状模型（SSM）进行建模和分割。这种方法基于一组来自几个个体的预分割图像，创建了一个器官的统计模型。通过将模型与新的图像数据进行拟合来进行分割。ssm和图谱模型的不同之处在于，ssm对形状进行建模，而图谱则对图像中每个分割类的组织分布和位置进行建模。然而，一种被称为主动外观模型的ssm也使用图像中的强度信息。

3.8. Statistical shape models

Several organs in the human body have similar shapes for different individuals. The shape of these organs may be modeled and segmented using a statistical shape model (SSM). This method creates a statistical model of an organ based on a set of pre-segmented images from several individuals. Segmentation is done by fitting the model to the new image data. The difference between SSMs and atlas models is that SSMs model the shape, while an atlas models the tissue distribution and location of each segmentation class in an image. Nevertheless, one type of SSMs called active appearance models also use intensity information in the image.

Heimann和Meinzer（2009）对使用ssm进行图像分割的综述。他们认为，这种方法比其他方法更复杂，但对局部图像伪影和噪声更具鲁棒性。SSM由平均形状和变化模式组成。通常，形状被表示为一组地标点，称为点分布模型（PDM）。这些点必须出现在每个训练样本中，并位于相同的解剖位置。在训练样本中的地标可以由专家手动完成。然而，这是非常耗时的，而且对于大型的3D形状并不实用。因此，通常会使用自动的方法来代替。

Heimann and Meinzer (2009) presented a review on image segmentation using SSMs. They argued that this method is more complex than other methods, but more robust to local image artifacts and noise. An SSM consists of a mean shape and modes of variations. Generally, shapes are represented as a set of landmark points called a point distribution model (PDM). These points have to be present in each training sample, and be located at the same anatomical positions. Setting the landmarks in the training samples can be done manually by an expert. However, this is time consuming, and not practical for large 3D shapes. Thus, automatic methods are often used instead.

其他方法将所有形状参数化为一个共同的基域，比如二维的圆和三维的球体。相应的地标被识别为位于基域中相同位置的地标。然而，形状的初始参数化可能不是最优的，可能需要重新参数化。最小描述长度（MDL）（Davies et al.，2002）是一个目标函数，它试图在每个形状上创建最佳的地标。这可以用来指导重新参数化，并给出一个最优的地标集。一般来说，建立形状对应关系是ssm最具挑战性的任务之一，也是影响整体结果的主要因素之一（海曼和Meinzer，2009）。识别地标并放置在同一坐标空间后，可以计算出平均形状和变化模式。假设地标点排列为一个向量~xi¼fðx1；y1；z1Þ；…；ðxN；yN；每个训练样本i的坐标zNÞg，平均形状~xmean可以计算为每个地标的平均位置

Other methods parameterize all shapes to a common base domain,

such as a circle for 2D and a sphere for 3D. Corresponding landmarks are then identified as those that are located at the same locations in the base domain. Nevertheless, the initial parameterization of the shapes may not be optimal, and re-parameterization may be needed. Minimum description length (MDL) (Davies et al., 2002) is an objective function that tries to create optimal landmarks on each shape. This can be used to guide the re-parameterization and give an optimal set of landmarks. Generally,

establishing shape correspondence is one of the most challenging tasks of SSMs and one of the major factors influencing the overall result (Heimann and Meinzer, 2009). After the landmarks have been identified and placed in the same coordinate space, the mean shape and modes of variation can be computed. Assuming that the landmark points are arranged as a single vector~xi ¼ fðx1; y1; z1Þ; ... ;ðxN; yN; zNÞg of coordinates for each training sample i, the mean shape, ~xmean, can be calculated as the average location of each landmark:

除了平均值外，还计算了一组描述形状变化的小模态。这通常是通过主成分分析（PCA）来完成的。Andrecut（2009）和Josth等人（2011）都提出了一个使用CUDA的PCA的GPU实现。加速的数量取决于地标点的数量，他们认为需要超过1000个地标点。对于像肝脏这样的大器官，经常使用数千个标志物（Heimann et al.，2009）。然而，GPU执行对小器官可能没有任何好处，因为只有几百个地标被使用。算法6描述了一个粗糙的并行PCA实现。更多的细节可在Andrecut（2009）中找到。通过比较新旧特征值与参数的绝对差值来测试收敛性。实际的计算包括多个矩阵运算，如乘法、加法和转置，所有这些运算都可以在GPU上并行执行。有几个GPU库可以用来加速这些矩阵操作。一些例子是维也纳、岩浆、铜和克隆

In addition to the mean, a small set of modes which describes the

shape variations is calculated. This is usually done with principal

component analysis (PCA). Andrecut (2009) and Jošth et al. (2011)

both presented a GPU implementation of PCA using CUDA. The

amount of speedup depends on the number of landmark points,

and they argued that more than a thousand landmark points are

necessary. For large organs such as the liver, several thousand landmarks are often employed (Heimann et al., 2009). However, there might not be any benefit of GPU execution for small organs, where only a few hundred landmarks are used. Algorithm 6 describes a crude parallel PCA implementation. More details can be found in Andrecut (2009). The implementation is iterative and test for convergence by comparing the absolute difference of the new and old eigenvalue / to a parameter . The actual computations consist of several matrix operations such as multiplication, addition and transpose, all of which can be executed in parallel on the GPU. There are several GPU libraries that can be used to accelerate these matrix operations. A few examples are ViennaCL, MAGMA, cuBLAS and clBLAS

在进行PCA后，可以使用第一个c模式~x¼~xmeanþXc i¼1~bi~/i来近似每个有效的形状

After PCA has been performed it is possible to approximate each

valid shape using the first c modes

~x ¼ ~xmean þXc i¼1~bi~/i

图10。主动形状模型算法在统计形状模型上的每个地标点上在直线搜索中定位边界。计算位移，并移动、缩放、旋转和变形，以最佳适合识别的边界点。这是重复的，直到收敛。

Fig. 10. The active shape model algorithm locates borders in a line search from each landmark point on the statistical shape model. A displacement is calculated and the shape is moved, scaled, rotated and deformed to best fit the identified border points. This is repeated until convergence.

平均形状~xmean和特定形状~x的计算也可以在GPU上运行。然而，可实现的加速取决于地标点的数量，如上所述，这可能很低。然而，理想情况下，统计形状模型的创建只在训练阶段完成一次，而不是对每个新的分割执行。因此，它可以离线完成，人们可以认为，训练阶段的加速不如SSM方法中的实际分割步骤重要。在SSM构建后，使用搜索算法对图像进行分割，试图将SSM与图像进行匹配。

The calculations of the mean shape ~xmean and a specific shape ~x can also be run on the GPU. However, the achievable speedup depends on the number of landmark points, which as discussed above can be low. Nevertheless, the creation of the statistical shape model is ideally done only once in a training phase and is not performed for each new segmentation. It can therefore be done offline, and one can argue that the acceleration of the training phase is not as important as the actual segmentation step in the SSM method. After the SSM is built, an image is segmented using a search algorithm that tries to match the SSM to the image.

Khallaghi等人（2011）采用了一种基于线性组合相似度度量的线性相关性的配准方法。他们在GPU上实现了注册部分，而SSM方法的其余部分是在CPU上实现的。配准过程包括对基于CT图像的超声图像进行模拟和b样条可变形配准。与CPU实现相比，它们的加速了350倍。但是，他们很少提供关于实现的细节。主动形状模型（ASMs）（Cootes et al.，1995）是一种局部搜索算法，它沿着每个地标点的法线搜索轮廓点。这一点如图10所示。在计算出每个地标点的位移后，形状将被移动、旋转和缩放。最后，估计了形状参数~bi。重复此操作，直到形状变化低于阈值，这需要全局同步。ASM是一种数据并行方法，线程数等于地标点的数量。内存使用量很低，因为只需要存储SSM。

Khallaghi et al. (2011) used a registration method based on the

linear correlation of linear combination similarity metric. They

implemented the registration part on the GPU, while the rest of

the SSM method was implemented on the CPU. The registration

process entailed simulation of an ultrasound image based on a

CT image, and a B-spline deformable registration. They reported

a speedup of 350 times in comparison to a CPU implementation.

However, they provided few details on the implementation.

Active shape models (ASMs) (Cootes et al., 1995) is a local

search algorithm that searches for contour points along the normal

of each landmark point. This is depicted in Fig. 10. After a displacement for each landmark point has been calculated, the shape is

moved, rotated and scaled. Finally, the shape parameters ~bi are

estimated. This is repeated until the shape change falls below a

threshold, which requires global synchronization. ASM is a data

parallel method with the thread count equal to the number of

landmark points. The memory usage is low, as only the SSM has

to be stored

ssm的另一种搜索算法是主动外观模型（AAMs）（Cootes et al.，2001）。AAMs使用外观模型来驱动搜索。这些外观模型能够从当前的形状中生成一个合成的图像。该合成图像叠加在输入图像上，用于计算当前形状与输入图像的匹配程度。最后，利用该测度来估计方向、尺度和形状参数。与ASM一样，这是迭代完成的，并且需要全局同步。图像的合成是通过纹理变换来完成的，这是gpu的数据并行性和高线程数的任务。然而，Heimann和Meinzer（2009）认为，由于AAM的内存需求非常高，因此AAM很少用于3D图像

Another search algorithm for SSMs is active appearance models (AAMs) (Cootes et al., 2001). AAMs use appearance models to drive the search. These appearance models are able to generate a synthetic image from the current shape. This synthetic image is superimposed on the input image, and used to calculate how well the current shape matches the input image. Finally, this measure is used to estimate the orientation, scale and shape parameters. As with ASM, this is done iteratively, and requires global synchronization. The synthesis of images is done by texture transformation, a task which GPUs excel at due to its data parallel nature and high thread count. Nevertheless, Heimann and Meinzer (2009) argue that AAM is rarely used on 3D images as the memory requirement of AAM is very high.

ssm的另一种搜索算法是主动外观模型（AAMs）（Cootes et al.，2001）。AAMs使用外观模型来驱动搜索。这些外观模型能够从当前的形状中生成一个合成的图像。该合成图像叠加在输入图像上，用于计算当前形状与输入图像的匹配程度。最后，利用该测度来估计方向、尺度和形状参数。与ASM一样，这是迭代完成的，并且需要全局同步。图像的合成是通过纹理变换来完成的，这是gpu的数据并行性和高线程数的任务。然而，Heimann和Meinzer（2009）认为AAM很少用于3D图像，因为AAM的内存需求非常高。

ASM和AAM在视频中跟踪人脸方面很受欢迎。Ahlberg（2002）和Song等人（2010）分别提出了AAM和ASM用于人脸跟踪的GPU实现。Ahlberg（2002）在AAM搜索中使用OpenGL进行纹理映射。Song等人（2010）使用GPU进行预处理操作，如边缘增强和音调映射，以及ASM搜索。3.9.马尔可夫随机场和图切割马尔可夫随机场（MRF）分割（Wang et al.，2013a）将图像中的所有像素作为图中的节点。所有节点都是连接的，每个像素都有一条到相邻像素的边。每个节点都有一个与其相关联的概率分布，它由属于每个类的像素的概率组成。这些节点具有马尔可夫性质，即一个节点的概率分布只取决于它最近的邻居。MRF分割是找到使概率PðSjIÞ最大化的分割S，其中I是要分割的观察图像。S可以为每个像素表示几种不同的分割类。这使得MRF分割成为理想的多标签分割。使用贝叶斯公式，就可以得到：

ASM and AAM have been popular for tracking faces in video. Ahlberg (2002) and Song et al. (2010) presented GPU implementations of AAM and ASM respectively for face tracking. Ahlberg (2002) used OpenGL for the texture mapping in the AAM search. Song et al. (2010) used the GPU for pre-processing operations such as edge enhancement and tone mapping, and for the ASM search. 3.9. Markov random fields and graph cuts Markov random field (MRF) segmentation (Wang et al., 2013a) considers all the pixels in the image as nodes in a graph. All nodes are connected and each pixel has an edge to its neighbor pixels. Each node has a probability distribution associated with it, which consists of the probability of the pixel belonging to each class. These nodes have the Markov property, which states that the probability distribution of a node only depends on its closest neighbors. MRF segmentation is to find the segmentation S that maximizes the probability PðSjIÞ, where I is the observed image to be segmented. S can express several different segmentation classes for each pixel. This makes MRF segmentation ideal for multi-label segmentation. Using Bayes formula this becomes:

在这个公式中，PðIjSÞ是观察我给定的分割图像的概率。PðSÞ是分割的概率，可以用来模拟分割结果的样子。PðIÞ被认为是一个归一化常数，因此在计算中被忽略了。PðIjSÞ和PðSÞ可以通过创建不同的表达式来分割感兴趣的结构。

In this formula, PðIjSÞ is the probability of observing an image I given a segmentation S. PðSÞ is the probability of a segmentation, and can be used to model how a segmentation result should look like. PðIÞ is considered to be a normalization constant, and is therefore ignored in the calculations. Structures of interest can be segmented by creating different expressions for PðIjSÞ and PðSÞ

有几种方法可以最大化后验分布。其中一种方法是迭代条件模式（ICM），它是由Besag（1986）引入的。ICM从初始分割S开始，并确定性地优化每个像素的局部能量。因此，每个像素都可以被并行处理。这将重复，直到收敛，这需要全局同步。然而，ICM很容易陷入局部最小值。模拟退火（SA）（Kirkpatrick et al.，1983）是另一种优化方法，它可以避免局部最小值。然而，SA通常需要更多的迭代来达到收敛。SA根据一个温度参数随机选择每个像素的类。这个温度首先被初始化到一个高值，然后逐渐降低。这使得分割S在一开始时就达到了许多状态。随着温度的降低，分割逐渐被限制在最小状态。ICM和SA都是迭代的，由于需要双缓冲，因此内存使用量中等。线程计数等于图像中的像素数。分支差异较低，因为教学的数量

There are several methods for maximizing the a posteriori distribution. One method is iterative conditional modes (ICM), which was introduced by Besag (1986). ICM starts with an initial segmentation S, and optimizes the local energy of each pixel deterministically. Thus, each pixel can be processed in parallel. This is repeated until convergence, which requires global synchronization. However, ICM is prone to getting stuck in local minima. Simulated annealing (SA) (Kirkpatrick et al., 1983) is another optimization method, which can avoid local minima. However, SA generally need a lot more iterations to reach convergence. SA select the class of each pixel stochastically based on a temperature parameter. This temperature is first initialized to a high value, and gradually lowered. This has the effect of allowing the segmentation S to reach many states in the beginning. As the temperature is lowered, the segmentation is gradually restricted to minima states. Both ICM and SA are iterative, and have a medium memory usage as double buffering is required. The thread count is equal to the number of pixels in the image. The branch divergence is low, as the number of instructions in the branches are low

Griesser等人（2005）提出了一种MRF分割的着色器实现，但很少提供其实现的细节。Valero等人（2011）在ITK库中实现了ICM方法的GPU版本。它们实现了显著的加速，并提到了优化，如使用共享内存和循环展开等。Jodoin（2006）提出了一个使用NVIDIA的SA和ICM的Cg着色器语言的实现。在这两种情况下，都有足够的并行性，因为每个像素都有一个线程。一次迭代的结果存储在纹理存储器中，以便在下一次迭代中可以更有效地读取每个像素的邻域。Walters等人（2009）提出了使用ICM和CUDA的肝脏分割。他们使用从全局内存中提取的合并读取来提高性能，并实验了不同的线程分组配置。Sui等人（2012）提出了另一种基于ICM的MRF分割的GPU实现。与这里提到的其他实现相反，它们

Griesser et al. (2005) presented a shader implementation of

MRF segmentation, but provided few details of their implementation. Valero et al. (2011) implemented a GPU version of the ICM method in the ITK library. They achieved significant speedups, and mention optimizations such as using shared memory and loop unrolling. Jodoin (2006) presented an implementation using NVIDIA’s Cg shader language of both SA and ICM. In both cases there is ample parallelism, as there is one thread for each pixel. The result from one iteration is stored in texture memory, so that the neighborhoods of each pixel can be read more efficiently during the next iteration. Walters et al. (2009) presented liver segmentation using ICM and CUDA. They used coalesced reads from global memory to increase performance, and experimented with different thread grouping configurations. Another GPU implementation of ICM based MRF segmentation was presented by Sui et al. (2012). As opposed to the other implementations mentioned here, they

没有并行地处理带有重叠邻居的像素。因此，每次迭代都需要多次传递，更大的图像需要获得足够的并行性。建模PðSÞ和PðIjSÞ可能需要几个未知的参数。这些参数可以使用期望最大化（EM）算法进行估计。该算法是一种迭代极大似然法。它需要计算条件分布PðSjIÞ的期望，这是非常复杂的（Zhang，1992）。然而，可以使用平均场近似使这种计算可行，Zhang（1992）。Saito等人（2012）提出了一种使用平均场近似和CUDA的MRF分割的GPU实现。然而，他们没有提供关于GPU实现的细节。

did not process pixels with overlapping neighborhoods in parallel. Multiple passes are therefore required for each iteration, and larger images are required for sufficient parallelism. Modelling PðSÞ and PðIjSÞ can require several unknown parameters. These parameters can be estimated using the expectation– maximization (EM) algorithm. This algorithm is an iterative maximum-likelihood method. It requires calculation of the expectation of the conditional distribution PðSjIÞ, which is extremely complex (Zhang, 1992). However, a mean-field approximation can be used to make this calculation feasible Zhang (1992). Saito et al. (2012) presented a GPU implementation of MRF segmentation using the mean-field approximation and CUDA. However, they provided no details on the GPU implementation.

图切（Boykov和Veksler，2006）是另一种MRF分割方法。该方法还使用了一个图，其中图像中的所有像素都是节点，每个像素与其相邻像素有一条边。然而，所有的像素都有两个特殊的节点的附加边，称为源(S)和汇聚(T)节点。这一点如图11所示。这些边缘被分配了一个权重，因此背景像素对其中一个节点有很大的权重，对另一个节点有很小的权重，对前景像素反之亦然。像素之间的边的权值被设计为相似像素之间的大，不同像素之间的小

Graph cut (Boykov and Veksler, 2006) is another MRF segmentation method. This method also uses a graph where all the pixels in the image are nodes, and each pixel has an edge to its neighbor pixels. However, all pixels have an additional edge to two special nodes, called a source (S) and sink (T) node. This is depicted in Fig. 11. The edges are assigned a weight, so that background pixels have a large weight to one of these nodes, a small weight to the other, and vice versa for the foreground pixels. The weights of the edges between the pixels are designed to be large between similar pixels, and small between different.

分割是采用最小切割图算法来确定的。这些算法将一个图的节点划分为两组。对图进行切割，使切割边的权值之和最小化。结果是一个二值分割，这是最优的权值分配给边缘。有几种算法可以寻找最小割及其对偶问题最大流，其中图被认为是一个流网络。两个例子是推重新标记和福德福尔克森算法

The segmentation is determined using a minimum cut graph algorithm. These algorithms partition the nodes of a graph into two sets. The graph is cut so that the sum of the weights of the cut edges is minimized. The result is a binary segmentation that is optimal in terms of the weights assigned to the edges. There are several algorithms for finding the minimum cut, and

its dual problem maximum flow, where the graph is considered to be a flow network. Two examples are the push-relabel and FordFulkerson algorithms

推送重新标签方法使用两种操作，这两种操作都对图中的每个节点执行。每个节点有一个线程，线程总数很高。然而，存在显著的分支分歧，因为这些操作只在每次迭代中对节点的一个子集执行。该方法的内存使用量很高，因为它必须为每条边存储几个属性。Dixit等人（2005）提出了一种使用着色器编程的推重标算法的GPU实现。然而，在与串行实现进行比较时，GPU实现速度较慢，除非使用了一些近似值。Hussean等人（2007）提出了一个优化的GPU实现

The push-relabel method uses two operations, which both are executed for every node in the graph. With one thread for each node, the total number of threads is high. However, there is significant branch divergence, as these operations are only performed for a subset of the nodes during each iteration. The memory usage of this method is high because it has to store several attributes for each edge. Dixit et al. (2005) presented a GPU implementation of the pushrelabel algorithm using shader programming. However, in their comparison with a serial implementation, the GPU implementation was slower except if some approximations were used. Hussein et al. (2007) presented an optimized GPU implementation

图11。对3 3 张图像的图形切割分割图示。要分割的图像显示在左边，其图形表示显示在右边。边缘的厚度表示它们的重量

Fig. 11. Illustration of graph cut segmentation of a 3 3 image. The image to be segmented is shown to the left, its graph representation on the right. The thickness of the edges indicates their weight

使用CUDA，这比两个不同的串行实现都要快。Vineet和纳拉亚南（2008）提出了一个类似的实现，他们通过使用共享内存和纹理内存来加快内存访问速度，从而提高了性能。前面的两个实现将图限制为一个晶格。Garrett和Saito（2009）展示了如何通过用线性阵列表示顶点和边，将推送-重新发布标签的GPU实现扩展到任意图。扩充路径是指图中具有可用容量的路径。Ford-富尔克森方法通过迭代寻找从源到接收节点到接收节点的扩充路径来解决最小割路和最大流问题。流通过此路径发送，并重复发送，直到不再发送流为止。该方法不像推送-重新标签算法那样适合用于数据并行计算。然而，正如Liu和Sun（2010）以及Strandmark和Kahl（2010）所做的那样，可以通过分割图并并行求解每个子图来并行运行该方法。3.10.管状结构的中心线提取与分割

using CUDA, which was faster than two different serial implementations. Vineet and Narayanan (2008) presented a similar implementation where they improved the performance by using

shared and texture memory to speed up memory access. The two previous implementations restrict the graph to a lattice. Garrett and Saito (2009) showed how a GPU implementation of push-relabel could be extended to arbitrary graphs by representing the vertices and edges in a linear array. An augmenting path is a path in the graph which has available capacity. The Ford-Fulkerson method solves the minimum cut and maximum flow problem by iteratively finding an augmenting path from the source to the sink node. Flow is sent through this path, and this is repeated until no more flow can be sent. This method is not as well suited for data parallel computation as the push-relabel algorithm. However, it is possible to run the method in parallel by splitting the graph and solving each sub-graph in parallel as done by Liu and Sun (2010) and Strandmark and Kahl (2010). 3.10. Centerline extraction and segmentation of tubular structures

血管、气道、骨骼、神经通路和肠道都是人体重要管状结构的例子。除了分割外，提取这些结构的中心线也很重要。中心线是一条穿过中心的线，并提供了管状结构的结构表示（见图12)。它在一些应用中很重要，如登记术前和术中数据，这是图像引导手术的关键组成部分。从医学图像中提取管状结构的方法有几种。Lesage等人（2009）最近对血管提取进行了广泛的回顾，Kirbas和Quek（2004）也进行了一次较老的回顾。Lo等人（2009）和Sluimer等人（2006）对气道的分割进行了两篇综述

Blood vessels, airways, bones, neural pathways and intestines are all examples of important tubular structures in the human body. In addition to the segmentation, the extraction of the centerline of these structures is also important. The centerline is a line that goes through the center and provides a structural representation of the tubular structures (see Fig. 12). It is important in several applications such as registration of pre- and intraoperative data,

which is a key component in image guided surgery. There are several methods for extracting tubular structures from medical images. A recent and extensive review on blood vessel extraction was done by Lesage et al. (2009), and an older one was done by Kirbas and Quek (2004). Two reviews on the segmentation of airways were done by Lo et al. (2009) and Sluimer et al. (2006)

图12。气道树的中心线。使用管检测滤波器从计算机断层扫描数据中提取中心线，并使用以中心线为种子的区域生长算法创建分割。如Smistad等人（2013）所解释的那样，所有的处理都是在GPU上完成的。（为了解释本图例中对颜色的参考资料，读者可以参考本文的网络版本。）

Fig. 12. Centerline, displayed in red, of the airway tree. The centerline was extracted using tube detection filters from computed tomography data and the segmentation was created using a region growing algorithm with the centerline as seeds. All the processing was done on the GPU as explained in Smistad et al. (2013). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

提取管状结构的一种常用方法是从初始点或区域迭代增长分割。例如，使用诸如区域增长、活动轮廓和水平集等方法。中心线可以从二值分割使用迭代形态细化，也称为骨架化。使用这种方法，体素以特定的顺序从分割中去除，直到对象不能再变薄。这是一种迭代的数据并行方法，其线程计数等于卷的大小。该方法具有分支散度，因为在每次迭代中只需要检查体素的一个子集。Jimenez和Miras（2012）提出了一个由Palagyi和Kuba（1999）使用CUDA和OpenCL实现的细化方法的GPU和多核CPU。

另一种方法是使用距离变换或梯度向量流（GVF），如Hassouna和Farag（2007）所做的那样。如前所述，可以在GPU上加速GVF的计算（Eidheim等人，2005；他和库斯特，2006；郑和张，2012；Smistad等人，2012b）。直接提取中心线的方法也可以使用最短路径和脊遍历。Aylward和Bullitt（2002）回顾了不同的中心线提取方法。他们提出了一种基于一套岭标准的改进的岭穿越方法，以及不同的噪声处理方法。Bauer和Bischof（2008）展示了该方法如何与GVF一起使用。然而，脊遍历并不是一种数据并行算法，因此不适合GPU加速

Another approach is to use a distance transform or gradient vector flow (GVF) as done by Hassouna and Farag (2007). As explained previously, computation of GVF can be accelerated on the GPU(Eidheim et al., 2005; He and Kuester, 2006; Zheng and Zhang, 2012; Smistad et al., 2012b). Direct centerline extraction without a prior segmentation is also possible using methods such as shortest path and ridge traversal. Aylward and Bullitt (2002) presented a review of different centerline extraction methods. They proposed an improved ridge traversal method based on a set of ridge criteria, and different methods for handling noise. Bauer and Bischof (2008) showed how this method could be used together with GVF. However, ridge traversal is not a data parallel algorithm and therefore not suited for GPU acceleration

这些方法通常需要对候选中心点或管状结构的方向进行初始估计。管状检测滤波器（TDFs）用于通过计算每个体素在管状结构内的概率来检测管状结构。大多数tdf使用梯度信息，通常以黑森矩阵的特征分析的形式进行。Frangi等人（1998）提出了一种基于该矩阵特征值的管状结构的增强和检测方法。Wang等人（2013b）使用CUDA在GPU上实现了类似的血管增强方法。Krissian等人（2000）创建了一个基于模型的检测滤波器，使一个圆适合于管状结构的横截面平面。这些tdf是数据并行的，并为卷中的每个体素进行计算。不需要同步，而且内存使用量也很低，因为每个体素只需要存储一个可能性值。埃尔特

These methods usually need an initial estimation of candidate

centerpoints or the direction of the tubular structure. Tube detection filters (TDFs) are used to detect tubular structures by calculating a probability of each voxel being inside a tubular structure.Most TDFs use gradient information, often in the form of the eigenanalysis of the Hessian matrix. Frangi et al. (1998) presented an enhancement and detection method for tubular structures based on the eigenvalues of this matrix. A similar vessel enhancement method was implemented on the GPU by Wang et al. (2013b) using CUDA. Krissian et al. (2000) created a model-based detection filter that fits a circle to the cross-sectional plane of the tubular structure. These TDFs are data parallel, and are computed for each voxel in the volume. No synchronization is needed, and the memory usage is low, as only one likelihood value has to be stored per voxel.Erdt

Erdt等人（2008）在GPU上进行了TDF和区域增长分割，并报告了梯度计算速度快了15倍，TDF速度快了100倍。Narayanaswamy等人（2010）进行了血管板分割，并对GPU进行了假设检测，报告了8倍的加速。Bauer等人在Bauer等人（2009a）中使用GPU加速进行GVF计算，在Bauer等人（2009b）中使用TDF计算。然而，他们没有提供关于GPU实现的描述。Smistad等人（2012a）提出了一种实现气道分割和中心线提取的方法。在这个实现中，使用OpenCL在GPU上执行数据集裁剪、GVF和TDF。Smistad等人（2013）进一步开发了这种实现，完全在GPU上运行，并处理其他类型的管状结构，如来自不同器官和模式的血管。

Erdt et al. (2008) performed the TDF and a region growing segmentation on the GPU and reported a 15 times faster computation of the gradients and up to 100 times faster TDF. Narayanaswamy et al. (2010) did vessel laminae segmentation with region growing and a hypothesis detection on the GPU and reported an 8 times speedup. Bauer et al. used GPU acceleration for the GVF computation in Bauer et al. (2009a), and the TDF calculation in Bauer et al. (2009b). However, they provided no description of the GPU implementations. Smistad et al. (2012a) presented an implementation of airway segmentation and centerline extraction. In this implementation, dataset cropping, GVF and TDF were executed on the GPU using OpenCL. This implementation was further developed in Smistad et al. (2013) to run completely on the GPU, and process other types of tubular structures such as blood vessels from different organs and modalities.

3.11.动态图像的分割-跟踪到目前为止，只讨论了在一个特定时间获得的单个图像的分割。然而，随着时间的推移而获得的医学图像数据也同样存在。例如，超声波设备每秒都能捕捉到几张图像。这些数据的实时处理需要将数据直接流到GPU。节段的内容

3.11. Segmentation of dynamic images – tracking So far, only segmentation of single images, acquired at one specific time, has been discussed. However, medical image data acquired over time also exist. For instance ultrasound devices captures several images per second. Real-time processing of such data requires streaming of the data directly to the GPU. The segmenta

动态图像数据中的结构结构通常被称为跟踪。动态图像分割的一种方法是在目前讨论的每一帧中应用其中的一种分割方法。然而，这可能不能满足实时约束。另一种方法是使用对前一帧的分割来分割下一帧。对前一帧的分割可以用于初始化，或者为下一帧创建一些先验知识。或者可以使用更先进的统计状态估计方法，如卡尔曼和粒子滤波器。在本节中，我们将进一步讨论这两种方法。一个名为开放跟踪库（OpenTL）的开源跟踪库（Panin，2011）支持GPU处理，并实现了这两种方法和其他方法。

tion of structures in dynamic image data is often referred to as tracking. One way to do segmentation of dynamic images, is to apply one of the segmentation methods discussed so far on each frame. However, this may not satisfy real-time constraints. Another approach is to use the segmentation of the previous frame to segment the next frame. The segmentation of the previous frame can be used for initialization, or to create some a priori knowledge for the next frame. Or more advanced statistical state estimation methods can be used, such as Kalman and particle filters. In this section, these two methods will be discussed further. An open source library for tracking called Open Tracking Library (OpenTL) (Panin, 2011) supports GPU processing, and implements both of these methods and others

3.11.1.卡尔曼滤波卡尔曼滤波（Kalman，1960）是一种算法，它试图使用一系列的噪声测量来估计状态。在图像分割中，状态可以是一组描述形状变换的参数，如平移、旋转、缩放和变形。可以进行几种类型的测量。对象跟踪的一种测量类型是从形状上的每个点到当前图像帧中的对象边缘的偏移量。这些偏移量是通过沿着每个点的法线进行线搜索来找到的，类似于活动形状模型（ASMs）。测量过程是数据并行的，线程计数等于行搜索的次数。

3.11.1. Kalman filter

The Kalman filter (Kalman, 1960) is an algorithm that tries to estimate a state using a series of noisy measurements over time In image segmentation, the state may be a set of parameters describing the transformation of a shape, such as translation, rotation, scaling and deformation. Several types of measurements can be conducted. One type of measurement for object tracking is the offset from each point on the shape to the object’s edges in the current image frame. These offsets are found by a line search along the normal in each point, similar to active shape models (ASMs). The measurement process is data parallel, and the thread count is equalto the number of line searches.

该算法本身由一组矩阵运算组成，并且大多数矩阵的大小取决于状态变量和度量值的数量。乘法、加法、反转等矩阵运算都是数据并行运算，线程计数取决于矩阵的大小。对于GPU有几个线性代数库，可以用于加速这种操作。一些例子是维也纳、岩浆、铜和克隆

The algorithm itself consists of a set of matrix operations, and most of the matrices have sizes dependent on the number of state variables and measurements. Matrix operations such as multiplication, addition and inversion are all data parallel operations, and the thread count is dependent on the matrix size. There exist several linear algebra libraries for the GPU that can be used for acceleration of such operations. A few examples are ViennaCL, MAGMA, cuBLAS and clBLAS

因此，使用卡尔曼滤波器对动态图像进行分割是一种数据并行操作，线程计数依赖于测量值和状态变量的数量。这些数字可能在不同的应用程序之间差别很大。然而，它们比体素的数量要小得多。因此，线程计数为中等计数。内存的使用量很低，因为只需要存储几个小的矩阵。在行搜索中可能会出现一些分支分歧。例如，如果形状上的一些点在图像之外。然而，实际的算法没有或几乎没有分支分歧。

Thus, segmentation of dynamic images using the Kalman filter is a data parallel operation, and the thread count is dependent on the number of measurements and state variables. These numbers can vary a lot from one application to another. However, they are a lot smaller than the number of voxels. Thus, the thread count is medium. The memory usage is low, as only a few small matrices have to be stored. Some branch divergence may occur on the line searches. For instance if some of the points on the shape are outside of the image. However, the actual algorithm has no or little branch divergence.

Huang等人（2011）提出了一种用CUDA编写的卡尔曼滤波器的GPU实现。他们观察到，与串行实现相比，加速速度非常大。状态变量的数量范围从250到4500，测量值从1000到7000。

Huang et al. (2011) presented a GPU implementation of the Kalman filter written in CUDA. They observed a very large speedup compared to a serial implementation. The number of state variables ranged from 250 to 4500 and measurements from 1000 to 7000.

3.11.2.粒子滤波器方法（Arulampalam et al.，2002）试图估计给定测量值的状态变量的后验密度。这是通过对大量样本进行蒙特卡罗模拟来实现的，也称为粒子。每个粒子都是下一个时间步的可能状态。这些粒子被分配了一个权重，这就决定了它对后验密度的描述程度。这是通过评估每个粒子与下一幅图像中的对象的匹配程度来完成的。由于有大量的粒子，这个过程的计算成本可能会很高。然而，每个粒子都可以并行处理，并且可以通过计算这些粒子的加权和来确定下一个状态的估计值。因此，该方法具有高度的数据并行性。线程计数等于粒子数。高粒子数通常会得到更好的结果，而且几千个粒子似乎很常见（蒙特马约尔et al.，2006；布朗和卡普森，2012年）。这个

3.11.2. Particle filter

The particle filter method (Arulampalam et al., 2002) tries to estimate the posterior density of the state variables given the measurements. This is done by performing a Monte Carlo simulation with a large number of samples, also called particles. Each particle is a possible state for the next time step. The particles are assigned a weight, which determines how well it describes the posterior density. This is done by evaluating how well each particle matches the object in the next image. With a large number of particles this process can be computationally expensive. However, each particle can be processed in parallel, and an estimate of the next state can be determined by calculating a weighted sum of these particles. Thus, the method is highly data parallel. The thread count is equal to the number of particles. A high particle count generally gives better results, and a couple of thousand particles seems to be common (Montemayor et al., 2006; Brown and Capson, 2012). The

内存的使用情况取决于权重计算的实现方式。例如，Brown和Capson（2012）为每个粒子生成了一张图像，并将这些合成图像与下一张图像进行了比较，这提供了很高的内存使用率。方法的其余部分内存。分支散度也是如此。一些粒子滤波的GPU实现已经被报道，并主要集中在加速昂贵的权重计算步骤。Montemayar等人（2006）使用Cg，并在320 240大小的2D图像流上实现了高达2048个粒子的实时速度。马特奥·洛扎诺和大冢（2008）以及洛扎诺和大冢（2008）使用CUDA在大小为1024 768的图像流上实现了人脸跟踪。墨菲-楚托里安和Trivedi（2008）和Lenz等人（2008）使用GLSL进行了人脸跟踪。Brown和Capson（2012）创建了一个用CUDA编写的GPU框架，用于跟踪2D图像流中的3D模型。他们使用共享内存来加速权重计算过程。

memory usage is dependent on how the weight calculation is

implemented. For instance, Brown and Capson (2012) generated an image for each particle, and compared each of these synthetic images to the next image, which gave a high memory usage. The rest of the method uses little memory. The same applies for the branch divergence.

Several GPU implementations of particle filtering have been reported, and have primarily focused on accelerating the expensive weight calculation step. Montemayor et al. (2006) used Cg and

achieved real-time speeds with up to 2048 particles on a stream of 2D images with the size 320 240. Mateo Lozano and Otsuka (2008) and Lozano and Otsuka (2008) implemented face tracking

on a stream of images with size 1024 768 using CUDA. Murphy-Chutorian and Trivedi (2008) and Lenz et al. (2008) did face tracking using GLSL. Brown and Capson (2012) created a GPU framework written in CUDA for tracking 3D models in a stream of 2D images. They used shared memory to accelerate the weight calculation process.

4. 在前面几节中，我们回顾了GPU的医学图像分割加速。在此基础上，我们讨论了关于未来在gpu上进行图像分割的主要发现和一些预测。

4. Discussion

In the preceding sections, GPU acceleration for medical image segmentation has been reviewed. To conclude the survey, a discussion on the main findingsand some predictions regarding the future of image segmentation on GPUs are presented.

4.1. 表2综述了本综述的主要发现。在本表中，列出了本文中讨论的所有分割方法，并使用第2节中介绍的框架进行了评级。一般来说，大多数分割和图像处理方法使用相同的指令和数据处理每个像素周围的小邻域数据。因此，线程计数通常很高。医学数据集的典型大小为图像为512 512，卷为5123，分别超过26.2万像素和超过1.34亿体素。然而，正如在本综述中所看到的，一些分割方法并不处理每个像素。例子包括主动轮廓，它移动由一组点组成的轮廓，以及统计形状模型，它使用一组地标点来建模形状。对于这些方法，只有在点数为数千时使用gpu才是有益的

4.1. Current state of the art

The main findings of this review are summarized in Table 2. In

this table, all the segmentation methods discussed in this paper are

listed, and rated using the framework introduced in Section 2.

In general, most segmentation and image processing methods process each pixel using the same instructions, and data from a small neighborhood around the pixel. Thus, the thread count is usually high. Typical sizes of medical datasets are 512 512 for images, and 5123 for volumes, which amount to over 262 thousand pixels and more than 134 million voxels respectively. However, as seen in this review, some segmentation methods do not process each pixel. Examples include active contours, which move a contour consisting of a set of points, and statistical shape models, that model shapes using a set of landmark points. For these methods, it may only be beneficial to use GPUs when the number of points is in the thousands.

大多数分割方法也是迭代的，因为它们多次运行相同的内核。这需要全局同步，而目前还不可能从内核内部有效地实现全局同步。迭代处理通常需要双重缓冲，因为全局内存写在一个内核执行中不一致。当使用纹理时，目前需要双缓冲，因为纹理只能在线程中读取或写入。双缓冲使使用的内存量增加了一倍，这对于一些方法可能会有问题，如3D梯度向量流。

Most segmentation methods are also iterative because they run

the same kernel several times. This requires global synchronization, which at present time is not possible to do efficiently from inside a kernel. The iterative processing often require double buffering, because global memory writes are not coherent within one kernel execution. When using textures, double buffering is currently required, as a texture can only be read or written to in a thread. Double buffering doubles the amount of memory used, which can be problematic for some methods such as 3D gradient vector flow.

分支发散也是几种方法的一个挑战，因为并不是所有的像素都需要处理。这是分割方法的情况，如区域增长和窄带水平集。使用流压缩可以减少由于分支分歧而造成的性能损失。然而，这是有代价的，如果必须在每次迭代中使用它，也不会提高性能，这是区域增长的情况。一些GPU实现可能不会比优化的串行版本提供较大的加速，因为该实现意味着要执行更多的工作。对于区域增长和分水岭等方法也是如此。随着地区的增长，总数

在数据并行GPU实现中，每次迭代中处理的像素数要比串行迭代高得多。Hadwiger等人（2004）在2004年发表了一份关于基于gpu的分割技术现状的报告。相比之下，当时很少有基于gpu的分割实现，级别集（Rumpf和斯特佐德ka，2001；Lefohn et al.，2004）是一个例外。他们得出的结论是，分支分歧和内存管理给GPU的实现带来了挑战

number of pixels processed in each iteration is much higher in the data parallel GPU implementation than the serial one. Hadwiger et al. (2004) presented a report on the state of the art of GPU-based segmentation in 2004. In contrast, there were very few GPU-based segmentation implementations at this time, with level set (Rumpf and Strzodka, 2001; Lefohn et al., 2004) being one of the exceptions. They concluded that branch divergence and memory management present challenges for GPU implementations

4.2. 软件预测通用的OpenCL和CUDA等GPU框架近年来吸引了很多用户。与着色器编程相比，因为它们简化了gpu的编程，所以它们的受欢迎程度可能会增加。OpenCL可以有效地使用gpu和cpu。很可能会出现更多使用gPU的混合解决方案，以及较少并行部分的CPU。这些混合解决方案面临的挑战是有效的数据共享。在编写时，共享必须通过PCI快速总线上的内存传输显式地完成。然而，这似乎是一个两个主要的GPU制造商都想要改进的问题。这将在下一节中进行更详细的讨论。使用常用的数据结构和算法，如堆、排序、流压缩和减少。随着更多的算法和图像数据在GPU上被处理，帮助编写图像处理算法以及动态图像数据的调度、内存管理和流化的库和框架可能会变得更加重要。一个有目标的框架

6. 4.3.硬件预测，两个主要的GPU制造商，NVIDIA和AMD，提供了他们的GPU未来发展的一些细节。然而，这些细节可能会发生变化。总的来说，GPU的发展趋势是增加了线程处理器的数量、时钟速度和板载内存的数量。这使得可以更快地并行处理更多的数据。NVIDIA最近推出了他们新的开普勒架构，它提供了动态并行性，允许线程调度新的线程。然而，嵌套深度目前被限制为24个（NVIDIA，2012）。动态并行性可能被证明在解决偏微分方程的分割方法中是有用的，如水平集和GVF，通过在某些图像区域实现细网格计算，在其他部分实现粗网格计算。他们目前的路线图（NVIDIA，2013b）表明，他们对下一个两个里程碑（麦克斯韦和Volta）的重点将放在记忆上。统一的虚拟内存将允许cpu和gpu更无缝地共享内存。接下来，他们计划将内存模块堆在一起，并将它们放在与GPU核心本身相同的硅基板上。这种技术被称为堆叠技术

7. 4.3. Hardware predictions The two main GPU manufacturers, NVIDIA and AMD, provide some details of the future development of their GPUs. However, these details are subject to change. In general, the trend in GPU development has been increasing the number of thread processors, the clock speed and the amount of on-board memory. This allows more data to be processed faster in parallel. NVIDIA recently launched their new Kepler architecture, which provide dynamic parallelism that allow threads to schedule new threads. However, the nesting depth is currently limited to 24 (NVIDIA, 2012). Dynamic parallelism might prove to be useful in segmentation methods that solve PDEs, such as level sets and GVF, by enabling fine grid computations on some image areas and coarse grid computations on other parts. Their current roadmap (NVIDIA, 2013b) suggests that their focus for the two next milestones (Maxwell and Volta) will be on memory. Unified virtual memory will allow CPUs and GPUs to share memory more seamlessly. Further down the road they plan to pile memory modules atop one another, and place them on the same silicon substrate as the GPU core itself. This technology is called stacked DRAM, and can supposedly give GPUs access to up to one terabyte per second of bandwidth.AMD plan to focus on heterogeneous computing through their Heterogeneous System Architecture (HSA) initiative (Advanced Micro Devices, 2013). They state that current CPUs and GPUs have been designed as separate processing elements, and do not work together efficiently. Their plans is to rethink processor design to unify these two processors types, and give applications a unified address space.

英特尔最近发布了另一种名为英特尔Xeon Phi协处理器的处理器（Intel，2014）。这些处理器有大量的内核（60）、较大的高速缓存（30MB）和大量的板载内存（16GB）。然而，与gpu相比，它们有更少的线程处理器（240）。尽管如此，大的缓存、内存带宽和大小也可能使这些处理器成为医学图像分割的兴趣。 5.结论本文讨论了最常用的医学图像分割算法，并根据其对图形处理单元（gpu）的适用程度进行了评价。通过比较，大多数分割方法都是数据并行的大量线程，这使得它们非常适合GPU加速。然而，诸如同步、分支分歧和内存使用等因素可能会限制串行执行时的加速。为了减少这些限制因素的影响，本文讨论了几种GPU优化技术

Intel recently released another type of processor called the Intel

Xeon Phi Coprocessor (Intel, 2014). These processors have a large

number of cores (60), large cache (30 MB) and a lot of on-board

memory (16 GB). However, in contrast to GPUs, they have fewer

thread processors (240). Still, the large cache, memory bandwidth and size may make these processors interesting also for medical image segmentation.

5. Conclusions

In this review, the most common medical image segmentation algorithms have been discussed, and rated according to how suited they are for graphic processing units (GPUs). Through this comparison, it is shown that most segmentation methods are data parallel with a high amount of threads, which makes them well suited for GPU acceleration. However, factors such as synchronization, branch divergence and memory usage can limit the speedup over serial execution. To reduce the impact of these limiting factors, several GPU optimization techniques are discussed

在gpu上的医学图像分割-一个综合的综述 Medical image segmentationonGPUs–Acomprehensive review