Full article: Petri net modeling and simulation of pipelined redistributions for a deadlock-free system

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The growing use of multiprocessing systems has given rise to the necessity for modeling, verifying, and evaluating their performance in order to fully exploit hardware. The Petri net (PN) formalism is a suitable tool for modeling parallel systems due to its basic characteristics, such as parallelism and synchronization. In addition, the PN formalism allows the incorporation of more details of the real system into the model. Examples of such details include contention for shared resources (like memory) or identification of blocked processes (a definition for blocked processes is found in the Introduction section). In this paper, PNs are considered as a modeling framework to verify and study the performance of parallel pipelined communications. The main strength of the pipelines is that if organized in a proper way, they lead to overlapping of computation, communication, and read/write costs that incur in parallel communications. Most of the well-known pipelined schemes have been evaluated by theoretical analysis, queueing networks, and simulations. Usually, the factors taken into account are scheduling, message classification, and buffer spacing. To the best of our knowledge, there is no work in the literature that uses PN as a modeling tool for verification of the pipeline-based scheme. Apart from verification, a more accurate and complete model should also consider other factors, such as contentions and blocked processes. These factors have a high impact on the performance of a parallel system. The PN model presented in this paper accurately captures the behavior of the pipeline-based parallel communication system. The model considers synchronization, message scheduling, and message classification, while it is proven to be free of deadlocks and contentions. Also, the model is characterized by symmetry, so it can be used for large and complex systems.

Keywords:

Public Interest Statement

This paper presents a Petri net-based model used to verify and evaluate the performance of pipelined parallel distributions. It precisely captures the behavior of a pipeline-based parallel communication system. The model considers message scheduling and message classification, while it is deadlock and contention free. Because it is symmetric, it can easily be used for larger systems only with minor changes.

The model is composed of two subnetworks: the reading subnetwork and the pipeline execution subnetwork. The reading subnetwork implements the task of preparing the necessary pipeline tasks. The pipeline execution subnetwork implements the actual communication by executing the series of tasks. The model (and thus the real system) maintains its safety by preventing a pipeline from starting until all pipeline tasks have been formed. Also, it avoids deadlocks by preventing a new communication to start before all the other segments have been assigned the proper tasks.

1. Introduction-related work

The problem of data distribution between several processors is very important, affecting the efficiency of parallel algorithms. As a parallel program is being executed, it may require a different distribution of data between the processors or redistribution.

The principal issues that need to be considered to model a pipeline-based communication are (1) total redistribution cost, (2) message scheduling, (3) message classification, (4) load balancing, (5) contentions, and (6) blocked processes.

(1)	Total redistribution cost: It is the total cost to redistribute data between several processors. It is composed of the index computation overhead, the total communication overhead, and the R/W (or I/O) overhead. The index computation overhead refers to computing the target processor and the memory positions where each data element will be located. The total communication overhead incurs during data distribution between parallel processors. The R/W overhead refers to the time consumed by R/W operations.
(2)	Message scheduling: The term refers to the organization of message exchanges into structured communication steps, such that the total redistribution cost is minimized.
(3)	Message classification: The term refers to dividing the data to be redistributed into homogeneous (in terms of communication cost) groups to effectively organize their redistribution.
(4)	Load balancing: The ability of having all communication links equally or nearly equally loaded to avoid network congestions.
(5)	Contention: This refers to conflicting transferring processes that distribute data to the same target processor at the same time.
(6)	Blocked processes: A process is blocked if it waits for the execution of another process that never occurs.

Several pipeline-based techniques have been proposed in the literature (King, Chou, & Ni, Citation1990; Preud’homme, Sopena, Thomas, & Folliot, Citation2012; Rodrigues, Wheeler, & Kogge, Citation2008). The pipeline techniques have been widely studied (Caron & Desprez, Citation2005; Jayachandran & Abdelzaher, Citation2007; Kashif, Gholamian, Pelizzoni, Patel, & Fischmeister, Citation2013; Kuntraruk, Pottenger, & Ross, Citation2005; Souravlas & Roumeliotis, Citation2004,Citation2014 among others) and used to schedule interprocessor communication. In Caron and Desprez (Citation2005), pipelining is combined with out-of-core techniques (OOC) to overlap the computation, communication, and R/W overheads. Scheduling is based on dividing memory into three memory blocks and then loading three blocks of data into memory. These blocks are three times smaller than what should be loaded without this segmentation. The communication costs overlap because during each period, the first memory block performs I/O, computation is performed on the data stored in the second block, and data from the third block is distributed to a nearby processor. No actual message classification incurs. Data are just fragmented into three pieces and transferred. Based on scheduling, no contentions should occur. During anytime, a memory block clearly receives one data block. The communication model does not take into account the issue of blocked processes. Assuming that there is a proper synchronization of the pipeline segments (so that no process can block the execution of another process), the communication tasks pass from one segment to another and delays are introduced on the writing processes to avoid idle times. A similar approach is presented in Jayachandran and Abdelzaher (Citation2007), but it imposes a restriction on the usage of the pipelines: every task should finish before proceeding to the next pipeline segment.

Table 1. Pipeline-based communication models in the literature and the factors they incorporate

Display Table

In Kashif et al. (Citation2013), the worst-case response times of real-time applications on multiprocessor systems are computed. The proposed technique schedules a simple pipelined communication operation for data distribution. The model consists of a set of processing resources interconnected with pipelined communication resources (CRs). Data transmitted on the CRs travel through the first segment followed by the next, and so on. Simultaneous transmission of data on segments is allowed. This means that while data are being transmitted on a later segment, new data can be transmitted on an earlier segment in parallel. However, if this situation is not handled carefully, it can lead to blocked processes. Pipeline segments should be carefully released first, before handling new transmissions.

The work in Kuntraruk et al. (Citation2005) addresses the problem of developing a resource estimation model for applications executed within a parallel pipeline model of execution. The model estimates the computation and communication complexities for parallel pipelined applications. It includes two components: the ones that execute pipelined application tasks and the ones that perform merge operations. In the beginning, there are $P$ tasks processed on $P$ processors, one task per processor. Afterwards, at every step, the tasks are reduced in half since half of the processors are merging data received from the previous step. The model pays no attention to conflicts that could easily arise when different data volumes are carried over the network of processors. Also, there is no concern about possible blocked processes.

An interesting approach for modeling pipeline parallelism is given in Navarro et al. (Citation2009). The authors develop a series of analytical models based on queueing theory for several parallel pipeline templates, which are modeled as closed or open queueing systems. Specifically, each pipeline segment is treated as a $M / M / c_{i} / N / K$ queue (for closed systems) or as a $M / M / c_{i}$ queue (for open systems). The models assure load balancing over the network. Since the proposed models are based on queues, contention is not avoided between messages that try to enter the queue. Things deteriorate if there is not much space in the queues. Necessarily, these models (unlike most of the models described) have to take into account the limited memory (buffer space). Simulations on real systems were used to verify that the queuing pipeline models capture the behavior of parallel systems faithfully. A related approach in Ties et al. (Citation2007) marks the pipeline segments and tries to track the data communications performed in each of them. Also, an open system approach model is proposed by Liao et al. (Citation2004), with the same factors taken into account. The latest approaches mentioned do not guarantee load balancing between processor sets.

Many other pipeline-based parallel communication models have been presented in the literature (King et al., Citation1990; Preud’homme et al., Citation2012; Rodrigues et al., Citation2008; Zhang & Deng, Citation2002). Generally, the models presented are basically concerned with maintaining some load balancing on the network during the pipelined distributions, with little or no attention paid on the problem of contentions and blocked processes. All these models also use simulation as the verification and performance study tool. Table summarizes the factors addressed by the pipeline communication models discussed in this section. Note that all of the models involve some communication scheduling and load balancing (the messages distributed are of the same size) and none of them considers blocked processes. Also, most of the models assume that buffer space is enough to handle the distributed data and do not include a straightforward contention-preventing mechanism.

This paper introduces a Petri net (PN) model for modeling and simulating pipelined and deadlock-free parallel communications. PNs are used to examine the sequence of executed tasks (Granda, Drake, & Gregorio, Citation1992). The study of the process sequence is very important to avoid faults such as deadlocks [(some very general ideas about modeling pipelined parallel communication with PN can be found in Zhao, Liu, Dou, and Yang (Citation2012)]. A deadlock-free scheme enhances the performance of any communication system. Generally, deadlocks occur when processes stay blocked for ever (waiting for an event caused by another process that never occurs) and in such cases, probably the whole system needs to be restarted. A block cyclic redistribution scheme can suffer from deadlock situations since each target block is formed during runtime and after a series of interrelated processes, which are described in Section 4.

The rest of the paper is organized as follows: Section 2 presents the background of the model, which is based on block cyclic( $r$ ) to block cyclic( $s$ ) distributions. Section 3 describes the pipelined communication, which is modeled via PN in Section 4. Section 5 presents some simulations for the complete PN model of Section 5, for three different communication scenarios. Section 6 concludes this paper.

2. Background

The model presented in this work has it is mathematical background on the well-known block cyclic redistribution problem, so this section briefly introduces the definitions required.

Definition 1

Data array is an array of size $M$ used to represent the redistributed data. An array element is an element of the redistributed data indexed with $i$ . Indexing begins from zero, thus, $i ϵ [0 \dots M - 1]$ .

Definition 2

A processor grid can be represented by a two-dimensional (2D) table called communication grid $Π$ : $Π = {(p, q) ϵ [0 \dots P - 1] \times [0 \dots Q - 1]}$ . Obviously, $p$ is the source processor index, $q$ is the destination processor index, while $P, Q$ represent the total number of sending and receiving processors, respectively.

Definition 3

Data distributed in a block cyclic fashion are divided into data blocks. If each data block has r elements, then, provided that $M$ divides $r$ , the data array will be divided into $M_{b}$ blocks where: $M_{b} = \frac{M}{r}$ . If $M$ does not divide $r$ , then $M_{b} = \frac{M}{r} + 1$ . We use variable $l$ as a block index that relates data blocks to the processors of the communication grid in a cyclic manner. Therefore, $l$ lies in $[0 \dots \frac{M_{b}}{P})$ or $[0 \dots \frac{M}{P r})$ (since $M_{b} = \frac{M}{r}$ ). Finally, variable $x$ indexes the local position of an element inside a block. This means that $0 \leq x < r$ .

Definition 4

The source distribution $R (i, p, l, x)$ is the mapping of a data array element with index $i$ to a processor index $p$ , a block index $l$ , and a local position inside the block $x$ , where $i = (l P + p) r + x$ .

Definition 5

Consider an element that is distributed cyclic(s) on $Q$ processors. The number of blocks created is $M_{b}^{'} = \frac{M}{s}$ , where $s$ is the block size. Variable $m$ relates data blocks to the processors and its bounds are found in the interval $[0 \dots \frac{M_{b}^{'}}{Q})$ or $[0 \dots \frac{M}{Q s})$ (because $M_{b}^{'} = \frac{M}{s}$ ). The target distribution $R^{'} (j, q, m, y)$ is defined similarly to the source redistribution. Parameters $(j, q, m, y)$ have the same meaning as $(i, p, l, x)$ of the source distribution. We can derive an equation for the distribution of element $j$ : $j = (m Q + q) s + y$ .

Table 2. Definitions of variables in this paper

Display Table

Definition 6

Suppose that data are a redistributed array from cyclic( $r$ ) on $P$ processors to cyclic( $s$ ) on $Q$ processors. In this case, changes will occur for all elements as far as their processor, block, and local position indices are concerned. These changes are described by: $R (i, p, l, x) = R^{'} (j, q, m, y)$ or:(1) $\begin{matrix} (l P + p) r + x = (m Q + q) s + y \end{matrix}$ (1)

This linear Diophantine equation is subject to the following restrictions: $0 \leq p < P, 0 \leq q < Q$ , $0 \leq l < \frac{L}{P r}, 0 \leq m < \frac{L}{Q s}$ , $0 \leq x < r, 0 \leq y < s$ , where $L$ is the least common multiplier of $P r, Q s$ , that is, $L = L C M (P r, Q s)$ .

Definition 7

The cost of transferring a message from a sending processor $p$ to a receiving processor $q$ is called communication cost, $C_{(p, q)}$ . To compute the communication cost for a processor pair $(p, q)$ , one needs to find the number of quadruples $(l, m, x, y)$ that satisfy Equation 1, given the number of sending $(P)$ and receiving $(Q)$ processors, and the block sizes of the source $(r)$ , and the target $(s)$ redistribution.

Definition 8

Consider the following function:(2) $\begin{matrix} f (p, q) = (p r - q s) m o d g, \end{matrix}$ (2)

where $g = g c d (P r, Q s)$ is the greatest common divisor of $P r$ and $Q s$ . A pair of processors $(p, q)$ belongs to a communication class (Desprez, Dongarra, Petitet, Randriamaro, & Robert, Citation1998a) $k$ if:(3) $\begin{matrix} f (p, q) = k or (p r - q s) m o d g = k \end{matrix}$ (3)

As Equation 3 indicates, all pairs of processors that communicate belong to a class of $(p r - q s) m o d g$ . The number of existing classes is at most $g$ . Table summarizes the variables used in this paper.

3. Pipelined communication

This section presents the pipelined interprocessor communication. Each pipeline includes a number of tasks responsible for the communication between carefully selected processor pairs. The main properties of the pipeline operations and their tasks are:

(1)	Each pipeline task handles the transmission of data between processor pairs that have the same communication cost.
(2)	A pipeline operation cannot include more than one task that handles message transmissions of a cost.
(3)	The time required for the execution of a task equals the communication cost of the processor pairs it includes.
(4)	The time required for the execution of a pipeline operation equals the execution time of its longest task.
(5)	All tasks are scheduled in such a way that receiving processors get one message at a time, thus congestions on the receiving ports are avoided.
(6)	The pipeline will include a number of segments (the role of segments in the communication will be explained in Section 3.3) equal to the number of different costs that exist in the scheme.
(7)	The time the processors remain idle is minimized.

The pipelined data distribution is composed of three stages: (1) generating the pipeline tasks, (2) reading messages from memory, and (3) transferring the messages and writing them to the target processors’ memory. In the next sections, details for each stage are presented.

3.1. Stage 1: Generating the pipeline tasks

The pipeline tasks must be scheduled in such a way that receiving processors get one message at a time. To satisfy this requirement, each task must include a number of distributions of same cost to different destination processors. Therefore, classes are used to group all the communicating processor pairs with respect to the cost of such communication. A processor pair lies in class $b (k)$ , if $k = (p r - q s)$ mod $g$ . The class processor table (CPT) shows the class of each processor pair and the communication cost of this class. Consider a redistribution with $P = Q = 9, r = 4$ , and $s = 5$ . In this case, $g = 9$ . The CPT for this redistribution example is shown in Table . For example, if $(p, q) = (4, 3)$ , then $p r - q s = 16 - 15 = 1$ . Thus, $(p r - q s)$ mod $g$ = 1 mod 9 = 1. This means that the processor pair (4,3) belongs to the class $k = 1$ . The cost of communication for each class is computed as the number of quadruples $(x, y, l, m)$ that satisfy Equation 1, for a given set $(p, q)$ .

Table 3. CPT for $P = Q = 9, r = 4, a n d s = 5$

Display Table

Having defined the classes, it remains to: (1) find the number of pipeline operations and the number of their tasks, (2) define an upper bound for the number of processor pairs selected from each class for a pipeline task, and (3) define the number of classes from which the processor pairs are selected to have a minimum of $Q$ transmissions (one message for each destination processor) in each pipelined communication.

To minimize the time the processors remain idle, each pipeline must be scheduled to have a maximum number of tasks; in other words, to transfer as much data as possible with a single pipeline operation. If the number of different communication costs found in all classes is $d$ , then a pipeline operation has at most $d$ tasks and can satisfy up to $d Q$ message transmissions, without contentions.

Theorem 1 (for proof see Desprez et al., Citation1998a) is used to define an upper bound for the number of processor pairs in class $b (k)$ that will be added in a pipeline task. Initially, we set $s^{'}^{'} = g c d (s, P)$ and $r^{'} = g c d (r, Q)$ . Since $s^{^{'}}$ divides $P$ and $r^{^{'}}$ divides $Q$ , there exist integers $P^{^{'}}$ and $Q^{^{'}}$ such that: $P = P^{^{'}} s^{^{'}}$ and $Q = Q^{^{'}} r^{^{'}}$ . Also, we set $g_{0} =$ gcd $(P^{^{'}}, Q^{^{'}})$ .

Theorem 1

Each class includes exactly $\frac{P Q}{g} = \frac{P^{^{'}} Q^{^{'}}}{g_{0}}$ processor pairs. Theorem 1 leads to the following corollaries:

(1)	The number of sending requests to a destination inside a class is $P^{^{'}} / g_{0}$ .
(2)	There are exactly $Q^{^{'}}$ different destinations inside each class, thus a pipeline task can satisfy no more than $Q^{^{'}}$ communications between processor pairs of a class because this would cause contentions.

To define the number of classes from which processor pairs are selected for a minimum of

Q

transmissions for a pipeline operation, Proposition 1 will be used.

Proposition 1

For a pipeline operation with minimum number of $Q$ communications, the communicating processor pairs must be selected from $r^{^{'}}$ different classes.

The minimum number of message exchanges for a pipeline operation corresponds to “one message for each destination processor”, that is, $Q$ messages in total. A pipeline task can satisfy at most $Q^{^{'}}$ communications from one class, otherwise contentions will occur. From the relationship $Q = Q^{^{'}} r^{^{'}}$ , one can easily conclude that the processor pairs must be selected from $r^{^{'}}$ classes to complete $Q$ transmissions. The generation of the pipeline operations and their tasks can be described in a series of well-defined steps as follows:

Step 1: Solve (Equation 3) for all processor pairs $(p, q)$ to define the processor classes and create the CPT.

Step 2: Find the total communication cost for each $b (k)$ by computing the number of quadruples $(l, m, x, y)$ that satisfy Equation 1. Since each class includes messages of the same cost, only one computation is needed for a pair $(p, q)$ . All other processor pairs in the same class would have the same cost. Afterwards, define the value of different costs that exist for this distribution, $d$ .

Step 3: Start from the class $b (k)$ , for which the communication cost $C_{b (k)}$ is minimum, and get $Q^{'}$ processor pairs. If the pairs selected from $b (k)$ can form a task of $Q$ transmissions, that is, if $Q$ = $Q^{'}$ , move to Step 4. Otherwise, check if there is a class of the same cost as $b (k)$ to add up to $Q - Q^{'}$ pairs. In either case, the processor pairs that task $T_{i}$ should include must be such that all destination processor indices differ: $T_{i}$ = $(p_{λ_{0}}, q_{μ_{0}}), (p_{λ_{1}}, q_{μ_{1}}), \dots (p_{λ_{P}}, q_{μ_{Q}})$ , where $μ_{0} \neq μ_{1} \neq \dots \neq μ_{Q}$ .

Step 4: Find the class $b (k)$ with the next communication cost and repeat Step 3. Tasks with the same communication cost are not allowed in the same pipeline operation. Once a pipeline includes $d Q$ message exchanges, it is completed. Go to Step 5.

Step 5: Check the value of $d$ to find the number of different costs for the rest of the processor pairs and use Steps 3 and 4 to create the next pipeline operation.

Step 6: When all processor pairs are added in a pipeline operation, terminate, if not, return to Step 1. Consider the redistribution for $P$ = $Q$ =9, $r$ =4, and $s$ =5. In this case, $g = 9$ . The CPT is shown in Table . Also (see the last column of Table ), $d = 4$ since there are four different communication costs in the scheme varying from 1 to 4 time units. According to Step 3, we get $Q^{'} = 9$ processor pairs from $r^{'} = 1$ class to create a task of $Q = 9$ transmissions. We can select pairs from any of the two classes $b (4)$ and $b (6)$ since they have the same communication cost of one time unit. Suppose that we select from class $b (4)$ . These processor pairs will form the first task $T_{0}$ of the first pipeline operation.

According to Step 4, the processor pairs of class $b (6)$ cannot be used in any of the tasks for this pipeline because the same communication cost of one unit will appear twice for all destinations. For the same reason, the classes $b (2)$ and $b (8)$ , $b (0)$ and $b (1)$ , and $b (3)$ and $b (7)$ are mutually exclusive. The tasks of the first pipeline include processor pairs from $b (0)$ , $b (2)$ , $b (3)$ , and $b (4)$ . In Step 5, the value of $d$ is checked to find the number of different costs in the remaining classes $b (1)$ , $b (6)$ , $b (7)$ , and $b (8)$ . We have $d = 4$ . Therefore, Steps 3 and 4 are used to create the second pipeline (the two pipeline operations are shown in Table ).

Table 4. Pipeline operations and its tasks for $P = Q = 9, r = 4, a n d s = 5$

Display Table

Once the pipeline operations and their tasks are scheduled, the messages must be read from local processor memories and prepared for distribution. This stage is described in the next section.

3.2. Initialization—reading messages from memory

This stage involves computing the local memory positions where the data to be distributed reside. Using the terms of Table and Equation 1, one can describe the reading stage as follows: the reading stage computes the local positions $x$ of the data elements to be redistributed. These elements reside in block $l$ of the source processors’ $(p)$ memory. All this information can be easily obtained when (Equation 1) is solved. As an example, consider the transfer of data blocks towards processor $q = 0$ in a redistribution problem with parameters $P = Q = 9, r = 4, and s = 5$ . Table gives solutions of equation (Equation 1) when $q = 0$ and $p \in [0, 8]$ .

Table 5. Solutions of Eq.1 for $P = Q = 9, r = 4, a n d s = 5$

Display Table

Suppose that we want to know the position of the data elements scheduled to be distributed from source processor $p = 7$ to target processor $q = 0$ . As shown in Table , these elements reside in block $l = 0$ and their local position inside the block is defined by $x$ , that is, 0, 1, 2, and 3. The upper part of Figure shows all the elements that will move to $q = 0$ and their initial position in the source processors. These positions are computed from Equation 1, as shown in Table . Once the initialization computations are done, the pipelines are ready for execution.

Figure 1. Reading from source processors’ memories and writing to target processors’ memories.

3.3. Transferring the messages and writing to the target processors’ memory

When the pipelines execute, they generate a number of communications between several processors. It is important to note that pipeline operations are executed sequentially (one after the other) but their tasks are parallel. Figure shows the execution of the two pipeline operations for the redistribution with parameters $P = Q = 9, r = 4, and s = 5$ . Each pipeline operation is composed of four tasks ( $T_{0} - T_{3}$ ).

Figure 2. R/W for $P = Q = 9, r = 4, and s = 5$ .

The horizontal axis displays the time in time units, while the vertical axis gives the pipeline segment. The role of a segment is to handle the distribution process performed by a pipeline task $T_{i}$ . When a task is handled by the $n$ th out of $d$ segments, it is scheduled to be the $n$ th to complete its distribution job. For example, in Figure , there are four segments that handle four transferring tasks. As tasks move “downwards” from segment 4 to segment 1, they are approaching their completion. Apparently, in this figure, $T_{0}$ is to finish first, as it is the “cheapest” task (one time unit, see Table ). The time required for the execution of a task equals the communication cost of the processor pairs it includes. From the previous discussion in Section 3.1, it is obvious that the pipeline tasks are completed at different times. Since each task cannot contain more than a message to a specific destination, contentions at the receiving processors’ ports are avoided. To make it more clear, consider the four tasks shown in Table . All tasks handle messages to the same target nodes; however, congestions are avoided since these tasks complete at different times.

Each of the tasks performs a partial transferring job, that is, it “adds” elements to data blocks at a certain time. Now, let us examine how these task are executed during communication to processor $q = 0$ . Suppose that communication starts at time $t = 0$ . By the end of time $t = 1$ , the first task $T_{0}$ is complete (see Figure ). This means that one element from source $p = 1$ (note in Table that $p = 1$ sends data to $q = 0$ during execution of $T_{0}$ ) is transferred to its new location. By the end of time time $t = 2$ two more elements are added from $p = 3$ . The first pipeline operation completes at $t = 4$ time units. In the very same manner, the second pipeline operation starts execution at time $t = 5$ . At each time unit, a task completes and adds elements to $q = 0$ . At $t = 8$ , the distribution to $q = 0$ is complete. Similarly, all destination processors receive their data blocks during the same period of time.

When pipeline execution completes, the distributed elements become parts of newly formed blocks indexed by $m$ in the memories of the target processors $q$ . Their new local position inside the blocks is defined by $y$ . For example, the lower part of Figure shows the newly formed blocks in the memory of $q = 0$ . It is clear that each new block has five elements. The block $m = 0$ is formed by four elements transferred from $p = 0$ and one element transferred from $q = 1$ . The local position of these five data elements in the newly formed block is given by $y$ , that is, 0, 1, 2, 3 (for elements from $p = 0$ ), and 4 (for the one element from $p = 1$ ). The lower part of Figure shows the new position of the elements distributed to $q = 0$ in their new blocks $m = 0, 1, 2,$ and 3. Figure shows the formulation of these blocks over time, during the execution of the pipeline tasks included in the two pipeline operations shown in Table . Assuming that communication starts at time $t = 1$ , at $t = 2$ , the first task will be completed. Therefore, the target processor $q = 0$ will have received one data block from the sending processor $p = 1$ (see task $T_{0}$ of the first pipeline in Table ). According to the results in Table , for $(p, q) = (1, 0)$ , this block will be stored in position $y = 4$ of the target block $m = 0$ (recall that block positions indexes start from 0, so this element occupies the last position of the block). Similarly, at $t = 2$ , task 2 is completed. Therefore, the target processor $q = 0$ will have received two data blocks from the sending processor $p = 3$ (see task $T_{0}$ of the first pipeline in Table ). According to the results in Table , for $(p, q) = (3, 0)$ , this block will be stored in positions $y = 3$ and $y = 4$ of the target block $m = 1$ .

Figure 3. Memory writing over time for $P = Q = 9, r = 4, and s = 5$ .

4. The PN pipelined communication model

As described in Section 2, the pipeline based models of communication in the literature do not consider the problem of deadlocks (blocked processes), while they seldom take into account the contentions. The goal here is twofold: (1) use the PN model as a tool to verify that the communication model of Section 3 is deadlock and conflict-free and (2) obtain the performance metrics for three different distribution scenarios from the PN model via a discrete event simulations.

The system model under consideration comprises of a number of pipeline tasks in a single pipeline operation that execute as described in Sections 3.1–3.3. Apparently, a distribution problem can include a variable number of operations with a variable number of tasks. Since all pipeline operations can follow the same pattern (execute a number of parallel tasks), it is important to design a symmetric model, so that it can be applied for variable number of tasks with minor changes. From a modeling perspective, there are three occurrences of interest: (1) generation of the pipeline tasks, (2) execution of the pipelines, and (3) handling the pipeline segments. This section presents a fully symmetric PN model for these occurrences and performs deadlock and safeness analyses to verify that the models (and consequently the pipeline communication schedule) do not suffer from blocked processes or contentions. Notice that occurrences (2) and (3) are closely related, thus presented in the same subsection. Before that some preliminaries regarding PN are required.

4.1. PN preliminaries

This subsection briefly presents the basic PN notations required in this work. A PN is a set of two different types of nodes: places (pictured with circles) and transitions (pictured with bars). Places and transitions are connected via directed arcs from places to transitions and vice versa. If an arc is directed from node $A$ to node $B$ , then $A$ is an input to $B$ . The state of a PN changes when it is executed. The execution is controlled by the tokens placed inside nodes. When the PN is executed, a number of tokens are removed from their current place and are located in different places. The distribution of tokens in the places defines the state of the net. This distribution is referred as marking of the PN. Apparently, when a system is initialized, it must have an initial state or initial marking. In this paper, every marking (initial or later) is denoted by $μ_{i} = (P_{i}, \dots P_{i + k}, \dots P_{n})$ , where $i$ is the marking index ( $i = 1$ denotes the initial marking), $n$ is the number of places, and $P_{i}$ is every place that has a token in this marking.

Figure 4. (a) A PN example and (b) Reachability tree

A change of a PN’s state (that is, the movement of tokens) occurs when the transitions are enabled to fire and this is true when all of its input places have a token. To describe that fact that a transition’s firing changed the marking from $μ_{i}$ to $μ_{i + 1}$ , the following notation is used: $μ_{i} \overset{t_{j}}{\to} μ_{i + 1}$ or $(P_{i}, \dots P_{i + k}, \dots P_{n}) \overset{t_{j}}{\to} {(P_{i}, \dots P_{i + k}, \dots P_{n})}^{^{'}}$ , where ${()}^{'}$ indicates the new set of token-holding places. As an example, consider the PN of Figure (a). The initial marking is $μ_{1} = (P_{3}, P_{4})$ and the only enabled transition is $t_{4}$ ( $t_{2}$ is not enabled, although $P_{2}$ has a token because its input place $P_{5}$ has no token). When $t_{4}$ fires, the token will be removed from $P_{4}$ and two new tokens will be placed, one in $P_{3}$ and one in $P_{1}$ , resulting in $μ_{2} = (P_{2}, P_{3}, P_{4})$ .

A very important issue about PN is that a sequence of firings can result in a marking $μ$ , where no transition is enabled. This would drive the model in a deadlock (process being blocked by another process). This means that there are a number of unwanted states that can lead to a deadlock. Spotting these states is a very important issue when modeling because it uncovers the sequence of executing events that can lead the model, and consequently the real system, to deadlock. The tool used to analyze the PN model for deadlocks is called reachability tree. A reachability tree is a set of nodes that represents all possible markings of a net (if the set is finite) caused by the firing of transitions. Figure (b) shows the reachability tree for the example of Figure (b). The transition that causes every new marking is written near the arrows. The parentheses above places $^{()}$ show the number of tokens in every place. If no parentheses are included, the number of tokens is 1. In the case of Figure (b), no place does store more than one token, but this is not always the case.

Peterson (Citation1997) introduced two basic rules for the reachability analysis: (1) if a newly formed marking is equal to an existing one on the path from the root (which is an initial marking) to this new marking, then this marking is a terminal node. This means that if a new marking is equal to a previous one, then all markings reachable from it are already added to the reachability tree, and (2) if a newly formed marking $y$ is greater than a previous marking $x$ , then all possible firings from marking $x$ are also possible from marking $y$ . In this case, the components of $y$ which are greater than the corresponding components of $x$ are replaced by the symbol $ω$ , where $ω$ is a value arbitrarily large compared to a natural number $α$ . Also, the sequence of firings that lead from $x$ to $y$ can be repeated endlessly, resulting every time in an increase of the number of tokens in the corresponding positions $P_{i}$ . The symbol $ω$ is used to denote an arbitrarily large number of tokens in $P_{i}$ , and the notation $P_{i}^{(ω)}$ is used. From Peterson’s rules, we derive Proposition 2 that gives a condition under which a model does not reach a deadlock.

Proposition 2

A PN model will never reach a deadlock if for all markings that are roots to subtrees formed on the reachability tree, it is possible to find a terminal node or a marking greater than the root.

In a PN model, the places represent conditions, while the transitions represent events. The presence of a token in a place shows that a condition is true. Since a condition is either true or false, there is no point in having more than a token in a place. Thus, especially when modeling hardware, one of the most important characteristic of PN is safety. A place of a PN is safe if its tokens are never more than one. In other words, therefore, safeness is violated when a sequence of firings puts two or more tokens in a place. In a parallel pipelined communication model, safeness ensures that there are no conflicting processes. Two or more processes are in conflict if, at the same time, they try to read the data blocks from the same source processors or to distribute and write data to the same target processors. The reachability analysis can show if multiple tokens are put in one place. Deadlock and safety analyses are provided for the proposed models in the next subsections.

4.2. Modeling the pipeline generation

The background and the steps required to generate the pipelines were presented in Section 3.1. Figure (a) presents the PN model and Figure (b) shows the pipeline generation in pseudo-code form. The part in the square is the reading subnetwork. Out of the reading subnetwork, there are two places, $P_{E}$ and $P_{G}$ and two transitions, $t_{E}$ and $t_{G}$ , their roles described in the following. The places and transitions of the model are described as follows:

Places
$P_{G}$ :	=	Generation request
$P_{E}$ :	=	Transfer request (pipeline execution)
$P_{1}$ :	=	Lowest cost class
$P_{2}$ :	=	$Q^{^{'}}$ messages
$P_{3}$ :	=	While condition
$P_{4}$ :	=	If condition
$P_{5}$ :	=	Unread messages from lowest cost class
$P_{6}$ :	=	End System available to generate new operation

Transitions
$t_{G}$ :	=	Get lowest cost class
$t_{1}$ :	=	Read $Q - Q^{^{'}}$ messages
$t_{2}$ :	=	Check condition $Q^{^{'}} < Q$
$t_{3}$ :	=	Assign FALSE to condition
$t_{4}$ :	=	Assign TRUE to condition
$t_{5}$ :	=	Terminate reading
$t_{6}$ :	=	Get to next lowest cost class
$t_{7}$ :	=	Read unread messages from a same cost class

Figure 5. Pipeline generation model.

The pipeline generation of model of Figure is a repeating process which is initialized by a generation request provided that there is no pipeline generation in process. The initial marking, $μ_{1} = (P_{6})$ , indicates that the system is available to generate a new pipeline operation. Once $P_{G}$ gets a token, $t_{G}$ is enabled to fire and cause a series of firings. Firing of $t_{G}$ will cause one token from $P_{G}$ and one from $P_{6}$ to be removed and one token to be placed to $P_{1}$ (the lowest cost class is defined). This will produce $μ_{2} = (P_{1})$ and enable $t_{1}$ . Once $t_{1}$ fires, a token is removed from $P_{1}$ and one token is be placed to $P_{2}$ ( $Q^{^{'}}$ messages generated). The new marking is $μ_{3} = (P_{2})$ and $t_{2}$ is enabled. Then, a while condition should be checked to decide if more messages are required from the same class. When $t_{2}$ fires, a token is placed in $P_{3}$ [marking $μ_{4} = (P_{3})$ ]. This enables two transitions ( $t_{3} and t_{4}$ ), but only one can fire. It should be stressed that PN does not have a mechanism to decide which of the enabled transitions will fire. This depends on the designer of the model. When $t_{4}$ is selected, there is a cyclic execution of firings, which repeats until $Q = Q^{^{'}}$ and creates repeatedly the markings $μ_{5} = (P_{5}), μ_{6} = (P_{2}), a n d μ_{7} = (P_{3})$ . When $t_{3}$ is selected, a token is placed to $P_{4}$ (marking $μ_{8} = (P_{4})$ . As with the while condition, the system now has to check if there is a class of a specific cost not added. If so, $t_{6}$ fires to cause $μ_{9} = (P_{1})$ . This means that the sequence of firings described will be repeated. If not, $t_{5}$ fires to cause $μ_{9} = (P_{6}, P_{E})$ . This means that the system has terminated the pipeline generation and it is available to generate a new one. Also, a transfer request is activated (pipelines can be executed, as described in Section 4.3).

At this point, an anti-paradigm is necessary to stress the importance of modeling. Assume that the arc from $P_{6}$ to $t_{G}$ is removed. Also, assume that a pipeline generation is in process (say, is checking the while statement, that is, a token is in $P_{3}$ ). If a second generation request is made, the system will have to start a new operation, starting with a new lower cost class. Thus, a token is placed in $P_{1}$ . Then, a sequence of firings $t_{3}, t_{6}$ will result in having two tokens stored in $P_{1}$ . The system is not safe and it can start a pipeline generation with two different distribution parameters (lowest cost class). This is a conflict and the problem can be resolved only by restarting the system. To avoid this, $t_{G}$ is enabled only if there is a request (token in $P_{G}$ ) and the system is available (token in $P_{6}$ ).

4.3. Execution of the pipelines handling the segments

Once the reading stage is completed, the actual communication can start. To execute the pipelined communication, the system control requires the following sequence of events:

Output of segment 1 (lowest segment that handles the lowest cost task) is ready. Thus, the lowest cost messages are to arrive at the target nodes.
Messages from segment 1 are written to the memories of the target nodes.
Output of segment 2 is ready. The task handled by segment 2 is now handled (“moves to”) by the lowest segment 1.
Segment 2 is free (ready to accept a task from the upper segment 3) and segment 1 now handles communication previously handled by segment 2 (these are the next messages to arrive to the target nodes).
Output of segment 3 is ready. The task handled by segment 3 moves to segment 2.
Segment 3 is free.
Output of segment 4 is ready. The task handled by segment 4 moves to segment 3.
The same pattern repeats until all $d$ segments “move down” to segment 1 and complete communication.

As described in the previous subsection, once node

P_{E}

receives a token, the execution stage can start. This is done by firing

t_{E}

. When

t_{E}

fires, one token is removed from

P_{E}

and put in all places that indicate the output of one pipeline segment is ready. Since a distribution problem has

d

different costs (thus,

d

pipeline tasks), the model requires a maximum of

d

segments; thus,

d

places receive a token by firing

t_{E}

. Figure (a) shows the model of the pipeline execution subnetwork (recall that place and transition numbering continue from the reading subnetwork) and Figure (b) shows the marking that arises by firing

t_{E}

. This marking is

μ_{1} = (P_{7}, P_{13}, P_{14}, P_{18}, \dots P_{m + 1})

, where

P_{m + 1}

is the place showing that the output of segment

d

is ready. Clearly, the model suggests that segments correspond with “circles” of places and transitions, which are clearly formed in Figure (a). Transfer between segments is implemented via firing the dual input transitions. Also, note that the circles are symmetric, with the only exception being the last segment where there is one node missing for the simple reason that there is no other segment to move down to

d

. The dashed arrows indicate that the pattern repeats until the last segment

d

. The places and transitions of the model are described as follows:

Places
$P_{7}$ :	=	Output of segment 1 is ready
$P_{8}$ :	=	Segment 1 busy
$P_{9}$ :	=	Segment 2 in segment 1
$P_{10}$ :	=	Lowest cost communication done
$P_{11}$ :	=	Segment 2 empty
$P_{12}$ :	=	Segment 3 in segment 2
$P_{13}$ :	=	Segment 2 output ready
$P_{14}$ :	=	Segment 3 output ready
$P_{15}$ :	=	Segment 4 in segment 3
$P_{16}$ :	=	Segment 3 empty
$P_{17}$ :	=	Segment 4 empty
$P_{18}$ :	=	Segment 4 output ready
$P_{m}$ :	=	Segment $d$ empty
$P_{m + 1}$ :	=	Segment $d$ output ready
$P_{m + 2}$ :	=	Lower segment communication continues

Transitions
$t_{8}$ :	=	Unpack messages
$t_{9}$ :	=	Write to receiving processors’ memory
$t_{10}$ :	=	Reset segment 2
$t_{11}$ :	=	“Move” segment 2 to segment 1
$t_{12}$ :	=	“Move” segment 3 to segment 2
$t_{13}$ :	=	Reset segment 3
$t_{14}$ :	=	Reset segment 4
$t_{15}$ :	=	“Move” segment 4 to segment 3
$t_{16}$ :	=	“Move” segment 5 to segment 4
$t_{17}$ :	=	Reset segment 5
$t_{n}$ :	=	Reset segment $d$
$t_{n + 1}$ :	=	“Move” segment $d$ to segment $d - 1$
$t_{n + 2}$ :	=	Proceed communication in lower segment

Figure 6. Pipeline execution model.

The proposed model is fully symmetric and it can be used easily for any number of states. Now, we are going to perform a test for the small distribution problem with parameters $P = Q = 9, r = 4, and s = 5$ presented in Section 3. This test will help us understand fully the execution of the model. Based on the CPT (Table ), there are two pipeline operations, each including four tasks $(T_{0}) - T_{3}$ (see Table ). The “movement” of these tasks across the pipeline segments is shown in Figure . The model for this distribution would include four circles formed by places and transitions and it is in shown in Figure (a).

Figure 7. Pipeline execution model for $P = Q = 9$ , $r = 4, and s = 5$ .

Initially, there are four places having a token, so $μ_{1} = (P_{7}, P_{13}, P_{14}, and P_{18})$ . Thus, the only enabled transition is $t_{9}$ . Once $t_{9}$ fires, a token is removed from $P_{7}$ and placed in $P_{10}$ , resulting in $(P_{10}, P_{13}, P_{14}, a n d P_{18})$ . From the real system’s point of view, this describes event completion of communications performed by task $T_{0}$ (1) ( $T_{0}$ is handled by segment 1, see Figure ). Now, $t_{11}$ is the only transition enabled. When it fires, it produces $μ_{2} = (P_{9}, P_{14}, and P_{18})$ (tokens are removed from $P_{10}, P_{13}$ and one token is placed in $P_{9}$ . In the real system event, task $T_{1}$ is assigned to segment 1 (2) (because segment 2 moves to segment 1). Next, only transition $t_{10}$ can fire, resulting in $μ_{3} = (P_{8}, P_{11}, P_{14}, and P_{18})$ . We have two events now: segment 2 is empty (3) and segment 1 is now busy with $T_{1}$ (4). At this point, there is a synchronization issue: if $t_{8}$ is enabled, the same pattern will be executed continuously. In the first of this series of executions, segment 1 will deliver some messages to the source processors, but afterwards there will be no messages to deliver because it will have to wait for tasks from segment 2 that do not come since segment 2 is empty waiting for upper segments. Practically, this means that the system stays idle for as long as this repeating execution continues and if this sequence does not change, the system reaches a deadlock and transmission has to start from scratch. Also, if $t_{12}$ does not fire in the meantime, more than one token will be placed on $P_{11}$ rendering the system unsafe. The solution we give with this model is to enable $t_{8}$ only after $t_{16}$ fires, which, for the real system, means that segment 1 can finish another distribution task only when all the remaining tasks are assigned to segments (“move downwards” as can be seen in Figure ). When $t_{16}$ fires, a token is put in place $P_{19}$ and this enables $t_{8}$ . However, $t_{16}$ can fire only after $t_{10}, t_{12}, t_{13}, t_{13},$ , and $t_{15}$ have fired, so all the tasks have been properly assigned to segments. That is, the model indicates that there must be a synchronization between segments (and consequently between the tasks they include) to avoid deadlocks.

Continuing the description, $t_{12}$ is the only enabled transition. Once it fires, it produces $μ_{4} = (P_{8}, P_{12}, a n d P_{18})$ . This indicates that event task $T_{2}$ is assigned to segment 2 (5) (segment 3 moves to segment 2). Now, $t_{13}$ is enabled. Once it fires, the new marking is $μ_{5} = (P_{8}, P_{13}, P_{16}, and P_{18})$ . Now, there are two events: output of segment 2 is ready (6) and segment 3 is empty (7). Event 6 means that $T_{2}$ is ready to move to segment 1 after the completion of $T_{1}$ executed there. From marking $μ_{5}$ , it is obvious that only $t_{15}$ can fire, resulting in $μ_{6} = (P_{8}, P_{13}, and P_{15})$ . For the real system, event task $T_{3}$ is assigned to segment 3 (8) (because segment 4 moves to segment 3). Now, $t_{14}$ is enabled. When it fires, the new marking is $μ_{7} = (P_{8}, P_{13}, P_{14}, a n d P_{17})$ . The two real system events are: output of segment 3 is ready (9) (meaning that when $T_{1}$ finishes from segment 1 and $T_{2}$ is pushed from segment 2 to segment 1, $T_{3}$ will move to segment 2) and segment 4 is empty(10). Now $t_{16}$ is enabled. When it fires, the new marking is $μ_{8} = (P_{8}, P_{13}, P_{14}, P_{18}, and P_{19})$ . With a token in $P_{19}$ , transition $t_{8}$ is enabled again because all segments have moved as explained previously. When this firing occurs, there is one event: completion of communications performed by task $T_{1}$ (11). Also, the new marking produced is $μ_{1}$ , meaning that the same sequence of firings repeats. Therefore, another transfer (task $T_{1}$ ) can be completed in the lowest segment 1 and the segments will move downwards again, repeating the same execution cycle. Next, the sequence of events produced by the model’s execution is listed (the events written boldfaced in the analysis above).

(1)	Completion of communications performed by task $T_{0}$
(2)	Task $T_{1}$ is assigned to segment 1
(3)	Segment 2 is empty
(4)	Segment 1 is now busy with $T_{1}$
(5)	Task $T_{2}$ is assigned to segment 2
(6)	Output of segment 2 is ready
(7)	Segment 3 is empty
(8)	Task $T_{3}$ is assigned to segment 3
(9)	Output of segment 3 is ready
(10)	Segment 4 is empty
(11)	Completion of communications performed by task $T_{0}$
(12)	Same pattern repeats

Based on the above analysis, it is easy to get the reachability tree of Figure (b). Obviously, the PN model has no deadlocks since there is no sequence of firings that can disable a transition and the pattern is proven to repeat itself. As an anti-paradigm, suppose that the at least one of the places

P_{7}, P_{13}, P_{14}, and P_{18}

, say

P_{13}

is not initially marked. When

t_{9}

fires, a token will move to

P_{10}

, but since

P_{13}

has no token,

t_{11}

is disabled causing a deadlock. For the real system, this means that segment 2 cannot move down to segment 1, simply because its output is not, and will never be, ready. This will halt the system causing unwanted effects (e.g. communication restart).

4.4. Advantages of the proposed model

At this point, it is important to summarize some of the advantages of the proposed model.

(1)	Optimal in terms of complexity: The complexity of the model is $O (n)$ , where $n$ is the number of states examined. Clearly, this is the best that can be achieved as any other method would have to examine every state, so the complexity would at least be $O (n)$ .
(2)	Symmetry: As stated before, the model can be used for any redistribution problem (that is, any number of states) due to its symmetry. Thus, generally, there are no applicability limitations.
(3)	Precision: The model includes all the main processes involved in a redistribution problem (as described in Section 3): generation of the pipeline tasks, initialization and reading messages from memory, transferring, and writing back to memory. Thus, it is precise and models the real problem with high accuracy.

Remark When the second cycle is executed, segment 4 will be found empty (from the execution of $t_{14}$ in the first cycle). This means that segment 4 can now be used to put the lowest cost task of the next pipeline communication. In this case, when $t_{15}$ fires, the lowest task of the next pipeline can move to segment 3. Similarly, when the third cycle executes, segment 3 will be found empty (from the execution of $t_{13}$ in the second cycle). So, the lowest cost task of the second pipeline can move from segment 4 to segment 3, while the next lowest cost task can enter segment 4. However, it must be pointed out that the communication model is not designed to take advantage of the empty segments in an organized manner. Maybe, an idea would be to create and pipeline groups of classes to assure some kind of transfer homogeneity. This is a subject of future research.

5. Experiments

In this section, the accuracy of the model is verified via simulations. Based on the PN model, a small pipeline simulator (PPN simulator) was implemented to serve the purpose of this work. The simulator simply executes the well-defined sequence of processors described in Section 3.3. The simulations performed are not restricted by the assumption that there is adequate bandwidth and enough buffer space imposed for the sake of the model. Instead, different bandwidth sizes are used and simulations were performed considering the fact that buffers may or may not have enough size for the data volume carried over the network. Different scenarios are studied in terms of buffer space, bandwidth, and data volumes. The interprocessor communication is performed via all to all links. Each link has a bandwidth of $B$ MB/sec bandwidth. Considering that each node sends a number of messages at each communication step, the bandwidth available to a node would be:(4) $\begin{matrix} {bandwidth}_{p} = B \times \frac{ℓ}{h} \end{matrix}$ (4)

where $B$ MB is the link bandwidth, $ℓ$ is the number of links to a node $p$ , and $h$ is the average number of intermediate hops required for the communication between two processors. Assuming that there is only one link for each node and since $h$ depends on the logarithmic value of $P$ , Equation 4 becomes:(5) $\begin{matrix} {bandwidth}_{p} = B \times \frac{1}{{log}_{2} (P)} \end{matrix}$ (5)

To run the simulations, the theoretical block size $s$ of the target distribution must be converted to real time. By multiplying $s$ to an assumed vector size, the message size transferred between two processors at a step is produced. For example, if the target distribution block size $s = 5$ and vector size is 1 MB, then processor $p$ will send a message of 5 MB to processor $q$ . Since the bandwidth is $B$ Mb/sec, it is easy to estimate the time required for the transmission. In the following, the vector sizes will also be variable. In half of the simulations ran, the buffer spaces available were considered to be, on the average, $u$ % less than the maximum bandwidth. For the remaining half simulations, the buffer sizes were considered to have enough space to accommodate the data.

5.1. Scenario 1

The parameters and their values for the first scenario are as follows:

(1)	$r = 7 and s = 11$
(2)	Number if processors $P = 8$ .
(3)	Vector size: ranging from 128 Kb to 2 Mb
(4)	Real data volumes distributed: 1408 Kb–22 MB
(5)	$B = 50$ , so the bandwidth available is $\frac{50}{{log}_{8}}$
(6)	$u$ is 0 or 40 (when it is 0, the buffer space is enough)

Figure (a) shows the simulation results. The bandwidth available for a processor

p

is computed as

{bandwidth}_{p} = \frac{50}{log (8)} \approx 16.66

MB. The vector size varies from 128 Kb to 2 MB. The upper line gives the results on the basis that the buffers have, on the average, 40% less capacity than the maximum bandwidth

(u = 40)

, in this case,

\approx 10

MB. The lower line assumes that the buffers have enough space to accommodate the messages

(u = 0)

. For vector sizes less than 128 Kb, the buffers can accommodate the messages arrived because each sending processor

p

sends a maximum of

128 \times 11 = 1408

KB to a receiver

q

. Since the maximum number of sending processors towards

q

is 8,

q

can receive a maximum of approximately

8 \times 1408

Kb = 11 Mb. As the vector size increases, the gap between the two lines increases.

Figure 8. Experiments for various parameters.

5.2. Scenario 2

In the second scenario, the vector size ranges from 32 to 512 Kb and the number of processors is $P = Q = 16$ . The parameters and their values for the first scenario are as follows:

(1)	$r = 7 and s = 11$
(2)	Number if processors $P = Q = 16$ .
(3)	Vector size: ranging from 32 to 512 Kb
(4)	Real data volumes distributed: 352 Kb–5.5 MB
(5)	$B = 50$ , so the bandwidth available is $\frac{50}{{log}_{16}}$
(6)	$u$ is 0 or 30

Figure (b) shows the simulation results. The bandwidth available for a processor

p

is computed as

{bandwidth}_{p} = \frac{50}{log (16)} \approx 8.33

MB. The upper line gives the results on the basis that the buffers have, on the average, 30% less capacity than the maximum bandwidth

(u = 30)

, in this case,

\approx 5.8

MB. The lower line assumes that the buffers have enough space to accommodate the messages

(u = 0)

. For vector sizes less than 32 Kb, the buffers can accommodate the messages arrived because each sending processor

p

sends a maximum of

32 \times 11 = 352

KB to a receiver

q

. Since the maximum number of sending processors towards

q

is 16,

q

can receive a maximum of approximately

16 \times 352

Kb = 5.5 Mb. As the vector size increases, the gap between the two lines increases. In this scenario,

m

decreases by 10% compared to the first scenario. This means that the memory can accommodate higher percentages of the total data volumes (recall that as the number of processors increases, the available bandwidth drops off). Thus, the two lines converge more compared to the first scenario.

5.3. Scenario 3

In the second scenario, the vector size ranges from 8 to 512 Kb and the number of processors is $P = Q = 64$ . The parameters and their values for the first scenario are as follows:

(1)	$r = 7 a n d s = 11$
(2)	Number if processors $P = Q = 64$ .
(3)	Vector size: ranging from 8 to 512 Kb
(4)	Real data volumes distributed: 88 Kb–5.5 MB
(5)	$B = 50$ , so the bandwidth available is $\frac{50}{log 64}$
(6)	$u$ is 0 or 10

Figure (b) shows the simulation results. The bandwidth available for a processor

p

is computed as

{bandwidth}_{p} = \frac{50}{log (64)} \approx 6.25

MB. The upper line gives the results on the basis that the buffers have, on the average, 10% less capacity than the maximum bandwidth

(u = 10)

, in this case,

\approx 5.6

MB. The lower line assumes that the buffers have enough space to accommodate the messages

(u = 0)

. For vector sizes less than 8 Kb, the buffers can accommodate the messages arrived because each sending processor

p

sends a maximum of

8 \times 11 = 88

KB to a receiver

q

. Since the maximum number of sending processors towards

q

is 64,

q

can receive a maximum of approximately

64 \times 88

Kb = 5.5 Mb. As the vector size increases, the gap between the two lines increases. Again, notice that

m

decreases by another 20% compared to the second scenario, meaning that the memory can accommodate higher percentages of the total data volumes. Thus, the two lines are closer compared to the second scenario.

As an observation, one can state that the results the model produces for the three scenarios corroborate each other. In the first scenario, the local processors’ memories were not capable of storing a good percentage of the data carried over the network. While $u$ keeps on reducing and the number of processors increases (thus, reducing the bandwidth and the data volumes distributed), the two lines shown in each of the three graphs are converging. The correctness of the results verifies the validity of the model.

6. Conclusions—future research

This paper presents a PN-based model used to verify and evaluate the performance of pipelined parallel distributions. It precisely captures the behavior of a pipeline-based parallel communication system. The model considers message scheduling and message classification, while it is deadlock and contention free. Because it is symmetric, it can easily be used for larger systems only with minor changes. This is one of its biggest strengths.

Future work can include the study of pipelined systems on certain topologies. It is interesting to check if the proposed model (perhaps with minor changes) can be used to study the performance of a data distribution over a torus or mesh network. Also, as already mentioned in the remark of Section IV, an implementation that can take advantage of any empty segments that incur during the distribution is of particular interest. An idea would be the introduction of superclasses (groups of classes).

Additional information

Funding

The authors received no direct funding for this research.

Notes on contributors

Stavros I. Souravlas

Stavros Souravlas and Manos Roumeliotis are faculty members of the Department of Applied Informatics at the University of Macedonia. They are the authors of three books in the field of Digital Logic Design, Digital Systems Modeling and Simulation with VHDL, and Simulation Techniques. Also, Stavros Souravlas Manos Roumeliotis are members of the Computer and Network Systems Technologies (CNST) group at the University of Macedonia. This group conducts research on the fields of digital systems design, parallel and distributed processing, communication networks, hardware description languages, and modeling and simulation of computer and network systems. This work is a part of a general research project ran by the CNST group on the possible faults that incur when applying parallel algorithms on a multiprocessor network.

References

Caron, E., & Desprez, F. (2005). Out-of-core and pipeline techniques for wavefront algorithms. In Proceedings of the 19th IEEE International Parallel and Distributed Symposium (IPDPS’05). Denver, CO.
Google Scholar
Chung, Y.-C., Hsu, C.-H., & Bai, S.-W. (1998). A basic-cycle calculation technique for efficient dynamic data redistribution. IEEE Transactions on Parallel and Distributed Systems, 9, 359–377.
Google Scholar
Desprez, F., Dongarra, J., Petitet, A., Randriamaro, C., & Robert, Y. (1998a). Scheduling block-cyclic array redistribution. IEEE Transactions on Parallel and Distributed Systems, 9, 192–205.
Google Scholar
Desprez, F., Dongarra, J., Petitet, A., Randriamaro, C., & Robert, Y. (1998b). More on scheduling block-cyclic array redistribution. In Proceedings of the 4th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, volume 1511 of Lecture Notes in Computer Science (pp. 275–287). Pittsburgh, PA: Springer-Verlag.
Google Scholar
Granda, M., Drake, J. M., & Gregorio, J. A. (1992). Performance evaluation of parallel systems by using unbounded generalized stochastic Petri nets. IEEE Transactions on Software Engineering, 18, 55–70.
Google Scholar
Jayachandran, P., & Abdelzaher, T. (2007). A delay composition theorem for real-time pipelines. In Proceedings of the 19th Euromicro Conference on Real-Time Systems, ECRTS ’07 (pp. 29–38). Washington, DC: IEEE Computer Society.
Google Scholar
Kashif, H., Gholamian, S., Pelizzoni, R., Patel, H. D., & Fischmeister, S. (2013, April). ORTAP: An offset-based response time analysis for a pipelined communication resource model. In Proceedings of the 19th Real-Time and Embedded Technology and Applications Symbosium (RTAS). Philadelphia, PA.
Google Scholar
Kaushik, S. D., Huang, C. H., Johnson, R. W. & Sadayappan, P. (1994, July). An approach to communication-efficient data redistribution. In Proceedings of the 8th ACM International Conference on Supercomputing. Manchester.
Google Scholar
King, C. T., Chou, W. H., & Ni, L. M. (1990). Pipelined data-parallel algorithms: Part I: Concept and modeling. IEEE Transactions on Parallel and Distributed Systems, 1, 470–485.
Google Scholar
Kuntraruk, J., Pottenger, W. M., & Ross, A. M. (2005). Application resource requirement estimation in a parallel pipeline model of execution. IEEE Transactions on Parallel and Distributed Systems, 16, 1154–1165.
Google Scholar
Liao, W.-K., Choundary, A., Weiner, D., & Varshney, P. (2004). Performance evaluation of a parallel pipeline computational model for space-time adaptive processing. Journal of Supercomputing, 31, 137–160.
Google Scholar
Navarro, A., Asenjo, R., & Tabik, S. (2009, September). Analytical modeling of pipeline parallelism. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques. Raleigh, NC.
Google Scholar
Peterson, J. L. (1997). Petri nets*. ACM Computing Surveys (CSUR), 9, 223–252.
Google Scholar
Preud’homme, T., Sopena, J., Thomas, G., & Folliot, B. (2012). An improvement of openMP pipeline parallelism with the BatchQueue algorithm. In Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS). Singapore.
Google Scholar
Rodrigues, A., Wheeler, K., & Kogge, P. (2008). Fine-grained message pipelining for improved MPI performance. In Proceedings of IEEE Conference on Cluster Computing. Barcelona.
Google Scholar
Souravlas, S. I., & Roumeliotis, M. (2004). A pipeline technique for dynamic data transfer on a multiprocessor grid. International Journal of Parallel Programming, 32, 361–388.
Google Scholar
Souravlas, S. I., & Roumeliotis, M. (2014, March). Verification & performance evaluation of parallel pipelined communications using Petri nets. In IEEE, UKSIM-AMSS, 16th International Conference on Computer Modeling and Simulation (pp. 398–403). Cambridge.
Google Scholar
Ties, W., Chandrasekhar, V., & Amarasinghe, S. (2007). A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In MICRO’07 (pp. 356–369). Chicago, IL.
Google Scholar
Zhang, P., & Deng, Y. (2002). Design and analysis of pipelined broadcast algorithms for the all-port interlaced bypass torus networks. IEEE Transactions on Parallel and Distributed Systems, 23, 2245–2253.
Google Scholar
Zhao, Z., Liu, X., Dou, W., & Yang, K. (2012, October 19--22). Research on data parallel and scheduling mechanism based on Petri nets. In 11th International Symposium on Distributed Computing and Applications to Business, Engineering & Science (DCABES) (pp. 36–39). Guilin.
Google Scholar

Petri net modeling and simulation of pipelined redistributions for a deadlock-free system

Abstract

Public Interest Statement

1. Introduction-related work

Table 1. Pipeline-based communication models in the literature and the factors they incorporate

2. Background

Table 2. Definitions of variables in this paper

3. Pipelined communication

3.1. Stage 1: Generating the pipeline tasks

Table 3. CPT for $P = Q = 9, r = 4, a n d s = 5$

Table 4. Pipeline operations and its tasks for $P = Q = 9, r = 4, a n d s = 5$

3.2. Initialization—reading messages from memory

Table 5. Solutions of Eq.1 for $P = Q = 9, r = 4, a n d s = 5$

3.3. Transferring the messages and writing to the target processors’ memory

4. The PN pipelined communication model

4.1. PN preliminaries

4.2. Modeling the pipeline generation

4.3. Execution of the pipelines handling the segments

4.4. Advantages of the proposed model

5. Experiments

5.1. Scenario 1

5.2. Scenario 2

5.3. Scenario 3

6. Conclusions—future research

Notes on contributors

Stavros I. Souravlas

References

Information for

Open access

Opportunities

Help and information

Petri net modeling and simulation of pipelined redistributions for a deadlock-free system

Abstract

Public Interest Statement

1. Introduction-related work

Table 1. Pipeline-based communication models in the literature and the factors they incorporate

2. Background

Table 2. Definitions of variables in this paper

3. Pipelined communication

3.1. Stage 1: Generating the pipeline tasks

Table 3. CPT for P=Q=9,r=4,ands=5

Table 4. Pipeline operations and its tasks for P=Q=9,r=4,ands=5

3.2. Initialization—reading messages from memory

Table 5. Solutions of Eq.1 for P=Q=9,r=4,ands=5

3.3. Transferring the messages and writing to the target processors’ memory

4. The PN pipelined communication model

4.1. PN preliminaries

4.2. Modeling the pipeline generation

4.3. Execution of the pipelines handling the segments

4.4. Advantages of the proposed model

5. Experiments

5.1. Scenario 1

5.2. Scenario 2

5.3. Scenario 3

6. Conclusions—future research

Additional information

Funding

Notes on contributors

Stavros I. Souravlas

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. CPT for $P = Q = 9, r = 4, a n d s = 5$

Table 4. Pipeline operations and its tasks for $P = Q = 9, r = 4, a n d s = 5$

Table 5. Solutions of Eq.1 for $P = Q = 9, r = 4, a n d s = 5$