

Applied Mathematics & Information Sciences An International Journal

# Study on Last-Level Cache Management Strategy of the Chip Multi-Processor

Li Lei\*, An Huiyao and Zhang Peng

School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

Received: 9 Aug. 2014, Revised: 10 Nov. 2014, Accepted: 11 Nov. 2014 Published online: 1 Apr. 2015

Abstract: Threads of fair and effective allocation of shared resources limited is a key problem for chip multiprocessors. As the processor core growth in the scale, multi thread for the shared resource limited system competition will become more intense, the performance of the system will also be more significant. In order to alleviate this problem, a fair and effective multi thread shared resources allocation scheduling algorithm is important. In all kinds of shared resources, the largest effect on the system performance is the shared cache and DRAM system. There are essential differences between the last level cache and a cache. The goal of a cache design is to provide fast data processor which requires high access speed. However, the object of the last level cache is to save data in the chip as much as possible, and the access speed requirements are not too high, it is more subject to the plate number of available transistors. Management level cache LRU strategy and its approximate algorithm are not applicable to the large capacity last level cache for traditional. It may cause destructive interference between threads, cache thrashing of stream media program lead, which will lead to a decline in the performance of processor. This paper focuses on the analysis of some hot problems of the last level cache management in the process of the large capacity of multi-core platform sharing, and puts forward the corresponding costs less but the solution became larger to improve the system performance.

Keywords: Chip multiprocessor, LLC, Cache partitioning, Memory access.

### **1** Introduction

With the development of integrated circuit technology, chip multiprocessors (CMP) began to occupy the market [1]. In a CMP on the concurrent execution, CMP multiple cores/threads are commonly shared last level cache (LLC), DRAM controller, memory bus bandwidth, pre-fetch unit and other resources. The system of shared resources are always limited, threads need to compete for the right to the use of shared resources. Different threads of the program characteristics are different, and the thread for the shared resource capacity is usually associated with this [2]. If there is no scheduling mechanism, some threads may take up most or all system resources of the other threads, requesting not to service. If there exists a scheduling strategy which can in multiple parallel threads of fair and effective allocation of resources, it can alleviate the negative effect on the shared resources, which can improve the system performance [3]. With the development of high performance microprocessor, the memory wall problem becomes one of the key factors restricting the microprocessor performance. How to solve the memory wall problem became to the processor in the design process, especially the one of the hot research problems in the design of multi core processor. The gap between the access speed data processing speed and the main memory of the processor led to the processor computing ability because of data delay and greatly restriction [4,5]. In a multi-core processor, due to multiple processor cores also compete for resources, the memory wall problem becomes more prominent, so that the design of cache and cache management strategies limits the performance of multi-core processor to a great extent.

There are many methods to improve the performance of microprocessor; the industry improved the performance steadily by improving the manufacturing process of the microprocessor. Academia by exploring the level parallelism (ILP) and thread level parallelism (TLP) of the two different ways to improve processor performance [6]. The pipeline technique is adopted to increase the performance of microprocessors by improving instruction level parallelism which is a very

<sup>\*</sup> Corresponding author e-mail: lileibt@126.com

effective method. Pipeline technology to an instruction is divided into several levels, each processor in a cycle of executive level, thus to execute multiple instructions in parallel in the same period, which can improve the overall performance of microprocessor. The depth of pipelining is an improved version of assembly line techniques, and this technique further refined water series and reduce the water series operation, so that the processor's frequency increase [7,8,9,10]. The data correlation program itself will not only make the pipeline design with the increasing number of complicated, but also increase the possibility of conflict between different pipelines, resulting in a growing pipeline delay. At the same time, the power consumption of the processor frequency increases sharply, which makes the heat dissipation becomes a difficult problem. Furthermore, the increase will make the line processor frequency fall, which lead to high frequency but low processor [11, 12, 13]. At the same time, the verification with the increase in the number of line becomes more and more difficult. Therefore, the performance of a processor cannot be improved desirable only by improving the processor frequency [14].

With the development of technology, the microprocessor data become faster than access to main memory speed. At present, the microprocessor to access the main memory latency has reached hundreds of clock cycles, and the delay in the future will continue to increase [15]. Therefore, the main memory access becomes one of the key factors that severely limit the performance of microprocessors. Practice shows that the cache can significantly reduce the negative influence brought by the storage wall. But lack of the last level cache or trigger for main memory access, hundreds of cycles lead to processor pause. Therefore, the last level cache hit rate around the whole performance microprocessor [16]. The microprocessor is also commonly used in the traditional sense of the LRU method or the approximate strategy to manage the last level cache, ignore the last level cache characteristics. However, the traditional LRU strategy and without effective use of the last level cache resources cause large number of cache misses, and eventually led to the decline in overall performance microprocessor. If design a suitable management strategy according to the last level cache characteristics, the microprocessor performance will be greatly improved [17].

This study expatiates on the current cache multi-core processors and cache management strategy. This research proposes a new method to eliminate Cache management strategy of low reuse block, predicts the access interval, and validate them by multi-core simulator and multi load, and analyze the performance and overhead. A divide and conquer thread aware Cache management strategy, and validate them by multi-core simulator and multi load, and analyze the performance and overhead. This study introduces a design called low reuse block elimination and re visit interval prediction management strategy. According to the characteristic that last level cache block low reuse resources occupied longer, the strategy adopted by the historical information predict low reuse block and its priority is out of data on a cache aware last level cache. And by improving the access interval prediction technique, it can predict low reuse potential and the first elimination. Experiments show that for multi core processor, this strategy weighted speedup increased an average of 9.1% than the LRU.

# 2 the relevant thoery of cache

# 2.1 The cache hierarchy design

The design of the current mainstream high performance processor adopts multi-level cache structure, which can increase cache layer, and its capacity is increasing, but the access delay is more and more big [18]. The level two cache architecture are the most common processor design process, there are very few processor with three levels cache architecture. The on-chip memory space is limited, and the cost is relatively expensive, therefore, the existing on-chip cache capacity is relatively small [19].

There are many limiting factors in the design process of multi-core processor structure. On one hand the speed gap between processor and memory is growing; on the other hand, the limited bandwidth will increasingly make the program performance depends on the on-chip memory hierarchy [20]. Therefore, designers will design the last level cache design large enough to save more data on chip. Level two cache architecture is widely used in design the on-chip cache. However, the level two cache architecture and management strategy faces many challenges, such as nano-control line delay which due to the difference in level two cache structure caused by the non-uniform cache access (NUCA) [21].

The design of two level caches contain two different design schemes which are private level two cache and shared level two cache. Figure 1 shows the private level two cache structure.

# 2.2 Multiple kernel shared memory architecture

This paper used the shared memory approach for data management. Figure 2 showed the schematic diagram of the multiple kernel shared memory structure. From the map, it can clearly see that it is connected with a bus with four modules, namely the data transmission module (ARM and FIFO), arbitration module (ARBITRATOR) module to judge, state (CHECK) as well as external storage module (SRAM).

The data transmission module is composed of two parts; one part is ARM and the control of DRAM and FLASH, the other part is the FPGA internal FIFO. Using ARM as the kernel of the system, it will store their required data and information which is controlled by



Sharing second-level cache architecture

Fig. 1: The private level two cache structure.



Fig. 2: Schematic diagram of the multiple kernel shared memory structure.

DRAM and FLASH. In memory management, this two parts as a class Cache to be used, stored some commonly used data to improve the efficiency of data processing. When ARM needs memory on chip for read and write operations, it should be realized by FPGA in FIFO.

# 2.3 Cooperative cache and means of communication

This paper used the weighted speedup and throughput fairly to measure the performance of each multi load. The weighted speedup metric is to reduce the system execution time, and the harmonic mean of fairness is the balance between performance and fairness. Application of *i* assuming that  $IPC_i$  is a multi-channel test load of IPC, total, IPC says *i* keep the shared Cache when IPC.

These evaluation functions are as follows:

ŀ

$$IPC_{sum} = \sum_{i=1}^{n} IPC_i \tag{1}$$

663

$$WeightedSpeedup = \sum_{i=1}^{n} (IPC/IPC_{total,i})$$
(2)

$$Fairness_{ipc} = N / \sum_{i=1}^{n} (IPC_{total,i} / IPC_i)$$
(3)

The negative effect on the performance of on-chip wire delay is the main reason for private design. The utilization efficiency of the private space for Cache design is low. Thus, in the optimization of the private design basis for the design of the Cache usually focus on how to improve space and Cache utilization. Cooperative Cache is an optimization strategy for private based design of the representative, its basic structure was shown in Figure 3:

| Р       | L1   | Р  | L1 | Р       | L1 |
|---------|------|----|----|---------|----|
| L2      |      | L2 |    | L2      |    |
| L1<br>P | L2 - |    | E  | L1<br>P | L2 |
| L2      |      |    |    | L2      |    |
| Р       | L1   | Р  | L1 | Р       | L1 |

Fig. 3: Basic structure of Cooperative Cache.

And other private design based strategy is that in the mechanism of CC, except L1 Cache, each processor core and a close from their own local L2 Cache, which can get delayed hits like private Cache. At the same time each processor local L2Cache through a set of Chinese style consistency engine (CCE). The private Cache cooperation, aggregated into a shared Cache, dirty blocks are expelled from private Cache which can be placed adjacent to the processor's private L2 Cache. It can occupy a space equivalent to the shared Cache as large volume, reduce the failure rate of Cache, and reduce the off chip memory access. According to dynamic load, the number of control cooperation, can be in private and shared reasonably choosing between two Cache extreme cases which is to adapt to the dynamic behavior of the load.

#### 2.4 Cache access pattern classification

Recent researches have shown that different load visit to Cache in different period of storage behavior is different, and specific Cache access mode can be divided into the following four:

Cache friendly application access patterns: this kind of access patterns in a short period of time to repeat visits to

a section of the same data block; in the access mode the data block access to the time interval is short [22,23].

Bumpy access mode: this mode of periodic access to a fixed length block of data, but the data block length is greater than the number of data blocks that contain Cache in size, so in this access mode Cache hit rate is low.

Streaming access mode: in this access mode when a block of data is accessed, it will no longer be access; in the access mode cache hit rate is very low.

Hybrid access mode: hybrid access model is a complex of the above several access modes [24].

In order to better explain the above different access mode,  $A_i$  said in the cache group the *i* cache line address,  $(A_1, A_2, \dots, A_{k-1}, A_k), (A_1, A_2, \dots, A_{k-1}, A_k)$  said the temporary access to cache block sequence, (A1, A2, ...  $A_{k-1}$ ,  $A_k$ <sup>N</sup> said on the sequence of continuous access to N. When k cache size, the accessed data are all in the chip, at this time, the access mode is cache friendly application access patterns, and when k > cache size, load working set is larger than the number of data blocks that contain the size of the cache, in the LRU management strategy will be accessed because the data turbulence caused 0 hit rate. When  $k = \infty$ , processor accessed data will never be accessed again, this time for streaming access mode. (A<sub>1</sub>, A<sub>2</sub>, ...  $P_t(B_1, B_2, ..., B_{k-1}, B_k)^M$ , ...  $A_{k-1}$ ,  $A_k$ ) said the hybrid access model, where  $P_t$ represents a probability of  $(B_1, B_2, \dots, B_{k-1}, B_k)^M$  access sequence, where  $M \ge 1$ ,  $l \le cache size$ ,  $k = \infty$ .

# **3** The Optimization of Cache management strategy

# 3.1 CMP stage cache optimization technology classification

The above some Cache optimization technologies such as target application, design, optimization and so on have carried on the simple classification, as shown in **Table 1**:

It can be seen from Table 1, as private design basis for the design of the Cache due to its fast hit access speed, the optimization is focus on how to improve the Cache space utilization rate. Because sharing design has high space utilization rate, the optimization to share design is a design basis of strategy which tend to focus on two points:

1) How to reduce the influence of on-chip wire delays to improve the hit access speed;

2) How to solve a Cache space problems caused by multiple threads in CMP system sharing; these problems include how to reduce running time of total thread, and how to ensure fairness, priority support, such as QoS.

| Table 1: Target | application, | design, | optimization | technology | of |
|-----------------|--------------|---------|--------------|------------|----|
| Cache           |              |         |              |            |    |

| Target<br>Application                                                     | Design<br>basis   | Optimization<br>objective                                       | cache optimization<br>technology                                                                                                                          |
|---------------------------------------------------------------------------|-------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Single-<br>threaded<br>applications<br>Multi-<br>threaded<br>applications | Private<br>design | To reduce the failure rate                                      | Cooperative<br>Cache(CC, [8])<br>Dynamic spillover<br>receiver(DSR, [9])<br>Distributed cooperation<br>Cache [20]                                         |
|                                                                           | Shared<br>design  | Improve the hit access speed                                    | NUCA [21]<br>NuRAPID [22]<br>CMP-NuRAPID [3]<br>Expel block<br>copy(VR, [4])                                                                              |
|                                                                           |                   | Improve the hit<br>access speed<br>To reduce                    | Expel block<br>migration(VM, [5])<br>Selective replication                                                                                                |
| Multi-<br>threaded<br>applications                                        | Shared<br>design  | To reduce<br>the failure rate                                   | (ASR, [2])<br>Static optimal<br>partition [7]<br>Path Partition [8]<br>Utilization of<br>Cache Partition(UCP, [2])<br>ASP-NUCA [3]<br>Elastic cooperation |
|                                                                           |                   | Fairness                                                        | Cache(ECC, [14])<br>Fair Cache [15]<br>Real property<br>equity Cache [9]                                                                                  |
|                                                                           |                   | QoS, Priority<br>support<br>Fairness, QoS,<br>Priority support, | CQoS [15]<br>Cooperative partition                                                                                                                        |
|                                                                           |                   | To reduce the failure rate                                      | (CCP, [8])                                                                                                                                                |

# 3.2 Multi core interconnection structure and data elimination strategy

When ARM writes operations, ARM writes after the data is ready and written directly for that piece of FIFO prepared in FPGA. When a packet data in FPGA, ARM are written request and header information; FPGA received the information, advanced it into the arbitration module, judging whether it satisfies the arbitration the results, if not meet, you will wait, if meet, it will enter the state of judging module; if the condition judgment module does not meet the state you would wait satisfy you to write data to the SRAM; when a packet of data was written to the ARM interrupt, notify the already finished, ARM revoke write request, this is the end of write operation. The writing process was shown in Figure 4 (a).

Read operation and write operation are very similar. Both of them contain the following steps: sending a read request first, when meet the arbitration and judgment module conditions, it will read the data into the FIFO from the SRAM, and transmits the read interrupt, after ARM received read interrupt, ARM began to read data, when the data finished reading ARM will the request for revocation. The read operation is shown in Figure 4 (b).

In this circuit, the complex is arbitration and judgment module. The two modules can realize many methods. In order to enhance the reliability of the circuit and improve the speed, we use slot way to design the circuit. Because the data packet size is determined, it can determine a





Fig. 4: The writing and reading process.

packet of data transmission time, and this time as a time slot, so you can make the circuit efficiency operation.

The architecture of a complete NoC is composed of several NoC nodes. Each node is connected to a resource and four adjacent. Generally speaking, the node is composed of three parts including the exchanger (Switch), network interface (NI or RNI) and cyber source (Resource). Among them, the switch structure is more complex which has strong function, and it can also be called as a router (Router). Figure 5 shows the overall structure of a NoC diagram. Each switch is connected with a resource and four adjacent switches; and each resource is connected to a switch through a RNI. The resource can be a processor core, memory, a FPGA, a custom hardware module, or any other IP that can be inserted into the slot and can and network interface (intellectual property) module. In terms of current technology is relatively new, especially in the multi processor system more complex, resource is a processor core. Switch and switch, switch and resource are connected by a pair of input and output channels. The channel is composed of two unidirectional point-to-point links.

# *3.3 The load characteristics of multi-core platform*

Different loads in the different period of the behavior of memory access behavior are not the same; the type of load of concrete can be divided into the following three types: the efficient use of load, inefficient use of load and load saturation effect. Inefficient use of load action point and the stream media program are similar. Its working set is larger than the capacity of the Cache, even if all the resources are allocated to the load, which cannot improve processor performance to a large extent neither. Very few of the utility and saturated loads even for cache resources can also show good performance. The efficient use of load



Fig. 5: The overall structure of a NoC diagram.

is between these two kinds of load, its performance can reach and resources are to a year-on-year growth trend.

Figure 6 gives some of the test cases SPEC2000 and SPEC2006 reached on different Cache allocation scheme under the lack of shared last level rate. Among them, the loss rate is in a 1MB 16 way set associative access to two levels Cache 4 Nuclear simulation platform.



**Fig. 6:** The test cases SPEC2000 and SPEC2006 reached on different Cache allocation scheme.

From Figure 6, we can see the efficient use of load, inefficient use of load and saturation utility load differently loss behavior. As can be seen from the way, applu is an inefficient use of load, from Cache resources among its 2 Road 16 Road, its performance is not improved much remained stable in a high loss rate. Ammp is a saturated with load, even given it a little Cache resource and its absence rates remain low. While the twolf is a high performance with load, the performance increases with the growth of resources allocated to it. As can be seen from the graph, from its assigned resources to the 2 road 16 road, the loss rate in gradually decreasing, performance improving.

If several of the efficient use of a multichannel load combinations of load, or inefficient use of load and the efficient use combine together to form a multi load combination load, the interfere with how different load can be eliminated, and the problem must be solved because its working set is larger than the capacity of Cache can occur Cache bumps.

### *3.4 The divide and conquer strategy framework*

When new data is loaded into the LLC, it will be placed in the group of set-tail, the NRU bit is set to  $2^M - 2$ , the whether-used bit will be set to 0.

Poisson distribution, using the formula:

$$P(X=x) = \frac{\lambda^x}{x!} e^{-\lambda} \quad (x=0,1,\ldots,\lambda>0)$$
(4)

Type: *x* — arriving customer number;

 $\lambda$  — A unit of time average – "customer" reached number.

Probability density:

$$P(T=t) = (\mu - \lambda)e^{-(\mu - \lambda)t} \quad (t \ge 0, \mu > \lambda) \quad (5)$$

T — Business hours;

 $\mu$  — The average service unit of time the number of customers;

 $\lambda$  — Average unit in the arrival time of the number of customers.

When the data was hit, there are two kinds of strategies. The one is hit priority (HP); the other is the frequency (FP). In the first case hit, hit a block NRU was set to 0; while in the frequency priority cases, hit a block NRU is based on the original 1; in these two cases, hit block is placed into the group of set-tail bi.

Figure 7 is a level two contains the Cache system structure, a structure with 2 ways set associative L1 Cache and 4 ways set associative LLC. In the beginning, there is no data of L1 Cache and LLC, the chart is access from C data. The arrows represent the direction is from the cache group of set-head to set-tail direction, which also indicated that the LRU strategy of LRU chain, from MRU to LRU direction.

# 3.5 *The cache verification design and simulation platform*

Network on chip is composed of computer network development. However, compared with the computer network, the use of network on chip resource has a very high demand, because it directly affects the network on chip area and power consumption. Therefore, the design requirements of the on-chip network are to improve resource utilization as much as possible. In the routing node structure of the classic, the size of the cache for NoC resources plays a decisive role.

For this system, it needs to use the queuing model to determine the size of the cache. Each FIFO can be



Next

Fig. 7: A level two contains the Cache system structure.

considered as a queue. Because the topology and routing algorithm are different, there is a great difference in the modeling method, no matter what kind of routing structure based on queuing model for modeling and calculation in the calculation of the cache size.

The theoretical basis is the queuing theory queuing model. Queuing theory, stochastic service system theory are mathematical theories to study the relationship between the "service" system, "service" and "demand", which is the operations research in probability theory, an important branch of based.

# **4 Experimental Results**

### 4.1 The performance speedup

Figure 8 shows a DRRIP, on a single core case DIP and DELRRIP three Cache management strategy throughput speedup all data based on the traditional LRU strategy is severed as a benchmark; in the given graph of all DELRRIP are obviously beyond LRU, throughput speedup the ensemble mean value is 2.72%; at the same time, compared with DIP and DRRIP strategy, DELRRIP LLC increased by 1.76% and 1.24% respectively.

Figure 9 shows the TADIP, TADRRIP and TADELRRIP three cache management strategy throughput speedup, all data based on the traditional LRU strategy is severed as a benchmark; in the given graph, the 10 load TADELRRIP obviously beyond LRU, throughput speedup the ensemble mean value is 12.85%; at the same time compared with TADIP and TADRRIP strategy TADELRRIP, the average performance of LLC increased by 5.98% and 5.36% respectively.



Fig. 8: DRRIP, DIP and DELRRIP cache management strategy throughput speedup.



Fig. 9: TADIP, TADRRIP and TADELRRIP cache management strategy throughput speedup.

### 4.2 The size of the cache under Tours structure

Each load in the share of the group can manage their own data blocks, which can eliminate the inter thread interference. Figure 10 gives the main frame of TADC strategy; the simulation platform is a 16- set associative shared two level Cache4 simulation platform. TADC will be the first 16 Cache which were divided into 4 groups. Each group contains 4 Cache block, and then the 4 groups are mapped to different application loads. From the load point of view, each of the divided group can be regarded as a 4 way set associative Cache. In this case, TADC S1 S2, S3 and S4 were mapped to load P1, P2, P3 and P4. If the P1 is a high performance with load, and P3 is an inefficient use of the load, they were in the S1 and S3 data processing block their own, so P3 cannot interfere with P1. Thus, it can eliminate the problem of interference thread.

The average waiting in the queue number or average queue length:

$$L_g = \frac{\lambda^2}{\mu(\mu - \lambda)} \tag{6}$$

In the system the average waiting queue length:

$$L = \frac{\lambda}{\mu - \lambda} = L_q \frac{\lambda}{\mu} \tag{7}$$

Idle rate:

$$1 - \rho = 1 - \frac{\lambda}{\mu} \tag{8}$$

$$P_n = \left(\frac{\lambda}{\mu}\right)^n \left(\frac{1-\lambda}{\mu}\right) \tag{9}$$

TADC strategy mainly consists of three sub modules. This section first introduces divide and conquer strategy, and then describes thread behavior perception algorithm, finally will elaborates on how to divide and conquer strategy and threading behavior perception strategy together, and how to form a complete strategy to manage the shared last level cache. Described methods, examples of this section will give a TADC to better describe the strategy.



Fig. 10: The main frame of TADC strategy.

Assume that each queue is isomorphic, and the injection rate of the same, can draw type:

$$\lambda^2 + \mu L_q \lambda - L_q \mu^2 = 0 \tag{10}$$

According to the formula of square root, giving invalid root, which can draw:

$$\lambda = \frac{\mu(\sqrt{L_q^2 + 4L_q} - L_q)}{2} \tag{11}$$

#### 4.3 Sensitivity analysis of cache capacity

LRU and TADERRIP strategies under the average every one thousand instruction number (MPKI), where LLC capacity are 1MB, 2MB, 4MB and 8MB, and in the LLC volume change of circumstances, LLC is always 16 way set associative. As can be seen from the graph, the TADELRRIP control strategy, LLC capacity increased from 1MB to 2MB, 2MB to 4MB and 4MB to 8MB; during these processes, MPKI increased 39.23%, 43.81% and 48.31% respectively. Therefore, TADELRRIP has good expansion capacity. TADELRRIP hardware overhead was shown in Table 2.

#### 4.4 Experimental results and analysis

4.4.1 Threshold analysis

Thresholdlow and Thresholdsat values are based on experimental experience. They are 1024 and 64,

667

Table 2: TADELRRIP hardware overhead

| Overhead source                                                | figure |
|----------------------------------------------------------------|--------|
| The increase additional domain in each row of two-level cache  | 3bits  |
| The total number of rows of cache                              | 32K    |
| Total overhead additional domain in two-level cache            | 12KB   |
| The saturated counter overhead                                 | 36bits |
| The increase additional domain in each row of one- level cache | 1bit   |
| The total number of rows of cache                              | 1KB    |
| Total overhead additional domain in one-level cache            | 0.5KB  |
| Increased LLC area percentage of TADELRRIP                     | 0.58%  |
| Increased area percentage of TADELRRIP in one-level cache      | 0.2%   |

respectively. We will analyze the quantitative performance based on the two values.



Fig. 11: The effect of different Thresholdlow value on the performance of Cache.

Figure 11 shows the effect of different Thresholdlow values on the performance of Cache. It can be seen from the figure that the value of Thresholdlow changes from 512 to 2048, only the processor in 1024 near the time of peak performance, so the Thresholdlow value of 1024 is reasonable.

Figure 11 shows the effect of different Thresholdsat values on the performance of Cache. It can be seen from the figure that the value of Thresholdsat changes from 32 to 128, only the processor in 128 near the time of peak performance, so the Thresholdsat value of 128 is reasonable.

### 4.4.2 Throughput analysis

Figure 12 shows the TADC, DCB and TADIP three Cache management strategy throughput speedup, all data based on the traditional LRU strategy is severed as a benchmark; given by the graph of the 10 load TADC obviously beyond the LRU, for the 2 core processor, throughput speedup the ensemble mean value is 7.48%; at the same time compared with DCB and TADIP strategy, the average performance of TADC LLC increased by 4.46% and 3.87% respectively. For the 4 core processor, throughput speedup the ensemble mean value is 2.95%; at the same time compared with DCB and TADIP strategy, the average performanceTADC LLC can be increased by 1.77% and 1.37% respectively.



**Fig. 12:** The TADC, DCB and TADIP three Cache management strategy throughput speedup.

#### 4.4.3 Scalability analysis

Figure 13 shows a LRU and TADC two kinds of strategies under the average throughput of 4 Nuclear multiprogramming speedup, where LLC capacity are 1MB, 2MB and 4MB. And in the LLC volume change of circumstances, LLC is always 16 way set associative. As can be seen from the graph, the TADC control strategy, LLC capacity increased from 1MB to 2MB and 2MB increased to 4MB; process throughput increased 10.10% and 20.19%, respectively. In the control strategy of TADC, LLC capacity increased from 1MB to 2MB and 2MB increased to 4MB, process throughput increased 13.97% and 24.98%, respectively. Therefore, TADC has good expansion capacity.

Total IPC Throughput For Quad-Core Workloads (Normalized to 1MB LRU)



Fig. 13: LRU and TADC two kinds of strategies under the average throughput.

#### **5** Conclusions

Chip multi processor final Cache optimization technology is the current hot spot in microprocessor research. Aiming at the problems of design of multi processor Cache



system at present, researchers have proposed many advanced methods and techniques. This paper expounds the basic ideas and features of these methods and technologies, and carried on the comparative analysis to them. In the future, all kinds of stage Cache optimization and design techniques mentioned in this paper will be an important technology which worth all of Cache system CMP and reference by designers. Because the processor computing faster than main memory access speed is two orders of magnitude, it will cause the processor due to the lack of timely access to data in main storage and long time waiting. With the development of semiconductor and multi-core technology, the memory wall problem will be more and more obvious. The cache mechanism to reduce costly for main memory access is generally used in modern microprocessor design. But lack of the last level cache access behavior will cause the main memory, the processor stall hundreds of clock cycles to obtain the required data.

### References

- Tiejun Wang, Proakis J.G., Masry, E., Zeidler J.R. "Performance degradation of OFDM systems due to Doppler spreading", IEEE Transactions on Wireless Communications, 5, 1422 – 1432, 2006.
- [2] T. Fusco and M. Tanda. "ML frequency offset and carrier phase estimation in OFDM systems with noncircular transmissions," in Proc. EUSIPCO 2004, 897–900, 2004.
- [3] Changyong Shin, Robert W. Jr., Edward J. Powers. "Blind Channel Estimation for MIMO-OFDM Systems", IEEE Transactions on Vehicular Technology, 56, 670-685, 2007.
- [4] C. Dubuc, D. Starks, T. Creasy, and Y. Hou. "A MIMO-OFDM prototype for next-generation wireless WANs", IEEE Commun. Mag., 42, 82–87, 2004.
- [5] A. van Zelst and T. C. W. Schenk. "Implementation of a MIMO OFDM-based wireless LAN system," IEEE Trans. Signal Process., 52, 483–494, 2004.
- [6] G. L. Stüber, J. R. Barry, S. W. Mclaughlin, Y. Li, M. A. Ingram, and T. G. Pratt. "Broadband MIMO-OFDM wireless communications", Proc. IEEE, 92, 271–294, 2004.
- [7] Wang Xin, Tan Zhen-hui, Chen Xia. "Doppler diversity receiver for broadband wireless OFDM system under high-speed mobile environments", IEEE International Symposium on Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications 2005, 2, 1444 – 1447, 2005.
- [8] Chen Donghua and Qiu Hongbing. Doubly selective channel estimation for subband and OFDMA using basis expansion models.
- [9] Q. H. Spencer, A. L. Swindlehurst, M. Haardt. "Zero-Forcing Methods for Downlink Spatial Multiplexing in Multiuser MIMO Channel", IEEE Transactions on Signal Processing, 52, 461-471, May 2004.
- [10] M. Sadek, A. Tarighat, A. H. Sayed, "A Leakagebased Precoding Scheme for Downlink multi-user MIMO Channels", IEEE Transactions on Wireless Communications, 26, 1505-1515, 2008.

- [11] A. Tarighat, M. Sadek, A. H. Sayed, "A multi User Beamforming Scheme for Downlink MIMO Channels based on Maximizing Signal-to-Leakage Ratios", IEEE International Conference on Acoustics, Speech, and Signal Processing, 1129-1132, 2005.
- [12] J.van de Beek, O. Edfors, M. Sandell, S. Wilson, P. Borjesson, "On Channel Estimation in OFDM System", in Proceedings of the IEEE Vehicular Technology Conference, 815-819, 1995.
- [13] K.Wong, R. Cheng, K. B. Letaeif, R. D. Murch, "Adaptive antennas at the mobile and base stations in an OFDM/TDMA system", IEEE Transactions on Communications, 49, 195-206, 2001.
- [14] M. Sadek, A. Tarighat, A. H. Sayed, "Active Antenna Selection in multi-user MIMO Communications," IEEE Transactions on Signal Processing, 55, 1498-1510, 2007.
- [15] Zhu Yazhou, Zheng Guoxin, Rui Yun, Li Mingqi, "A Novel Distributed Precoding Scheme Based on THP for Downlink Multi-Cell Multi-User OFDMA Wireless Systems", IJACT: International Journal of Advancements in Computing Technology, 5, 213-220, 2011.
- [16] Raore Soungalo, Li Renfa and Zeng Fanzi, "Evaluating and Improving Wireless Local Area Networks Performance", IJACT: International Journal of Advancements in Computing Technology, 3, 156-164, 2011.
- [17] 3GPP TS 36.300, "Evolved Universal Terrestrial Radio Access (E-UTRA) and Evolved UTRA (E-UTRA)", Dec. 2008. V8.0.0.
- [18] L Tong, G Xu, B Hassibi, and T Kailath. Blind channel estimation based on second-order statistics: a frequencydomain approach. IEEE Trans. Inform. Theory, **41**. Jan 1995. 329-334.
- [19] Y Zhao, A Huang. A novel channel estimation methods for OFDM mobile communication systems based on pilot signals and transform domain processing [A]. inPro. IEEE 47th Vehicular Technology Conference [C], Phoenix, USA, 1997, 2089-2093.
- [20] Theodore S. Rappaport etc. "Wireless Communications Principles and Practice. Publishing House of Electronics Industry. Mar. 2003.
- [21] Sinern Estimation Coleri, Mustafa Ergen, Anuj Puri, Ahmad Bahai. A Study of Channel in OFDM Systems. IEEE VTC, Vancouver, Canada. September, 2002.
- [22] D. B. Van, O. Edfors, and M. Sandle. On channel estimation in OFDM systems, Proc. IEEE Vehic.Tech.Conf, 1999. 815-819.



### Li Lei

was born in Jiangsu province, China in 1980. She is currently an Ph.D Candidate of the School of Electronics Engineering and Computer Science at Peking University, China. Her research interests include SoC Design and information security.



### An Huiyao



was born in Hunan province, China in 1972. He is currently professor associate an of the School of Electronics Engineering and Computer Science at Peking University, China. His research interests include wireless communications, mobile



### Zhang Peng

was born in Hebei province, China in 1972. His research interests include Information Security.

networks, and information security.