How does the processor cache handle large memory objects?

Question

How does the processor cache handle large memory objects?

Scenario

Cache size (L1) ( CS

): 32kB
Line size ( LS

): 64B
Associativity ( A

): 8
Set size ( SS

): 512B ( A

* LS

)
Sets ( S

): 64 ( C

/ SS

)
Object read / write ( O

) is largerLS

Assumptions (correct me if this is not correct):

Virtual memory blocks (4kB ( SS

* A

), denoted as B

) are modulo mapped to similar settings. In other words, addresses 0x0 : 0xFFFF

(block index ( BI

) 0) map to 0, 0x1000 : 0x1FFF

( BI

1) map to 1, etc.
A read / write request (non-temporary writes / reads are not used) for a given address A

needs to be found BI

and then moved to the assigned set. For example, A

= 0x4600A would have BI

= 70. This is BI

displayed for setting 6 ( BI

% S

).
Correct (no offset) r / w object ( O

) needs alignment for caching LS

.

Questions

Will it O

consistently align in the cache, or can it accept (for example) free slots 0 and 4 and 5 instead of 0 and 1 and 2?
What is the cost (penalty) for fetching the partitioned O

from the cache? Suppose it is O

not split into several B

.
Same question as above, but in case it O

fits in two B

, so two sets are used.
What happens if the size is O

larger than SS

(512B)? Will it move data to L2 and step move data to L1? Will other kits be used?
What if L2 (and L3 for that matter) is too small for all the data?

+3

optimization cpu-cache

Red XIII 07 Feb '13 at 9:31

source to share

1 answer

chus · Answer 1 · 2013-05-23T12:01:46+0000

Virtual memory blocks (4kB (SS * A), labeled B) are displayed in a modulo-like manner. In other words, addresses 0x0: 0xFFFF (block index (BI) 0) are mapped to 0, 0x1000: 0x1FFF (BI 1) are mapped to 1, etc.

Transfer between L1 cache and memory hierarchy: The block of transfer between the L1 cache and the next level of the memory hierarchy is a row-sized byte (LS) block. That is, to your L1 cache, the memory is structured in 64 byte blocks (LS bytes).

Correspondence between blocks of memory and items in the cache: Consecutive blocks of memory are mapped to cache lines of consecutive sets. Therefore, block 0 (addresses 0x0000 : 0x003F

) maps to cache line 0, block 1 (addresses 0x0040 : 0x007F

) maps to cache line at set 1, and so on.

Read / write request (no temporary writes / reads are used) a for a given address A needs to be found by its BI and then moved to the designated set. For example, A = 0x4600A would have BI = 70. This is BI for display 6 (BI% S).

The correct procedure for finding a block identifier (or index) and a given index (SI) is as follows:

 BI = A >> LS = 0x4600A >> 6 = 0x1180
 SI = BI & (S-1) = 0x1180 & 0x3F = 0x0000
 (when S is a power of two, BI & (S-1) = BI  mod S)

For correct (no offset) r / w object (O) - cache, LS alignment is required.

It's not obligatory. O does not have to be block aligned.

Q1. Will O be sequentially aligned in the cache or can it accept (for an instance) free slots 0 and 4 and 5 instead of 0 and 1 and 2?

O

the blocks will be stored in sequential sets with cache line granularity (set k, k + 1, ..., S-1, 0, 1, ...).

Q2. What is the cost (penalty) for fetching a partitioned O from the cache? Suppose O is not split into several B. Q3. Same question as above, but in the case where O fits into two Bs, thus two sets are used.

I assume you are interested in the cost of the processor reading words O

from the cache. Assuming that is O

specified sequentially, the number of cache hits will equal the number of reference words. I think the cost is independent of whether the blocks are in the same or different sets (at least in a multiport cache).

Q4. What happens if size O is larger than SS (512B)? Will it move data to L2 and step data to move to L1? Will he use other kits?

Q5. What if L2 (and L3 for that matter) is too small for all the data?

If a block is to be allocated for dialing with no free cache lines, then the block to send (victim block) must be selected. The replacement policy chooses a victim block according to an algorithm (LRU, pLRU, random).

How does the processor cache handle large memory objects?

More articles: