| 1 | NTSC PPU timing |
| 2 | by Samus Aran (livingmonolith@hotmail.com) |
| 3 | date: Sept. 25th, Y2K |
| 4 | |
| 5 | This weekend, I setup an experiment with my NTSC NES MB & my PC so's I could |
| 6 | RE the PPU's timing. What I did was (using a PC interface) analyse the |
| 7 | changes that occur on the PPU's address and data pins on every rising & |
| 8 | falling edge of the PPU's clock. I was not planning on removing the PPU from |
| 9 | the motherboard (yet), so basically I just kept everything intact (minus the |
| 10 | stuff I added onto the MB so I could monitor the PPU's signals), and popped |
| 11 | in a game, so that it would initialize the PPU for me (I used DK classics, |
| 12 | since it was only taking somthing like 4 frames before it was turning on the |
| 13 | background/sprites). |
| 14 | |
| 15 | The only change I made was taking out the 21 MHz clock generator circuitry. |
| 16 | To replace the clock signal, I connected a port controlled latch to the |
| 17 | NES's main clock line instead. Now, by writing a 0 or a 1 out to an PC ISA |
| 18 | port of my choice (I was using $104), I was able to control the 21 MHz |
| 19 | clockline of the NES. After I would create a rise or a fall on the NES's |
| 20 | clock line, I would then read in the data that appeared on the PPU's address |
| 21 | and data pins, which included monitoring what PPU registers the game |
| 22 | read/wrote to (& the data that was read/written). |
| 23 | |
| 24 | My findings: |
| 25 | |
| 26 | - The PPU makes NO external access to name or character tables, unless the |
| 27 | background or sprites are enabled. This means that the PPU's address and |
| 28 | data busses are dead while in this state. |
| 29 | |
| 30 | - Because the PPU's palette RAM is internal to it, the PPU has multiport |
| 31 | access to it, and therefore, instant access to it at all times (this is why |
| 32 | reading palette RAM via $2007 does not require a throw-away read). This is |
| 33 | why when a scanline is being rendered, never does the PPU put the palette |
| 34 | address on it's bus; it's simply unneccessary. Additionally, when the |
| 35 | programmer accesses palette RAM via $2006/7, the palette address accessed |
| 36 | actually does show up on the PPU's external address bus, but the PPU's /R & |
| 37 | /W flags are not activated. This is required; to prevent writing over name |
| 38 | table data falling under the approprite mirrored area. I don't know why |
| 39 | Nintendo didn't just devote an exclusive area for palette RAM, like it did |
| 40 | for sprite RAM. |
| 41 | |
| 42 | - Sprite DMA is 6144 clock cycles long (or in CPU clock cycles, 6144/12). |
| 43 | 256 individual transfers are made from CPU memory to a temp register inside |
| 44 | the CPU, then from the CPU's temp reg, to $2004. |
| 45 | |
| 46 | - One scanline is EXACTLY 1364 cycles long. In comparison to the CPU's |
| 47 | speed, one scanline is 1364/12 CPU cycles long. |
| 48 | |
| 49 | - One frame is EXACTLY 357368 cycles long, or EXACTLY 262 scanlines long. |
| 50 | |
| 51 | |
| 52 | Sequence of pixel rendering |
| 53 | --------------------------- |
| 54 | |
| 55 | External PPU memory is accessed every 8 clock cycles by the PPU when it's |
| 56 | drawing the background. Therefore, the PPU will typically access external |
| 57 | memory 170 times per scanline. After the 170th fetch, the PPU does nothing |
| 58 | for 4 clock cycles (except in the case of a 1360 clock cycle scanline (more |
| 59 | on this later)), and thus making the scanline up of 1364 cycles. |
| 60 | |
| 61 | accesses |
| 62 | -------- |
| 63 | |
| 64 | 1 thru 128: |
| 65 | |
| 66 | 1. Fetch 1 name table byte |
| 67 | 2. Fetch 1 attribute table byte |
| 68 | 3. Fetch 2 pattern table bitmap bytes |
| 69 | |
| 70 | This process is repeated 32 times (32 tiles in a scanline). |
| 71 | |
| 72 | This is when the PPU retrieves the appropriate data from PPU memory for |
| 73 | rendering the background. The first background tile fetched here is actually |
| 74 | the 3rd to be drawn on the screen (the background data for the first 2 tiles |
| 75 | to be rendered on the next scanline are fetched at the end of the scanline |
| 76 | prior to this one). |
| 77 | |
| 78 | In one complete cycle of fetches (4 fetches, or 32 cycles), the PPU renders |
| 79 | or draws 8 pixels on the screen. However, this does not suggest that the PPU |
| 80 | is always drawing on-screen results while background data is being fetched. |
| 81 | There is a delay inside the PPU from when the first background tile is |
| 82 | fetched, and when the first pixel to be displayed on the screen is rendered. |
| 83 | It is important to be aware of this delay, since it specifically relates to |
| 84 | the "sprite 0 hit" flag's timing. I currently do not know what the delay |
| 85 | time is (as far as clock cycles go). |
| 86 | |
| 87 | Note that the PPU fetches a nametable byte for every 8 horizontal pixels |
| 88 | it draws. It should be understood that with some custom cartridge hardware, |
| 89 | the PPU's color area could be increased (more about this at the end of this |
| 90 | document). |
| 91 | |
| 92 | It is also during this time that the PPU evaluates the "Y coordinate" |
| 93 | entries of all 64 sprites (starting with sprite 0) in sprite RAM, to see if |
| 94 | the sprites are within range (to be drawn on the screen) FOR THE NEXT |
| 95 | SCANLINE. For sprite entries that have been found to be in range, they (that |
| 96 | is, the sprite's nametable, and x coordinate bytes, attribute (5 bits) and |
| 97 | fine y scroll (3 or 4 bits, depending on bit 5 of $2000 ("sprite size")) |
| 98 | bits) accumulate into a part of PPU memory called the "sprite temporary |
| 99 | memory", which is big enough to hold the data for up to 8 sprites. If 8 |
| 100 | sprites have accumulated into the temporary memory and the PPU is still |
| 101 | finding more sprites in range for drawing on the next scanline, then the |
| 102 | sprite data is ignored (not loaded into the sprite temporary memory), and |
| 103 | the PPU raises a flag (bit 5 of $2002) indicating that it is going to be |
| 104 | dropping sprites for the next scanline. |
| 105 | |
| 106 | 129 thru 160: |
| 107 | |
| 108 | 1. Fetch 2 garbage name table bytes |
| 109 | 2. Fetch 2 pattern table bitmap bytes for applicable sprites ON THE NEXT |
| 110 | SCANLINE |
| 111 | |
| 112 | This process is repeated 8 times. |
| 113 | |
| 114 | This is the period of time when the PPU retrieves the appropriate pattern |
| 115 | table data for the sprites to be drawn on the next scanline. Where the PPU |
| 116 | fetches pattern table data for an individual sprite depends on the nametable |
| 117 | byte, and fine y scroll bits of a single sprite entry in the sprite |
| 118 | temporary memory, and bits 3 and 5 of $2000 ("sprite pattern table select" |
| 119 | and "sprite size" bits, respectively). The fetched pattern table data (which |
| 120 | is 2 bytes), plus the associated 5 attribute bytes, and the x coordinate |
| 121 | byte in sprite temporary memory are then loaded into a part of the PPU |
| 122 | called the "sprite buffer memory". This memory area again, is large enough |
| 123 | to hold the contents for 8 sprites. The makeup of one sprite memory cell |
| 124 | here is composed of 2 8-bit shift registers (the fetched pattern table data |
| 125 | is loaded in here, where it will be serialized at the appropriate time), a |
| 126 | 5-bit latch (which holds the attribute data for a sprite), and a 8-bit down |
| 127 | counter (this is where the x coordinate is loaded). The counter is |
| 128 | decremented every time the PPU draws a pixel on screen, and when the counter |
| 129 | reaches 0, the pattern table data in the shift registers will start to |
| 130 | serialize, and be drawn on the screen. |
| 131 | |
| 132 | Even if no sprites exist on the next scanline, a pattern table fetch takes |
| 133 | place. |
| 134 | |
| 135 | Although the fetched name table data is thrown away, I still can't make |
| 136 | much sense out of the name table address accesses the PPU makes during this |
| 137 | time. However, the address does seem to relate to the first name table tile |
| 138 | to be rendered on the screen. |
| 139 | |
| 140 | It should also be noted that because this fetch is required for sprites on |
| 141 | the next line, it is neccessary for a garbage scanline to exist prior to the |
| 142 | very first scanline to be actually rendered, so that sprite RAM entries can |
| 143 | be evaluated, and the appropriate bitmap data retrieved. |
| 144 | |
| 145 | Finally, it would appear to me that the PPU's 8 sprite/scanline |
| 146 | bottleneck exists clearly because the PPU could only find the time in one |
| 147 | scanline to fetch the pattern bitmaps for 8 sprites. However, why the PPU |
| 148 | doesn't attempt to access pattern table data in the time when it fetches 2 |
| 149 | garbage name table bytes is a good question. |
| 150 | |
| 151 | 161 thru 168: |
| 152 | |
| 153 | 1. Fetch 1 name table byte |
| 154 | 2. Fetch 1 attribute table byte |
| 155 | 3. Fetch 2 pattern table bitmap bytes |
| 156 | |
| 157 | This process is repeated 2 times. |
| 158 | |
| 159 | It is during this time that the PPU fetches the appliciable background |
| 160 | data for the first and second tiles to be rendered on the screen for the |
| 161 | next scanline. The rest of tiles (3..128) are fetched at the beginning of |
| 162 | the following scanline. |
| 163 | |
| 164 | 169 thru 170: |
| 165 | |
| 166 | 1. Fetch 1 name table byte |
| 167 | |
| 168 | This process is repeated 2 times. |
| 169 | |
| 170 | I'm unclear of the reason why this particular access to memory is made. |
| 171 | The nametable address that is accessed 2 times in a row here, is also the |
| 172 | same nametable address that points to the 3rd tile to be rendered on the |
| 173 | screen (or basically, the first nametable address that will be accessed when |
| 174 | the PPU is fetching background data on the next scanline). |
| 175 | |
| 176 | |
| 177 | After memory access 170, the PPU simply rests for 4 cycles (or the |
| 178 | equivelant of half a memory access cycle) before repeating the whole |
| 179 | pixel/scanline rendering process. If the scanline being rendered is the very |
| 180 | first one on every second frame, then this delay simply doesn't exist. |
| 181 | |
| 182 | |
| 183 | Sequence of line rendering |
| 184 | -------------------------- |
| 185 | |
| 186 | 1. Starting at the instant the VINT flag is pulled down (when a NMI is |
| 187 | generated), 20 scanlines make up the period of time on the PPU which I like |
| 188 | to call the VINT period. During this time, the PPU makes NO access to it's |
| 189 | external memory (i.e. name / pattern tables, etc.). |
| 190 | |
| 191 | 2. After 20 scanlines worth of time go by (since the VINT flag was set), |
| 192 | the PPU starts to render scanlines. Now, the first scanline it renders is a |
| 193 | dummy one; although it will access it's external memory in the same sequence |
| 194 | it would for drawing a valid scanline, the fetched background data is thrown |
| 195 | away, and the places that the PPU accesses name table data is unexplainable |
| 196 | (for now). |
| 197 | |
| 198 | IMPORTANT! this is the only scanline that has variable length. On every |
| 199 | second rendered frame, this scanline is only 1360 cycles. Otherwise it's |
| 200 | 1364. |
| 201 | |
| 202 | 3. after rendering 1 dummy scanline, the PPU starts to render the actual |
| 203 | data to be displayed on the screen. This is done for 240 scanlines, of |
| 204 | course. |
| 205 | |
| 206 | 4. after the very last rendered scanline finishes, the PPU does nothing for |
| 207 | 1 scanline (i.e. makes no external memory accesses). When this scanline |
| 208 | finishes, the VINT flag is set, and the process of drawing lines starts all |
| 209 | over again. |
| 210 | |
| 211 | This makes a total of 262 scanlines. Although one scanline is slightly |
| 212 | shorter on every second rendered frame (4 cycles), I don't know if this |
| 213 | feature is neccessary to implement in emulators, since it only makes 1/3 a |
| 214 | CPU cycle difference per frame (and there's NO way that a game could take |
| 215 | into account 1/3 of a CPU cycle). |
| 216 | |
| 217 | |
| 218 | Food for thought |
| 219 | ---------------- |
| 220 | |
| 221 | What's important to remember about the NES's 2C02 or picture proecssing unit |
| 222 | (hereon PPU) is that all screen data is fetched & drawn on a real-time |
| 223 | basis. For example, let's consider how the PPU draws background tiles. |
| 224 | |
| 225 | We know that one name table byte is associated with an 8x8 cluster of pixels |
| 226 | (and therefore, 16 bytes worth of pattern bitmap data, plus 2 attribute |
| 227 | bits). Therefore, it would make sense for the PPU to only have to fetch a |
| 228 | name table byte once for each 8x8 pixel array it draws (one tile), and 1 |
| 229 | attribute byte fetch for every 4x4 tile matrix that it draws. However, since |
| 230 | the PPU always draws one complete scanline before drawing the next, The PPU |
| 231 | will actually fetch the same name table byte 8 times, once each scanline at |
| 232 | the appropriate x coordinate. Since these name table address access reads |
| 233 | are redundant, with some custom cartridge hardware, it would be possible to |
| 234 | make the PPU appear as if it had background tiles as small as 8x1 pixels! |
| 235 | |
| 236 | Additionally, an attribute table byte is fetched from name table RAM once |
| 237 | per 2 fetched pattern bitmap bytes (or, every 8 pixels worth of pattern |
| 238 | bitmap data). This is useful information to keep in mind, for with some |
| 239 | custom cartridge hardware, this would allow the NES's PPU to appear to have |
| 240 | an effective color area as small as of 8*1 pixels (!), where only the 8 |
| 241 | pixels are limited to having 4 exclusive colors, which, is *alot* better |
| 242 | than the PPU's default color area of 16x16 pixels. |
| 243 | |
| 244 | So basically, what I'm getting at here, is that the PPU has absolutely NO |
| 245 | memory whatsoever of what it rendered last scanline, and therefore all data |
| 246 | must be processed/evaluated again, whether it's name table accesses, |
| 247 | attribute table accesses, or even it's internal sprite RAM accesses. |
| 248 | |
| 249 | What's good, and what's bad about the way the PPU draws it's pictures: |
| 250 | |
| 251 | What's good about it is that it makes the PPU a hell of alot more versatile, |
| 252 | provided you have the appropriate hardware to assist in the improvement of |
| 253 | the PPU's background drawing techniques (MMC5 comes to mind). Also, by doing |
| 254 | background rendering in the real time, the PPU complexity is less, and less |
| 255 | internal temporary registers are required. |
| 256 | |
| 257 | What's bad about it is that it eats up memory bandwidth like it's going out |
| 258 | of style. When the PPU is rendering scanlines, the PPU is accessing the VRAM |
| 259 | every chance it gets, which takes away from the time that the programmer |
| 260 | gets to access the VRAM. In contrast, if redundantly loaded data (like |
| 261 | attribute bytes) were kept in internal PPU RAM, this would allow some time |
| 262 | for the PPU to allow access to it's VRAM. |
| 263 | |
| 264 | All in all though, Nintendo engineered quite a cost effective, versatile |
| 265 | graphic processor. Now, if only they brought the 4 expansion pins on the PPU |
| 266 | out of the deck! |