c62d2810 |
1 | NTSC PPU timing |
2 | by Samus Aran (livingmonolith@hotmail.com) |
3 | date: Sept. 25th, Y2K |
4 | |
5 | This weekend, I setup an experiment with my NTSC NES MB & my PC so's I could |
6 | RE the PPU's timing. What I did was (using a PC interface) analyse the |
7 | changes that occur on the PPU's address and data pins on every rising & |
8 | falling edge of the PPU's clock. I was not planning on removing the PPU from |
9 | the motherboard (yet), so basically I just kept everything intact (minus the |
10 | stuff I added onto the MB so I could monitor the PPU's signals), and popped |
11 | in a game, so that it would initialize the PPU for me (I used DK classics, |
12 | since it was only taking somthing like 4 frames before it was turning on the |
13 | background/sprites). |
14 | |
15 | The only change I made was taking out the 21 MHz clock generator circuitry. |
16 | To replace the clock signal, I connected a port controlled latch to the |
17 | NES's main clock line instead. Now, by writing a 0 or a 1 out to an PC ISA |
18 | port of my choice (I was using $104), I was able to control the 21 MHz |
19 | clockline of the NES. After I would create a rise or a fall on the NES's |
20 | clock line, I would then read in the data that appeared on the PPU's address |
21 | and data pins, which included monitoring what PPU registers the game |
22 | read/wrote to (& the data that was read/written). |
23 | |
24 | My findings: |
25 | |
26 | - The PPU makes NO external access to name or character tables, unless the |
27 | background or sprites are enabled. This means that the PPU's address and |
28 | data busses are dead while in this state. |
29 | |
30 | - Because the PPU's palette RAM is internal to it, the PPU has multiport |
31 | access to it, and therefore, instant access to it at all times (this is why |
32 | reading palette RAM via $2007 does not require a throw-away read). This is |
33 | why when a scanline is being rendered, never does the PPU put the palette |
34 | address on it's bus; it's simply unneccessary. Additionally, when the |
35 | programmer accesses palette RAM via $2006/7, the palette address accessed |
36 | actually does show up on the PPU's external address bus, but the PPU's /R & |
37 | /W flags are not activated. This is required; to prevent writing over name |
38 | table data falling under the approprite mirrored area. I don't know why |
39 | Nintendo didn't just devote an exclusive area for palette RAM, like it did |
40 | for sprite RAM. |
41 | |
42 | - Sprite DMA is 6144 clock cycles long (or in CPU clock cycles, 6144/12). |
43 | 256 individual transfers are made from CPU memory to a temp register inside |
44 | the CPU, then from the CPU's temp reg, to $2004. |
45 | |
46 | - One scanline is EXACTLY 1364 cycles long. In comparison to the CPU's |
47 | speed, one scanline is 1364/12 CPU cycles long. |
48 | |
49 | - One frame is EXACTLY 357368 cycles long, or EXACTLY 262 scanlines long. |
50 | |
51 | |
52 | Sequence of pixel rendering |
53 | --------------------------- |
54 | |
55 | External PPU memory is accessed every 8 clock cycles by the PPU when it's |
56 | drawing the background. Therefore, the PPU will typically access external |
57 | memory 170 times per scanline. After the 170th fetch, the PPU does nothing |
58 | for 4 clock cycles (except in the case of a 1360 clock cycle scanline (more |
59 | on this later)), and thus making the scanline up of 1364 cycles. |
60 | |
61 | accesses |
62 | -------- |
63 | |
64 | 1 thru 128: |
65 | |
66 | 1. Fetch 1 name table byte |
67 | 2. Fetch 1 attribute table byte |
68 | 3. Fetch 2 pattern table bitmap bytes |
69 | |
70 | This process is repeated 32 times (32 tiles in a scanline). |
71 | |
72 | This is when the PPU retrieves the appropriate data from PPU memory for |
73 | rendering the background. The first background tile fetched here is actually |
74 | the 3rd to be drawn on the screen (the background data for the first 2 tiles |
75 | to be rendered on the next scanline are fetched at the end of the scanline |
76 | prior to this one). |
77 | |
78 | In one complete cycle of fetches (4 fetches, or 32 cycles), the PPU renders |
79 | or draws 8 pixels on the screen. However, this does not suggest that the PPU |
80 | is always drawing on-screen results while background data is being fetched. |
81 | There is a delay inside the PPU from when the first background tile is |
82 | fetched, and when the first pixel to be displayed on the screen is rendered. |
83 | It is important to be aware of this delay, since it specifically relates to |
84 | the "sprite 0 hit" flag's timing. I currently do not know what the delay |
85 | time is (as far as clock cycles go). |
86 | |
87 | Note that the PPU fetches a nametable byte for every 8 horizontal pixels |
88 | it draws. It should be understood that with some custom cartridge hardware, |
89 | the PPU's color area could be increased (more about this at the end of this |
90 | document). |
91 | |
92 | It is also during this time that the PPU evaluates the "Y coordinate" |
93 | entries of all 64 sprites (starting with sprite 0) in sprite RAM, to see if |
94 | the sprites are within range (to be drawn on the screen) FOR THE NEXT |
95 | SCANLINE. For sprite entries that have been found to be in range, they (that |
96 | is, the sprite's nametable, and x coordinate bytes, attribute (5 bits) and |
97 | fine y scroll (3 or 4 bits, depending on bit 5 of $2000 ("sprite size")) |
98 | bits) accumulate into a part of PPU memory called the "sprite temporary |
99 | memory", which is big enough to hold the data for up to 8 sprites. If 8 |
100 | sprites have accumulated into the temporary memory and the PPU is still |
101 | finding more sprites in range for drawing on the next scanline, then the |
102 | sprite data is ignored (not loaded into the sprite temporary memory), and |
103 | the PPU raises a flag (bit 5 of $2002) indicating that it is going to be |
104 | dropping sprites for the next scanline. |
105 | |
106 | 129 thru 160: |
107 | |
108 | 1. Fetch 2 garbage name table bytes |
109 | 2. Fetch 2 pattern table bitmap bytes for applicable sprites ON THE NEXT |
110 | SCANLINE |
111 | |
112 | This process is repeated 8 times. |
113 | |
114 | This is the period of time when the PPU retrieves the appropriate pattern |
115 | table data for the sprites to be drawn on the next scanline. Where the PPU |
116 | fetches pattern table data for an individual sprite depends on the nametable |
117 | byte, and fine y scroll bits of a single sprite entry in the sprite |
118 | temporary memory, and bits 3 and 5 of $2000 ("sprite pattern table select" |
119 | and "sprite size" bits, respectively). The fetched pattern table data (which |
120 | is 2 bytes), plus the associated 5 attribute bytes, and the x coordinate |
121 | byte in sprite temporary memory are then loaded into a part of the PPU |
122 | called the "sprite buffer memory". This memory area again, is large enough |
123 | to hold the contents for 8 sprites. The makeup of one sprite memory cell |
124 | here is composed of 2 8-bit shift registers (the fetched pattern table data |
125 | is loaded in here, where it will be serialized at the appropriate time), a |
126 | 5-bit latch (which holds the attribute data for a sprite), and a 8-bit down |
127 | counter (this is where the x coordinate is loaded). The counter is |
128 | decremented every time the PPU draws a pixel on screen, and when the counter |
129 | reaches 0, the pattern table data in the shift registers will start to |
130 | serialize, and be drawn on the screen. |
131 | |
132 | Even if no sprites exist on the next scanline, a pattern table fetch takes |
133 | place. |
134 | |
135 | Although the fetched name table data is thrown away, I still can't make |
136 | much sense out of the name table address accesses the PPU makes during this |
137 | time. However, the address does seem to relate to the first name table tile |
138 | to be rendered on the screen. |
139 | |
140 | It should also be noted that because this fetch is required for sprites on |
141 | the next line, it is neccessary for a garbage scanline to exist prior to the |
142 | very first scanline to be actually rendered, so that sprite RAM entries can |
143 | be evaluated, and the appropriate bitmap data retrieved. |
144 | |
145 | Finally, it would appear to me that the PPU's 8 sprite/scanline |
146 | bottleneck exists clearly because the PPU could only find the time in one |
147 | scanline to fetch the pattern bitmaps for 8 sprites. However, why the PPU |
148 | doesn't attempt to access pattern table data in the time when it fetches 2 |
149 | garbage name table bytes is a good question. |
150 | |
151 | 161 thru 168: |
152 | |
153 | 1. Fetch 1 name table byte |
154 | 2. Fetch 1 attribute table byte |
155 | 3. Fetch 2 pattern table bitmap bytes |
156 | |
157 | This process is repeated 2 times. |
158 | |
159 | It is during this time that the PPU fetches the appliciable background |
160 | data for the first and second tiles to be rendered on the screen for the |
161 | next scanline. The rest of tiles (3..128) are fetched at the beginning of |
162 | the following scanline. |
163 | |
164 | 169 thru 170: |
165 | |
166 | 1. Fetch 1 name table byte |
167 | |
168 | This process is repeated 2 times. |
169 | |
170 | I'm unclear of the reason why this particular access to memory is made. |
171 | The nametable address that is accessed 2 times in a row here, is also the |
172 | same nametable address that points to the 3rd tile to be rendered on the |
173 | screen (or basically, the first nametable address that will be accessed when |
174 | the PPU is fetching background data on the next scanline). |
175 | |
176 | |
177 | After memory access 170, the PPU simply rests for 4 cycles (or the |
178 | equivelant of half a memory access cycle) before repeating the whole |
179 | pixel/scanline rendering process. If the scanline being rendered is the very |
180 | first one on every second frame, then this delay simply doesn't exist. |
181 | |
182 | |
183 | Sequence of line rendering |
184 | -------------------------- |
185 | |
186 | 1. Starting at the instant the VINT flag is pulled down (when a NMI is |
187 | generated), 20 scanlines make up the period of time on the PPU which I like |
188 | to call the VINT period. During this time, the PPU makes NO access to it's |
189 | external memory (i.e. name / pattern tables, etc.). |
190 | |
191 | 2. After 20 scanlines worth of time go by (since the VINT flag was set), |
192 | the PPU starts to render scanlines. Now, the first scanline it renders is a |
193 | dummy one; although it will access it's external memory in the same sequence |
194 | it would for drawing a valid scanline, the fetched background data is thrown |
195 | away, and the places that the PPU accesses name table data is unexplainable |
196 | (for now). |
197 | |
198 | IMPORTANT! this is the only scanline that has variable length. On every |
199 | second rendered frame, this scanline is only 1360 cycles. Otherwise it's |
200 | 1364. |
201 | |
202 | 3. after rendering 1 dummy scanline, the PPU starts to render the actual |
203 | data to be displayed on the screen. This is done for 240 scanlines, of |
204 | course. |
205 | |
206 | 4. after the very last rendered scanline finishes, the PPU does nothing for |
207 | 1 scanline (i.e. makes no external memory accesses). When this scanline |
208 | finishes, the VINT flag is set, and the process of drawing lines starts all |
209 | over again. |
210 | |
211 | This makes a total of 262 scanlines. Although one scanline is slightly |
212 | shorter on every second rendered frame (4 cycles), I don't know if this |
213 | feature is neccessary to implement in emulators, since it only makes 1/3 a |
214 | CPU cycle difference per frame (and there's NO way that a game could take |
215 | into account 1/3 of a CPU cycle). |
216 | |
217 | |
218 | Food for thought |
219 | ---------------- |
220 | |
221 | What's important to remember about the NES's 2C02 or picture proecssing unit |
222 | (hereon PPU) is that all screen data is fetched & drawn on a real-time |
223 | basis. For example, let's consider how the PPU draws background tiles. |
224 | |
225 | We know that one name table byte is associated with an 8x8 cluster of pixels |
226 | (and therefore, 16 bytes worth of pattern bitmap data, plus 2 attribute |
227 | bits). Therefore, it would make sense for the PPU to only have to fetch a |
228 | name table byte once for each 8x8 pixel array it draws (one tile), and 1 |
229 | attribute byte fetch for every 4x4 tile matrix that it draws. However, since |
230 | the PPU always draws one complete scanline before drawing the next, The PPU |
231 | will actually fetch the same name table byte 8 times, once each scanline at |
232 | the appropriate x coordinate. Since these name table address access reads |
233 | are redundant, with some custom cartridge hardware, it would be possible to |
234 | make the PPU appear as if it had background tiles as small as 8x1 pixels! |
235 | |
236 | Additionally, an attribute table byte is fetched from name table RAM once |
237 | per 2 fetched pattern bitmap bytes (or, every 8 pixels worth of pattern |
238 | bitmap data). This is useful information to keep in mind, for with some |
239 | custom cartridge hardware, this would allow the NES's PPU to appear to have |
240 | an effective color area as small as of 8*1 pixels (!), where only the 8 |
241 | pixels are limited to having 4 exclusive colors, which, is *alot* better |
242 | than the PPU's default color area of 16x16 pixels. |
243 | |
244 | So basically, what I'm getting at here, is that the PPU has absolutely NO |
245 | memory whatsoever of what it rendered last scanline, and therefore all data |
246 | must be processed/evaluated again, whether it's name table accesses, |
247 | attribute table accesses, or even it's internal sprite RAM accesses. |
248 | |
249 | What's good, and what's bad about the way the PPU draws it's pictures: |
250 | |
251 | What's good about it is that it makes the PPU a hell of alot more versatile, |
252 | provided you have the appropriate hardware to assist in the improvement of |
253 | the PPU's background drawing techniques (MMC5 comes to mind). Also, by doing |
254 | background rendering in the real time, the PPU complexity is less, and less |
255 | internal temporary registers are required. |
256 | |
257 | What's bad about it is that it eats up memory bandwidth like it's going out |
258 | of style. When the PPU is rendering scanlines, the PPU is accessing the VRAM |
259 | every chance it gets, which takes away from the time that the programmer |
260 | gets to access the VRAM. In contrast, if redundantly loaded data (like |
261 | attribute bytes) were kept in internal PPU RAM, this would allow some time |
262 | for the PPU to allow access to it's VRAM. |
263 | |
264 | All in all though, Nintendo engineered quite a cost effective, versatile |
265 | graphic processor. Now, if only they brought the 4 expansion pins on the PPU |
266 | out of the deck! |