In the previous article, you learned the internal architecture of the Z80Cpu class that implements the CPU emulation in SpectNetIde. The CPU has more than 1300 instructions, and thus Z80Cpu should take care each of them. In this post, you will learn the implementation details behind a few Z80 instructions.

Documentation and Tests

When designing the emulation architecture, I took care building it to be easily testable. The current SpectNetIde project tests each instruction separately; most instructions have more than one unit test cases. In the next article, I will show you how I implemented those tests.

Besides testing, I intended to create the source code so that you can immediately understand the specification of a particular instruction—without jumping to the Z80 reference documentation.

I added the reference documentation to the XML comments of each instruction methods, as this sample (ADD A,B) shows:

The documentation starts with a short description of the operation. A part of Z80 instructions does not modify the flags at all, while others do. After the explanation, I treat how a specific instruction handles the flags.
The T-States value indicates the number of clock cycles the instruction takes to carry out. The contention breakdown entry describes how a particular instruction behaves on ZX Spectrum in a contended situation. Later, in the article that treats memory and I/O contention, I will tell you how to decode the content of that field. Right now, just ignore it.
Just for a short recap, here is the list of Z80 flags:

Flag Description
C (Bit 0) Carry flag. It is set or cleared depending on the operation is performed. For ALU operations, it signs carry (e.g., ADD) or borrow (e.g., SUB). For bit shift and rotate operations, it stores the least/most significant bit after an operation. For the logical instructions AND, OR, and XOR, the Carry flag is reset.
N (Bit 1) Add/Subtract flag. This flag is used by the Decimal Adjust Accumulator instruction (DAA) to distinguish between the ADD and SUB instructions. For ADD instructions, N is cleared to 0. For SUB instructions, N is set to 1.
P/V (Bit 2) Parity/Overflow flag. This flag is set to a specific state depending on the operation being performed. For arithmetic operations, this flag indicates an overflow condition when the result in the Accumulator is greater than the maximum possible number (+127) or is less than the minimum possible number (–128). This overflow condition is determined by examining the sign bits of the operands.
H (Bit 4) Half Carry flag. This flag is set or cleared depending on the carry and borrow status between bits 3 and 4 of an 8-bit arithmetic operation. This flag is used by the Decimal Adjust Accumulator (DAA) instruction to correct the result of a packed BCD add or subtract operation.
Z (Bit 6) Zero flag. It is set if the result generated by the execution of certain

instructions is 0; otherwise, it is reset.

S (Bit 7) Sign flag. It stores the state of the most-significant bit of the Accumulator (bit 7).

Note: There are two undocumented flags, Bit 3 and Bit 5 of the F register. These flags cannot be read directly. They store the 3rd and 5th bit of the result for every operation that changes any flag. In the emulator, I use the names R3 and R5 for these flags.

You probably remember that the Z80Cpu class uses jump tables to invoke actions associated with operation codes. In this post, I will show end explain the methods behind these actions.

Simple Instructions

Many Z80 instructions are simple. They work with registers, load them from the memory, or store them. Here, I show a few of them.

NOP

The simplest is the NOP (No Operation) instruction. The CPU executes its M1 cycle without any further processing. Thus, the NOP instruction even does not have a dedicated action method. The ExecuteCpuCycle() method does this job with these lines:

8-Bit Register-To-Register Load

The Z80 CPU has 49 operations to move data from one of the seven 8-bit registers to another one. This example shows the LD B,C operation, which could not be implemented simpler:

All remaining 8-bit-register-to-8-bit-register operations use the same approach with a single line transfer code.

Loading Value to an 8-Bit Register

The CPU has instructions to move 8-bit literal values from the code to an 8-bit register, such as the code of the LD E,N operation shows:

Here, N is an 8-bit value that follows the opcode in the memory. By the time the code invokes the LdEN() method, PC points to N in the memory.

Because a memory read operation takes 3 T-states, the ClockP3() method adjusts the Tacts counter.

Note: At first sight, it does not matter where we put ClockP3() in the code because anywhere we put it, it always increases Tacts with 3. Well, it is not so. Because of memory contention, we need to add it after the memory read operation. The reason behind this approach is that the ReadMemory() operation may adjust the counter. The amount of this adjustment is a function of two inputs: the current Tacts value, and the memory address, respectively. Moving ClockP3() before the ReadMemory() call might result a different clock adjustment.
You will read more details later in the article about memory and I/O contention.

Loading Value to a 16-Bit Register

The 16-bit value loading operation follows the same logic as its 8-bit version pair. This code shows the internals of the LD DE,NN instruction, where NN is a 16-bit value stored in LSB/MSB order right after the opcode.

You can see that the code carries out two read operations after each other. The method stores the result of the first read in the E register (LSB), the second in D. Similarly to the previous operation the two ClockP3() call cannot be changed to a single ClockP6(), as it would not correctly handle memory contention.

Loading an 8-Bit Register from Memory

The following code executes the LD A,(BC) operation. It works exactly as you imagine. Nonetheless, the code sets up the value of the internal WZ register:

You may ask, why we set the value of an internal register if it’s not available from program code. Well, the Z80 CPU has some officially undocumented behavior (e.g., BIT instruction), where the value of WZ influences how the undocumented R3 and R5 flags are calculated. Nonetheless, the interesting thing is that though we read from the address pointed by BC, WZ is set to BC+1. I won’t explain why, it would take long. It is the internal behavior of the Z80 CPU.

Loading a 16-Bit Register from Memory

As the code of the LD HL,(NN) instruction shows, we need four read operations to get the value of a 16-bit register from an actual memory address:

The first two reads collect the address of the memory; the subsequent two reads obtain the LSB and MSB of the register’s new value from the memory.

Storing an 8-Bit Register’s Value into Memory

The Z80 has instruction that writes into a 16-bit memory address, such as LD (NN),A. This instruction first obtains the memory address, and then stores the value of A to that address:

The code is straightforward, except the snippet that sets the value of WZ. As you can see, it happens in two steps (the data bus can handle eight bits in a single transfer). WZh signifies the upper eight bits of WZ.

Note: From now on, I will explain the reason for using WZ only when I intend to point out to something significant. Otherwise, I just ignore the explanation

Storing an 8-Bit Register’s Value into Memory

I guess the implementation of the LD (NN),HL instruction is straightforward:

More about Reading and Writing Memory

The Z80Cpu class outsources memory handling functionality to an abstract IMemoryDevice:

This is an essential implementation detail. Doing so, the emulator can easily handle features such as paging, banking, handling ROM, and so on. This indirection makes testing easier, too.

ALU Operations

Implementing ALU operations seems to be evident at first sight. How could be adding or subtracting two 8-bit or 16-bit value difficult? When emulating these operations, the challenge is to manage flag changes efficiently, for these instructions change flags profoundly.

Incrementing a 16-Bit Register’s Value

Incrementing a 16-bit register keeps all flags unaffected. Thus, as the code of INC HL shows, the implementation is as simple as you expect:

Incrementing an 8-Bit Register’s value

Unlike a 16-bit increment instruction, an 8-bit operation changes flag values. The code of INC D shows how it goes:

According to the comment, this instruction may change seven flags out of eight (let’s not forget about the two undocumented flags, R3 and R5). It keeps unaffected only C. For performance reason, I do not set these flags individually, but use a predefined table, s_IncOpFlags to obtain the value of flags after the increment operation. Observe how the implementation keeps the Carry flag untouched.

Because we work with eight bits of data, we can easily pre-calculate the flag values. The implementation contains ALU helper tables within the Z80Cpu class. When the CPU instance is constructed, the InitializeAluTables() method prepares these tables. This is the code snippet that calculates the contents of s_IncOpFlags:

Note: The FlagSetMask class contains constant values to mask out a particular flag from the F register. There’s another class, FlagResetMask with constant values to reset the specific flag while keeping others from F.

Adding Two 8-Bit Registers

Using ALU helper tables is a good technique, but as you can see from the implementation details of the ADD A,H instruction, the price of the performance is increased usage of memory:

Here, the s_AdcFlags table contains 2 * 256 *256 entries: we need to combine 256 different value of A with 256 potential values of H. Besides, we have two additions, ADD, which ignores the carry flag, and ADC, which utilizes carry. This is how I calculate s_AdcFlags entries:

For the sake of completeness, here is the code of ADC A,E. You can observe how the carry flag is weaved into the operation:

Adding Two 16-Bit Registers

Today, when we have gigabytes of memory in our computers (and even in mobile devices), storing 128 Kbytes of pre-calculated data seems to be a good tradeoff for the performance gain. However, when we execute ALU operations with two 16-bit numbers, we had to store 8 Gbytes of data for such a helper table. Apparently, it would not be a viable tradeoff. We need to calculate the flag values during run time. To demonstrate this, here is the code behind the ADC HL,DE operation:

The first thing you observe is that the method’s name is not ADCHL_DE, but ADCHL_QQ. It is not a misnaming. QQ represent that this operation works with any of the BC, DE, HL, and SP registers; this method implements all the ADC HL,BC, ADC HL,DE, ADC HL,HL, and ADC HL,SP instructions.

These are extended operations with the ED opcode prefix. Bit 4 and 5 of the second opcode byte names the second operand register, the value of which is queried with this code line:

The Registers class provides an indexer property to access to 16-bit registers (and another indexer to get or set 8-bit registers). The _registers[qq] expression gets the value of the register specified by the second opcode byte.

Bit Test Instructions

The BIT N,Q instruction, which tests if the Nth bit of the Q 8-bit register is set, used opcode indirection for both N and Q:

As the code (and its comment) shows, Bit 0, 1, and 2 name Q, Bit 3, 4, and 5 specify N.

The BITN_Q method itself carries out all the 64 bit-test operations that you can execute with the CB prefix. Instead of calculating flag values run time, I could also create a helper table with 8 * 256 entries (8 entries for N, 256 entries for each 8-bit values) to accelerate the calculation. Well, this method is a good candidate for such performance refactoring.

I could use a single method body to implement the 8-bit-register-to-8-bit register load operations. However, I created 64 separate methods. I did it because I opted to avoid two indirections (getting the value of the source register and setting the value of the destination register) for the sake of performance. So, such a transfer operation (e.g. LD D,B) is so simple:

Shift and Rotate Instructions

I used helper tables for shift and rotate instructions with pre-calculated flag values. Here is the implementation of the SLA D operation:

The RR L operation copies the previous carry flag into bit 7. In this implementation, I use two helper tables, according to the value of the carry:

Logical operations

The Z80 CPU provides logical operations between A and the other 8-bit registers, such as OR, AND, XOR. Their implementation uses the same helper table, s_AluOpFlags, as the implementation of OR C and AND C shows:

The DAA Instruction

Believe it or not, the DAA instruction is probably the most complicated Z80 instruction, though its implementation does not reflect this fact:

As the table embedded into the comment suggests, the tough job is to create the s_DaaResults helper table. The current implementation of the calculation method is about hundred lines of code.

Many other instructions are worth to mention. In the next post, you will learn implementation details about interrupt handling, I/O, and block transfer instructions.

Leave a comment

Your email address will not be published. Required fields are marked *