Lab 8. Optional parts

Everything in this file is optional and worth extra credit, up to 10 points. Earlier optional parts must be completed in order to receive credit for later optional parts. There is no partial credit for any of the extra credit parts. Make sure you’ve completed everything else before attempting these.

Up to 10 points of extra credit may be earned.

Handling signed zeros and NaN (1 pt)

In this part, you’ll add support for adding NaN as well as handling signed zeros in your add() function.

When either of the arguments to add() is NaN, return NaN. There are many different NaN values. For this part, it won’t matter which one you return. For the final optional part, we’ll revisit this and return the “correct” NaN.

The description in the required part of the lab said that if one of the arguments to add() is zero to return the other argument. That isn’t quite correct if both x and y are zero. In that case, we need to return a zero but $+ 0.0$ or $- 0.0$ ? The rule here is simple: if both zeros have the same sign, return a zero with that sign. If they have opposite signs, then return positive zero. Finally, adding (finite) values $x$ and $- x$ returns positive zero.

We can express these two rules in the following equations. $x + NaN NaN + x + 0.0 + + 0.0 + 0.0 + - 0.0 - 0.0 + + 0.0 - 0.0 + - 0.0 x + - x = NaN = NaN = + 0.0 = + 0.0 = + 0.0 = - 0.0 = + 0.0$

You can test your code by running cargo test --test nanzero -- --ignored (note all the -- there, they’re all required).

$ cargo test --test nanzero -- --ignored
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running tests/nanzero.rs (target/debug/deps/nanzero-eb399c667b6b43b2)

running 3 tests
test test_add_nan ... ok
test test_result_zero ... ok
test test_add_zero ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Handling infinity and overflow (1 pt)

Handling input $\pm \infty$ is easy. If $x$ is a finite number, then $+ \infty + x x + + \infty - \infty + x x + - \infty + \infty + + \infty + \infty + - \infty - \infty + + \infty - \infty + - \infty = + \infty = + \infty = - \infty = - \infty = + \infty = NaN = NaN = - \infty.$

And of course, adding any infinity to a NaN remains NaN.

If the result of adding two finite numbers has an exponent value that’s greater than 15, then return either $+ \infty$ or $- \infty$ , depending on the sign of the result. (This is floating point overflow.)

You can test your code by running cargo test --test infinity -- --ignored (note all the -- there, they’re all required).

$ cargo test --test infinity -- --ignored
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running tests/infinity.rs (target/debug/deps/infinity-b12975c711da657e)

running 2 tests
test test_add_infinity ... ok
test test_infinite_result ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Handling Subnormal numbers and underflow (3 pts)

Numbers between $0$ and $2^{- 14}$ cannot be represented by a normal half-precision floating point number. Some of these numbers can be represented as subnormal numbers. As shown in the table at the beginning of this lab, subnormal numbers have a exponent field of 0 and the fraction bits holds the entire significand (i.e., no hidden bit).

For example, the smallest, positive number that’s representable in half-precision floating point has representation 0 00000 0000000001. This corresponds to the number $(- 1)^{0} \times 0.000000000 1_{2} \times 2^{- 14} = 2^{- 24} \approx 0.00000005960464477539$ .

Note that being subnormal is a statement about the representation of the number. It’s not an intrinsic quality of the number. For example, the subnormal number shown above is only subnormal in half-precision floating point. It’s a normal single- or double-precision floating point number. E.g., its single-precision representation is 0 01100111 00000000000000000000000 which is normal (because the exponent field is between 0 and 255).

To handle subnormal numbers, I recommend making the following changes to your add() implementation. First, add a helper function that takes a finite F16 and returns a (bool, i32, i32) triple similar to F16::to_parts() except you want to return the unbiased exponent and the actual significand (including the hidden bit for normal numbers). Now you can implement the addition in terms of these 3 components of each number in a unified manner. I.e., you don’t need a bunch of cases to handle adding (finite) numbers together. We’ll adjust this helper function in the next part.

Example

Let’s add a normal number $x = - 1.000110000 0_{2} \times 2^{- 14}$ and subnormal number $y = 0.110100000 0_{2} \times 2^{- 14}$ . (Note that $y$ is subnormal because if we were to normalize it, we’d have $y = 1.101000000 0_{2} \times 2^{- 15}$ and the exponent, $- 15$ is smaller than the minimum exponent we can represent with a normal number, $- 14$ .)

These numbers have representation x = 1_00001_0001100000 and y = 0_00000_1101000000.

The helper function described above would return (true, -14, 0b100_01100000) for x and (false, -14, 0b11_01000000) for y. Notice that the exponent for x is $- 14$ because x’s exponent field is 1 and we get $- 14$ when we subtract the bias value. The exponent for y is $- 14$ because y’s exponent field is 0 which indicates its a subnormal number. Note further that the significand for x is the fraction bits plus the hidden bit whereas the significand for y is just the fraction bits.

Once you have the sign, exponent, and significand of both values, add them together exactly as you did in the required part of the lab.

If the result is 0, return a 0. If the normalized result has an exponent greater than 15, return an infinity. If the normalized result has an exponent in the range $- 14$ to $15$ (inclusive), return a normal number. Otherwise, the exponent is smaller than $- 14$ . Shift the result’s significand to the right the number of bits required to make the exponent $- 14$ . For example, if the normalized result had an exponent of $- 17$ , then shift the significand to the right by 3 bits. Finally, construct an F16 with the appropriate sign, an exponent field of 0, and the shifted significand as the fraction bits. Note that if the final shifting step shifts all of the 1 bits out of the number, then the result will be $\pm 0$ . (This is floating point underflow.)

Example

Continuing the example from above, we have

x_sign = true
x_exp = -14
x_sig = 0b100_01100000 (leading zeros omitted)

y_sign = false
y_exp = -14
y_sig = 0b11_01000000

Since x_exp = y_exp, we don’t need to shift the value with the smaller exponent. We can now add the significands together giving

z_sign = true
z_exp = -14
z_sig = 0b1_00100000

Normalizing this requires shifting the significand left by 2 bits.

z_sign = true
z_exp = -16
z_sig = 0b100_10000000

Since the exponent $- 16 < - 14$ , the result is a subnormal number. We need to shift the value back to the right by 2 bits to make the exponent be $- 14$ .

z_sign = true
z_exp = -14
z_sig = 0b1_00100000

(In this example, this step simply undid the previous normalization step. That won’t happen in general and you won’t know you need to do this until after you’ve normalized the result, z, and found its exponent is less than $- 14$ .)

The final value is 1_00000_0100100000.

Test your code by running cargo test --test subnormal -- --ignored.

$ cargo test --test subnormal -- --ignored
   Compiling fp v0.1.0 (/Users/steve/teaching/210/rust-labs/fp)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.23s
     Running tests/subnormal.rs (target/debug/deps/subnormal-750204f0c5dd8830)

running 1 test
test test_add_subnormals ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Accurate sums with correct rounding (4 pts)

One of the steps of the addition algorithm is to shift the input with the smaller exponent to the right so that both inputs have the same exponent.

Shifting to the right loses bits and sometimes these bits make a difference. Let’s look at an example.

Example

Let $x = - 1.000000000 1_{2} \times 2^{2}$ and $y = 1.000000000 0_{2} \times 2^{3}$ . In decimal, this is adding $- 4.00390625 + 8.0$ which is $3.99609375$ . This gives

x_sign = true
x_exp = 2
x_sig = 0b100_00000001

y_sign = false
y_exp = 3
y_sig = 0b100_00000000

We shift x to the right by 1.

x_sign = true
x_exp = 3
x_sig = 0b10_00000000 (we lost the least significant 1 bit)

Adding the significands (taking the sign into account) gives

z_sign = false
z_exp = 3
z_sig = 0b10_00000000

Normalizing this requires shifting left by 1.

z_sign = false
z_exp = 2
z_sig = 0b100_00000000

or $z = 1.000000000 0_{2} \times 2^{2} = 4.0$ . This is not correct because there exists a half-precision floating point number that is closer to the actual value, namely $1.111111111 0_{2} \times 2^{1} = 3.99609375$ . In fact, this is the exact number.

To get the correct value, we need to preserve some of the bits that we shifted off. The Patterson and Hennessy textbook, section 3.5 contains a subsection titled “Accurate Arithmetic” which describes a clever approach using just three additional bits called “guard,” “round,” and “sticky” to represent the intermediate values in the computation.

An alternative approach is to just use more bits. In the optional section on subnormals above, I suggested a helper function which turns an F16 into the triple (sign, real_exponent, significand) and I suggested using an i32 for the significand. Let’s make use of that here.

Change your helper function to shift all of the significands left by 16 and subtract 16 from the exponents.

Let’s redo the calculation to see how that impacts our results.

Example

Redoing the previous example with the change made to the helper function gives

x_sign = true
x_exp = -14
x_sig = 0b100_00000001_00000000_00000000

y_sign = false
y_exp = -13
y_sig = 0b100_00000000_00000000_00000000

Shifting x to the right by 1 to give it the same exponent as y, gives

x_sign = true
x_exp = -13
x_sig = 0b10_00000000_10000000_00000000

Adding the significands gives

z_sign = false
z_exp = -13
z_sig = 0b1_11111111_10000000_00000000

Normalizing requires shifting right by 14 (because we need the most significant 1 to be in bit 10 as usual)

z_sign = false
z_exp = 1
z_sig = 0b111_11111110

This is the value we want $z = 1.111111111 0_{2} \times 2^{1} = 3.99609375$ !

So are we done? Not quite! There’s one more complicated step we have to perform: rounding.

Floating point rounding is similar to how we round real numbers to integers in arithmetic: factional values greater than 0.5 round up and fractional values smaller than 0.5 round down. The same will be true for floating point numbers except we’re going to be looking at rounding the normalized significand of the result.

Specifically, if the normalization step requires shifting to the right to put the leading 1 in bit 10, then whether we need to round depends on the bits that are shifted off of the significand. Let’s look at an example.

Example

Let $x = 0.000000110 1_{2} \times 2^{- 14}$ and $y = - 1.111001000 1_{2} \times 2^{- 1}$ . (Note that the first number is subnormal although the same rounding rules apply to any numbers.)

x_sign = false
x_exp = -30
x_sig = 0b1101_00000000_00000000

y_sign = true
y_exp = -17
y_sig = 0b111_10010001_00000000_00000000

Denormalizing requires shifting x right by 13 bits which gives

x_sign = false
x_exp = -17
x_sig = 0b1101000

Adding the significands with signs gives

z_sign = true
z_exp = -17
z_sig = 0b111_10010000_11111111_10011000

Normalizing requires shifting right by 16 in this case which gives

z_sign = true
z_exp = -1
z_sig = 0b111_10010000
remainder = 0b11111111_11001100

where remainder are the bits we shifted off the end of the significand. The way to think about this is that we have a fractional significand. In this case, that would be 0b111_10010000.11111111_11001100 (this is just z_sig.remainder). We want to round the significand and the question is, is the remainder greater than half which, in binary, is $0. 1_{2}$ . Clearly $0.111111111100110 0_{2} > 0. 1_{2}$ so we round the significand up to z_sig = 0b111_100100001.

Since z_exp is in the range for normal numbers, the result is $- 1.1110010000 1_{2} \times 2^{1}$ .

Let’s look at another example. Specifically, let’s look at what happens if the remainder is exactly $0. 1_{2}$ . In this scenario, the standard says that we should “round toward even” meaning that if the remainder is exactly half and z_sig (before rounding) is an even integer, we round down and if z_sig is an odd integer, we round up.

Example

Let $x = 1.001111110 0_{2} \times 2^{3}$ and $y = 1.110101111 1_{2} \times 2^{3}$ (two normal numbers).

x_sign = false
x_exp = -13
x_sig = 0b100_11111100_00000000_00000000

y_sign = false
y_exp = -13
y_sig = 0b111_01011111_00000000_00000000

Both have the same exponent so we don’t need to shift. Adding the significands yields

z_sign = false
z_exp = -13
z_sig = 0b1100_01011011_00000000_00000000

Normalizing requires shifting right by 17 which gives

z_sign = false
z_exp = 4
z_sig = 0b110_00101101
remainder = 0b10000000_00000000

In this case, the z_sig is odd and the remainder is exactly half (again, we should think of this as 0b110_001011010.10000000_00000000). The rule says we should round up giving a significand of z_sig = 0b110_00101110 with a final value of $1.100010111 0_{2} \times 2^{4}$ .

The final thing to consider is what happens if we round up (either because the remainder was greater than $0. 1_{2}$ or because it was exactly $0. 1_{2}$ and rounding up is to the even number) and this causes our significand to become denormalized. In this case, we normalize by shifting the significand right by 1 and increment the exponent by 1. One final example should make this clear.

Example

Let $x = 1.000110110 0_{2} \times 2^{1}$ and $y = - 1.1011000011 \times 2^{- 3}$ .

x_sign = false
x_exp = -15
x_sig = 0b100_01101100_00000000_00000000

y_sign = true
y_exp = -19
y_sig = 0b110_11000011_00000000_00000000

Denormalize y.

y_sign = true
y_exp = -15
y_sig = 0b1101100_00110000_00000000

Add.

z_sign = false
z_exp = -15
z_sig = 0b11_11111111_11010000_00000000

Normalize.

z_sign = false
z_exp = 0
z_sig = 0b111_11111111
remainder = 0b1010000_00000000

Now we see the issue. The normalized z_sig is eleven 1s, the largest possible 11-bit significand. But at the same time, the remainder $0.10 1_{2} > 0. 1_{2}$ so we round up. This gives a denormalized significand of 0b1000_00000000. So we shift right by one and increment the exponent by one giving

z_sign = false
z_exp = 1
z_sig = 0b100_00000000

which is $z = 1.0000000000 \times 2^{1}$ .

(If we do standard binary arithmetic on these numbers, we have $x + y = 10.00110110 0_{2} + - 0.001101100001 1_{2} = 1.111111111110 1_{2}$ which is indeed very close to $10. 0_{2} = 2$ .)

To test, run cargo test --test rounding -- --ignored.

$ cargo test --test rounding -- --ignored
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running tests/rounding.rs (target/debug/deps/rounding-e3ad8ea17f5a2a42)

running 1 test
test test_rounding ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Comparing to a real implementation (1 pt)

If you made it this far, congratulations! There’s only one thing left to do: Compare the results of your implementation to a real f16 implementation. There are $2^{16}$ possible f16 values so we can try all $2^{16} \times 2^{16}$ possible sums $x + y$ with your implementation and with a real one and make sure everything is bit-for-bit correct.

To be able to use f16 in Rust, we need to use the “nightly” version of the compiler. This won’t work on a lab machine which doesn’t have nightly installed. You’ll have to run these final tests on your own computer with Rust installed via rustup.

The command to run these tests is a little convoluted: cargo +nightly test --features real-f16 --release --test exact.

$ cargo +nightly test --features real-f16 --release --test exact
    Finished `release` profile [optimized] target(s) in 0.03s
     Running tests/exact.rs (target/release/deps/exact-05dde320acdb2f26)

running 1 test
test test_add_exact ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 25.26s

CSCI 210 Introduction to Computer Architecture -- Labs

Lab 8. Optional parts

Handling signed zeros and NaN (1 pt)

Handling infinity and overflow (1 pt)

Handling Subnormal numbers and underflow (3 pts)

Accurate sums with correct rounding (4 pts)

Comparing to a real implementation (1 pt)