Lab 8. Floating Point

Due: Sunday, November 16 at 23:59

In this lab, you will implement part of the IEEE 754 Standard for Floating-Point Arithmetic. In particular, you will implement half-precision (16-bit) floating point addition.

This lab has opportunities for extra credit.

In addition to the implementation, there are questions to answer.

Preliminaries

You may work on this lab in groups of up to two people.

Click on the assignment link.

Once you have accepted the assignment, you can clone the repository on your computer by following the instruction and begin working. If you’re working with a partner, the first person to accept the repository will create the team name. The second person will select which team to join. DO NOT JOIN THE WRONG TEAM

Be sure to ask any questions on Ed.

Background

Half-precision floating point numbers are similar to the 32-bit single-precision and 64-bit double precision floating point numbers we talked about in class. Click on the link in the previous sentence to read more about half-precision floating point numbers.

Just like single- and double-precision floating point numbers, half-precision floating point numbers are represented with three fields, one sign bit, five bits of biased exponent, and 10 fraction bits. The half-precision format represents 5 categories of numbers based on the values stored in the exponent and fraction fields. These categories are described in the table below with $s$ being the value of the sign bit and $e$ being the value of the fraction field.

Category	Exponent	Fraction	Represented number
Normal	$0 < e < 31$	anything	$(- 1)^{s} \times 1. fraction \times 2^{e - 15}$
Subnormal	$0$	nonzero	$(- 1)^{s} \times 0. fraction \times 2^{- 14}$
Zero	$0$	$0$	$\pm 0$
Infinity	$31$	$0$	$\pm \infty$
NaN	$31$	nonzero	Not a Number

As you can see from the table above, normal numbers are stored with an exponent bias of 15. For example, the number $- 37.25 = (- 1)^{1} \times 1.001010100 0_{2} \times 2^{5}$ so it would be represented with $s = 1$ , $e = 5 + 15 = 20$ , and $fraction = 001010100 0_{2}$ yielding the binary representation 1 10100 0010101000. Click here to play around with other values.

The smallest, positive, normal number has representation 0 00001 0000000000 which corresponds to the value $2^{- 14}$ . Smaller positive values are represented as subnormal numbers. As you can see from the table above, these are represented with an exponent field of 0 and nonzero fraction field and the number they represent does not have a “hidden bit.” Instead, the entire significand is stored in the fraction field which is why the “Represented number” column shows $0. fraction$ whereas the normal numbers have $1. fraction$ .

Program Specification

In the assignment repo, you’ll find a Rust project containing a single source file, src/lib.rs and several testing files in the tests directory.

Inside src/lib.rs, you’ll find a Rust struct named F16 which contains a single u16 field. Each F16 holds a single half-precision floating point value as a 16-bit unsigned integer.

To create an F16 with a given bit representation, you can write F16(representation). For exmaple, to create an F16 that represents $- 0$ , write, F16(0b1_00000_0000000000) (Rust lets you separate your numbers using _ as in that example, You could equivalently write this as F16(0b1000000000000000). Of course, you could specify it with hex F16(0x8000).)

Several functions are provided for you, F16::from_f32() and F16::to_f32() convert from or to f32, Rust’s single-precision floating point number type. You must not use these functions in your code. They are provided to support testing and printing. You may print an x: F16 normally, println!("{x}") which converts x to an f32 and prints that, or you may print it using the “debug” representation, println!("{x:?}") which will print in scientific (binary) notation.

You must use bit masking and shifting to manipulate bits. In particular, you are not allowed to convert your integers to strings.

Number representation (13 points)

Your first task is to implement all of the functions in src/lib.rs that contain todo!("Implement me"). These functions are described briefly below with more comments appears in src/lib.rs.

(3 pts) fn from_parts(negative: bool, exponent: u16, fraction: u16) -> Self constructs an F16 from the three fields: sign, exponent, and fraction. You should construct the 16-bit representation of the number and return it as an F16. For normal numbers, the exponent argument will already be biased.
(3 pts) fn to_parts(self) -> (bool, u16, u16) returns a triple containing the sign bit (as a bool), the (biased) exponent, and the fraction bits. This is a counterpart to F16::from_parts() described above.

You should start by implementing these functions and all of the other functions you write will use one or both functions. After you write these, you should run the tests to make sure the test_round_trip_through_parts test passes successfully. (See below for how to run the tests.)

(1 pt) fn is_zero(self) -> bool returns true if self represents $\pm 0$ .
(1 pt) fn is_sign_positive(self) -> bool returns true if self’s sign bit is 1.
(1 pt) fn is_sign_negative(self) -> bool returns true if self’s sign bit is 0.
(1 pt) fn is_infinite(self) -> bool returns true if self represents $\pm \infty$ .
(1 pt) fn is_nan(self) -> bool returns true if self represents NaN.
(1 pt) fn is_normal(self) -> bool returns true if self represents a normal number.
(1 pt) fn is_subnormal(self) -> bool returns true if self represents a subnormal number.

For each of the preceding functions, you should use self.to_parts() to determine if the function should return true or false.

After you implement these functions, all of the tests in tests/representation.rs should pass. (See below for how to run the tests.)

Once all of the tests in tests/representation.rs pass, you can move on to implementing floating point addition. The first part, described below, is required. The remaining implementation parts, described on the next page, are optional and worth extra credit.

Implementing addition for zero and normal numbers (17 points)

The fn add(self, rhs: Self) -> Self::Output function takes in two F16, self and rhs (right-hand-side) which corresponds to the arguments to + in self + rhs. For ease of reference, the starter code assigns these values to variables x and y and this description will refer to x and y. The add function needs to return an F16 that corresponds to the sum of x and y.

Remember, you are not allowed to convert x and y to f32 and add them that way. You must implement all of the floating point addition using integer operations only.

For this required portion, you only have to support adding normal floating point numbers or zeros (positive or negative). This will only be tested by adding normal numbers (or 0) together when the result will be a normal number (or 0).

You should implement the following algorithm.

If either x or y is 0 (you should use your is_zero() function), return the other value.
Otherwise, x and y are both normal numbers which should be added together and the result will be a normal number or zero
1. Use x.to_parts() and y.to_parts() to get the 3 fields of each number and unbias the exponents.
2. Construct the significands for x and y by ORing the HIDDEN_BIT (a constant provided for you) with the corresponding fraction fields.
3. Shift the significand of whichever of x or y has the smaller exponent to the right by the difference of the two exponents. At this point, we have two denormalized numbers which have the same (larger) exponent.
4. Give the significands the sign of their corresponding floating point value. I.e., if x is negative, then make x’s significand negative to perform the addition. Similarly for y. The result of this addition is the significand of the final result (including its sign).
5. If the result’s significand is negative, remember this fact and make it positive.
6. If the result’s significand is zero, return $+ 0.0$ .
7. Otherwise, normalize the result by shifting the significand left or right as required such that there is a 1 in the position of the HIDDEN_BIT and only 0s in the more significant bits. Remember to update the exponent when you shift.
8. Return the resulting F16 via F16::from_parts(), passing the sign flag, the exponent (after biasing), and the fraction bits (which you get from masking off the hidden bit using a bit operation).

Example

Let’s add $x = 33.5$ and $y = - 0.125$ . After extracting the fields, unbiasing the exponents, and ORing the HIDDEN_BIT, we have

x_sign = false
x_exp = 5
x_sig = 0b00000100_00110000

y_sign = true
y_exp = -3
y_sig = 0b00000100_00000000

Since y_exp < x_exp, we shift y_sig right by $5 - - 3 = 8$ giving

x_sign = false
x_exp = 5
x_sig = 0b00000100_00110000

y_sign = true
y_exp = 5 (note that this was incremented by 8 because we shifted right by 8)
y_sig = 0b00000000_00000100

Next, we add x_sig and -y_sig (it’s negative because y_sign = true) giving

z_exp = 5
z_sig = 0b00000100_00101100

Since z_sig > 0, we set z_sign = false and leave z_sig alone.

Next, we need to shift z_sig left or right until it has a 1 in the position of the HIDDEN_BIT and 0s in the more significant bits. In this case, we lucked out and z_sig already satisfies that. Therefore, the final result is 0b0_10100_0000101100.

Example

Let’s add $x = - 12.8125$ and $y = 12.75$ .

x_sign = true
x_exp = 3
x_sig = 0b00000110_01101000

y_sign = false
y_exp = 3
y_sig = 0b00000110_01100000

Since x_exp = y_exp, we don’t need to shift either.

Next, we add -x_sig and y_sig giving

z_exp = 3
z_sig = 0b11111111_11111000

Since z_sig < 0, we set

z_sign = true
z_exp = 3
z_sig = 0b00000000_00001000

Now we need to shift z_sig left by 7 to put the leading 1 in the correct position.

z_sign = true
z_exp = -4
z_sig = 0b00000100_00000000

Therefore, the final result is 0b1_01011_0000000000.

Once you finish this part, all of the tests in tests/add.rs should pass. See below for how to run the tests.

Questions (10 points)

Create a README.md with both partner’s names and answer the following questions.

(1 pt) We saw in class that single-precision floating point numbers can exactly represent all integers in the range $[- 2^{- 24}, 2^{24}]$ but no larger range. Meaning that $2^{24} + 1 = 16777217$ is not representable as a single-precision floating point number. Try that out for yourself here. Double-recision floating point numbers can exactly represent all integers in the range $[- 2^{53}, 2^{53}]$ but no larger range. (Note that this doesn’t mean that larger integers aren’t representable. For example, $2^{24} + 2$ is representable in single-precision floating point even though $2^{24} + 1$ is not.)

What is the range of integers that are all exactly representable in half-precision floating point? (There are several ways to answer this question. One way is to use your F16 implementation. Start with x = 1.0 and then add 1.0 repeatedly to x until the value of x stop changing. One way is to play with float.exposed.
(1 pt) Somewhat confusingly, there’s a different 16-bit floating point format: bfloat16. It consists of 1 sign bit, 8 exponent bits, and 7 fraction bits. This stands in contrast to the half-precision floating point format (sometimes called binary16) that you implemented. What is the range of integers that are all exactly representable in bfloat16?
(3 pts) Generalize your answers to questions 1 and 2 as follows. Given a floating point format with 1 sign bit, $e$ exponent bits, and $f$ fraction bits, what is the range of integers that are all exactly representable in this format.
(1 pt) A common programmer mistake is to use floating point numbers to represent currency. E.g., a novice programmer might represent $10.25 as the floating point number $10.25$ . To show why this is a bad idea, consider starting with a balance of $0.00 and then adding some fixed number of cents like $0.25 one hundred times. The balance will be that number of dollars, $25.00 in the case of a 25¢ increment.

Select an increment in cents (i.e., in the range $$0.00 < increment < $1.00$ ) and show that adding it to an initial balance of $0.00 does not give the correct number of dollars when the balance and increment are represented as single precision floating point numbers (f32). (You may find this Rust playground helpful.)
(2 pts) Explain why the value you selected in question 4 led to this result with single-precision floating point numbers. Would moving to double-precision floating point numbers solve this problem? If yes, explain how it solves it. If no, explain why it doesn’t solve it.
(2 pts) Give an alternative representation of currency one could use that would not suffer from the problem illustrated in question 4. You may assume that you don’t need to represent fractional cents. (E.g., $10.25 is needs to be representable but $10.123 does not.)

Add the README.md to your repository, commit, and push it.

At this point, you should have completed the required implementation and answered all the questions. Congratulations, you’ve completed the lab! Now go take a look at the optional implementation tasks.

Tests

Tests have been provided for your convenience in the tests directory. By default, running cargo test will run all of the (non-ignored) test functions found in the files in this directory. We can run tests in specific files by cargo test --test <test-file-name>.

$ cargo test --test representation
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running tests/representation.rs (target/debug/deps/representation-790de78c03535650)

running 12 tests
test test_is_infinite ... ok
test test_is_nan ... ok
test test_is_normal ... ok
test test_is_sign_negative ... ok
test test_is_sign_positive ... ok
test test_is_subnormal ... ok
test test_is_zero ... ok
test test_to_f32 ... ok
test test_zero ... ok
test test_round_trip_through_parts ... ok
test test_all_to_f32 ... ok
test test_round_trip_through_f32 ... ok

test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

We can also run a specific test in the file by supplying the name of the test itself.

$ cargo test --test add test_add_zero
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running tests/add.rs (target/debug/deps/add-05adf8e33ec38607)

running 1 test
test test_add_zero ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 0.00s

The optional parts have more complicated testing requirements. Tests for the optional parts are ignored by default and you’ll have to follow the instructions on the next page to run them.

Hints

When implementing add, you’ll need to shift the significand of the number with the smaller magnitude so that the two numbers have the same exponent. However, if the number of bits you need to shift by is greater than or equal to the number of bits in the datatype (16 for a u16, 32 for a u32), then you need to set the result to 0 explicitly. If you don’t, you’ll get a panic with a message similar to “attempt to shift right with overflow.”
When normalizing significands, it’s easy to figure out how far left or right you need to shift the significand by computing the number of leading zeros in the significand. To that end, you can use the u16::leading_zeros() method. If you’re using a 16-bit type for the significand, then there should be $16 - 11 = 5$ leading zeros. If you’re using a 32-bit type, then there should be $32 - 11 = 21$ leading zeros.

Submission

Submit the lab by committing your code and pushing it to your GitHub repository.

Keyboard shortcuts

CSCI 210 Introduction to Computer Architecture -- Labs