Lab 8. Floating Point
Due: Sunday, April 27 at 23:59
In this lab, you will implement part of the IEEE 754 Standard for Floating-Point Arithmetic. In particular, you will implement half-precision (16-bit) floating point addition.
This lab has opportunities for extra credit.
In addition to the implementation, there are questions to answer.
Preliminaries
You make work on this lab in groups of up to two people.
Click on the assignment link.
Once you have accepted the assignment, you can clone the repository on your computer by following the instruction and begin working. If you’re working with a partner, the first person to accept the repository will create the team name. The second person will select which team to join. DO NOT JOIN THE WRONG TEAM
Be sure to ask any questions on Ed.
Background
Half-precision floating point numbers are similar to the 32-bit single-precision and 64-bit double precision floating point numbers we talked about in class. Click on the link in the previous sentence to read more about half-precision floating point numbers.
Just like single- and double-precision floating point numbers, half-precision floating point numbers are represented with three fields, one sign bit, five bits of biased exponent, and 10 fraction bits. The half-precision format represents 5 categories of numbers based on the values stored in the exponent and fraction fields. These categories are described in the table below with being the value of the sign bit and being the value of the fraction field.
Category | Exponent | Fraction | Represented number |
---|---|---|---|
Normal | anything | ||
Subnormal | nonzero | ||
Zero | |||
Infinity | |||
NaN | nonzero | Not a Number |
As you can see from the table above, normal numbers are stored with an
exponent bias of 15. For example, the number
so it would be represented with
, , and yielding the binary
representation 1 10100 0010101000
. Click
here to play around with other values.
The smallest, positive, normal number has representation 0 00001 0000000000
which corresponds to the value . Smaller positive values are
represented as subnormal numbers. As you can see from the table
above, these are represented with an exponent field of 0 and nonzero fraction
field and the number they represent does not have a “hidden bit.” Instead,
the entire significand is stored in the fraction field which is why the
“Represented number” column shows whereas the normal
numbers have .
Program Specification
In the assignment repo, you’ll find a Rust project containing a single source
file, src/lib.rs
and several testing files in the tests
directory.
Inside src/lib.rs
, you’ll find a Rust struct
named F16
which contains a
single u16
field. Each F16
holds a single half-precision floating point
value as a 16-bit unsigned integer.
To create an F16
with a given bit representation, you can write
F16(representation)
. For exmaple, to create an F16
that represents ,
write, F16(0b1_00000_0000000000)
(Rust lets you separate your numbers using
_
as in that example, You could equivalently write this as
F16(0b1000000000000000)
. Of course, you could specify it with hex
F16(0x8000)
.)
Several functions are provided for you, F16::from_f32()
and F16::to_f32()
convert from or to f32
, Rust’s single-precision floating point number type.
You must not use these functions in your code. They are provided to support
testing and printing. You may print an x: F16
normally, println!("{x}")
which converts x
to an f32
and prints that, or you may print it using the
“debug” representation, println!("{x:?}")
which will print in scientific
(binary) notation.
You must use bit masking and shifting to manipulate bits. In particular, you are not allowed to convert your integers to strings.
Number representation
Your first task is to implement all of the functions in src/lib.rs
that
contain todo!("Implement me")
. These functions are described briefly below
with more comments appears in src/lib.rs
.
-
(3 pts)
fn from_parts(negative: bool, exponent: u16, fraction: u16) -> Self
constructs anF16
from the three fields: sign, exponent, and fraction. You should construct the 16-bit representation of the number and return it as anF16
. For normal numbers, theexponent
argument will already be biased. -
(3 pts)
fn to_parts(self) -> (bool, u16, u16)
returns a triple containing the sign bit (as abool
), the (biased) exponent, and the fraction bits. This is a counterpart toF16::from_parts()
described above.
You should start by implementing these functions and all of the other
functions you write will use one or both functions. After you write these, you
should run the tests to make sure the test_round_trip_through_parts
test
passes successfully. (See below for how to run the tests.)
- (1 pt)
fn is_zero(self) -> bool
returnstrue
ifself
represents . - (1 pt)
fn is_sign_positive(self) -> bool
returnstrue
ifself
’s sign bit is 1. - (1 pt)
fn is_sign_negative(self) -> bool
returnstrue
ifself
’s sign bit is 0. - (1 pt)
fn is_infinite(self) -> bool
returnstrue
ifself
represents . - (1 pt)
fn is_nan(self) -> bool
returnstrue
ifself
representsNaN
. - (1 pt)
fn is_normal(self) -> bool
returnstrue
ifself
represents a normal number. - (1 pt)
fn is_subnormal(self) -> bool
returnstrue
ifself
represents a subnormal number.
For each of the preceding functions, you should use self.to_parts()
to
determine if the function should return true
or false
.
After you implement these functions, all of the tests in tests/representation.rs
should pass. (See below for how to run the tests.)
Once all of the tests in tests/representation.rs
pass, you can move on to
implementing floating point addition. The first part, described below, is
required. The remaining implementation parts, described on the next page, are
optional and worth extra credit.
Implementing addition for zero and normal numbers
The fn add(self, rhs: Self) -> Self::Output
function takes in two F16
,
self
and rhs
(right-hand-side) which corresponds to the arguments to +
in self + rhs
. For ease of reference, the starter code assigns these values
to variables x
and y
and this description will refer to x
and y
. The
add
function needs to return an F16
that corresponds to the sum of x
and
y
.
Remember, you are not allowed to convert x
and y
to f32
and add them
that way. You must implement all of the floating point addition using integer
operations only.
For this required portion, you only have to support adding normal floating point numbers or zeros (positive or negative). This will only be tested by adding normal numbers (or 0) together when the result will be a normal number (or 0).
You should implement the following algorithm.
- If either
x
ory
is 0 (you should use youris_zero()
function), return the other value. - Otherwise,
x
andy
are both normal numbers which should be added together and the result will be a normal number or zero- Use
x.to_parts()
andy.to_parts()
to get the 3 fields of each number and unbias the exponents. - Construct the significands for
x
andy
by ORing theHIDDEN_BIT
(a constant provided for you) with the corresponding fraction fields. - Shift the significand of whichever of
x
ory
has the smaller exponent to the right by the difference of the two exponents. At this point, we have two denormalized numbers which have the same (larger) exponent. - Give the significands the sign of their corresponding floating point
value. I.e., if
x
is negative, then makex
’s significand negative to perform the addition. Similarly fory
. The result of this addition is the significand of the final result (including its sign). - If the result’s significand is negative, remember this fact and make it positive.
- If the result’s significand is zero, return .
- Otherwise, normalize the result by shifting the significand left or
right as required such that there is a 1 in the position of the
HIDDEN_BIT
and only 0s in the more significant bits. Remember to update the exponent when you shift. - Return the resulting
F16
viaF16::from_parts()
, passing the sign flag, the exponent (after biasing), and the fraction bits (which you get from masking off the hidden bit using a bit operation).
- Use
Let’s add and . After extracting the fields, unbiasing
the exponents, and ORing the HIDDEN_BIT
, we have
x_sign = false
x_exp = 5
x_sig = 0b00000100_00110000
y_sign = true
y_exp = -3
y_sig = 0b00000100_00000000
Since y_exp < x_exp
, we shift y_sig
right by giving
x_sign = false
x_exp = 5
x_sig = 0b00000100_00110000
y_sign = true
y_exp = 5 (note that this was incremented by 8 because we shifted right by 8)
y_sig = 0b00000000_00000100
Next, we add x_sig
and -y_sig
(it’s negative because y_sign = true
) giving
z_exp = 5
z_sig = 0b00000100_00101100
Since z_sig > 0
, we set z_sign = false
and leave z_sig
alone.
Next, we need to shift z_sig
left or right until it has a 1 in the position
of the HIDDEN_BIT
and 0s in the more significant bits. In this case, we
lucked out and z_sig
already satisfies that. Therefore, the final result is
0b0_10100_0000101100
.
Let’s add and .
x_sign = true
x_exp = 3
x_sig = 0b00000110_01101000
y_sign = false
y_exp = 3
y_sig = 0b00000110_01100000
Since x_exp = y_exp
, we don’t need to shift either.
Next, we add -x_sig
and y_sig
giving
z_exp = 3
z_sig = 0b11111111_11111000
Since z_sig < 0
, we set
z_sign = true
z_exp = 3
z_sig = 0b00000000_00001000
Now we need to shift z_sig
left by 7 to put the leading 1 in the correct position.
z_sign = true
z_exp = -4
z_sig = 0b00000100_00000000
Therefore, the final result is 0b1_01011_0000000000
.
Once you finish this part, all of the tests in tests/add.rs
should pass. See
below for how to run the tests.
Questions
Create a README.md
with both partner’s names and answer the following questions.
-
(1 pt) We saw in class that single-precision floating point numbers can exactly represent all integers in the range but no larger range. Meaning that is not representable as a single-precision floating point number. Try that out for yourself here. Double-recision floating point numbers can exactly represent all integers in the range but no larger range. (Note that this doesn’t mean that larger integers aren’t representable. For example, is representable in single-precision floating point even though is not.)
What is the range of integers that are all exactly representable in half-precision floating point? (There are several ways to answer this question. One way is to use your
F16
implementation. Start withx = 1.0
and then add1.0
repeatedly tox
until the value ofx
stop changing. One way is to play with float.exposed. -
(1 pt) Somewhat confusingly, there’s a different 16-bit floating point format: bfloat16. It consists of 1 sign bit, 8 exponent bits, and 7 fraction bits. This stands in contrast to the half-precision floating point format (sometimes called binary16) that you implemented. What is the range of integers that are all exactly representable in
bfloat16
? -
(3 pts) Generalize your answers to questions 1 and 2 as follows. Given a floating point format with 1 sign bit, exponent bits, and fraction bits, what is the range of integers that are all exactly representable in this format.
-
(1 pt) A common programmer mistake is to use floating point numbers to represent currency. E.g., a novice programmer might represent $10.25 as the floating point number . To show why this is a bad idea, consider starting with a balance of $0.00 and then adding some fixed number of cents like $0.25 one hundred times. The balance will be that number of dollars, $25.00 in the case of a 25¢ increment.
Select an increment in cents (i.e., in the range ) and show that adding it to an initial balance of $0.00 does not give the correct number of dollars when the balance and increment are represented as single precision floating point numbers (
f32
). (You may find this Rust playground helpful.) -
(2 pts) Explain why the value you selected in question 4 led to this result with single-precision floating point numbers. Would moving to double-precision floating point numbers solve this problem? If yes, explain how it solves it. If no, explain why it doesn’t solve it.
-
(2 pts) Give an alternative representation of currency one could use that would not suffer from the problem illustrated in question 4. You may assume that you don’t need to represent fractional cents. (E.g., $10.25 is needs to be representable but $10.123 does not.)
Add the README.md
to your repository, commit, and push it.
At this point, you should have completed the required implementation and answered all the questions. Congratulations, you’ve completed the lab! Now go take a look at the optional implementation tasks.
Tests
Tests have been provided for your convenience in the tests
directory. By
default, running cargo test
will run all of the (non-ignored) test functions
found in the files in this directory. We can run tests in specific files by
cargo test --test <test-file-name>
.
$ cargo test --test representation
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
Running tests/representation.rs (target/debug/deps/representation-790de78c03535650)
running 12 tests
test test_is_infinite ... ok
test test_is_nan ... ok
test test_is_normal ... ok
test test_is_sign_negative ... ok
test test_is_sign_positive ... ok
test test_is_subnormal ... ok
test test_is_zero ... ok
test test_to_f32 ... ok
test test_zero ... ok
test test_round_trip_through_parts ... ok
test test_all_to_f32 ... ok
test test_round_trip_through_f32 ... ok
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
We can also run a specific test in the file by supplying the name of the test itself.
$ cargo test --test add test_add_zero
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.00s
Running tests/add.rs (target/debug/deps/add-05adf8e33ec38607)
running 1 test
test test_add_zero ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 0.00s
The optional parts have more complicated testing requirements. Tests for the optional parts are ignored by default and you’ll have to follow the instructions on the next page to run them.
Hints
-
When implementing
add
, you’ll need to shift the significand of the number with the smaller magnitude so that the two numbers have the same exponent. However, if the number of bits you need to shift by is greater than or equal to the number of bits in the datatype (16 for au16
, 32 for au32
), then you need to set the result to 0 explicitly. If you don’t, you’ll get a panic with a message similar to “attempt to shift right with overflow.” -
When normalizing significands, it’s easy to figure out how far left or right you need to shift the significand by computing the number of leading zeros in the significand. To that end, you can use the
u16::leading_zeros()
method. If you’re using a 16-bit type for the significand, then there should be leading zeros. If you’re using a 32-bit type, then there should be leading zeros.
Submission
Submit the lab by committing your code and pushing it to your GitHub repository.