Part 1. Parsing input (40 points)
At a high level, your shell will (1) read a line of input from stdin; (2) parse the input into tokens; (3) construct a pipeline from the tokens; and (4) run the whole pipeline.
In this part, you will parse input into a sequence of tokens.
Initially, a token will be either a pipe character |
or a word. For example,
the input echo 'Why hello there!'| wc -l -c
consists of the following
tokens.
[Word("echo"), Word("Why hello there!"), Pipe, Word("wc"), Word("-l"), Word("-c")]
Notice how the |
turns into a pipe token rather than a word token. In
contrast, if the pipe is quoted, it is a word as in this example echo '|'
.
[Word("echo"), Word("|")]
This is a perfect use case for an enum.
Your task
Create a new binary project in the assignment repo using
$ cargo init --bin --vcs git --name osh .
Define the Result
type in main.rs
the same way you did for the past few labs.
Create a new file src/input.rs
and add the input
module to main.rs
.
Inside input.rs
, add a use
line for Result
.
use crate::Result;
Create a new, public enum called InputToken
which you will use to represent
either a word or a pipe.
Create a new function
#![allow(unused)] fn main() { type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>; enum InputToken { Word(String), Pipe }; pub fn split_input(input: &str) -> Result<Vec<InputToken>> { todo!() } }
which on success will return a Vec
of InputToken
s.
To parse the input into tokens, you’ll want to iterate over the input string
one character at a time which you can do using the .chars()
method which
returns an iterator over characters.
What makes this tricky is that how you handle any given character depends on what other characters have been read. For example, a space character can separate two words or it could be part of a quoted word. Similarly, a pipe character might be part of a word (if it is quoted) or it might be a separator in the pipeline (see the two examples at the top of this page).
Your split_input()
function should support unquoted words, single quoted words, and
double quoted words or a mix. Here’s an example input line
echo unquoted_word 'single "quoted" word' "double 'quoted' word" 'mixing different 'quoting" styles"
and its parsing
[Word("echo"), Word("unquoted_word"), Word("single \"quoted\" word"), Word("double 'quoted' word"), Word("mixing different quoting styles")]
Note that the unquoted quotation marks are removed from each word. (The single
quoted word containing a double quoted inner word is shown with a backslash
before the inner quotes. This is an artifact of the Debug
representation,
not actual characters.)
To parse the input, I recommend creating a state machine which examines its input one character at a time. A state machine works by maintaining a current state and reading the input in a loop. For each character read, use the state to determine what to do with the character and what state to move to next.
My implementation used the following four states and a mutable State
variable which is initially State::Blank
and represents the current state of
parsing.
#![allow(unused)] fn main() { #[derive(Clone, Copy, Debug, PartialEq, Eq)] enum State { Blank, Unquoted, SingleQuoted, DoubleQuoted, } let mut state = State::Blank; }
The Blank
state indicates that the state machine is not currently parsing a
word. Initially, the current state is Blank
, but after reading an unquoted
white space character (a blank, if you will), the state will change back to
Blank
.
The other three states indicate that the state machine is in the middle of parsing an unquoted word, a single quoted word, or a double quoted word.
This is a graphical representation of the state machine.
Each circle represents one of the four states. The arrows indicate how the
state machine’s current state changes between them. For example,
starting in the Blank
state parsing the input foo 'bar'
, the state changes
to Unquoted
upon reading the f
. It stays in Unquoted
until the space
when it transitions back to Blank
. Upon reading the '
, it transitions to
SingleQuoted
where it stays until it reaches the last '
which causes it to
transition to Unquoted
.
Note that in state SingleQuoted
, a single quote character causes a transition to state Unquoted
. The DoubleQuoted
state behaves similarly with a double quote. It doesn’t transition to the Blank
state because a closing quote doesn’t end the word. For example, the input foo' 'bar
is the same as 'foo bar'
: both of them produce
a Word("foo bar")
token.
The bulk of the state machine is implemented in a loop
containing a match
.
for ch in input.chars() {
match (state, ch) {
_ -> todo!()
}
}
Rust will ensure that we add cases for every possible state and input character combination.
For example, here’s how my solution handles characters in the Blank
state.
let mut current_word = String::new();
let mut result = Vec::new();
for ch in input.chars() {
match (state, ch) {
(State::Blank, '|') => result.push(InputToken::Pipe),
(State::Blank, _) if ch.is_ascii_whitespace() => (),
(State::Blank, '\'') => state = State::SingleQuoted,
(State::Blank, '"') => state = State::DoubleQuoted,
(State::Blank, _) => {
current_word.push(ch);
state = State::Unquoted;
}
// Handle the other states
}
}
In the Blank
state, if a |
is read, it pushes an InputToken::Pipe
token
into a result
vector.
If the character is ASCII whitespace, then it’s ignored. Notice that we can add conditions to the match patterns.
match (state, ch) {
(State::Blank, _) if ch.is_ascii_whitespace() => (),
}
This will match if state
is State::Blank
and ch.is_ascii_whitespace()
returns true
. (The _
wildcard will match any character.)
If the character is a single or double quote, the state changes to the appropriate state.
In every other case, the character is pushed onto the current_word
and
changes the state to Unquoted
.
The Unquoted
state is similar to Blank
except that a |
or ASCII
whitespace ends the current_word
(so it should be pushed onto result
). If
it’s a |
, then an InputToken::Pipe
should also be pushed. The state should be
set to Blank
.
The SingleQuoted
and DoubleQuoted
are similar to each other. They should
keep collecting characters in the word until the matching single or double
quote character is read in which case the state should be changed to
Unquoted
.
After all characters have been read, if the state is Unquoted
, the current
word should be pushed onto the result vector. If the state is one of the
quoted states, return an error. Otherwise, return the vector of tokens.
For a complete example, see below.
Implement a main()
function in main.rs
that reads a line of input from
stdin
, removes the newline, calls split_input()
on it, and prints the
debug representation.
Write some unit tests in input.rs
to test your code. Here are two sample
unit tests to get you started.
#[cfg(test)]
mod test {
use super::*;
#[test]
fn simple_pipeline() {
let input = "echo hello | wc -c";
let tokens = split_input(input).unwrap();
assert_eq!(
tokens,
vec![
InputToken::Word("echo".to_string()),
InputToken::Word("hello".to_string()),
InputToken::Pipe,
InputToken::Word("wc".to_string()),
InputToken::Word("-c".to_string()),
]
);
}
#[test]
fn mixed_quotes() {
// This is a "raw" string. It has delimiters r#" and "#.
// Note that we can use " inside of it.
let input = r#"'mixing different 'quoting" styles""#;
let tokens = split_input(input).unwrap();
assert_eq!(
tokens,
vec![
InputToken::Word("mixing different quoting styles".to_string()),
]
);
}
}
Remember that you can add additional unit tests by adding functions to the
test
module. Each function that you want to run as a test needs to be
annotated with #[test]
. Note the use of raw
strings
in the second test.
To run all unit tests, run $ cargo test
.
Example parsing
The table below shows how current_word
and result
change as a result of
parsing the input echo 'one two' three" four"| tr a-z A-Z
.
The first (body) row of the table indicates that the state machine starts in
the Blank
state and the first character is e
. This causes the e
to be
pushed into current_word
and result
remains empty. The state changed to
the Unquoted
state which you can see in the second row.
The second row indicates that the state machine is in the Unquoted
state
and the next letter is c
. This causes the c
to be pushed into
current_word
and result
remains empty. The state remains Unquoted
.
The final row of the table shows that after reading all of the input
characters, the state machine is in the Unquoted
state and the
current_word
from the row above has been pushed into the result
vector.
State | ch | current_word | result |
---|---|---|---|
Blank | 'e' | "e" | [] |
Unquoted | 'c' | "ec" | [] |
Unquoted | 'h' | "ech" | [] |
Unquoted | 'o' | "echo" | [] |
Unquoted | ' ' | "" | [Word("echo")] |
Blank | '\'' | "" | [Word("echo")] |
SingleQuoted | 'o' | "o" | [Word("echo")] |
SingleQuoted | 'n' | "on" | [Word("echo")] |
SingleQuoted | 'e' | "one" | [Word("echo")] |
SingleQuoted | ' ' | "one " | [Word("echo")] |
SingleQuoted | 't' | "one t" | [Word("echo")] |
SingleQuoted | 'w' | "one tw" | [Word("echo")] |
SingleQuoted | 'o' | "one two" | [Word("echo")] |
SingleQuoted | '\'' | "one two" | [Word("echo")] |
Unquoted | ' ' | "" | [Word("echo"), Word("one two")] |
Blank | 't' | "t" | [Word("echo"), Word("one two")] |
Unquoted | 'h' | "th" | [Word("echo"), Word("one two")] |
Unquoted | 'r' | "thr" | [Word("echo"), Word("one two")] |
Unquoted | 'e' | "thre" | [Word("echo"), Word("one two")] |
Unquoted | 'e' | "three" | [Word("echo"), Word("one two")] |
Unquoted | '"' | "three" | [Word("echo"), Word("one two")] |
DoubleQuoted | ' ' | "three " | [Word("echo"), Word("one two")] |
DoubleQuoted | 'f' | "three f" | [Word("echo"), Word("one two")] |
DoubleQuoted | 'o' | "three fo" | [Word("echo"), Word("one two")] |
DoubleQuoted | 'u' | "three fou" | [Word("echo"), Word("one two")] |
DoubleQuoted | 'r' | "three four" | [Word("echo"), Word("one two")] |
DoubleQuoted | '"' | "three four" | [Word("echo"), Word("one two")] |
Unquoted | ' ' | "" | [Word("echo"), Word("one two"), Word("three four")] |
Blank | '|' | "" | [Word("echo"), Word("one two"), Word("three four"), Pipe] |
Blank | ' ' | "" | [Word("echo"), Word("one two"), Word("three four"), Pipe] |
Blank | 't' | "t" | [Word("echo"), Word("one two"), Word("three four"), Pipe] |
Unquoted | 'r' | "tr" | [Word("echo"), Word("one two"), Word("three four"), Pipe] |
Unquoted | ' ' | "" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")] |
Blank | 'a' | "a" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")] |
Unquoted | '-' | "a-" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")] |
Unquoted | 'z' | "a-z" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")] |
Unquoted | ' ' | "" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")] |
Blank | 'A' | "A" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")] |
Unquoted | '-' | "A-" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")] |
Unquoted | 'Z' | "A-Z" | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")] |
Unquoted | [Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z"), Word("A-Z")] |
Once you finish the rest of the lab, this input should print out ONE TWO THREE FOUR