Part 1. Parsing input - CSCI 241 Systems Programming Labs

Part 1. Parsing input (40 points)

At a high level, your shell will (1) read a line of input from stdin; (2) parse the input into tokens; (3) construct a pipeline from the tokens; and (4) run the whole pipeline.

In this part, you will parse input into a sequence of tokens.

Initially, a token will be either a pipe character | or a word. For example, the input echo 'Why hello there!'| wc -l -c consists of the following tokens.

[Word("echo"), Word("Why    hello there!"), Pipe, Word("wc"), Word("-l"), Word("-c")]

Notice how the | turns into a pipe token rather than a word token. In contrast, if the pipe is quoted, it is a word as in this example echo '|'.

[Word("echo"), Word("|")]

This is a perfect use case for an enum.

Your task

Create a new binary project in the assignment repo using

$ cargo init --bin --vcs git --name osh .

Define the Result type in main.rs the same way you did for the past few labs.

Create a new file src/input.rs and add the input module to main.rs.

Inside input.rs, add a use line for Result.

use crate::Result;

Create a new, public enum called InputToken which you will use to represent either a word or a pipe.

Create a new function

#![allow(unused)]
fn main() {
type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
enum InputToken { Word(String), Pipe };
pub fn split_input(input: &str) -> Result<Vec<InputToken>> {
    todo!()
}
}

which on success will return a Vec of InputTokens.

To parse the input into tokens, you’ll want to iterate over the input string one character at a time. You can do this using the .chars() method which returns an iterator over characters.

What makes this tricky is that how you handle any given character depends on what other characters have been read. For example, a space character can separate two words or it could be part of a quoted word. Similarly, a pipe character might be part of a word (if it is quoted) or it might be a separator in the pipeline (see the two examples at the top of this page).

Your split_input() function should support unquoted words, single quoted words, and double quoted words or a mix. Here’s an example input line

echo unquoted_word 'single "quoted" word' "double 'quoted' word"  'mixing different 'quoting" styles"

and its parsing

[Word("echo"), Word("unquoted_word"), Word("single \"quoted\" word"), Word("double 'quoted' word"), Word("mixing different quoting styles")]

Note that the unquoted quotation marks are removed from each word. (The single quoted word containing a double quoted inner word is shown with a backslash before the inner quotes. This is an artifact of the Debug representation, not actual characters.)

Your function should return a custom error message if the input has an unclosed quote string (e.g., echo "hello). See the hint below for instructions on how to create a custom error.

Hint

To parse the input, I recommend creating a state machine which examines its input one character at a time. A state machine works by maintaining a current state and reading the input in a loop. For each character read, use the current state to determine what to do with the character and what state to move to next.

My implementation used the following four states and a mutable State variable to represent the current state of parsing (which is initially State::Blank).

#![allow(unused)]
fn main() {
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
enum State {
    Blank,
    Unquoted,
    SingleQuoted,
    DoubleQuoted,
}
let mut state = State::Blank;
}

The Blank state indicates that the state machine is not currently parsing a word. Initially, the current state is Blank, and it will return to being Blank after reading an unquoted white space character (a blank, if you will).

The other three states indicate that the state machine is in the middle of parsing an unquoted word, a single quoted word, or a double quoted word.

This is a graphical representation of the state machine.

Each circle represents one of the four states. The arrows indicate how the state machine’s current state changes. For example, take the input foo 'bar'. The state begins at Blank. Upon reading the f, the state changes to Unquoted. It stays in Unquoted for the rest of the word foo, until the space when the state transitions back to Blank. Upon reading the ', it transitions to SingleQuoted where it stays until it reaches the last ' which causes it to transition to Unquoted.

Note that in the SingleQuoted state, a single quote character causes a transition to the Unquoted state. The DoubleQuoted state behaves similarly with a double quote. It doesn’t transition to the Blank state because a closing quote doesn’t end the word. For example, the input foo' 'bar is the same as 'foo bar': both of them produce a Word("foo bar") token.

The bulk of the state machine is implemented in a loop containing a match.

for ch in input.chars() {
    match (state, ch) {
        _ -> todo!()
    }
}

Rust will ensure that we add cases for every possible state and input character combination.

For example, here’s how my solution handles characters in the Blank state.

let mut current_word = String::new();
let mut result = Vec::new();
for ch in input.chars() {
    match (state, ch) {
        (State::Blank, '|') => result.push(InputToken::Pipe),

        (State::Blank, _) if ch.is_ascii_whitespace() => (),
        (State::Blank, '\'') => state = State::SingleQuoted,
        (State::Blank, '"') => state = State::DoubleQuoted,
        (State::Blank, _) => {
            current_word.push(ch);
            state = State::Unquoted;
        }
        // Handle the other states
    }
}

In the Blank state, if a | is read, it pushes an InputToken::Pipe token into a result vector.

If the character is ASCII whitespace, then it’s ignored. Notice that we can add conditions to the match patterns.

match (state, ch) {
    (State::Blank, _) if ch.is_ascii_whitespace() => (),
}

This will match if state is State::Blank and ch.is_ascii_whitespace() returns true. (The _ wildcard will match any character.)

If the character is a single or double quote, the state changes to the appropriate state.

In every other case, the character is pushed onto the current_word and changes the state to Unquoted.

The Unquoted state is similar to Blank except that a | or ASCII whitespace ends the current_word (so it should be pushed onto result). If it’s a |, then an InputToken::Pipe should also be pushed. The state should be set to Blank.

The SingleQuoted and DoubleQuoted are similar to each other. They should keep collecting characters in the word until the matching single or double quote character is read, in which case the state should be changed to Unquoted.

After all characters have been read, if the state is Unquoted, the current word should be pushed onto the result vector. If the state is one of the quoted states, return an error. Otherwise, return the vector of tokens.

Note that to return an error with a custom error message, you can use return Err("custom message here".into());.

For a complete example, see below.

Implement a main() function in main.rs that reads a line of input from stdin, removes the newline, calls split_input() on it, and prints the debug representation. You may find it easiest to do this with a main() function that calls a separate run() function, as you did in the last lab.

Write some unit tests in input.rs to test your code. You should write at least three additional tests, beyond what is provided here. Here are two sample unit tests to get you started.

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn simple_pipeline() {
        let input = "echo hello | wc -c";
        let tokens = split_input(input).unwrap();
        assert_eq!(
            tokens,
            vec![
                InputToken::Word("echo".to_string()),
                InputToken::Word("hello".to_string()),
                InputToken::Pipe,
                InputToken::Word("wc".to_string()),
                InputToken::Word("-c".to_string()),
            ]
        );
    }

    #[test]
    fn mixed_quotes() {
        // This is a "raw" string. It has delimiters r#" and "#.
        // Note that we can use " inside of it.
        let input = r#"'mixing different 'quoting" styles""#;
        let tokens = split_input(input).unwrap();
        assert_eq!(
            tokens,
            vec![
                InputToken::Word("mixing different quoting styles".to_string()),
            ]
        );
    }
}

Remember that you can add additional unit tests by adding functions to the test module. Each function that you want to run as a test needs to be annotated with #[test]. Note the use of raw strings in the second test.

To run all unit tests, run $ cargo test.

Example parsing

The table below shows how current_word and result change as a result of parsing the input echo 'one two' three" four"| tr a-z A-Z.

The first (body) row of the table indicates that the state machine starts in the Blank state and the first character is e. This causes the e to be pushed into current_word and result remains empty. The state changed to the Unquoted state which you can see in the second row.

The second row indicates that the state machine is in the Unquoted state and the next letter is c. This causes the c to be pushed into current_word and result remains empty. The state remains Unquoted.

The final row of the table shows that after reading all of the input characters, the state machine is in the Unquoted state and the current_word from the row above has been pushed into the result vector.

State	ch	current_word	result
`Blank`	`'e'`	`"e"`	`[]`
`Unquoted`	`'c'`	`"ec"`	`[]`
`Unquoted`	`'h'`	`"ech"`	`[]`
`Unquoted`	`'o'`	`"echo"`	`[]`
`Unquoted`	`' '`	`""`	`[Word("echo")]`
`Blank`	`'\''`	`""`	`[Word("echo")]`
`SingleQuoted`	`'o'`	`"o"`	`[Word("echo")]`
`SingleQuoted`	`'n'`	`"on"`	`[Word("echo")]`
`SingleQuoted`	`'e'`	`"one"`	`[Word("echo")]`
`SingleQuoted`	`' '`	`"one "`	`[Word("echo")]`
`SingleQuoted`	`'t'`	`"one t"`	`[Word("echo")]`
`SingleQuoted`	`'w'`	`"one tw"`	`[Word("echo")]`
`SingleQuoted`	`'o'`	`"one two"`	`[Word("echo")]`
`SingleQuoted`	`'\''`	`"one two"`	`[Word("echo")]`
`Unquoted`	`' '`	`""`	`[Word("echo"), Word("one two")]`
`Blank`	`'t'`	`"t"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`'h'`	`"th"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`'r'`	`"thr"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`'e'`	`"thre"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`'e'`	`"three"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`'"'`	`"three"`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`' '`	`"three "`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`'f'`	`"three f"`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`'o'`	`"three fo"`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`'u'`	`"three fou"`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`'r'`	`"three four"`	`[Word("echo"), Word("one two")]`
`DoubleQuoted`	`'"'`	`"three four"`	`[Word("echo"), Word("one two")]`
`Unquoted`	`' '`	`""`	`[Word("echo"), Word("one two"), Word("three four")]`
`Blank`	`'\|'`	`""`	`[Word("echo"), Word("one two"), Word("three four"), Pipe]`
`Blank`	`' '`	`""`	`[Word("echo"), Word("one two"), Word("three four"), Pipe]`
`Blank`	`'t'`	`"t"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe]`
`Unquoted`	`'r'`	`"tr"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe]`
`Unquoted`	`' '`	`""`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")]`
`Blank`	`'a'`	`"a"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")]`
`Unquoted`	`'-'`	`"a-"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")]`
`Unquoted`	`'z'`	`"a-z"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr")]`
`Unquoted`	`' '`	`""`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")]`
`Blank`	`'A'`	`"A"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")]`
`Unquoted`	`'-'`	`"A-"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")]`
`Unquoted`	`'Z'`	`"A-Z"`	`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z")]`
`Unquoted`			`[Word("echo"), Word("one two"), Word("three four"), Pipe, Word("tr"), Word("a-z"), Word("A-Z")]`

Once you finish the rest of the lab, this input should print out ONE TWO THREE FOUR