Part 2. Reading the index - CSCI 241 Systems Programming Labs

Part 2. Reading the index (5 points)

In this part of the lab, you’re going to read in the index file and create an instance of an Entry structure for each entry in the index.

There are many common formats for exchanging data. One of the most simple is tab-separated values. In this format, each line in the file refers to a multi-field record. Each field is separated from the previous one in the line by a tab character (we specify a tab character in code using the escape sequence \t, similar to how \n is a newline character).

For example,

first field	second field	third field
XXX	YYY	ZZZ

shows a file with two records, each of which has three fields.

The Project Gutenberg index is a tab-separated values file. Each line of the index file has the format

authors<tab>title<tab>URL<tab>resource 1<tab>...<tab>resource n

where <tab> means there’s a tab character. The first field is the authors, the second field is the title, the third is the URL for the entry on Project Gutenberg’s website. The remaining fields represent individual resources associated with the entry (resources include .epub files, text files, and audio files).

Here’s one example,

Roosevelt, Wyn; Schneider, S.	The Frontier Boys in the Sierras; Or, The Lost Mine	https://www.gutenberg.org/ebooks/32253	text https://www.gutenberg.org/ebooks/32253.html.images	text https://www.gutenberg.org/files/32253/32253-h/32253-h.htm	text https://www.gutenberg.org/files/32253/32253-h.zip	epub https://www.gutenberg.org/ebooks/32253.epub3.images	epub https://www.gutenberg.org/ebooks/32253.epub.images	epub https://www.gutenberg.org/ebooks/32253.epub.noimages	text https://www.gutenberg.org/ebooks/32253.txt.utf-8	text https://www.gutenberg.org/files/32253/32253-8.txt	text https://www.gutenberg.org/files/32253/32253-8.zip	text https://www.gutenberg.org/files/32253/32253.txt	text https://www.gutenberg.org/files/32253/32253.zip

Example

Given a line of text from the index file, it’s easy to split it up into its parts by splitting the string on a tab character.

#![allow(unused)]
fn main() {
let line = "Roosevelt, Wyn; Schneider, S.	The Frontier Boys in the Sierras; Or, The Lost Mine	https://www.gutenberg.org/ebooks/32253	text https://www.gutenberg.org/ebooks/32253.html.images	text https://www.gutenberg.org/files/32253/32253-h/32253-h.htm	text https://www.gutenberg.org/files/32253/32253-h.zip	epub https://www.gutenberg.org/ebooks/32253.epub3.images	epub https://www.gutenberg.org/ebooks/32253.epub.images	epub https://www.gutenberg.org/ebooks/32253.epub.noimages	text https://www.gutenberg.org/ebooks/32253.txt.utf-8	text https://www.gutenberg.org/files/32253/32253-8.txt	text https://www.gutenberg.org/files/32253/32253-8.zip	text https://www.gutenberg.org/files/32253/32253.txt	text https://www.gutenberg.org/files/32253/32253.zip";
// Assuming we're reading through the file a line at a time and `line`
// holds the current line of text, we can split the line into parts.
let parts: Vec<&str> = line.split('\t').collect();
println!("Author: {}", parts[0]);
println!("Title: {}", parts[1]);
println!("URL: {}", parts[2]);
println!("{} resources", parts.len() - 3);
}

Click Run.

You’re going to represent each entry using an instance of an Entry structure. So you’re going to have to define the structure.

Each entry has an author field, a title field, a URL field and 0 or more resources associated with it. In fact, it’s possible for the author field to be empty. In this case, Project Gutenberg hasn’t recorded the author or the author is unknown. You’ll have to deal with this. Let’s ignore the resources for the moment. We’ll return to them in a later part.

Since the author field can be empty, let’s model that with an Option<String> so that None refers to the no author case and Some(authors) is the case where there are authors. The other fields should not be empty. This suggests that we define our Entry structure like this.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
struct Entry {
    author: Option<String>,
    title: String,
    // Other fields here.
}
}

Your task

First, you’ll need to download the index file and decompress it inside your assignment repository

$ gunzip pgindex.txt.gz

Your repo should look like this

repo
├── Cargo.toml
├── pgindex.txt
└── src
    └── main.rs

Do not add this file to your repository as it’s pretty large for a text file at 46 MB.

In main.rs, define a new Entry structure that has fields for author, title, and URL.

In your run() function, open the file pgindex.txt, wrap it in a BufReader. (If you forget how BufReader works, review Lab 4.)

Then read from the reader line by line (the .lines() method is extremely useful here) and parse each line into an Entry. Finally, print out the debug representation of the Entry.

#![allow(unused)]
fn main() {
use std::fs::File;
use std::io::{BufRead, BufReader};
type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
#[derive(Debug)]
struct Entry { author: Option<String>, title: String, }
fn run() -> Result<()> {
    let mut reader = BufReader::new(File::open("pgindex.txt")?);

    for line in reader.lines() {
        //Turn the line from a result to a &str
        let line: &str = &line?;
        // TODO: Parse the entry here.  You will need to set the author field to None if the string length is 0
        let author = Some("TODO: get this from line".to_string());
        let title = "TODO: get this from line".to_string();
        // Other fields
        let entry = Entry {
            author,
            title,
            // ...
        };

        println!("{entry:?}");
    }
    Ok(())
}
}

Tip

If you run your code, you’re going to get a lot of output. Far more than you want. I recommend that while you’re testing this part, you replace reader.lines() in the above code with reader.lines().take(5) which will only run the body of the loop for the first 5 lines of the file.

Just don’t forget to remove the .take(5) later.

When this all works, you should see this output when you run your code.

Entry { author: None, title: "\"Contemptible\", by \"Casualty\"", url: "https://www.gutenberg.org/ebooks/18103" }
Entry { author: None, title: "A Handbook of the Boer War With General Map of South Africa and 18 Sketch Maps and Plans", url: "https://www.gutenberg.org/ebooks/15699" }
Entry { author: None, title: "A Jolly by Josh", url: "https://www.gutenberg.org/ebooks/17499" }
Entry { author: None, title: "Baseball ABC", url: "https://www.gutenberg.org/ebooks/19169" }
Entry { author: None, title: "Bibelen, Det nye Testamente", url: "https://www.gutenberg.org/ebooks/2143" }

Note that the first few entries in the index don’t have authors so it prints None.