Part 2. Reading the index (5 points)
In this part of the lab, you’re going to read in the index file and create an
instance of an Entry
structure for each entry in the index.
There are many common formats for exchanging data. One of the most simple is
tab-separated values. In this format, each line in the file refers to a
multi-field record. Each field is separated from the previous one in the line
by a tab character (we specify a tab character in code using the escape
sequence \t
, similar to how \n
is a newline character).
For example,
first field second field third field
XXX YYY ZZZ
shows a file with two records, each of which has three fields.
The Project Gutenberg index is a tab-separated values file. Each line of the index file has the format
authors<tab>title<tab>URL<tab>resource 1<tab>...<tab>resource n
where <tab>
means there’s a tab character. The first field is the authors,
the second field is the title, the third is the URL for the entry on Project
Gutenberg’s website. The remaining fields represent individual resources
associated with the entry (resources include .epub
files, text files, and
audio files).
Here’s one example,
Roosevelt, Wyn; Schneider, S. The Frontier Boys in the Sierras; Or, The Lost Mine https://www.gutenberg.org/ebooks/32253 text https://www.gutenberg.org/ebooks/32253.html.images text https://www.gutenberg.org/files/32253/32253-h/32253-h.htm text https://www.gutenberg.org/files/32253/32253-h.zip epub https://www.gutenberg.org/ebooks/32253.epub3.images epub https://www.gutenberg.org/ebooks/32253.epub.images epub https://www.gutenberg.org/ebooks/32253.epub.noimages text https://www.gutenberg.org/ebooks/32253.txt.utf-8 text https://www.gutenberg.org/files/32253/32253-8.txt text https://www.gutenberg.org/files/32253/32253-8.zip text https://www.gutenberg.org/files/32253/32253.txt text https://www.gutenberg.org/files/32253/32253.zip
Given a line of text from the index file, it’s easy to split it up into its parts by splitting the string on a tab character.
#![allow(unused)] fn main() { let line = "Roosevelt, Wyn; Schneider, S. The Frontier Boys in the Sierras; Or, The Lost Mine https://www.gutenberg.org/ebooks/32253 text https://www.gutenberg.org/ebooks/32253.html.images text https://www.gutenberg.org/files/32253/32253-h/32253-h.htm text https://www.gutenberg.org/files/32253/32253-h.zip epub https://www.gutenberg.org/ebooks/32253.epub3.images epub https://www.gutenberg.org/ebooks/32253.epub.images epub https://www.gutenberg.org/ebooks/32253.epub.noimages text https://www.gutenberg.org/ebooks/32253.txt.utf-8 text https://www.gutenberg.org/files/32253/32253-8.txt text https://www.gutenberg.org/files/32253/32253-8.zip text https://www.gutenberg.org/files/32253/32253.txt text https://www.gutenberg.org/files/32253/32253.zip"; // Assuming we're reading through the file a line at a time and `line` // holds the current line of text, we can split the line into parts. let parts: Vec<&str> = line.split('\t').collect(); println!("Author: {}", parts[0]); println!("Title: {}", parts[1]); println!("URL: {}", parts[2]); println!("{} resources", parts.len() - 3); }
Click Run.
You’re going to represent each entry using an instance of an Entry
structure. So you’re going to have to define the structure.
Each entry has an author field, a title field, a URL field and 0 or more resources associated with it. In fact, it’s possible for the author field to be empty. In this case, Project Gutenberg hasn’t recorded the author or the author is unknown. You’ll have to deal with this. Let’s ignore the resources for the moment. We’ll return to them in a later part.
Since the author field can be empty, let’s model that with an
Option<String>
so that None
refers to the no author case and
Some(authors)
is the case where there are authors. The other fields should
not be empty. This suggests that we define our Entry
structure like this.
#![allow(unused)] fn main() { #[derive(Debug, Clone)] struct Entry { author: Option<String>, title: String, // Other fields here. } }
Your task
First, you’ll need to download the index file and decompress it inside your assignment repository
$ gunzip pgindex.txt.gz
Your repo should look like this
repo
├── Cargo.toml
├── pgindex.txt
└── src
└── main.rs
Do not add this file to your repository as it’s pretty large for a text file at 46 MB.
In main.rs
, define a new Entry
structure that has fields for author, title, and URL.
In your run()
function, open the file pgindex.txt
, wrap it in a
BufReader
. (If you forget how BufReader
works, review Lab 4.)
Then read from the reader line by line (the .lines()
method is extremely
useful here) and parse each line into an Entry
. Finally, print out the debug
representation of the Entry
.
#![allow(unused)] fn main() { use std::fs::File; use std::io::{BufRead, BufReader}; type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>; #[derive(Debug)] struct Entry { author: Option<String>, title: String, } fn run() -> Result<()> { let mut reader = BufReader::new(File::open("pgindex.txt")?); for line in reader.lines() { //Turn the line from a result to a &str let line: &str = &line?; // TODO: Parse the entry here. You will need to set the author field to None if the string length is 0 let author = Some("TODO: get this from line".to_string()); let title = "TODO: get this from line".to_string(); // Other fields let entry = Entry { author, title, // ... }; println!("{entry:?}"); } Ok(()) } }
If you run your code, you’re going to get a lot of output. Far more than you want. I recommend while you’re testing this part, you replace reader.lines()
in the above code with reader.lines().take(5)
which will only run the body of the loop for the first 5 lines of the file.
Just don’t forget to remove the .take(5)
later.
When this all works, you should see this output when you run your code.
Entry { author: None, title: "\"Contemptible\", by \"Casualty\"", url: "https://www.gutenberg.org/ebooks/18103" }
Entry { author: None, title: "A Handbook of the Boer War With General Map of South Africa and 18 Sketch Maps and Plans", url: "https://www.gutenberg.org/ebooks/15699" }
Entry { author: None, title: "A Jolly by Josh", url: "https://www.gutenberg.org/ebooks/17499" }
Entry { author: None, title: "Baseball ABC", url: "https://www.gutenberg.org/ebooks/19169" }
Entry { author: None, title: "Bibelen, Det nye Testamente", url: "https://www.gutenberg.org/ebooks/2143" }
Note that the first few entries in the index don’t have authors so it prints None
.