Use files

Recently I’ve found myself in the common situation where I have to open a file and make something with its content. This is a basic task and should present no difficulties to a programmer; I happened, however, to get stuck for a while in search of a good way to carry out this apparently simple task. In particular, I have run into the recent addition of regular expression processing to the STL. Spoiler alert: I was not particularly happy of what I’ve found along the way.

Parse a graph data file

A file with data for a graph can be structured in several ways. I usually try to use the format I’ve learnt by reading Skiena’s book on algorithms and data structures, with a first line containing the number of vertices and of edges, and each subsequent line containing an edge in the form of a source and a destination node. A comment line can be made by starting the line itself with a pound sign.

For example:

# Directed graph, 6 vertices, 8 edges
6 8
1 2
2 1
2 3
2 4
3 6
# 4 is source to no destination
5 4
6 4

designates a graph in the form of the following figure:

Graph from data file

Using streams

The first way I’ve found to correctly read and parse the file is based on character-by-character reading:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Using std::istream::get()
int main() {
if (argc != 2) {
cerr << "Usage: " << argv[0] << " FILENAME\n";
return EXIT_FAILURE;
}

ifstream f(argv[1]);
if (!f.good()) {
cerr << "Problems opening file " << argv[1] << "!\n";
return EXIT_FAILURE;
}
bool first_line = true;
int V = 0;
int E = 0;
int src = 0;
int dest = 0;
char c;
stringstream line;

// This test returns true if the stream is available
// for additional reading
while (f.get(c)) {
if (c == '#') {
f.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
} else if (c == '\n') {
// line ended
if (first_line) {
line >> V >> E;
cout << "V: " << V << ", E: " << E << '\n';
first_line = false;
} else {
line >> src >> dest;
add_edge(src, dest);
}
line.clear();
} else {
line << c;
}
}

return EXIT_SUCCESS;
}

Using regex

The second way I’ve found uses regular expressions. Let’s have a look at the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
int main() {
int V = 0;
int E = 0;
bool first_line = true;
if (argc != 2) {
cerr << "Usage: " << argv[0] << " FILENAME\n";
return EXIT_FAILURE;
}

ifstream file(argv[1]);
if (!file.good()) {
cerr << "Problems opening file " << argv[1] << "!\n";
return EXIT_FAILURE;
}

std::regex re("(.+) (.+)");

while(file.good()) {
string line;
if (!getline(file, line))
break;
std::smatch result;
if (std::regex_search(line, result, std::regex("^#"))) {
continue;
} else {
if (!std::regex_search(line, result, re)) {
cout << "No match for '" << line << "'\n";
continue;
} else {
if (!first_line) {
int src = std::stoi(result[1]);
int dest = std::stoi(result[2]);
process_edge(src, dest);
} else {
first_line = false;
// read number of vertices and edges
V = std::stoi(result[1]);
E = std::stoi(result[2]);
cout << "V : " << V << ", E: " << E << '\n';
}
}
}
}
}

This code only works correctly with the clang compiler, using the clang implementation of the STL (option -stdlib=libc++ on Ubuntu). The detail of what happens is the following:

1
std::regex re("(.+) (.+)");

This builds the regular expression. I want to match digits in this case, but when I tried with the escape sequence \d instead of the generic dot (“match any character”), I got a warning about unknown escape sequence, even specifying explicitly what is expected to be the default regex grammar, ECMAScript: it can be used as a second parameter to the constructor of the regex:

1
std::regex re("(.+) (\d+)", std::regex_constants::ECMAScript);

The second interesting bit is the search:

1
2
3
4
5
// ...
if (std::regex_search(line, result, std::regex("^#"))) {
// ...
if (!std::regex_search(line, result, re)) {
// ...

This second one, for example, looks for the regex re in the line, and stores the matching groups (delimited by the parentheses in the regular expression) in a sub_match object, result, that we can later use to initialize the integers. Note the use of the new (in the C++11 sense) functions stoi, that solve the basic problem of converting a numeric string into an integer number.

In this case, as a greater benefit, we can match comments with a pound sign in any point of the line, provided it is the very first character of the line. Compare with what we do in the first example, where we only match the pound if it is in the first column.

Conclusion

We have had a look at how to perform a very basic task, namely, opening a file and reading its content line by line, parsing the lines as they come. The support of regular expressions is still very basic (using gcc was impossible), and clang proves to be ahead of the competition. I am an avid user of regex, so I hope that the support will become flawless in a very small time.