EP 128 · Parsing and Performance · Dec 7, 2020 ·Members

Video #128: Parsing and Performance: Combinators

smart_display

Loading stream…

Video #128: Parsing and Performance: Combinators

Episode: Video #128 Date: Dec 7, 2020 Access: Members Only 🔒 URL: https://www.pointfree.co/episodes/ep128-parsing-and-performance-combinators

Episode thumbnail

Description

We convert some of our substring parsers to work on lower levels of String abstractions, and unlock huge performance gains. Even better, thanks to our generalized parser we can even piece together multiple parsers that work on different abstraction levels, maximizing performance in the process.

Video

Cloudflare Stream video ID: 2b35e77662c1e08b768a086b8b3a882f Local file: video_128_parsing-and-performance-combinators.mp4 *(download with --video 128)*

References

Transcript

0:05

So there really are quite substantial performance gains to be had by dropping to lower and lower abstractions levels. The biggest gain is just by using substring over string, but even using unicode scalars and UTF-8 are big enough to consider using it if possible.

0:22

Now how do we apply what we’ve learned to parsers?

0:27

Well, so far all of our string parsers have been defined on Substring , and we’ve used lots of string APIs such as removeFirst , prefix and range subscripting. As we have just seen in very clear terms, these operations can be a little slow on Substring because of the extra work that must be done to properly handle traversing over grapheme clusters and normalized characters. The time differences may not seem huge, measured in just a few microseconds, but if you are parsing a multi-megabyte file that can really add up.

1:00

So, let’s see what kind of performance gains can be had by switching some of our parsers to work with UTF-8 instead of Substring . Benchmarking a simple parser

1:13

Let’s start by getting a benchmark suite into place. We’ll create a new file Benchmark-IntParsing.swift for us to put some integer parsing benchmarks, starting with a benchmark for our int parser combinator. import Benchmark import Foundation let intParserSuite = BenchmarkSuite(name: "Int Parsing") { suite in let string = "1234567890" suite.benchmark("Substring") { } }

1:37

We’ll need to copy over all of our parser code to get access to it, so we’ll create a new Parsing.swift file.

1:59

And now we can finish filling in the benchmark: // Benchmark-IntParsing.swift suite.benchmark("Substring") { var string = string[...] _ = Parser.int.run(&string) } // main.swift Benchmark.main( [ // copyingSuite, intParserSuite, ] ) running Int Parsing: Substring... done! (1896.02 ms) name time std iterations ------------------------------------------------------- Int Parsing.Substring 2451.000 ns ± 164.48 % 544265 Program ended with exit code: 0

2:46

So it seems to take about 2 microseconds to parse a single integer from a string. That seems pretty fast, but we have nothing to compare it to. Let’s add a benchmark for just using Int ’s initializer that converts strings to integers: suite.benchmark("Int.init") { _ = Int(string) } running Int Parsing: Int.init... done! (98.27 ms) running Int Parsing: Substring... done! (1699.20 ms) name time std iterations ------------------------------------------------------- Int Parsing.Int.init 46.000 ns ± 504.19 % 1000000 Int Parsing.Substring 2271.000 ns ± 91.28 % 518657 Program ended with exit code: 0

3:17

Wow ok, so that function takes a fraction of a microsecond to do its job. It looks to be about 45 times faster than our parser, but really once you factor in the 30 nanoseconds a test takes to run in the first place it’s more like 100 times faster.

3:35

This is all kind of expected. All it is doing is converting the entire string into an integer, where our parser is doing extra work to scan the maximum number of characters off the front of string to make a valid integer.

3:50

Before moving one we want to do one quick thing. Right now we are performing the work we want to benchmark and then completely discarding it: _ = Int(string) … _ = Parser.int.run(&string)

3:59

This may be ok to do now, but in the future as we refactor our parsers to make them more performant we may accidentally break them. If we do break a parser, then benchmarks like these will keep humming along and give us misleading numbers. We may think we’ve uncovered some magical way to unlock a 10x performance boost to a parser, when in reality we just broke it.

4:20

So we’d like to know as early as possible when we break a parser. To do this we will add preconditions to our benchmarks as a lightweight way to verify that the data extracted by the parser matches what we expect: precondition(Int(string) == 1234567890) precondition(Parser.int.run(&string) == 1234567890)

4:53

This will keep us in check and we can have more certainty that our benchmarks are measuring the things we think they are.

5:06

With that said, the benchmark for Int ’s initializer is good to keep in mind the cost of parsing in general, but it still doesn’t tell us how our parser compares to other forms of parsing. We’ve previously mentioned that Apple’s Foundation framework ships with a few tools for parsing, and the most general purpose tool is known as Scanner . You create a scanner for your input string, and then the scanner exposes a bunch of methods for you to consume bits off the beginning of the string and turn it into integers, doubles and more. Let’s create a benchmark for a scanner grabbing an integer from the beginning of a string: import Foundation … suite.benchmark("Scanner") { let scanner = Scanner(string: string) precondition(scanner.scanInt() == 1234567890) }

5:59

In order for us to be able to use this API we need to specify in Package.swift that we want to build for a newer version of macOS: platforms: [.macOS(.v10_15)],

6:18

And now we can run benchmarks: running Int Parsing: Int.init... done! (127.85 ms) running Int Parsing: Substring... done! (1767.38 ms) running Int Parsing: Scanner... done! (865.51 ms) name time std iterations ------------------------------------------------------- Int Parsing.Int.init 67.000 ns ± 355.16 % 1000000 Int Parsing.Substring 2345.000 ns ± 93.27 % 516334 Int Parsing.Scanner 638.000 ns ± 174.43 % 1000000 Program ended with exit code: 0

6:22

So it seems that the scanner is about 4 times faster than our parser. That’s a bit of a bummer. We would hope that we can beat the performance of Apple’s APIs. But even worse, this benchmark isn’t very fair, and this is what makes making good benchmarks so difficult. Right now our scanner benchmark is not only measuring how long it takes for it to scan an integer, but also the amount of time it takes to create a scanner. That is unfair because creating a scanner is a one time cost, not something you do every time you want to parse.

6:59

Google’s swift-benchmark library exposes the ability to do work before a benchmark is run that does not count towards the total time run. Whenever we call these .benchmark methods on the suite, under the hood it’s constructing a concrete type that conforms to a protocol and “registering” it with the suite: suite.register(benchmark: <#AnyBenchmark#>)

7:19

The .benchmark methods we have been using are just conveniences to not have to construct a whole new type that conforms to this protocol. Unfortunately, that convenience method doesn’t expose the set up functionality, so we need to conform to the protocol with a custom type. I’m going to paste in a simple wrapper struct: struct Benchmarking: AnyBenchmark { var name: String private var run_: (_ state: inout BenchmarkState) throws -> Void private var setUp_: () -> Void init( name: String, setUp: @escaping () -> Void = {}, run: @escaping (_ state: inout BenchmarkState) throws -> Void ) { self.name = name self.setUp_ = setUp self.run_ = run } func setUp() { self.setUp_() } func run(_ state: inout BenchmarkState) throws { try self.run_(&state) } // No-op var settings: [BenchmarkSetting] { [] } func tearDown() {} }

7:55

With this new Benchmarking type we can now benchmark the scanner code in a fairer manner by moving the creation of the scanner into the set up block: var scanner: Scanner! suite.register( benchmark: Benchmarking( name: "Scanner", setUp: { scanner = Scanner(string: string) }, run: { _ in precondition(scanner.scanInt() == 1234567890) } ) ) running Int Parsing: Int.init... done! (94.60 ms) running Int Parsing: Substring... done! (1688.96 ms) running Int Parsing: Scanner... done! (773.89 ms) name time std iterations ------------------------------------------------------- Int Parsing.Int.init 59.000 ns ± 594.36 % 1000000 Int Parsing.Substring 1160.000 ns ± 96.74 % 633706 Int Parsing.Scanner 275.000 ns ± 271.38 % 1000000 Program ended with exit code: 0

9:06

And now scanner is even faster. About 10 times faster than our parser! That’s no good. We want to be faster than Apple’s APIs without sacrificing composability.

9:21

Fortunately, there are some obvious things we can do to bridge the gap. For one thing, we’ve already seen that scanning characters on Substring is expensive because it has to deal with grapheme clusters and character normalization. We saw that if we just drop down one level of abstraction, to unicode scalars, we can get a bit more performance without losing too much cohesion in how the code is written.

9:41

So, let’s try porting our int parser that is currently defined on Substring to Substring.UnicodeScalarView . We can simply copy and paste the int parser we’ve already defined, and make a few changes. First we should constrain the extension to Substring.UnicodeScalarView as its Input : extension Parser where Input == Substring.UnicodeScalarView, Output == Int { static let int = Self { input in … } }

10:00

The beginning of this parser can basically do exactly what the other int parser does, which is to find the maximum prefix of the input string consisting of only numeric characters, as well as potentially starts with a plus or minus sign: var isFirstCharacter = true let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == "-" || c == "+") && isFirstCharacter || c.isNumber }

10:05

However, one problem here is that c does not have an isNumber property. That’s a property on the Character type, which is the collection element of Substring , but here c is now a Unicode.Scalar , which is completely different.

10:28

We couldn’t find an easy way to determine if a unicode scalar is a numeric digit or not, but one thing we can do is to simply check if the character is contained in the range between the scalars “0” and “9”: || ("0"..."9").contains(c) Correction Jasdev Singh pointed out that there is a unicode scalar helper for this. We can check the scalar’s properties to see if it has a numericType , which will provide similar behavior to the substring-based parser: || c.properties.numericType != nil

10:47

Next we try to convert this prefix we computed into an integer, and if it fails we can fail the parser: guard let output = Int(intPrefix) else { return nil }

11:00

This doesn’t compile yet because apparently you cannot construct an integer from a UnicodeScalarView . We can work around this by first converting the scalar view into a string, and then turning that into a string: guard let output = Int(String(intPrefix)) else { return nil }

11:06

And that’s all that needs to change! So this is a pretty straightforward port of the int parser to UnicodeScalarView . We can definitely see that it’s a little less straightforward to write against scalar views because we have to care about more low level details, like the fact that we can’t turn a scalar view into an integer directly, and we sometimes lose helpers that exist on higher level abstractions, like isNumber . But for the most part it looks basically the same as when trying to parse an integer from the beginning of a substring, especially since Unicode.Scalar conforms to a literal protocol that allows us to do things like use the “0” and “9” literals.

11:38

That’s cool and all, but the thing we really care about is how fast it is. Let’s write a new benchmark: suite.benchmark("UnicodeScalar") { var string = string[...].unicodeScalars _ = Parser.int.run(&string) } running Int Parsing: Int.init... done! (111.94 ms) running Int Parsing: Substring... done! (1554.03 ms) running Int Parsing: UnicodeScalar... done! (372.46 ms) running Int Parsing: Scanner... done! (880.55 ms) name time std iterations ----------------------------------------------------------- Int Parsing.Int.init 55.000 ns ± 182.58 % 1000000 Int Parsing.Substring 2297.000 ns ± 84.24 % 441184 Int Parsing.UnicodeScalar 259.000 ns ± 241.28 % 1000000 Int Parsing.Scanner 292.000 ns ± 150.68 % 1000000 Program ended with exit code: 0

12:07

Very interesting! Parsing integers with our new unicode scalar parser is now basically the same speed as scanner parsing from Apple’s Foundation. It was a pretty small change but came with a 10x improvement in performance. Pretty incredible.

12:23

But it gets better. We previously saw that there is another string abstraction that sits even lower than unicode scalars, and that’s UTF8View . What if we wrote our integer parser on UTF8View s instead of unicode scalars? We hope that it would have better performance, but it’s also going to come with some extra costs on its own. See, unicode scalars and substrings are doing a lot of work for us to hide the complexities of strings, and UTF8View lays almost everything bare for us to see. So we expect an integer parser on UTF8View s to be a little more complicated.

12:56

Let’s start by defining a new parser inside the Parser type when the input is a Substring.UTF8View . We’ll copy and paste things once again, but this time we’ll constrain the extension to UTF8View : extension Parser where Input == Substring.UTF8View, Output == Int { static let int = Self { input in … } }

13:08

We can begin much like we do with the unicode scalar parser and substring parser by trying to find the larger prefix of a number at the beginning of the input string: var isFirstCharacter = true let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == "-" || c == "+") && isFirstCharacter || ("0"..."9").contains(c) }

13:12

However, we know longer have the luxury of checking if our character is equal to these literal strings. Unicode.Scalar conforms to ExpressibleByUnicodeScalarLiteral , but UTF8.CodeUnit does not. We have to figure what is the code unit for these characters and type them in explicitly. We are lucky that these are all simple ASCII characters, which means they consist of only a single code unit. A more complex character could consist of many code units, and even do so in a non-unique way.

13:46

So, for example, we could look up an ASCII table to find out that the minus sign’s code unit is 45, the plus sign is 43, zero is 48 and 9 is 57, and so what we really need to do is this: let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == 45 || c == 43) && isFirstCharacter || (48...57).contains(c) }

14:06

That’s quite a bit more obscure than what we had before. Luckily we can clean this up slightly because UTF8.CodeUnit has an initializer that takes an ASCII character let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return ( c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+") ) && isFirstCharacter || ( UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9") ) .contains(c) }

14:36

It’s not clear from this code but UTF8.CodeUnit is actually just a type alias for UInt8 , so UTF8View s are nothing more than collections of UInt8 s.

14:46

This has made our code a little bit messier, but at least we don’t have to memorize the ASCII table to understand what this code is doing, and in our experience these helpers don’t introduce a measurable cost to performance.

14:56

Next we want to trying to convert this sequence of code units into an integer, and fail the parser if it doesn’t convert. We can bundle the UTF-8 view back into a substring and pass it along: guard let output = Int(String(intPrefix)) else { return nil }

15:04

Unfortunately this doesn’t work. There is indeed a string initializer for a sequence of code units, but it’s failable. Substring , however, has a non-failable initializer that takes a UTF-8 view, so we can just nest the initializers to get an Int : guard let output = Int(String(Substring(intPrefix))) else { return nil }

15:13

So we’re seeing a pattern of what it’s like to write algorithms against these lower level string representations. The lower we go the more obscure the algorithm gets and the more we have to worry about nitty gritty details such as code units and scalars.

15:27

But was it worth it? Do we get any performance benefits from this? Let’s write a benchmark: suite.benchmark("UTF8") { var string = string[...].utf8 _ = Parser.int.run(&string) } running Int Parsing: Int.init... done! (105.61 ms) running Int Parsing: Substring... done! (1533.45 ms) running Int Parsing: UnicodeScalar... done! (364.50 ms) running Int Parsing: UTF8... done! (200.39 ms) running Int Parsing: Scanner... done! (868.21 ms) name time std iterations ----------------------------------------------------------- Int Parsing.Int.init 53.000 ns ± 414.80 % 1000000 Int Parsing.Substring 2381.000 ns ± 80.62 % 413786 Int Parsing.UnicodeScalar 254.000 ns ± 248.81 % 1000000 Int Parsing.UTF8 134.000 ns ± 507.62 % 1000000 Int Parsing.Scanner 283.000 ns ± 272.39 % 1000000 Program ended with exit code: 0

15:48

Wow! We are now more than twice as fast as Foundation’s Scanner , and we are only about 4 times slower than Int ’s initializer, which doesn’t do nearly as much work as we do in our parser.

16:04

So this is pretty incredible. Just by dropping down to UTF-8 we have improved the performance of our int parser on substrings by nearly 25 times. And it may be hard to believe, but it’s even possible to close this last remaining gap between our UTF8View integer parser and Swift’s Int initializer. There’s a little bit of trickery we can do to get access to the raw storage inside the string, and parsing it is even more efficient than using UTF8View directly. We’re not going to show that now, though. Instead we’ll save that as an exercise for the viewer.

16:37

But, we now have a bit of a mess on our hands. We have 3 different integer parsers working on 3 different abstraction levels of strings: extension Parser where Input == Substring, Output == Int { static let int = Self { input in … } } extension Parser where Input == Substring.UnicodeScalarView, Output == Int { static let int = Self { input in … } } extension Parser where Input == Substring.UTF8View, Output == Int { static let int = Self { input in … } }

17:14

Is this the right way to approach integer parsing? Should we maintain 3 different parsers forever? Or should we get rid of the the substring and unicode scalar parser and just keep the UTF-8 parser? Or is there a completely different approach that we are missing? Benchmarking a complex parser

17:30

In order to answer these questions we need to spend some time with more real world parsing problems to see exactly how we want to interact with UTF-8 on strings. We are going to bring back the marathon race parser we previously made in order to demonstrate parsing complex string formats.

18:03

All of this code is mostly the same as what we have done before:

18:07

We have a few structs that represent the first-class data types that we want to parse into. We have a big string that holds all of the marathon races we want to parse. And then we have a bunch of parsers to attack this problem in small chunks and then we glue them all together to form the big marathon races parser.

18:36

The only real change we’ve made is that we have made a bunch of these helper parsers private so that they don’t leak out of the file. This will be important in a moment because we will soon be making a UTF-8 version of this parser in order to understand the differences between the two styles.

18:48

But before doing that let’s write a benchmark for this parser. We’ll create a new file, Benchmark-Race.swift , and run the parser on the big string: // Benchmark-Race.swift import Benchmark let raceSuite = BenchmarkSuite(name: "Race") { suite in suite.benchmark("Substring") { var input = upcomingRaces[...] precondition(races.run(&input)?.count == 3) } } // main.swift import Benchmark Benchmark.main([ // copyingSuite, // intParserSuite, raceSuite, ]) running Race: Substring... done! (1636.13 ms) name time std iterations -------------------------------------------------- Race.Substring 249101.000 ns ± 24.27 % 4792 Program ended with exit code: 0

19:35

So it takes about 0.2 milliseconds to parse that string. I guess that seems fast, but perhaps it can be made even faster.

19:44

Right now all of the race parsers are working on Substring , which means they incur a bit of an extra cost when traversing over characters. As we’ve said before, this is due to the fact that characters are variable width, and so extra work needs to be done to move the substring’s startIndex forward since it has to figure out how many code units to jump ahead.

20:06

However, if we look closely at what the parser is doing we never really need the full power of Substring . There isn’t a single spot where we need to deal with UTF-8 normalization. There’s no handling of accented characters, emoji or any of the other exotic things UTF-8 gives us access to. Correction JP Simard brought to our attention that we are overstating the simplicity of the situation here. Although it is true that our parsers do not currently need the full power of the Substring API, it doesn’t mean we are fully in the clear to operate on code units. In particular, the \n code unit can combine with the carriage return code unit \r to form a new grapheme cluster that represents a newline. So, if we only parse \n s we run the risk of leaving a bunch of errant \r s in the parser’s output. This is a subtlety that is important to keep in mind. We will discuss this issue more in the next episodes of this series.

20:26

For example, when parsing the multiplier sign of the north/south coordinate we check if a character is equal to either “N” or “S”, but those are just basic ASCII characters: private let northSouth = Parser.char.flatMap { $0 == "N" ? .always(1.0) : $0 == "S" ? .always(-1) : .never }

20:39

Substring isn’t doing us any favors in that equality check. We could have just as easily checked the UTF-8 code points against the numeric ASCII value of “N” and “S”.

20:47

And even the places where we are using non-ASCII characters, we are doing so with characters that are still quite simple. For example, when parsing latitude coordinates we need to parse the exact match of a degree symbol: private let latitude = Parser.double .skip("° ") .take(northSouth) .map(*)

21:01

But that is the only character for that symbol, as far as we know. So we don’t have to worry about UTF-8 normalization kicking in to make sure that multiple representations of the same character get identified together.

21:14

We have something similar for currencies, but again there is only one character for these symbols: private let currency = Parser<Substring, Currency>.oneOf( Parser.prefix("€").map { Currency.eur }, Parser.prefix("£").map { .gbp }, Parser.prefix("$").map { .usd } )

21:24

And then outside of those examples we do some really simple matching on newlines, commas and dashes, all of which are ASCII characters.

21:31

We are incurring the cost of Substring , which we know can be significant, even though none of the parsing we are doing requires the full power of Substring . So, let’s try write a new parser that works on lower-level abstraction to see if we get any performance improvements. We’ll jump straight to UTF-8. We’ll copy and paste the races parser and its helper parsers to a new file.

22:05

And we’ll rename races to racesUtf8 so that it doesn’t conflict with the other races parser: let racesUtf8 = … The helper parsers do not need to be renamed because they are just private to this file.

22:12

Let’s convert these parsers to UTF-8 one by one, starts with the northSouth parser: private let northSouth = Parser.char.flatMap { $0 == "N" ? .always(1.0) : $0 == "S" ? .always(-1) : .never }

22:17

Here we are using the char parser to peel off the first character of the substring so that we can then transform it into either 1.0 or -1.0 if the character is “N” or “S”, and otherwise we fail with nil .

22:36

In order for this to work with UTF-8 we need the concept of peeling off the first code unit from a UTF8View so that we can check if it equals the ASCII characters “N” or “S”. Sounds like we need to cook up a more general version of the char parser, one that works on any kind of collection. It will have the same implementation as the char parser, but it be loosely constrained to any type of collection whose SubSequence associated type is equal to itself: extension Parser where Input: Collection, Input.SubSequence == Input, Output == Input.Element { static var first: Self { Self { input in guard !input.isEmpty else { return nil } return input.removeFirst() } } }

24:25

Heck we could now even alias char parser to first : extension Parser where Input == Substring, Output == Character { static let char = first // static let char = Self { input in // guard !input.isEmpty else { return nil } // return input.removeFirst() // } }

24:35

Now we might hope we can just replace char with first and be done with it: private let northSouth = Parser.first.flatMap { $0 == "N" ? .always(1.0) : $0 == "S" ? .always(-1) : .never }

24:42

But unfortunately first is now so generic that Swift has no idea what kind of collection we are wanting to peel off the first element of. This is something we mentioned before and it’s just the reality of working with ultra-general parsers. You just sometimes need to be more explicit with the types. In this case we are parsing a UTF8View into a UTF8.CodeUnit , which is just a type alias for UInt8 : private let northSouth = Parser<Substring.UTF8View, UTF8.CodeUnit> .first .flatMap { $0 == "N" ? .always(1.0) : $0 == "S" ? .always(-1) : .never } Binary operator ‘==’ cannot be applied to operands of type ‘UTF8.CodeUnit’ (aka ‘UInt8’) and ‘String’

25:22

It’s also not quite right to compare a code unit directly to a string literal like this, after all code units are just UInt8 s. But we can easily construct code units from ascii characters: private let northSouth = Parser<Substring.UTF8View, UTF8.CodeUnit> .first .flatMap { $0 == .init(ascii: "N") ? .always(1.0) : $0 == .init(ascii: "S") ? .always(-1) : .never }

25:49

So that wasn’t so bad. Things did get a little noisier, but if you squint you can basically see all the same shapes. And hopefully the payoff of doing this work is a nice bump in performance of the parser.

26:01

Next we have the eastWest parser which can similarly be updated like the northSouth parser: private let eastWest = Parser<Substring.UTF8View, UTF8.CodeUnit> .first .flatMap { $0 == .init(ascii: "E") ? .always(1.0) : $0 == .init(ascii: "W") ? .always(-1) : .never }

26:15

Then we have the latitude parser, which parses a double, then a degree sign, then the sign multiplier for north/south, and finally multiplies them together: private let latitude = Parser.double .skip("° ") .take(northSouth) .map(*)

26:31

The first reason this isn’t working is because the double parser still works on substrings. We haven’t generalized it to work on UTF-8 views like we did with the int parser. Rather than convert in from scratch I’ll paste in a version that works: extension Parser where Input == Substring.UTF8View, Output == Double { static let double = Self { input in let original = input let sign: Double if input.first == .init(ascii: "-") { sign = -1 input.removeFirst() } else if input.first == .init(ascii: "+") { sign = 1 input.removeFirst() } else { sign = 1 } var decimalCount = 0 let prefix = input.prefix { c in if c == .init(ascii: ".") { decimalCount += 1 } return (.init(ascii: "0") ... .init(ascii: "9")).contains(c) || (c == .init(ascii: ".") && decimalCount <= 1) } guard let match = Double(String(Substring(prefix))) else { input = original return nil } input.removeFirst(prefix.count) return match * sign } } It looks a lot like the substring version with a few minor changes.

27:03

The next problem is where we attempt to skip over the degrees symbol: .skip("° ")

27:12

This isn’t working because we are using the fact that parsers are ExpressibleByStringLiteral to turn the literal "° " into a parser that parses that exact string from the beginning of the input. Unfortunately that parser, too, works on Substring , not UTF8View . We could theoretically update the conformance to work on UTF8View instead of Substring , but that would be a dangerous thing to do since you could write a parser like this— // .skip("café") —to parse café off the front of the string, but because we would be using raw UTF-8 code units under the hood we would only be matching this one specific version of café. The other version of café that uses the combining character would not be recognized, and could lead to subtle bugs that are hard to track down.

28:24

Instead of updating the ExpressibleByStringLiteral conformance we are just going to be explicit with how we want to parse these characters off the front of the input. We have the .prefix parser combinator for doing precisely that, and it is suitably generic so that it will work for UTF8View in much the same way it did for Substring : .skip(.prefix("° "[...].utf8))

28:41

We did have to add a bit of extra noise for being explicit that we are dealing with Substring.UTF8View , but it’s not so bad.

28:50

We can use this same idea to also fix the longitude parser: private let longitude = Parser.double .skip(.prefix("° "[...].utf8)) .take(eastWest) .map(*)

28:54

Next we have the coordinate parser: private let zeroOrMoreSpaces = Parser<Substring, Void> .prefix(" ") .zeroOrMore() private let coord = latitude .skip(",") .skip(zeroOrMoreSpaces) .take(longitude) .map(Coordinate.init) This isn’t compiling because of the same string literal problem, but also because the zeroOrMoreSpaces parser deals with substrings instead of UTF8View s. Both are easy to fix: private let zeroOrMoreSpaces = Parser<Substring.UTF8View, Void> .prefix(" "[...].utf8).zeroOrMore() private let coord = latitude .skip(.prefix(","[...].utf8)) .skip(zeroOrMoreSpaces) .take(longitude) .map(Coordinate.init)

29:25

Next we have the currency parser, which just needs to force the .prefix parsers to operate on UTF8View s: private let currency = Parser<Substring.UTF8View, Currency>.oneOf( Parser.prefix("€"[...].utf8).map { Currency.eur }, Parser.prefix("£"[...].utf8).map { .gbp }, Parser.prefix("$"[...].utf8).map { .usd } )

29:35

Then we have the locationName parser, which just wants to consume everything until a comma. We can try to update that: private let locationName = Parser .prefix(while: { $0 != .init(ascii: ",") })

29:49

The problem with this parser is that it works on Substring instead of UTF8View , and in fact the prefix(while:) combinator is just far more constrained than it needs to be. Currently it only works on substrings: extension Parser where Input == Substring, Output == Substring { static func prefix(while p: @escaping (Character) -> Bool) -> Self { Self { input in let output = input.prefix(while: p) input.removeFirst(output.count) return output } } }

30:08

But the work happening inside the parser would really work for any collection whose SubSequence associated type is Self . So let’s generalize this parser: extension Parser where Input: Collection, Input.SubSequence == Input, Output == Input { static func prefix( while p: @escaping (Input.Element) -> Bool ) -> Self { Self { input in let output = input.prefix(while: p) input.removeFirst(output.count) return output } } }

30:48

And with this generalized prefix(while:) parser we can now properly generalize the locationName parser: private let locationName = Parser<Substring.UTF8View, Substring.UTF8View> .prefix(while: { $0 != .init(ascii: ",") })

30:57

Next we have the race parser, which pieces together lots of things: private let race = locationName.map(String.init) .skip(",") .skip(zeroOrMoreSpaces) .take(money) .skip("\n") .take(coord.zeroOrMore(separatedBy: "\n")) .map(Race.init(location:entranceFee:path:))

31:00

Not only do we need to convert all of the string literal parsers, but we also will no longer be able to so easily .map on locationName to turn its output into a string. This is because String ’s initializer for UTF8View s is failable, and so we have to do an extra step to create the string in a non-failable way: private let race = locationName.map { String(Substring($0)) } .skip(.prefix(","[...].utf8)) .skip(zeroOrMoreSpaces) .take(money) .skip(.prefix("\n"[...].utf8)) .take(coord.zeroOrMore(separatedBy: .prefix("\n"[...].utf8))) .map(Race.init(location:entranceFee:path:))

31:36

And then finally the races parser can be fixed with yet another .prefix combinator: let racesUtf8 = race .zeroOrMore(separatedBy: .prefix("\n---\n"[...].utf8)

31:45

OK, that was a pretty straightforward conversion, and we even were able to generalize a few parsers along the way, so that’s cool. Things definitely did get a little noisier, and that’s a bummer, but we didn’t have to change anything too substantial in our parsers in order to support operating on the much lower level of UTF-8 instead of Substring . In fact, the overall shape of these parsers is identical to our original parsers. It’s pretty cool that we can write parsers that look the same as when we work on the higher abstraction yet technically they are working on the lower abstraction.

32:17

Beyond aesthetics of the code, the thing that really matters is how well this new UTF-8 parser performs, so let’s create a new benchmark: suite.benchmark("UTF8") { var input = upcomingRaces[...].utf8 precondition(racesUtf8.run(&input)?.count == 3) } running Race: Substring... done! (1726.93 ms) running Race: UTF8... done! (2354.03 ms) name time std iterations -------------------------------------------------- Race.Substring 255560.000 ns ± 26.80 % 4789 Race.UTF8 78497.000 ns ± 29.57 % 15934 Program ended with exit code: 0

32:51

Nice! We have gotten more than a 300% speed up in our parsing code without really doing anything substantial at all. We are still writing our parsers mostly in the same style as before, just with a few extra decorations to force ourselves to use UTF8View s, but the performance has substantially improved.

33:09

This is pretty cool. We were able to get a huge performance boost without fundamentally changing our code. Sure we had to be a little more explicit with some UTF-8 stuff, but the shape of our code is the same. Typically when you have code that is written in a nice, readable style but has performance issues, the way to boost the performance is rewrite it in a less clear, but more efficient style. Here we are not having to do that at all. We can continue to use parsers just as we always do. Parsing multiple abstraction levels

33:46

But it gets better. Not only can we massively improve the performance of our parsers by working on the lower level UTF-8 representation of strings, but we can even mix together parsers that work on different abstraction levels of strings. For example, while we may be able to do the vast majority of our parsing on UTF8View s, there may still be times you really do need to parse Substring s because you care about the extra work it does for normalization.

34:14

Let’s explore this idea by adding something to our marathon race parser. Right now we parse the location name as everything up until the comma because we allow basically any location. Instead let’s say there are only a few locations we recognize, and so we will represent that as an enum: enum City { case berlin case london case newYork }

34:50

Then we can parse locations into this enum in the same way we do for currencies: private let city = Parser<Substring, City>.oneOf( Parser.prefix("Berlin").map { .berlin }, Parser.prefix("London").map { .london }, Parser.prefix("New York").map { .newYork } )

35:10

We can update our Location struct to use the City type instead of a free form string: struct Race { let location: City let entranceFee: Money let path: [Coordinate] } And then we can update the race parser to use the city parser rather than the locationName parser: private let race = city // locationName.map(String.init) …

35:23

This gets the substring parser in compiling order, but we also have to do something similar for the UTF-8 parser: private let city = Parser<Substring.UTF8View, City>.oneOf( Parser.prefix("Berlin"[...].utf8).map { .berlin }, Parser.prefix("London"[...].utf8).map { .london }, Parser.prefix("New York"[...].utf8).map { .newYork } ) private let race = city // locationName.map { String(Substring($0)) } …

35:52

Running the benchmarks we get something strange…a crash! It looks like one of our precondition s has failed. Well, this is a really good thing. Turns out we accidentally introduced a bug in our parser. We are matching against the string “New York”, but really it was supposed to be “New York City”: private let city = Parser<Substring, City>.oneOf( Parser.prefix("Berlin").map { City.berlin }, Parser.prefix("London").map { City.london }, Parser.prefix("New York City").map { City.newYork } ) … private let city = Parser<Substring.UTF8View, City>.oneOf( Parser.prefix("Berlin"[...].utf8).map { City.berlin }, Parser.prefix("London"[...].utf8).map { City.london }, Parser.prefix("New York City"[...].utf8).map { City.newYork } )

36:29

That was causing the parsing to fail pretty much immediately, and therefore would have greatly changed our benchmark numbers. This shows just how tricky benchmarks can be. If we didn’t have these preconditions in place we would not have been benchmarking the thing we thought we were, and we would be coming to some very faulty conclusions.

37:01

Now that that is fixed we can run our benchmarks and see that times haven’t changed much: running Race: Substring... done! (1763.90 ms) running Race: UTF8... done! (2370.90 ms) name time std iterations -------------------------------------------------- Race.Substring 265929.000 ns ± 25.69 % 4896 Race.UTF8 78822.000 ns ± 30.17 % 16341 Program ended with exit code: 0

4:42

The UTF-8 parser is still roughly 3 times faster than the substring parser.

37:22

But now let’s make things a bit more complicated. Let’s introduce a new city that we want to parse: case sanJose

37:33

And let’s add some San José data to our upcomingRaces string. We’ll copy-paste New York data for now to keep things easy: San José, $900 40.60248° N, 74.06433° W 40.61807° N, 74.02966° W 40.64953° N, 74.00929° W 40.67884° N, 73.98198° W 40.69894° N, 73.95701° W 40.72791° N, 73.95314° W 40.74882° N, 73.94221° W 40.75740° N, 73.95309° W 40.76149° N, 73.96142° W 40.77111° N, 73.95362° W 40.80260° N, 73.93061° W 40.80409° N, 73.92893° W 40.81432° N, 73.93292° W 40.80325° N, 73.94472° W 40.77392° N, 73.96917° W 40.77293° N, 73.97671° W ---

37:52

In order to parser this we have to add a new .prefix parser to our city parser: Parser.prefix("San José").map { City.sanJose }

38:01

And similarly to our UTF-8 parser: Parser.prefix("San José"[...].utf8).map { City.sanJose }

38:12

Now when we run things we get a crash because we expect our parser to find 4 races so let’s update the precondition: suite.benchmark("Substring") { var input = upcomingRaces[...] precondition(races.run(&input)?.count == 4) } suite.benchmark("UTF8") { var input = upcomingRaces[...].utf8 precondition(racesUtf8.run(&input)?.count == 4) } running Race: Substring... done! (1798.23 ms) running Race: UTF8... done! (1426.58 ms) name time std iterations -------------------------------------------------- Race.Substring 322557.000 ns ± 25.47 % 3796 Race.UTF8 97515.000 ns ± 27.61 % 12183 Program ended with exit code: 0

38:28

The times got a little slower, and that’s expected since we are doing more work, but still the UTF-8 parser is about 3 times faster than the substring parser.

38:44

Now let’s make things interesting. As we’ve mentioned before, “San José” can be typed out in two seemingly different ways even though visually they look the same. We can use the combining acute character: let upcomingRaces = """ San José, $900 … """

39:17

Visually it looks like nothing changed, and as far as Substring is concerned nothing did change, but let’s run benchmarks just to make sure everything is ok… Thread 1: Swift runtime failure: precondition failure

39:23

The precondition in our UTF-8 benchmark failed. This is because the UTF-8 parser is not correctly parsing “San José”. In the UTF-8 city parser we hard coded one version of “San José”: Parser.prefix("San José"[...].utf8).map { City.sanJose } This is using the “e” with an acute accent, which is technically different from an “e” combined with an acute accent as far as UTF-8 code units is concerned.

39:48

This is another great example of why it’s important to not just benchmark code but to also have some kind of check that verify the code is doing what you expect. Without this precondition we would have assumed that everything works just fine, but we would not actually be measuring what we think we are.

40:16

To make this more correct we would need to also look out for the other kind of accented “e”: Parser.prefix("San José"[...].utf8).map { City.sanJose }, Parser.prefix("San José"[...].utf8).map { City.sanJose } And we just gotta hope we fully understand the ins and outs of UTF-8 normalization so that we can guarantee there aren’t any other combinations we are missing.

40:32

But at least we can now run benchmarks and they both complete: running Race: Substring... done! (1882.06 ms) running Race: UTF8... done! (1413.70 ms) name time std iterations -------------------------------------------------- Race.Substring 356369.000 ns ± 20.98 % 3888 Race.UTF8 100221.000 ns ± 28.35 % 11643 Program ended with exit code: 0

40:38

And the times haven’t changed much. Parsing across abstraction levels

40:47

Fortunately we can improve this situation. I don’t know about you, but I do not have any confidence that I am going to know all of the ins out outs of UTF-8 normalization. I’d rather let Substring pull that weight while I worry about the parsing domain I’m actually familiar with, which is the marathon race format.

41:22

So what I’d like to do is be able to use the city parser that works on Substring s, but somehow push it down to work on the lower abstraction level of UTF8View . In general, can we implement a computed property that allows us to transform any Substring parser into a UTF8View parser? extension Parser where Input == Substring { var utf8: Parser<Substring.UTF8View, Output> { .init { input in } } }

42:09

In this closure we have a UTF8View and self is a parser that works on Substring s. We can easily create a substring from the UTF8View so that we can run the parser on it: var substring = Substring(input) let output = self.run(&substring)

42:39

Which is exactly what we need to return: return output

42:43

At this point the substring value has been mutated a bit, but what we really want to mutate is the input value. To do that we just have to convert substring back to a UTF8View and overwrite input with it: input = substring.utf8

43:01

Some of this code may worry you. It seems like we are creating a whole new Substring , just to mutate it, pluck out the UTF8View and then completely discard the substring. Surely this is going to be inefficient right?

43:14

Well, remember that Substring s and UTF8View s are really just lightweight views into a shared, base storage, so there’s a chance that creating these things is basically free. But let’s not leave this up to conjecture, let’s prove it with a benchmark.

43:27

We are going to change the name of the UTF-8 city parser: private let _city = Parser<Substring.UTF8View, City>.oneOf( Parser.prefix("Berlin"[...].utf8).map { City.berlin }, Parser.prefix("London"[...].utf8).map { City.london }, Parser.prefix("New York City"[...].utf8).map { City.newYork }, Parser.prefix("San José"[...].utf8).map { City.sanJose }, Parser.prefix("San José"[...].utf8).map { City.sanJose } )

43:39

And we are going to remove the private declaration on the substring city parser so that we can use it, instead: let city = Parser<Substring, City>.oneOf( Parser.prefix("Berlin").map { City.berlin }, Parser.prefix("London").map { City.london }, Parser.prefix("New York City").map { City.newYork }, Parser.prefix("San José").map { City.sanJose } )

43:52

Finally we are going to take the substring city parser, convert it to a UTF-8 parser, and then use it in order to build our UTF-8 race parser: private let race = city.utf8 …

44:05

Running benchmarks: running Race: Substring... done! (1717.38 ms) running Race: UTF8... done! (1442.32 ms) name time std iterations -------------------------------------------------- Race.Substring 355127.000 ns ± 27.88 % 3213 Race.UTF8 106166.000 ns ± 29.16 % 10979 Program ended with exit code: 0

44:14

Incredible, the UTF-8 parser only got very slightly slower, about 5% slower, but at least we are now fully UTF-8 compliant. We can now let Substring handle all the messy details of how to normalize “San José” from all of its various forms, while still allowing ourselves to parse everything else as raw UTF-8.

44:32

We can even hyper-localize the placement of the UTF-8 parser by saying that we will parse all of the cities as UTF-8 except for San José: private let city = Parser<Substring.UTF8View, City>.oneOf( Parser.prefix("Berlin"[...].utf8).map { City.berlin }, Parser.prefix("London"[...].utf8).map { City.london }, Parser.prefix("New York City"[...].utf8).map { City.newYork }, Parser.prefix("San José").utf8.map { .sanJose} ) … private let race = _city

45:08

In this case this doesn’t shave much time off our parser, but it’s great to see that we can push the substring parser even deeper.

45:26

We just want to take a moment to take in just how awesome these few lines of code are: private let city = Parser<Substring.UTF8View, City>.oneOf( Parser.prefix("Berlin"[...].utf8).map { City.berlin }, Parser.prefix("London"[...].utf8).map { City.london }, Parser.prefix("New York City"[...].utf8).map { City.newYork }, Parser.prefix("San José").utf8.map { .sanJose} )

45:32

We are mixing and matching parsers that work on two very different abstraction levels. The city parser works on both Substring s and UTF8View s and so benefits from all of the powers of an abstraction that understands UTF-8 normalization, which allows us to correctly parse accents, emojis and more. And then everything else in this file works on the more low level, and more performant, raw UTF-8, where we get no such power but also we don’t need any of that power. All we’re doing is checking for some newlines and comments, and that work is easy to do in raw UTF-8.

46:05

This means we get to stay in the low level UTF-8 parsing for as long as possible, where things are very performant, but when we need to we can travel up to the high level to take advantage of Substring , all the while still leaving ourselves open to plugging everything together into one big parser. Never are we in a situation where our parsers can’t fit together. That’s the power of having a fully generalized parser. Benchmarking an even more complex parser

46:28

Let’s quickly show one more performance benefit to using UTF-8 over substrings. There’s another complex parser we developed in previous episodes, one that can parse the logs spit about by Xcode when running tests so that we can extract out just the test failures and passes, and we even pretty printed them so that it was easier to grok.

46:56

Let’s paste in the original parser .

47:06

And one we’ve updated for UTF-8, along with more generalized versions of prefix(upTo:) and `prefix(through:).

47:58

Let’s also paste in a big set of logs . This is what gets spit out when we run tests on the codebase that powers this very site, which is an open source project on GitHub.

48:21

And finally let’s create a new benchmark: // Benchmark-XcodeLogs.swift import Benchmark let xcodeLogsSuite = BenchmarkSuite(name: "Xcode Logs") { suite in suite.benchmark("Substring") { var input = xcodeLogs[...] precondition(testResults.run(&input)?.count == 283) } } // main.swift import Benchmark Benchmark.main([ // copyingSuite, // intParserSuite, // raceSuite, xcodeLogsSuite, ]) running Xcode Logs: Substring... done! (1678.19 ms) name time std iterations ----------------------------------------------------------- Xcode Logs.Substring 142074164.500 ns ± 16.07 % 10 Program ended with exit code: 0

48:54

So this is quite a bit slower than our other parsers, but it’s also doing a lot more work and on a much bigger input. It takes about 0.14 seconds to run.

49:07

So how does the UTF-8 parser fare? Let’s write a benchmark for it: suite.benchmark("UTF8") { var input = xcodeLogs[...].utf8 precondition(testResultsUtf8.run(&input)?.count == 283) } running Xcode Logs: Substring... done! (1482.43 ms) running Xcode Logs: UTF8... done! (1406.15 ms) name time std iterations ----------------------------------------------------------- Xcode Logs.Substring 144003426.000 ns ± 12.16 % 9 Xcode Logs.UTF8 11921011.000 ns ± 7.92 % 102 Program ended with exit code: 0

49:29

Pretty incredible. More than 10 times faster than the substring parser. This really does show that the performance gains to be had by working on UTF-8 can be truly substantial. This is the difference of being able to process more than 50 megabytes of logs per second, or being able to process a measly 4 megabytes of logs per second. Next time: even more performance?

49:50

So this is pretty huge. We have found that our parser type is so general and so composable that we are not only able to parse at the low UTF-8 level in order to get huge performance gains, but we can also fluidly move between abstraction levels allowing us to choose the right balance of correctness and speed. This is absolutely incredible, and honestly it’s not something we’ve really ever seen in other parsing libraries.

50:15

And as hard as it may be to believe, it gets even better. There is even one more change we can make to our parser library that unlocks the final level of performance, and this will bring the performance of our parsers within a very slim margin of hand-rolled, ad hoc, imperative parsers, which are widely believed to be the fast way to write parsers even if they are a bit messy.

50:41

To see where this last performance gain could be hiding, let’s run one of our benchmarks in the time profile instrument and see what it exposes to us…next time! References swift-benchmark Google • Mar 13, 2020 A Swift library for benchmarking code snippets, similar to google/benchmark. http://github.com/google/swift-benchmark UTF-8 Michael Ilseman • Mar 20, 2019 Swift 5 made a fundamental change to the String API, making the preferred encoding UTF-8 instead of UTF-16. This brings many usability and performance improves to Swift strings. https://swift.org/blog/utf8-string/ Strings in Swift 4 Ole Begemann • Nov 27, 2017 An excerpt from the Advanced Swift that provides a deep discussion of the low-level representations of Swift strings. Although it pre-dates the transition of strings to UTF-8 in Swift 5 it is still a factually correct accounting of how to work with code units in strings. https://oleb.net/blog/2017/11/swift-4-strings/ Improve performance of Collection.removeFirst(_:) where Self == SubSequence Stephen Celis • Jul 28, 2020 While researching the string APIs for this episode we stumbled upon a massive inefficiency in how Swift implements removeFirst on certain collections. This PR fixes the problem and turns the method from an O(n) operation (where n is the length of the array) to an O(k) operation (where k is the number of elements being removed). https://github.com/apple/swift/pull/32451 Downloads Sample code 0128-parsing-performance-pt2 Point-Free A hub for advanced Swift programming. Brought to you by Brandon Williams and Stephen Celis . Content Become a member The Point-Free Way Beta previews Gifts Videos Collections Free clips Blog More About Us Community Slack Mastodon Twitter BlueSky GitHub Contact Us Privacy Policy © 2026 Point-Free, Inc. All rights are reserved for the videos and transcripts on this site. All other content is licensed under CC BY-NC-SA 4.0 , and the underlying source code to run this site is licensed under the MIT License .