Saturday, August 13, 2016

Load balancing: Rancher vs Swarm

Rancher has a load balancer built it (HAProxy). Let's compare its performance vs Docker Swarm one. I will use 3 identical nodes:

  • 192GB RAM
  • 28-cores i5 Xeon
  • 1GBit LAN
  • CentOS 7
  • Docker 1.12
  • Rancher 1.1.2
I will benchmark against a hello world HTTP server written in Scala with Akka-HTTP and Spray JSON serialization (I don't think it matters though), sources are on GitHub. I will use Apache AB benchmark tool.

As a baseline, I exposed the web server port outside the container and run the following command:

ab -n 100000 -c 20 -k

It shows 22400 requests per second. I'm not sure whether it's a great result for Akka-HTTP, considering that some services written in C can handle hundreds of thousands requests per second, but it's not the main topic of this blog post (I ran the test with 100 concurrent connections (-c 100), and it shows ~50k req/sec. I don't know if this number is good enough either :) )

Now I created a Rancher so-called "stack" containing our web server and a build in load balancer:

Now run the same benchmark against the load balancer, increasing number of akka-http containers one-by-one:

Containers Req/sec
1 755
2 1490
3 2200
4 3110
5 3990
6 4560
7 4745
8 4828

OK, it looks like HAProxy introduced a lot of overhead. Let's look how well Swarm internal load balancer handles such load. After initializing Swarm on all nodes, create a service:

docker service create --name akka-http --publish 32020:29001/tcp private-repository:5000/akkahttp1:1.11

Check that the container is running:

$ docker service ps akka-http
0yb99vo9btmx3t1wluvd0fgo6  akka-http.1  xxxx  Running        Running 3 minutes ago

OK, great. Now I will scale our service up one container at a time, running the same benchmark as I go:

docker service scale akka-http=2
docker service scale akka-http=3
docker service scale akka-http=8


Containers Req/sec
1 19200
2 19700
3 18700
4 18700
5 18300
6 18800
7 17900
8 18300

Much better! Swarm LB does introduce some little overhead, but it's totally tolerable. What's more, if I run the benchmark against the node where single container is running, Swarm LB shows exactly same performance as directly exposed web server (22400 req/sec in my case).

To make this blog post less boring, I added a nice picture :)

I run JMeter (8 parallel threads). Direct: 6700, Swarm: 6500 (1 container) - 5600 (8 containers), Rancher: 835 r/s (1 container) - 2400 (8 containers). Which is roughly the same as AB results. 

Saturday, June 11, 2016

Running computational intensive code outside of Hopac scheduler

Hopac uses a bounded pool of worker threads, number of which is equal to number of CPU cores (by default). A dangerous thing about this design is that a situation is possible where all the threads are busy doing some CPU intensive work and no other Hopac jobs can proceed. A good solution for this is running such a CPU bound computations on the standard .NET thread pool, freeing Hopac pool for more intelligent work. I found a nice code in one of the older Hopac GitHub discussions which schedules a ordinary function on ThreadPool and represents the result as a Hopac job.

Here is a test with explanations:

Friday, May 27, 2016

Upcoming F# struct tuples: are they always faster?

Don Syme has been working on struct tuples for F# language. Let's see if they are more performant than "old" (heap allocated) tuples in simple scenario: returning tuple from function. The code is very simple:

Decompiled code in Release configuration:

Everything we need to change to switch to struct tuples, is adding "struct" keyword in front of constructor and pattern matching:

Decompiled code in Release configuration:

I don't know about you, but I was surprised with those results. The performance roughly the same. GC is not a bottleneck as no objects were promoted to generation 1.


  • Using struct tuples as a faster or "GC-friendly" alternative to return multiple values from functions does not make sense.
  • Building in release mode erases away heap allocated tuples, but not struct tuples.
  • Building in release mode inlines the "foo" function, which makes the code 10x faster.
  • You can fearlessly allocate tens of millions of short-living object per second, performance will be great.

Sunday, May 22, 2016

Hash maps: Rust, F#, D, Go, Scala

Let's compare performance of hashmap implementation in Rust, .NET, D (LDC) and Go.
As you can see, Rust is slower at about 17% on insersions and at about 21% on lookups.


As @GolDDranks suggested on Twitter, since Rust 1.7 it's possible to use custom hashers in HashMap. Let's try it:
Yes, it's significantly faster: additions is only 5% slower than .NET implementation, and lookups are 32% *faster*! Great.

Update: D added

LDC x64 on windows
It's very slow at insertions and quite fast on lookups.

Update: Go added

Update: Scala added

Compared to Scala all the other languages looks equally fast :) What's worse, Scala loaded all four CPU cores at almost 100% during the test, while others used roughly single core. My guess is that JVM allocated so many objects (each Int is an object, BTW), that 3/4 of CPU time was spend for garbage collecting. However, I'm a Scala/JVM noob, so I just could write the whole benchmark in a wrong way. Scala developers, please review the code and explain why it's so slow (full IDEA/SBT project is here). Thanks!

Wednesday, May 4, 2016

Akka.NET Streams vs Hopac vs AsyncSeq

Akka.NET Streams is a port of its Scala/Java counterpart and intended to execute complex data processing graphs, optionally in parallel and even distributed. It has quite different semantics compared to Hopac's one and it's wrong to compare them feature-by-feature, but it's still interesting to benchmark them in a scenario which both of them supports well: read lines of a file asynchronously, filter them by a regex in controlled degree of parallelism, then normalize the lines with a simple string manipulation algorithm, also in parallel, then count the number of lines.

Firts, Akka.NET:

Note that I have to use the empty string as indication that the regular expression does not match. I should use `option` of course (just like I do in the Hopac snippet below), but Akka.NET Streams is strict about what is allowed to be returned by its combinators like `Map` or `Filter`, in particular, you cannot return `null`, doing so makes Akka.NET unhappy and it will throw exception at you. In F#, expressions like `fun x -> printfn "%O" x` and `fun x -> None` returns `()` and `None` values respectively, which are represented as `null` at runtime, so you have to be very careful `Map`ping and `Filter`ing (and using all the combinators actually) over side effecting functions or returning `Options` (just do not do either).

Now, Hopac:

And finally AsyncSeq:

Number of allocations is roughly identical for Hopac and Akka, but it's an order of magnitude larger for AsyncSeq.


  • Use Hopac if you need the best performance available on .NET, or if you need to implement arbitrary complex concurrent scenarios.
  • Akka.NET is quite fast and has a full blown graph definition DSL, so it's great for implementing complex stream processing, which can run on a cluster of nodes. However, it has a typical "fluent" C#-targeted API, so it's necessary to write a thin layer over it in order to make it usable from F#.
  • AsyncSeq has the most "F# friendly" API - it's just a combination of two computation expressions which every F# programmer knows: Async and Seq.

Update 6 May, 2016

Marc Piechura suggested a way to exclude materialization phase from the benchmark, here is the modified code:

It turns out it takes Akka.NET about 3 seconds to materialize the graph.

Thanks Vesa Karvoven for help with fixing the Hopac version and Lev Gorodinski for fixing AsyncSeq performance (initially it works awfully slow).

Thursday, September 24, 2015

Regular expressions: Rust vs F# vs Scala

Let's implement the following task: read first 10M lines from a text file of the following format:

then find all lines containing Microsoft namespace in them, and format the type names the usual way, like "Microsoft.Win32.IAssemblyEnum".

First, F#:

Now Rust:

After several launches the file was cached by the OS and both implementations became non IO-bound. F# one took 29 seconds and 31MB of RAM at peak; Rust - 11 seconds and 18MB.

The Rust code is as twice as long as F# one, but it's handling all possible errors explicitly - no surprises at runtime at all. The F# code may throw some exceptions (who knows what kind of them? Nobody). It's possible to wrap all calls to .NET framework with `Choice.attempt (fun _ -> ...)`, then define custom Error types for regex related code, for IO one and a top-level one, and the code'd be even longer then Rust's, hard to read and it would still give no guarantee that we catch all possible exceptions.

Update 4 Jan 2016: Scala added:

Ok, it turns out that regex performance may depend on whether it's case sensitive or not. What's worse, I tested F# with case insensitive pattern, but Rust - for case sensitive. Anyway, as I've upgraded my machine recently (i5-750 => i7-4790K), I've rerun F# and Rust versions in both the regex modes and added Scala to the mix. First, case sensitive mode:
  • F# (F# 4.0, .NET 4.6.1) - 4.8 secs
  • Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.5 secs
  • Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 5.9 secs
Now, case insensitive:
  • F# (F# 4.0, .NET 4.6.1) - 15.5 secs
  • Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.2 secs
  • Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 6.1 secs

Although case sensitive patterns performs roughly the same on all the platforms, it's quite surprising that Rust is not the winner.

Scala is faster in case insensitive mode (?), Rust is slightly slower and now the question: what's wrong with .NET implementation?.. It performs more than 3 times slower that case sensitive and the others.

Update 4 Jan 2016: D added.

  • regex - 10.6 s (DMD), 7.8 s (LDC)
  • ctRegex! - 6.9 s (DMD), 6.6 s (LDC)

Update 6 Jan 2016: Elixir added:

It takes 56 seconds to finish.

Update 6 Jan 2016: Haskell added:

I takes 20 seconds.

Update 7 Jan 2016: Nemerle added:

It takes 3.8 seconds (case sensitive) and 7.1 seconds (case insensitive).

Update 8 Jan 2016: Nemerle PEG added:

It takes 4.1 seconds.

All results so far:

  Case sensitive Case insensitive
F# 4,80 15,50
Scala 3,50 3,20
Rust 5,90 6,10
DMD 6,90  
LDC 6,60  
Elixir 56,00  
Hakell 20,00  
Nemerle 3,80 7,10
Nemerle PEG 4,10 4,20

Saturday, July 18, 2015

Elixir: first look

I don't have a clear impression about Elixir language yet. I don't like it has Ruby like syntax, but do like it has pipe operator and macros. So, Fibonacci:

It executes in about 13 seconds which is on pair (even faster for unknown reason) with Erlang, no surprises here.

  • D (GDC) - 0.990
  • C# - 1.26
  • D (DMD) - 1.3
  • C++ - 1.33
  • F# - 1.38
  • Nemerle - 1.45
  • Rust - 1.66
  • Go - 2.38
  • Haskell - 2.8
  • Clojure - 9
  • Elixir - 13
  • Erlang - 17
  • Ruby - 60
  • Python - 120

Monday, June 22, 2015

SHA1 compile time checked literals: F# vs Nemerle vs D

I've always been interested in metaprogramming. Sooner or later, I'm starting to feel constrained within a language without it. F# is a really nice language, but I'm afraid I'd have got bored with it if it'd not have Type Providers, for example. Why metaprogramming is so important? Because it allows changing a language without cracking the compiler. It allows making things which seemed to be impossible to implement.

I'm dealing with cryptography hashes a lot at work, nothing rocket since, just MD5, SHA-1 and so on. And I write tons of tests where such hashes are used in form of string literals, like this:

The problem with this code is that the compiler cannot guarantee that the hex string in the last line represents a valid SHA-1. If it does not, the test will fail at runtime for a reason it's not intended to.

OK, now we can formulate our task: provide a language construct to enforce a string literal being a valid SHA-1 hexadecimal, at compile time. We will explore how much work it's required to implement such a simple feature in F#, Nemerle and D. It's also interesting how well the development workflow is for each of this languages - IDE integration, error reporting and testing cycle.


Using Type Providers is the only way to check (at compile time) that a string is a valid hex one and that it's length is exactly 40 characters (SHA-1 is a 20-bytes hash). Actually, I've written this type provider before. The interesting part looks like this:

It includes caching, and `HexParser` module is not shown, but those details are not important here. It's simple and it generates Value property which directly returns byte array, created in compile-time.

Error reporting:


Nemerle has full fledged macros, which strictly more powerful than F#'s Type Providers. Let's see if they allow solving the task in an elegant way:

Error reporting:


The code does not use any unusual stuff and does not manipulate AST. Just plane D code. Very elegant. Note that the template is defined in the same file as its usage. Contrast this with F# and Nemerle where you have to place your Type Provider / macros into a dedicated assembly.

Error reporting:

The error is located in the template itself, not at the instantiation point though.


I added 1000 usages of the TP, macro and template and measured compilation time.

  • F# - 5 seconds
  • Nemerle - 2 seconds
  • D - the compiler crashes with "Error: out of memory" after 1 minute work.

Saturday, June 20, 2015

Fib: C++, C# and GDC

As a reference implementation, I added C++ one:

It's execution time is 1.33 seconds, which surprisingly is not the best result so far.
A C# version:

Also, I compiled this D code with GDC compiler and it executed in 990 ms, which is the best result:

  • D (GDC) - 0.990
  • C# - 1.26
  • D (DMD) - 1.3
  • C++ - 1.33
  • F# - 1.38
  • Nemerle - 1.45
  • Rust - 1.66
  • Go - 2.38
  • Haskell - 2.8
  • Clojure - 9
  • Erlang - 17
  • Ruby - 60
  • Python - 120

Unfortunately, I have not managed to compile the D code with LDC compiler, it returns the following error:

Building: DFib (Release)
Performing main compilation...
Current dictionary: d:\git\DFib\DFib
D:\ldc2-0.15.2-beta1-win64-msvc\bin\ldc2.exe -O3 -release "main.d"   "-od=obj\Release" "-of=d:\git\DFib\DFib\bin\Release\DFib.exe"
LINK : fatal error LNK1181: cannot open input file 'kernel32.lib'
Error: C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\link.exe failed with status: 1181
Exit code 1181

Saturday, May 16, 2015

Composing custom error types in F#

I strongly believe that we should keep code as referential transparent as possible. Unfortunately, F# language does not encourage programmers to use Either monad to deal with errors. The common practice in the community is using common in the rest .NET (imperative) world exception based approach. From my experienced, almost all bugs found in production are caused by unhandled exceptions. 

The problem

In our project we've used the Either monad for error handling with great success for about two years. ExtCore is a great library making dealing with Either, Reader, State and other monads and their combinations really easy. Consider a typical error handling code, which make use Choice computation expression from ExtCore:

The code is a bit hairy because of explicit error mapping. We could introduce an operator as a synonym for Choice.mapError, like <!>, after which the code could become a bit cleaner:

(actually it's the approach we use at in our team).

Rust composable errors

I was completely happy until today, when I read Error Handling in Rust article and found out how elegantly errors are composed using From trait. By implementing it for an error type, you enable auto converting lower level errors to be convertable to it by try! macro, which eliminates error mapping completely. I encourage the reader to read that article because it explains good error handling in general, it's totally applicable to F#.

Porting to F#

Unfortunately, there's no static interface implementation neither in F# nor in .NET, so we cannot just introduce IError with a static member From: 'a -> 'this, like we can in Rust. But in F# we can use statically resolved type parameters to get the result we need. The idea is that each "higher level" error type defines a bunch of static methods, each of which converts some lower level error type to one of the error type cases: 

Now we can write a generic function which can create any higher level error type, which defines From methods:

Now we can rewrite our processFile function without explicit mapping to concrete error cases:

Great. But it's still not as clean. The remaining bit is to modify Choice computation expression builder so that it can do the same implicit conversion in its Bind method (its ChoiceBuilder from ExtCore as is, but without For and While methods):

The CE now requires all errors to be convertable to its main error type, including the error type itself, so we have to add one more From static method to Error type, and we finally can remove any noise from our processFile function:

Monday, May 4, 2015

Go: fib

Go code is relatively low-level since it does not have "foreach over range" syntax construct:

Results are not as impressive for a systems language: 2.38 seconds. And it lays below Rust but under Haskell:
  • C# - 1.26
  • D (DMD) - 1.3
  • F# - 1.38
  • Nemerle - 1.45
  • Rust - 1.66
  • Go - 2.38
  • Haskell - 2.8
  • Clojure - 9
  • Erlang - 17
  • Ruby - 60
  • Python - 120

Saturday, April 11, 2015

Computing cryptography hashes: Rust, F#, D and Scala

Let's compare how fast Rust, D and F# (.NET actually) at computing cryptography hashes, namely MD5, SHA1, SHA256 and SHA512. We're going to use rust-crypto cargo:

  • MD5 - 3.39s 
  • SHA1 - 2.89s 
  • SHA256 - 6.97s
  • SHA512 - 4.47s

Now the F# code:

Results (.NET 4.5, VS 2013, F# 3.1):
  • MD5CryptoServiceProvider - 2.32s (32% faster)
  • SHA1CryptoServiceProvider - 2.92s (1% slower)
  • SHA256Managed - 16.50s (236% slower)
  • SHA256CryptoServiceProvider - 11.50s (164% slower)
  • SHA256Cng - 11.71s (168% slower)
  • SHA512Managed - 61.04s (1365% slower)
  • SHA512CryptoServiceProvider - 21.88s (489% slower)
  • SHA512Cng - 22.19s (496% slower)
(.NET 4.6, VS 2015, F# 4.0):

  • MD5CryptoServiceProvider elapled 2.55
  • SHA1CryptoServiceProvider elapled 2.89
  • SHA256Managed elapled 17.01
  • SHA256CryptoServiceProvider elapled 8.74
  • SHA256Cng elapled 8.75
  • SHA512Managed elapled 23.42
  • SHA512CryptoServiceProvider 5.81
  • SHA512Cng elapled 5.79


  • MD5 - 16.05s (470% slower)
  • SHA1 - 2.35s (19% faster)
  • SHA256 - 47.96s (690% slower (!))
  • SHA512 - 61.47s (1375% slower (!))
  • MD5 - 2,18s (55% faster)
  • SHA1 - 2.88s (same)
  • SHA256 - 6,79s (3% faster)
  • SHA512 - 4,6s (3% slower)
  • MD5 - 2,43 (29% faster)
  • SHA1 - 2,84 (2% faster)
  • SHA256 - 12,62 (45% slower)
  • SHA512 - 8,56 (48% slower)


  • MD5 - 4.2s (23% slower)
  • SHA1 - 6.09s (110% slower)
  • SHA256 - 9.96s (42% slower)
  • SHA512 - 7.32s (63% slower)
Interesting things:

  • Rust and D (LDC2) show very close results. D is significantly faster on MD5, so it's the winner!
  • D (DMD) has very bad performance on all algorithms, except SHA1, where it's won.
  • SHA512Managed .NET class is very slow. Do not use it.