## Binary comprehensions in Erlang

In set theory there is a convenient notation for defining new sets, called set comprehension. For example when we have a set $A$ we can define a collection of singleton sets with elements from the set $A$ as $\{ \{x\}:x\in A\}$. Sometimes vertical bar is used instead of the colon, and in Isabelle/ZF a single dot is used (something like $\{ \{x\}. x\in A\}$ parses successfully in Isabelle/ZF). In some programming languages a similar notation is used for lists. For example in Python one can write

[ [x] for x in [1,2,3] ]


to get a list of singleton lists and in Haskell

[ [x] | x <- [1,2,3] ]

gives the same.

Erlang also has a syntax for list comprehension, very similar to Haskell’s:

[ [X] || X <- [1,2,3] ]


The part X <- [1,2,3] above is called the generator expression.

Erlang has also something unique (as far as I know): binary comprehensions. This is a concept similar to the list comprehensions, but the dummy variable bound by the notation (x in the examples above) can range over binaries rather than lists. I found this very convenient when I was implementing Erlang interface to the  KDB+ IPC protocol.

The documentation for Erlang binary comprehensions is rather sparse. There is a chapter on it in the Learn you some Erlang book, but it’s rather short. The author sends the reader to a white paper by Per Gustafsson and Konstantinos Sagonas for details. The problem is that the white paper describes a proposal for syntax rather than what was actually implemented. As a result none of the examples of the “Examples of binary comprehensions” section in that paper are parsed by the Erlang version I am using ( erlang:system_info(version) returns “6.1” for me). This post aims at providing some examples of usage of binary comprehensions, in addition to the ones provided in the Learn you some Erlang book.

## Lists from binaries

The simplest example is

1> [ X || <<X>> <= <<1,2,3>> ].
[1,2,3]


This is the same as what

binary:bin_to_list(<<1,2,3>>).


does.

The  <<X>>  syntax always goes together with the  <= , so if you use the  <=  token to indicate you want to range over a binary, you always need to surround your dummy variable (or a pattern, see below) with  <<>> .

Now suppose we want to interpret a binary as representing a list of 4-byte integers, written in big endian. Then, instead of the variable X we write a pattern for matching a single integer.

2> [ X || <<X:32/big-signed-integer>> <= <<0,0,0,1,0,0,0,2>>].
[1,2]


If the (bit) length the of binary is not a multiple of the length of the pattern, the largest possible multiple is taken

3> [ X || <<X:32/big-signed-integer>> <= <<0,0,0,1,0,0,0,2,0,0>>].
[1,2]


An interesting thing about the bit syntax in Erlang is that the size in the bit segment specification does not have to be a literal constant – it can be a variable or an expression. So for example one can write a function for interpreting a binary as a sequence of integers of different lengths. Below we parse the same binary as sequence of 32-bit integers or as a sequence of 64-bit integers using the same function.

4> F = fun(Length,Bin) -> [ X || <<X:Length/big-signed-integer>> <= Bin ] end.
#Fun<erl_eval.12.90072148>
5> F(32,<<0,0,0,0,0,0,0,5>>).
[0,5]
6> F(64,<<0,0,0,0,0,0,0,5>>).



It seems that none of the other elements of the bit segment specification can treated this way. For example we can not specify endianess in a variable

7> G = fun(Endianess,Bin) -> [ X || <<X:32/(Endianess)-signed-integer>> <= Bin ] end.
* 1: syntax error before: '('


We can use any segment with known size to specify the dummy variable(s) in the generator. For example suppose we have a binary that we want to interpret as a sequence of 5-byte structures. Each structure starts with a byte we don’t care about and the rest is a 4-byte integer. We can write the redundant byte in the pattern as an underscore:

7> [ X || <<_,X:32/big-signed-integer>> <= <<5,0,0,0,1,6,0,0,0,2>> ].
[1,2]


If we do need that byte we can bind a variable to it and return a list of pairs rather than discard those additional bytes:

8> [ {X,C} || <<C,X:32/big-signed-integer>> <= <<5,0,0,0,1,6,0,0,0,2>> ].
[{1,5},{2,6}]


## Binaries from lists

The simplest example of this case is

9> << <<X>> || X <- [1,2,3] >>.
<<1,2,3>>


This is the same as calling binary:list_to_bin([1,2,3]). Note that if we use the binary delimiters (<<>>) to indicate a comprehension that results in a binary, then we have to use those delimiters around the X variable (or the expression we put there) as well. This is similar to the situation where we have to surround <<X>> whenever we indicate with <= that we want our variable to range over a binary. To me this syntax seems a bit noisy since we have to repeat the same information twice.
Suppose we want to convert a list of integers to a binary that contains each of them in the 4-byte, little endian format. We can write this as follows:

10> << <<X:32/little-signed-integer>> || X <- [1,2,3] >>.
<<1,0,0,0,2,0,0,0,3,0,0,0>>


One can use this type of binary comprehensions to concatenate a list of binaries:

11> << <<X/binary>> || X <- [<<1,2>>,<<3>>] >>.
<<1,2,3>>


We can use pattern matching in the generator (just as in the list comprehensions) to define the dummy variables. For example if we have a list of pairs of integers and we want to encode this list by encoding each pair as 8-byte binary with 4-byte encodings of the elements of the pair, then we can do this:

12> << <<A:32/big-signed-integer,B:32/big-signed-integer>> || {A,B} <- [{1,2},{3,4}] >>.
<<0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4>>


Now for a little bit more complicated example suppose we have a list of binaries representing some packets. We want to discard packets that start from the 0 byte and swap the first two bytes in the remaining ones, then concatenate the results and get one binary. Using binary comprehensions we can do this as follows:

13> Packets = [<<1,2,3,4>>,<<0,1,2>>,<<5,6,7>>].
[<<1,2,3,4>>,<<0,1,2>>,<<5,6,7>>]
14> << <<B1,B0,Rest/binary>> || <<B0,B1,Rest/binary>> <- Packets, B0 =/= 0>>.
<<2,1,3,4,6,5,7>>


Here the <<B0,B1,Rest/binary>> part defines variables B0,B1,Rest (integer,integer,binary, resp.) and pattern matches to the first byte, second byte and the rest for each packet. The expression B0 =/= 0 acts as a filter discarding the packets with the first byte zero. The <<B1,B0,Rest/binary>> expression does the swapping.
Another way to do the same thing is to write

15> << <<B1/binary,B0/binary,Rest/binary>> || <<B0:1/binary,B1:1/binary,Rest/binary>> <- Packets, B0 =/= <<0>> >>.
<<2,1,3,4,6,5,7>>


Here the variables B0 and B1 are bound to 1-byte binaries rather than integers, so (maybe) we avoid converting bytes to integers. I am curious if this is faster.

## Binaries from binaries

The last case is where the generator ranges over a binary and the result is also concatenated into a binary. The simplest example is just the identity:

16> << <<X>> || <<X>> <= <<1,2,3>> >>.
<<1,2,3>>


Here is how we can convert encoding of 4-byte little endian integers into 8-byte big endian integers

17> << <<X:64/big-signed-integer>> || <<X:32/little-signed-integer>> <= <<1,0,0,0,2,0,0,0>> >>.
<<0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2>>