Binary comprehensions in Erlang

In set theory there is a convenient notation for defining new sets, called set comprehension. For example when we have a set A we can define a collection of singleton sets with elements from the set  A as \{ \{x\}:x\in A\}. Sometimes vertical bar is used instead of the colon, and in Isabelle/ZF a single dot is used (something like \{ \{x\}. x\in A\} parses successfully in Isabelle/ZF). In some programming languages a similar notation is used for lists. For example in Python one can write

[ [x] for x in [1,2,3] ]

to get a list of singleton lists and in Haskell

[ [x] | x <- [1,2,3] ]

gives the same.

Erlang also has a syntax for list comprehension, very similar to Haskell’s:

[ [X] || X <- [1,2,3] ]

The part X <- [1,2,3] above is called the generator expression.

Erlang has also something unique (as far as I know): binary comprehensions. This is a concept similar to the list comprehensions, but the dummy variable bound by the notation (x in the examples above) can range over binaries rather than lists. I found this very convenient when I was implementing Erlang interface to the  KDB+ IPC protocol.

The documentation for Erlang binary comprehensions is rather sparse. There is a chapter on it in the Learn you some Erlang book, but it’s rather short. The author sends the reader to a white paper by Per Gustafsson and Konstantinos Sagonas for details. The problem is that the white paper describes a proposal for syntax rather than what was actually implemented. As a result none of the examples of the “Examples of binary comprehensions” section in that paper are parsed by the Erlang version I am using ( erlang:system_info(version) returns “6.1” for me). This post aims at providing some examples of usage of binary comprehensions, in addition to the ones provided in the Learn you some Erlang book.

Lists from binaries

The simplest example is

1> [ X || <<X>> <= <<1,2,3>> ].
[1,2,3]

This is the same as what

binary:bin_to_list(<<1,2,3>>).

does.

The <<X>> syntax always goes together with the <= , so if you use the <= token to indicate you want to range over a binary, you always need to surround your dummy variable (or a pattern, see below) with <<>> .

Now suppose we want to interpret a binary as representing a list of 4-byte integers, written in big endian. Then, instead of the variable X we write a pattern for matching a single integer.

2> [ X || <<X:32/big-signed-integer>> <= <<0,0,0,1,0,0,0,2>>].
[1,2]

If the (bit) length the of binary is not a multiple of the length of the pattern, the largest possible multiple is taken

3> [ X || <<X:32/big-signed-integer>> <= <<0,0,0,1,0,0,0,2,0,0>>].
[1,2]

An interesting thing about the bit syntax in Erlang is that the size in the bit segment specification does not have to be a literal constant – it can be a variable or an expression. So for example one can write a function for interpreting a binary as a sequence of integers of different lengths. Below we parse the same binary as sequence of 32-bit integers or as a sequence of 64-bit integers using the same function.

4> F = fun(Length,Bin) -> [ X || <<X:Length/big-signed-integer>> <= Bin ] end.
#Fun<erl_eval.12.90072148>
5> F(32,<<0,0,0,0,0,0,0,5>>).
[0,5]
6> F(64,<<0,0,0,0,0,0,0,5>>).
[5]

It seems that none of the other elements of the bit segment specification can treated this way. For example we can not specify endianess in a variable

7> G = fun(Endianess,Bin) -> [ X || <<X:32/(Endianess)-signed-integer>> <= Bin ] end.
* 1: syntax error before: '('

We can use any segment with known size to specify the dummy variable(s) in the generator. For example suppose we have a binary that we want to interpret as a sequence of 5-byte structures. Each structure starts with a byte we don’t care about and the rest is a 4-byte integer. We can write the redundant byte in the pattern as an underscore:

7> [ X || <<_,X:32/big-signed-integer>> <= <<5,0,0,0,1,6,0,0,0,2>> ].
[1,2]

If we do need that byte we can bind a variable to it and return a list of pairs rather than discard those additional bytes:

8> [ {X,C} || <<C,X:32/big-signed-integer>> <= <<5,0,0,0,1,6,0,0,0,2>> ].
[{1,5},{2,6}]

Binaries from lists

The simplest example of this case is

9> << <<X>> || X <- [1,2,3] >>.
<<1,2,3>>

This is the same as calling binary:list_to_bin([1,2,3]). Note that if we use the binary delimiters (<<>>) to indicate a comprehension that results in a binary, then we have to use those delimiters around the X variable (or the expression we put there) as well. This is similar to the situation where we have to surround <<X>> whenever we indicate with <= that we want our variable to range over a binary. To me this syntax seems a bit noisy since we have to repeat the same information twice.
Suppose we want to convert a list of integers to a binary that contains each of them in the 4-byte, little endian format. We can write this as follows:

10> << <<X:32/little-signed-integer>> || X <- [1,2,3] >>.
<<1,0,0,0,2,0,0,0,3,0,0,0>>

One can use this type of binary comprehensions to concatenate a list of binaries:

11> << <<X/binary>> || X <- [<<1,2>>,<<3>>] >>.
<<1,2,3>>

We can use pattern matching in the generator (just as in the list comprehensions) to define the dummy variables. For example if we have a list of pairs of integers and we want to encode this list by encoding each pair as 8-byte binary with 4-byte encodings of the elements of the pair, then we can do this:

12> << <<A:32/big-signed-integer,B:32/big-signed-integer>> || {A,B} <- [{1,2},{3,4}] >>.
<<0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4>>

Now for a little bit more complicated example suppose we have a list of binaries representing some packets. We want to discard packets that start from the 0 byte and swap the first two bytes in the remaining ones, then concatenate the results and get one binary. Using binary comprehensions we can do this as follows:

13> Packets = [<<1,2,3,4>>,<<0,1,2>>,<<5,6,7>>].
[<<1,2,3,4>>,<<0,1,2>>,<<5,6,7>>]
14> << <<B1,B0,Rest/binary>> || <<B0,B1,Rest/binary>> <- Packets, B0 =/= 0>>.
<<2,1,3,4,6,5,7>>

Here the <<B0,B1,Rest/binary>> part defines variables B0,B1,Rest (integer,integer,binary, resp.) and pattern matches to the first byte, second byte and the rest for each packet. The expression B0 =/= 0 acts as a filter discarding the packets with the first byte zero. The <<B1,B0,Rest/binary>> expression does the swapping.
Another way to do the same thing is to write

15> << <<B1/binary,B0/binary,Rest/binary>> || <<B0:1/binary,B1:1/binary,Rest/binary>> <- Packets, B0 =/= <<0>> >>.
<<2,1,3,4,6,5,7>>

Here the variables B0 and B1 are bound to 1-byte binaries rather than integers, so (maybe) we avoid converting bytes to integers. I am curious if this is faster.

Binaries from binaries

The last case is where the generator ranges over a binary and the result is also concatenated into a binary. The simplest example is just the identity:

16> << <<X>> || <<X>> <= <<1,2,3>> >>.
<<1,2,3>>

Here is how we can convert encoding of 4-byte little endian integers into 8-byte big endian integers

17> << <<X:64/big-signed-integer>> || <<X:32/little-signed-integer>> <= <<1,0,0,0,2,0,0,0>> >>.
<<0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2>>
Advertisements

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: