What is the best way to pass and update a collection of data in Julia?
Introduction
Julia offers many different collections for storing data. In this post, I am interested in showing which type of collection is most efficient for storing/updating values and passing to other functions in the context of solving a backwards recursion dynamic programming model. I do not use the most obvious choice — arrays — because they are too restrictive in most use cases since they do not allow mixing of data types (e.g. real numbers and strings).
I compare the performance of four different collection types:
Mutable struct
DataFrame
(usingDataFrames.jl
)Dict
NamedTuple
The end result is that, for my use case below, named tuples have the best performance, followed closely by mutable structs. Dictionaries are an order of magnitude slower, and data frames are two orders of magnitude slower.
All the examples are tested under Julia 1.7.3. See here for a similar post that compares immutable and mutable arrays using the Setfield.jl
package.
The code
Here is the first part of the code which loads packages and defines the objects to be used. Each container has one element, x
, which will be updated many, many times in downstream function calls.
using Statistics, BenchmarkTools, DataFrames, RandomRandom.seed!(1234)mutable struct allParmsM
x::Float64
end# fill in the mutable struct with the values
allpM = allParmsM(
5.784 # x
)# create a data frame
allp_df = DataFrame(x=5.784)# create a dict
allp_dict = Dict("x"=>5.784)# create a named tuple
allp_nt = (x=5.784,)
The next chunk of code creates functions for updating x
in each container. We update x
with a draw from a uniform random number and then return the square root of that number.
Note well that named tuples are actually immutable, but that can be worked around by using the code mynt = merge(mynt, (fieldn=new_value,))
. (Thanks to this StackOverflow answer for pointing that out!) Note also that the Setfield.jl
package can be used for mutating immutable structs to achieve performance gains.
# mutable struct
function readin_struct(st)
st.x = rand()
return sqrt(st.x)
end# data frame
function readin_df(stdf)
stdf[!,:x] .= rand()
return sqrt(stdf[!,:x][1])
end# dict
function readin_dict(stdict)
stdict["x"] = rand()
return sqrt(stdict["x"])
end# named tuple
function readin_nt(stnt)
stnt = merge(stnt, (x=rand(),))
return sqrt(stnt.x)
end
Now I create functions that call the above functions many, many times as one would in a nested set of loops in a backwards recursion dynamic programming problem, for example. Each function takes as its inputs an integer (the number of times to repeat the call) and a collection (struct, data frame, dictionary, or named tuple). Each function outputs the mean of the vector of square-rooted uniform random numbers.
function looper_struct(J::Int64,st)
out = zeros(J)
for j=1:J
out[j] = readin_struct(st)
end
return mean(out)
endfunction looper_df(J::Int64,st)
out = zeros(J)
for j=1:J
out[j] = readin_df(st)
end
return mean(out)
endfunction looper_dict(J::Int64,st)
out = zeros(J)
for j=1:J
out[j] = readin_dict(st)
end
return mean(out)
endfunction looper_nt(J::Int64,st)
out = zeros(J)
for j=1:J
out[j] = readin_nt(st)
end
return mean(out)
end
Benchmarking
Now I benchmark the above functions to compare performance requirements. I set the number of replications to be 1 million as a way to evaluate performance at a realistic scale.
# force compilation and collect garbage
looper_struct(1,allpM); looper_df(1,allp_df); looper_dict(1,allp_dict); looper_nt(1,allp_nt); GC.gc();# Do the benchmark timingprintln("\nmutable struct")
@btime looper_struct(1_000_000,allpM)println("\ndict")
@btime looper_dict(1_000_000,allp_dict)println("\ndata frame")
@btime looper_df(1_000_000,allp_df)println("\nnamed tuple")
@btime looper_nt(1_000_000,allp_nt)
The result
As you can see below, named tuples have the best performance, while dictionaries and DataFrames perform much worse.
julia> include("btiming.jl")mutable struct
4.343 ms (3 allocations: 7.63 MiB)dict
42.174 ms (3 allocations: 7.63 MiB)data frame
751.915 ms (11999492 allocations: 419.61 MiB)named tuple
3.478 ms (3 allocations: 7.63 MiB)