Background
Sometimes we may generate or retrieve a list, set or even dict when creating collection of things that we will be testing against. Theoretically a set, frozenset or dictionary should be the fastest (and equivalent) forms of storage for this operation. However, what I’ve found is that on some systems the set is faster and on others dict is faster. This difference maybe very important if your writing real-time or close to real-time software. So where am I going with this?
Big-O Notation – Advertised Complexity
Python has published the expected time complexities of their collection types. I’ve copied the ones for the in operator below. These Big-O numbers are exactly what you would expect since everything but a list is implemented using a hashing algorithm. It should be noted, however, that the speed of the set, frozenset, and dict can be compromised if the objects stored do not implement a good hashing algorithm.
Type | Average | Worst |
---|---|---|
list | O(n) | |
set | O(1) | O(n) |
frozenset | O(1) | O(n) |
dict | O(1) | O(n) |
What I Found
Going back to my statement above, I found that on certain machines, python sets were faster and on some machines python dicts where faster. I cannot replicate sets being faster in all cases directly so I tried to replicate it with a RHEL 7.1 machine on AWS. Given that I was at an optimal case for the collection (no collisions), I would have thought that set, frozenset, and dict at least performed on par with each other. I was surprised to find with the default python interpreter my tests showed that python dicts are actually faster. So, I reran the tests with the corresponding version of PyPy and found that the expected results hold true and set and frozenset operate at virtually the same speed as dicts. I suspect the primary reasons for the differences are the compiler used to create the python binaries. It was interesting however that PyPy performed as expected on all systems.
The Data
I ran the benchmarks on OSX, Ubuntu 14.04, and RHEL 7.1 (Courtesy of AWS Free Tier); Though, I opted not to record the RHEL results as they are similar to the Ubuntu results.
Benchmarks | Fastest | % Difference | |
---|---|---|---|
OSX | |||
Python | |||
list | 5.47 | 150.641 | |
set | 0.85 | 9.877 | |
frozenset | 0.85 | 9.877 | |
dict | 0.77 | 0.77 | 0.000 |
PyPy | |||
list | 0.34 | 89.362 | |
set | 0.13 | 0.13 | 0.000 |
frozenset | 0.13 | 0.13 | 0.000 |
dict | 0.13 | 0.13 | 0.000 |
Ubuntu | |||
Python | |||
list | 6.07 | 123.733 | |
set | 1.44 | 0.697 | |
frozenset | 1.49 | 4.110 | |
dict | 1.43 | 1.43 | 0.000 |
PyPy | |||
list | 0.78 | 102.913 | |
set | 0.25 | 0.25 | 0.000 |
frozenset | 0.25 | 0.25 | 0.000 |
dict | 0.26 | 3.922 |
Recommendations
If you have a need to create a collection to test for existence like in this example; favor set, frozenset or dict whichever makes sense for your situation. If you are working with a list your given and you want to speedup the system, you can consider changing the list to a set.
The Code
I’ve uploaded all the code to github. It is available here: https://github.com/chaddotson/container-membership-benchmark/.