There were several moments over the last few weeks when I heard people discuss differences between Python lists and dicts and one of the first ones mentioned was that lists are ordered and dicts are not.
Well, not anymore. Quoting the docs referenced above:
Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6.
So if you want to discuss fundamental differences you can pretty much only point out that dict values are accessible by keys, which could be of any immutable type, while list values are indexed with integers. That's it :-)
How and why
A plain hash table holds both keys and values in a pseudo random order determined by hashes calculated from keys. It is also sparse, with unoccupied holes in a pre-allocated array:
Since version 3.6 CPython holds keys and values in a separate dense array, while the hash table itself only holds indexes into that:
Since the entries array is populated sequentially, it naturally ensures the order.
As far as I know the initial reason for this change was saving space by sharing hash tables of multiple dicts with the same set of keys (which in Python means instances of the same class, for example). I don't know the exact politics of how this useful property progressed from an implementation detail to a guaranteed behavior.
dict[int] vs. list
Does it mean that a dict with int keys is the same as list? There's still a few practical differences.
One obvious difference is the API. You'd have to always mention indexes explicitly when working with a dict[int], there's no such thing as .append(value)
or .pop()
with no argument. Frankly, I can't see any point in trying to make it work :-)
Also, dicts are bigger, by an order of magnitude:
In [79]: getsizeof(['' for i in range(100000)])
Out[79]: 824472
In [80]: getsizeof({i: '' for i in range(100000)})
Out[80]: 5242984
Surprisingly, I couldn't find any consistent difference in speed in neither insertion of new values in a list and a dict[int], nor in traversing them. My gut feeling tells me it's mostly due to the fact that a hash value of an int is the int itself, so there's no time wasted on hashing.
Not sets though!
Yep, to my surprise sets are still unordered:
In [3]: list({'one': True, 'two': True, 'three': True, 'four': True, 'five': True})
Out[3]: ['one', 'two', 'three', 'four', 'five']
In [4]: list({'one', 'two', 'three', 'four', 'five'})
Out[4]: ['three', 'five', 'four', 'one', 'two']
Until this moment I thought of sets as basically thin wrappers around actual dicts operating on their keys and neglecting values. Turns out, they have their own implementation.
Comments: 38 (noteworthy: 4)
Interesting! So it brings Python on par with PHP.
PHP (and Facebook Hack) is the only language I was aware of that has the ordered maps/dicts in the standard container toolkit.
When I was exposed to Hack, my initial instincts were not to rely on this feature, but when I got used to it I actually started to appreciate it. This helped to improve my code quite a few times!
Does any other popular language have that?
Mitya, Ruby has had ordered dicts for quite a while too.
Noteworthy comment
There is a great talk on this topic: https://youtu.be/p33CVV29OG8
Python has had ordered dicts for a long time (in "containers.ordereddict" I think). The change described above, in Python 3.6 (from 2016) merely extends that "ordered" behavior to regular dicts as well.
Mitya: C++
std::map
is ordered. If you want one that's unordered you have to usestd::unordered_map
. C# has OrderedDictionary. Java has SortedMap.It's not uncommon.
@hoistbypetard as far as I know,
std::map
is ordered by keys, not by insertion order. It probably doesn't matter often in practice, but still is something that's good to know!@Ivan Sagalaev: That matches my recollection. I glossed over the "insertion order" part. Probably because that's never been important to me, where consistent ordering is sometimes important (or at least useful).
Nice analysis overall, btw.
Welcome back to blogging! Nice to read you.
@Ruslan Keba thanks! :-) I'm not promising anything though, last year I managed 2 posts overall. I have way too many drafts sitting in my queue.
Brandon Rhodes gave a good talk at PyCon 2017 about the evolution of Python dictionaries: https://www.youtube.com/watch?v=66P5FMkWoVU.
Just testing, sorry. This is a good interface, the markdown is nice. How do you deal with spam (like mine)? I would love to have something like this for my blog. Also, are the commends db-backed, or do you do some static-site generation stuff with it?
@hoistbypetard In java you have LinkedHashMap to keep insertion order. SortedMap is used to order by the actual key itself.
I think sets in Python will never be ordered due to the fact that while
>>> list([1,2,3]) == list([1,3,2]) False >>> set([1,2,3]) == set([1,3,2]) True
that is, it must conform to the mathematical definition of equal sets:
Tcl has ordered dicts by default as well.
One application where ordered dicts are really useful is when parsing JSON files. If you want read a JSON file, edit something and then write it back, ordered dicts ensure that the object attributes in the edited file are in the same order as in the original file.
Noteworthy comment
@hoistbypetard @Ivan Sagalaev C++ std::map is a tree, not a hash table. So order by key comes naturally. Entirely different data structure.
There may be no
.pop()
but there is a.popitem()
which does the same in guaranteed LIFO from Python 3.7.JavaScript Maps are also sorted in insertion order.
Mitya, in Tcl dictionaries have been ordered from the start. They are also immutable, something I wish was the default in other scripting languages.
@hoistbypetard std::map is ordered by value of the key and kept in insertion order. E.g. If you insert the following key/value pairs (in order left to right): {3, "three"}, {1, "one"}, {2, "two"}.
- std::map key order: { 1, 2, 3}
- Python 3.7 dict key order: {3, 1, 2}
I deal with spam by manually approving all comments :-) There was a more complicated system in place at one point, but it relied on other services (Akismet) and was more brittle than I wanted to deal with. So I eventually simplified everything since my blog doesn't get a lot of traffic these days.
Also yes, it's a custom DB-backed software. Write me an email if you like more details!
Thanks for the nice article!
You wrote: Yep, to my surprise sets are still unordered!
I just want to point out that it is should not be a surprise that the set type is still unordered. The behavior of the set type is based on the mathematical idea that two sets are equivalent if they have the same elements. I suspect that the basic set type characteristics will never change.
I just noticed Mauro made the same point I did a few comments back. He included a definition of set that makes the point mor clearly than I did.
Mitya, Ruby has had ordered hashes as the default since 2013.
Well yes. Let's draw the correct conclusion from this. Sets can be ordered by an implementation, while raising no conflict with the mathematical definition. Perhaps you had in mind the equality operator with sets. Yes, the behavior of that operator should not change: set(1,2,3) should be equal to set(1,3,2) and also to set(3,1,2) etc. Were a language or library to dish the set elements out in a certain order, that would not change this rule. The ordering can be fixed, or random; either way, we still mathematically have a set.
whoops, i take that back, since 2007. https://www.ruby-lang.org/en/news/2007/12/25/ruby-1-9-0-released/
Noteworthy comment
You can read the discussions at these two threads:
https://mail.python.org/pipermail/python-dev/2017-November/150142.html
https://mail.python.org/pipermail/python-dev/2017-December/151263.html
I thought unordered hashes/maps/dicts had security implications. I come from Perl where the language went in the other direction, from ordered to unordered for purposes of security hardening. Have these security considerations changes or are they not relevant in Python?
https://stackoverflow.com/questions/30340027/why-do-hash-keys-have-different-order-when-printing
The security implications are around collisions inside the hash table. Python's solution to ordering things doesn't change that and underlying everything they still have the same kind of randomness added to the hash function that Perl 5 uses.
Sets don't need any ordering. They are collections, but there is no concept of order for sets.
People commenting that sets cannot be ordered are, I think, missing the point. The same argument could be made for dicts, which are just a set with a value attached to each key.
Sets could retain all their existing behavior, preserving any parallels with mathematical sets , except when calling 'pop' or iterating over members, it provides them in the order they were inserted.
Java has https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html for a long time in its core.
I am very happy with this change, it is a game-changer for Python. In the past, it was difficult to use classes to specify interfaces, because the order of elements in a class was not defined. This made it very difficult to e.g. write a OEM mapper for databases, like SQLAlchemy. They had to use complex meta-programming tricks to find out in which order elements are defined in a database table class. I once wrote an OEM myself as an exercise, and trust me, it is not pretty.
With release 3.6, that problem has become moot. I love it. The new type-declaration system is an added super-bonus. I now often use dataclasses to define complex protocols, with an absolute minimum of overhead, automatic generation of parsers and serialisers, etc. I love it !
I was eagerly waiting for release 3.6, and imported a pre-release of dataclasses, just because these features are such a game-changer.
Hey very interesting post and good to know, but I used your example with timers to verify the claims about the similar insertion/traversal times. I examined it for 3 Operations: Lookups, Insertion and Traversals. While lists may be faster for traversal (iteration), dicts and sets are way faster for the other two Ops. I found that lists might match equal speeds (runtimes) for inserting elements at the end (, for which they do not need to move memory and simply call append()). But to be fair, I opted for inserting the element in the middle of the iterables. Please try the following code yourself and let me know if i'm doing something wrong here :) :
# speedtest dicts vs lists import time d = {i: '' for i in range(10000000)} s = {i for i in range(10000000)} l = ['' for i in range(10000000)] elem = 999999 #membership testing 1 - lookups: speed in Decreasing order: dicts, sets, lists start = time.perf_counter() if elem in s: print('True') end = time.perf_counter() print('time1 sets:' + str(end-start)) start = time.perf_counter() if elem in d: print('True') end = time.perf_counter() print('time1 dicts:' + str(end-start)) start = time.perf_counter() if elem in l: print('True') end = time.perf_counter() print('time1 lists:' + str(end-start)) # ----------------------------------------- # inserting elements 2 --- speed in Decreasing order: dicts, sets, lists start = time.perf_counter() s.add(elem) end = time.perf_counter() print('time2 sets:' + str(end-start)) start = time.perf_counter() d[elem] = '' end = time.perf_counter() print('time2 dicts:' + str(end-start)) start = time.perf_counter() l.insert(elem,1) end = time.perf_counter() print('time2 lists:' + str(end-start)) # ----------------------------------------- # traversing elements 3 --- speed in Decreasing order: lists, dicts, sets start = time.perf_counter() for item in s: pass end = time.perf_counter() print('time3 sets:' + str(end-start)) start = time.perf_counter() for item in d: pass end = time.perf_counter() print('time3 dicts:' + str(end-start)) start = time.perf_counter() for item in l: pass end = time.perf_counter() print('time3 lists:' + str(end-start))
Thank you for considering, noticing and acknowledging! ;)
Because of the suboptimal readability, I reposted my code on pastebin: https://pastebin.com/4aMY6eWP
Many
list
vsdict
articles on the web seem to be obsolete now. I was also struggling to find any advantage of lists over dicts. I am grateful for Ivan's post and comments. For me the difference boils down to:dict
(10x in case of 100k integers)PS: easy
list
->dict
>>> l = ['a', 'b'] >>> d = dict(enumerate(l)) >>> d {0: 'a', 1: 'b'}
Thnx for the article! I thought they did it like the c++ map with red-black tree. It's good that they kept the hash based implementation.
Noteworthy comment
There is still an important difference between the dict and OrderedDict class, which is that == comparisons for dicts are still order independent, while OrderedDicts are only == to each other if their orders match.
This is both good and weird. The good is that the behavior of the equality operator was not changed as a result of this update and still "makes sense" for how people normally think of dictionary semantics. The weird is that dicts now have extra hidden infomation (the order) that is not taken into account by the equality operator, and thus two different dicts that are technically == to each other, could produce different results when passed to a function that iterates over them.