Here's a toy problem. Given a corpus of phone numbers for different countries determine a most prevalent display format in each country and use it to re-format an arbitrary phone number for its country. For example, if most US numbers in our data corpus are written like xxx-xxx-xxxx
then the string (206) 1234567
should be converted to 206-123-4567
.
For simplicity, let's assume that all numbers are local so we don't have to deal with the complexity of international prefixes.
Conversion
The actual conversion is not what this post is about, but for completeness here's a pair of complementary functions, one determining a pattern, and the other formatting a phone number according to it:
import re
def get_pattern(phone):
pattern = re.sub(r'\s+', ' ', phone) # normalize whitespace
return re.sub(r'\d', '{}', pattern) # replace digits with placeholders
def format(pattern, phone):
return pattern.format(*re.findall(r'\d', phone))
(Error handling omitted for clarity.)
Straightforward solution
The usual approach is to use a dict-like structure that would hold patterns as keys and count how often they appear in the source database, so the result is going to look something like this:
{
'US': {
'({}{}{}) {}{}{}-{}{}{}{}': 1,
'{}{}{}-{}{}{}-{}{}{}{}': 2,
},
'GB': {
'{}{}{} {}{}{} {}{}{}{}': 1,
},
'RU': {
'({}{}{}) {}{}{}-{}{}-{}{}': 1,
},
}
(Don't mind the numbers, it's a small database!)
As far as I have seen, people usually turn to defaultdict for something like this. It automatically initializes values of absent keys on first access which lets you simply do result[key] += 1
every time. In our case we need a more complicated thing: a defaultdict
for countries which would initialize keys with nested defaultdict
s for counts. Overall it might look like this:
from collections import defaultdict
stats = defaultdict(lambda: defaultdict(int)) # nested defaultdicts
for country, phone in DB:
pattern = get_pattern(phone)
stats[country][pattern] += 1
# Replace each country stats with the most common pattern
for country, patterns in stats.items():
pattern = max(patterns.items(), key=lambda i: i[1])[0]
stats[country] = pattern
Counter
Here's a rule of thumb: whenever you see defaultdict(int)
you can replace it with a Counter. It can do everything a defaultdict
can, plus a few convenient tricks, like the .most_common()
method which we can use instead of the ungainly max()
expression from the previous example.
from collections import defaultdict, Counter
stats = defaultdict(Counter)
for country, phone in DB:
pattern = get_pattern(phone)
stats[country][pattern] += 1
for country, patterns in stats.items():
stats[country] = patterns.most_common(1)[0][0] # most_common would return
# a list [(pattern, count)], so
# we need this [0][0] here
Flat and functional
Let me offer you another solution, which uses Counter
magic even more. I am also going to replace for-loops with a functional approach, so it won't be so embarrassingly imperative :-)
Here's the code, followed by an explanation:
from collections import Counter
def determine_patterns(db):
stats = Counter((country, get_pattern(phone)) for country, phone in db)
stats = reversed(stats.most_common())
return {country: pattern for (country, pattern), _ in stats}
Look ma no loops!
Let me explain all that's happening here.
-
As it happens,
Counter
can consume a flat sequence of things and count them in one go:Counter(['a', 'b', 'a', 'c']) => {'a': 2, 'b': 1, 'c': 1}
For this to work in our case we'll have to replace the nested data structure with a flat one. This can be done by simply gluing keys together in a tuple, replacing
stats[country][pattern]
with equivalentstats[(country, pattern)]
(you can do it for any number of keys). -
The expression
(country, get_pattern(phone)) for country, phone in db
looks like a list comprehension without square brackets. It's a generator expression which you can iterate over without constructing an actual list in memory. It is usually enclosed in parentheses( .. )
but when it is a sole argument in a function call you can omit them and avoid ugly doubling likeCounter(( .. ))
. -
stats.most_common()
with no arguments returns the entire contents of the structure as a sorted list of pairs(key, count)
:[ (('US', '{}{}{}-{}{}{}-{}{}{}{}'), 2), (('GB', '{}{}{} {}{}{} {}{}{}{}'), 1), (('US', '({}{}{}) {}{}{}-{}{}{}{}'), 1), (('RU', '({}{}{}) {}{}{}-{}{}-{}{}'), 1), ]
It's sorted by count, with no regard for countries, but we don't care as you'll see later, as well as why we need to reverse it.
-
The last expression is a dict comprehension which walks over the sorted list, destructuring its nested pairs into individual variables using
(country, pattern), _
which mimics the structure of one pair._
is a conventional name for things we don't use,count
in our case. So the sequence we actually fold into a final dict looks like this, with countries becoming keys and patterns becoming values:('US', '{}{}{}-{}{}{}-{}{}{}{}'), ('GB', '{}{}{} {}{}{} {}{}{}{}'), ('US', '({}{}{}) {}{}{}-{}{}{}{}'), ('RU', '({}{}{}) {}{}{}-{}{}-{}{}'),
-
Here's the important thing: subsequent values with the same key replace the ones already in the dict. This is why we need to reverse the list, so the patterns with bigger counts replace least common ones as we go along. This is also why we don't care about countries being intermixed, as they won't affect each other.
Admittedly, the last bit is the most unobvious here.
Is it better?
I don't really care, you can judge for yourself :-) I only wanted to demonstrate collections.Counter
in action!
Working code as a bonus: phone-counter.py.
Comments: 1 (noteworthy: 1)
Noteworthy comment
Just ¢1 more. Unpacking of
most_common()
result could be done in self-documented manner, so no comment needed: