Here's a toy problem. Given a corpus of phone numbers for different countries determine a most prevalent display format in each country and use it to re-format an arbitrary phone number for its country. For example, if most US numbers in our data corpus are written like xxx-xxx-xxxx then the string (206) 1234567 should be converted to 206-123-4567.

For simplicity, let's assume that all numbers are local so we don't have to deal with the complexity of international prefixes.


The actual conversion is not what this post is about, but for completeness here's a pair of complementary functions, one determining a pattern, and the other formatting a phone number according to it:

import re

def get_pattern(phone):
    pattern = re.sub(r'\s+', ' ', phone)  # normalize whitespace
    return re.sub(r'\d', '{}', pattern)   # replace digits with placeholders

def format(pattern, phone):
    return pattern.format(*re.findall(r'\d', phone))

(Error handling omitted for clarity.)

Straightforward solution

The usual approach is to use a dict-like structure that would hold patterns as keys and count how often they appear in the source database, so the result is going to look something like this:

    'US': {
        '({}{}{}) {}{}{}-{}{}{}{}': 1, 
        '{}{}{}-{}{}{}-{}{}{}{}': 2,
    'GB': {
        '{}{}{} {}{}{} {}{}{}{}': 1,
    'RU': {
        '({}{}{}) {}{}{}-{}{}-{}{}': 1,

(Don't mind the numbers, it's a small database!)

As far as I have seen, people usually turn to defaultdict for something like this. It automatically initializes values of absent keys on first access which lets you simply do result[key] += 1 every time. In our case we need a more complicated thing: a defaultdict for countries which would initialize keys with nested defaultdicts for counts. Overall it might look like this:

from collections import defaultdict

stats = defaultdict(lambda: defaultdict(int))  # nested defaultdicts

for country, phone in DB:
    pattern = get_pattern(phone)
    stats[country][pattern] += 1

# Replace each country stats with the most common pattern
for country, patterns in stats.items():
    pattern = max(patterns.items(), key=lambda i: i[1])[0]
    stats[country] = pattern


Here's a rule of thumb: whenever you see defaultdict(int) you can replace it with a Counter. It can do everything a defaultdict can, plus a few convenient tricks, like the .most_common() method which we can use instead of the ungainly max() expression from the previous example.

from collections import defaultdict, Counter

stats = defaultdict(Counter)

for country, phone in DB:
    pattern = get_pattern(phone)
    stats[country][pattern] += 1

for country, patterns in stats.items():
    stats[country] = patterns.most_common(1)[0][0]  # most_common would return 
                                                    # a list [(pattern, count)], so
                                                    # we need this [0][0] here

Flat and functional

Let me offer you another solution, which uses Counter magic even more. I am also going to replace for-loops with a functional approach, so it won't be so embarrassingly imperative :-)

Here's the code, followed by an explanation:

from collections import Counter

def determine_patterns(db):
    stats = Counter((country, get_pattern(phone)) for country, phone in db)
    stats = reversed(stats.most_common())
    return {country: pattern for (country, pattern), _ in stats}

Look ma no loops!

Let me explain all that's happening here.

Is it better?

I don't really care, you can judge for yourself :-) I only wanted to demonstrate collections.Counter in action!

Working code as a bonus:

Comments: 1 (noteworthy: 1)

  1. Denis

    Noteworthy comment

    Just ¢1 more. Unpacking of most_common() result could be done in self-documented manner, so no comment needed:

    [(pattern, count)] = patterns.most_common(1)
    stats[country] = pattern

Add comment