Published on

Field Note: How to assert for unique items between two lists

Authors
  • avatar
    Name
    Eyji Koike Cuff
    Twitter

Have you ever had two lists of things and you wanted to know the exact items that were different between them?

Well, I did. For the context, I was looking to find the differences between two lists of objects, one coming from a graphql query, and another from an API after that graphql list was ingested.

Just a quick reminder before we follow that many other ways of doing this exist specially in a language like Python.


πŸ’­ The instinctive thought

Whenever you see problems with lists you might instinctively think of loops or related things. That's what one of the first things we learn to do when introduced to another language. That is pretty normal.

Take this example program that illustrates the problem:

my_list_of_ids = ['1', '2', '3']
api_list = session.request_api_ids() # yields ['1', '2']

def get_ids_missing(graphql_list: list[str], api_list: list[str]):
    # complexity O(n * m)
    missing_ids = []
    for id in my_list_of_ids:
        if id not in api_list: 
            missing_ids.append(id)


missing_ids = get_missing_ids(my_list_of_ids, api_list) # this will be ['3']

I bet now you can spot the problem. We are doing a linear search. The complexity is O(n * m) where n is the quantity of items in the first list, and m is the quantity of items in the second list. This means the loop makes us go through all the elements in both lists, and you can imagine that the longer they are, more time we will take to finish this.

Doing basic math, if we have 1000 items on each list, we will at the end have 1000*1000 = 1000000 operations.

⚑️ But there is another way!


πŸ• The simpler time complexity solution

There is another data structure that can helps us speed that up and it's called set.

Why are we using set?

Because a Python set is a collection of immutable (Strings, Numbers, Tuples) and unique elements. Internally, set is basically a hashtable where Python creates a hash from the item you're passing to it. For example, internally it would look like:

set = {
    "hashed_1": 1,
    "hashed_2": 2,
}

And here is the cool thing, since the hash provided by python is a mathematical representation of the item you're passing to it, you will never have duplicate values. This means that, let's say the hash for 5 (the integer) results in 12345, it's hashkey will always be 12345. So if you have add another 5, at the time Python computes its hash you use the address 12345 again. It will always take you to the same one!

Basically Python expects that:

a == b β†’ then hash(a) == hash(b)

That's why only immutable types are hashable.

Ok, but how using the set will improve the speed?

By transforming your lists into sets, you basically give access to every item with O(1) complexity! And sets have very cool properties to take advantage of that:

set_a = set([1,2,3,4]) # Complexity O(n)
set_b = set([1,2]) # Complexity O(m)

difference_method = set_a.symmetrical_difference(set_b) # also 3, 4 Complexity O(n)

Symmetric difference works like an XOR: it gives you items that are in one set or the other, but not in both.

Sure, I see. But what about time?

Now we have 3 operations, O(n), O(m), and O(n). If you say that n=1000, and m=1000, now you will have a sum o complexities, not a multiplication, resulting in 3000 operations.

I DON'T BELIEVE YOU!

Well, try the following (I promise is not a virus)

import random
import timeit

# Setup: generate two lists with overlap
setup_code = """
import random

# Lists with 100,000 elements each and 50,000 different items
list_a = [str(i) for i in range(100000)] 
list_b = [str(i) for i in range(50000, 150000)]
"""

# Linear search version: O(n * m), takes about 50s
linear_search_code = """
difference = [item for item in list_a if item not in list_b]
print(len(difference)) 
"""

# Set symmetric difference: O(n + m), takes about 0.04s
set_diff_code = """
set_a = set(list_a)
set_b = set(list_b)
difference = set_a - set_b  
print(len(difference)) 
"""

print("⏱️ Linear search:")
print(timeit.timeit(stmt=linear_search_code, setup=setup_code, number=1))

print("\n⚑ Symmetric difference:")
print(timeit.timeit(stmt=set_diff_code, setup=setup_code, number=1))


πŸ“ Final Thought

With this small example, I hope you see how exploring all the tools a language offers can make your life β€” and code β€” easier.