- Published on
How to optimize Python code for resource-constrained environments
- Authors
- Name
- Eyji Koike Cuff
Last week I ran into a funny issue. A lambda function was freezing mid-execution of a function, and the whole thing would just time out. Looking at the logs, the issue was clear.
Max Memory Used: 512mb
The memory was maxed out.
๐ The issue
This specific function that was being triggered deals with data expansion. It would query all the dataset data and expand those records to generate mock, back-dated ones. This would increase the data size by 3 times!
Here is a flowchart to help you understand the data lifecycle:

On the iteration part, we were doing simple looping over the records and creating deep copies of the dataset list every time. This led us to utilize way more memory than needed. Take the following snippet as an example:
from datetime import datetime
from copy import deepcopy
import random
import string
import tracemalloc
from pydantic import BaseModel
# Define the Pydantic model
class HeavyModel(BaseModel):
id: int
name: str
created_at: datetime
payload: str
metadata: dict
def generate_payload(size=10240): # 10 KB
return ''.join(random.choices(string.ascii_letters + string.digits, k=size))
def generate_metadata():
return {
'tags': [f"tag{random.randint(1, 100)}" for _ in range(10)],
'attributes': {f"key_{i}": random.random() for i in range(20)},
'flags': [random.choice([True, False]) for _ in range(10)],
}
def generate_original_data(count=40000): # Use 4000 to match your baseline
now = datetime.now()
return [
HeavyModel(
id=i,
name=f"Item {i}",
created_at=now,
payload=generate_payload(),
metadata=generate_metadata()
) for i in range(count)
]
def generate_backdated_data_by_list_deepcopy(original_data, years=3):
results = []
for year_offset in range(years + 1):
year_copy = deepcopy(original_data)
for model in year_copy:
model.created_at = model.created_at.replace(
year=model.created_at.year - year_offset
)
results.extend(year_copy)
return results
def print_memory_usage_tracemalloc(note=""):
current, peak = tracemalloc.get_traced_memory()
current_mb = current / 1024 / 1024
peak_mb = peak / 1024 / 1024
print(f"[MEMORY]{note}: Current={current_mb:.2f} MB | Peak={peak_mb:.2f} MB")
# Start tracking with tracemalloc
tracemalloc.start()
print_memory_usage_tracemalloc("Before data generation")
original = generate_original_data()
print_memory_usage_tracemalloc("After original data generation")
all_data = generate_backdated_data_by_list_deepcopy(original)
print_memory_usage_tracemalloc("After backdated data generation")
# Stop tracking
tracemalloc.stop()
"""
You should see something similar
[MEMORY]Before data generation: Current=0.00 MB | Peak=0.00 MB
[MEMORY]After original data generation: Current=563.27 MB | Peak=563.34 MB
[MEMORY]After backdated data generation: Current=878.22 MB | Peak=896.34 MB
"""
All that generated data will be kept in memory, which is unnecessary. That would mean an increase in Lambda Memory!. We don't need all of it at the same time.
โ The solution
Using generators to yield data as we need it. Generator functions in Pythonโand in any other language they are availableโcompose resourceful tools to lazily evaluate things and reduce memory and processing overhead by leveraging just-in-time methodology. Therefore, we can adapt our dataset function to return batches of 4000 records like this:
def generate_lazy_dataset_pages(page_size=4000, total=40000):
now = datetime.now()
for batch in range(0, total, page_size):
yield [
HeavyModel(
id=i,
name=f"Item {i}",
created_at=now,
payload=generate_payload(),
metadata=generate_metadata()
) for i in range(batch, batch + page_size)
]
And our backdated items function will also be refactored into a generator that yields 4000 backdated records for each year.
def generate_backdated_lazy(paged_data_gen, years=3):
for page in paged_data_gen:
for year_offset in range(years + 1):
year_batch = []
for model in page:
model_copy = deepcopy(model)
model_copy.created_at = model_copy.created_at.replace(
year=model_copy.created_at.year - year_offset
)
year_batch.append(model_copy)
yield year_batch
And to simulate that we are consuming our items, we would do
for year_batch in backdated_batches:
for record in year_batch:
_ = record.model_dump()
With the full fledged script, we can compare the memory usage:
โญโ~/PycharmProjects โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ took 15m 17s pythonโbase 3.10.13 at 15:17:46 โโฎ
โฐโ python test.py โโฏ
[Before eager processing] Current = 0.00 MB | Peak = 0.00 MB
[After eager processing] Current = 878.22 MB | Peak = 896.34 MB
[Before lazy processing] Current = 0.00 MB | Peak = 0.00 MB
[After lazy processing] Current = 55.62 MB | Peak = 120.70 MB
from datetime import datetime
from copy import deepcopy
import random
import string
import tracemalloc
from pydantic import BaseModel
class HeavyModel(BaseModel):
id: int
name: str
created_at: datetime
payload: str
metadata: dict
def generate_payload(size=10_240):
return ''.join(random.choices(string.ascii_letters + string.digits, k=size))
def generate_metadata():
return {
'tags': [f"tag{random.randint(1, 100)}" for _ in range(10)],
'attributes': {f"key_{i}": random.random() for i in range(20)},
'flags': [random.choice([True, False]) for _ in range(10)],
}
# -------- EAGER VERSION --------
def generate_original_data_eager(count=40000):
now = datetime.now()
return [
HeavyModel(
id=i,
name=f"Item {i}",
created_at=now,
payload=generate_payload(),
metadata=generate_metadata()
) for i in range(count)
]
def generate_backdated_data_eager(original_data, years=3):
results = []
for year_offset in range(years + 1):
year_copy = deepcopy(original_data)
for model in year_copy:
model.created_at = model.created_at.replace(
year=model.created_at.year - year_offset
)
results.extend(year_copy)
return results
# -------- LAZY VERSION --------
def generate_lazy_dataset_pages(page_size=4000, total=40000):
now = datetime.now()
for batch in range(0, total, page_size):
yield [
HeavyModel(
id=i,
name=f"Item {i}",
created_at=now,
payload=generate_payload(),
metadata=generate_metadata()
) for i in range(batch, batch + page_size)
]
def generate_backdated_lazy(paged_data_gen, years=3):
for page in paged_data_gen:
for year_offset in range(years + 1):
year_batch = []
for model in page:
model_copy = deepcopy(model)
model_copy.created_at = model_copy.created_at.replace(
year=model_copy.created_at.year - year_offset
)
year_batch.append(model_copy)
yield year_batch
def print_memory(note=""):
current, peak = tracemalloc.get_traced_memory()
current_mb = current / 1024 / 1024
peak_mb = peak / 1024 / 1024
print(f"[{note}] Current = {current_mb:.2f} MB | Peak = {peak_mb:.2f} MB")
# -------- Run EAGER version --------
tracemalloc.start()
print_memory("Before eager processing")
original_eager = generate_original_data_eager()
backdated_eager = generate_backdated_data_eager(original_eager)
print_memory("After eager processing")
tracemalloc.stop()
# -------- Run LAZY version --------
tracemalloc.start()
print_memory("Before lazy processing")
paged_data = generate_lazy_dataset_pages()
backdated_batches = generate_backdated_lazy(paged_data)
for year_batch in backdated_batches:
for record in year_batch:
_ = record.model_dump() # Simulate sending to SQS
print_memory("After lazy processing")
tracemalloc.stop()
"""
You will see something like:
[Before eager processing] Current = 0.00 MB | Peak = 0.00 MB
[After eager processing] Current = 878.22 MB | Peak = 896.34 MB
[Before lazy processing] Current = 0.00 MB | Peak = 0.00 MB
[After lazy processing] Current = 0.11 MB | Peak = 0.20 MB
"""
๐ก Conclusion
As products grow, we will face resource constraints on infrastructure. Of course, we can scale up, particularly with tools like Infrastructure as Code (IaC), but that does not mean we cannot improve older implementations by leveraging language resources. Every programming language is constantly changing as new utilities get added to it. We could reduce memory usage to an eighth by getting the basics right and leveraging Python generators, avoiding the need to upgrade the infrastructure and increase costs.