8 Python Libraries You Should Know: Completly based from my experience

Python as programming language is the powerhouse of Data roles, that one of the reasons it's so popular is their rich ecosystem of libraries. The principal idea across libraries in Python is save time on building things and you will be completly focused on feature deliveries. So I built this top 8 that you should now based on my programming journey (ETL-Big Data-Streaming-Backend).

Itertools: – Iteration Solutions LLC

This tool is really helpful when your work or part of your solution uses a loop, iterators or any kind of combinational logic. I can define the 4 horses of iteration:

itertools.cycle: Creates an infinite loop over an iterable.
itertools.permutations: Generates all possible permutations of a sequence.
itertools.combinations: Generates all possible combinations of a sequence.
itertools.groupby: Groups elements in an iterable based on a key function.

Why this lib istead of building your own implementation? Well, first... is already here, is a memory efficient solution (because it's an iterator and don't load the whole stuff in memory), it's secure because is a built-in function (it comes with your python environment)

# Example: Generate all pairs of combinations
import itertools

data = [1, 2, 3]
combinations = list(itertools.combinations(data, 2))
print(combinations)  
# Output: [(1, 2), (1, 3), (2, 3)]

collections - Extended features for .... I don't want to say it.

The collection module gives you a some utilities that you can use across other python data types like dict, list and set.

The most important functionalities will be:

defaultdict: Allows you to create a dictionary that provides default values for missing keys. There is another option: Use Pydantic models.
Counter: A dictionary subclass for counting hashable objects.
deque: A double-ended queue for fast appends and pops from both ends.
namedtuple: Creates a tuple subclasses with named fields

functools - Functional programming extension

The functools module is perfect for functional programming enthusiasts. It provides higher-order functions that act on or return other functions. I started to use it because I was working with Scala and Python simultaneously with Spark. I mark three functions that i found it very helpful.

functools.lru_cache: Memory decorator for cached results

from functools import lru_cache

@lru_cache(maxsize=128)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# Usage example
print(fibonacci(10))  # First call: calculates everything
print(fibonacci(10))  # Second call: returns cached result instantly
print(fibonacci.cache_info())  # Shows cache statistics

functools.partial: Freezes some function arguments, creating a new function.

from functools import partial

# Basic example
def greet(greeting, name):
    return f"{greeting}, {name}!"

# Create a new function with 'Hello' frozen as the greeting
say_hello = partial(greet, "Hello")
print(say_hello("Alice"))  # Outputs: Hello, Alice!
print(say_hello("Bob"))    # Outputs: Hello, Bob!

# Practical example with file handling
def save_log(filename, mode, content):
    with open(filename, mode) as f:
        f.write(content)

# Create a function that always appends to a specific log file
append_to_app_log = partial(save_log, "app.log", "a")
append_to_app_log("Error occurred\n")  # Just pass the content
append_to_app_log("Operation completed\n")

functools.reduce: Applies a function cumulatively to items in a sequence.

from functools import reduce

# Let's say we want to add all numbers in a list
numbers = [1, 2, 3, 4, 5]

# Using reduce to sum all numbers
sum_result = reduce(lambda x, y: x + y, numbers)
print(sum_result)  # Outputs: 15

# To understand how it works, here's what happens step by step:
# 1. First iteration:  1 + 2 = 3
# 2. Second iteration: 3 + 3 = 6
# 3. Third iteration:  6 + 4 = 10
# 4. Fourth iteration: 10 + 5 = 15

pandas - Data manipulation and analysis

Pandas is essential for any data engineer, providing high-performance, easy-to-use data structures like DataFrames.

SQLAlchemy - SQL toolkit and ORM

SQLAlchemy bridges the gap between Python and databases, allowing you to work with databases using Pythonic code.

from sqlalchemy import create_engine, Column, Integer, String, MetaData, Table
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

# Create engine and session
engine = create_engine('postgresql://user:password@localhost:5432/mydatabase')
Session = sessionmaker(bind=engine)
session = Session()

# Declarative mapping
Base = declarative_base()
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    
# Query data
users = session.query(User).filter(User.name.like('%Smith%')).all()

Apache Airflow - Workflow automation

While not just a library but a platform, Airflow is built on Python and essential for orchestrating complex data pipelines.

I would like to include it because when part of your work is related with Data processing by events, RPAs or tiny pipelines, API integrations or any batch wich no more than 1 GB of file size it's a significant change. The principal benefit is organize your pipeline in a trackeable and visualizable graph dependance, Not only that you can design pipelines that in case of failing you can retry it sequenntially to solve any dependance that require to run after the failed.

PySpark - Distributed data processing

For big data processing, PySpark allows you to write Spark applications in Python. Consider this tool if part of your data is growing more than 1 GB and can be chunked. Spark uses parallelism and mapreduce approachs to make data transformations faster. Spark is a very good tool, it will help you for really big batch files or for tiny but a lot of files (if is streaming pipelines in the road)

8 Apache Kafka Python (confluent-kafka) - Stream processing

For real-time data streaming and processing: