This Python implementation demonstrates high-performance financial data processing using the Polars library, which provides fast DataFrame operations through Rust and Arrow backends. It calculates 100+ quantitative features for each row in a large financial dataset (3.9M+ records) while leveraging vectorized operations for performance.
import polars as pl # High-performance DataFrame library (Rust backend)
import numpy as np # Numerical computing library
import time # Time measurement utilities
from concurrent.futures import ProcessPoolExecutor # For multiprocessing
import multiprocessing as mp # Multiprocessing utilities
import os # Operating system interfacepolars: High-performance DataFrame library with Rust backend, offering performance comparable to pandas but with better memory efficiencynumpy: Fundamental package for scientific computing with Pythonconcurrent.futures: High-level interface for asynchronously executing callablesmultiprocessing: Process-based parallelism for CPU-bound tasks
def calculate_features_polars(df):
"""Calculate quantitative features using Polars for better performance"""
# Calculate basic features using Polars expressions (vectorized operations)
result_df = df.select([
# Basic price features
pl.col("Close").alias("feature_0"), # Close price
pl.col("Open").alias("feature_1"), # Open price
pl.col("High").alias("feature_2"), # High price
pl.col("Low").alias("feature_3"), # Low price
pl.col("volume").alias("feature_4"), # Volumedf.select(): Polars method to select and transform columnspl.col(): Polars column expression for referencing columns.alias(): Renames the computed column- Vectorized operations: Operations are applied to entire columns at once, not element-by-element
# Return calculations
((pl.col("Close") - pl.col("Open")) / pl.col("Open")).alias("feature_5"), # Return
((pl.col("High") - pl.col("Low")) / pl.col("Open")).alias("feature_6"), # True range
((pl.col("Close") - pl.col("Low")) / (pl.col("High") - pl.col("Low"))).alias("feature_7"), # Stochastic- These are common financial indicators:
- Return: percentage change from open to close
- True range: measure of volatility
- Stochastic: momentum indicator showing closing price relative to high-low range
- Polars expressions are lazy-evaluated, meaning they're optimized before execution
# Moving averages using rolling windows
pl.col("Close").rolling_mean(window_size=5).alias("feature_8"), # 5-period SMA
pl.col("Close").rolling_mean(window_size=10).alias("feature_11"), # 10-period SMA
pl.col("Close").rolling_mean(window_size=20).alias("feature_13"), # 20-period SMArolling_mean(): Calculates moving average over a sliding window- Window sizes of 5, 10, and 20 periods
- Polars optimizes these operations internally using efficient algorithms
# Volatility measures
pl.col("Close").rolling_std(window_size=5).alias("feature_9"), # 5-period volatility
pl.col("Close").rolling_std(window_size=10).alias("feature_12"), # 10-period volatility
pl.col("Close").rolling_std(window_size=20).alias("feature_14"), # 20-period volatilityrolling_std(): Calculates rolling standard deviation (volatility measure)- Standard deviation is a common measure of price volatility
- Larger window sizes smooth out short-term fluctuations
# Momentum indicators
((pl.col("Close") - pl.col("Close").shift(1)) / pl.col("Close").shift(1)).alias("feature_15"), # 1-period return
((pl.col("Close") - pl.col("Close").shift(3)) / pl.col("Close").shift(3)).alias("feature_16"), # 3-period return
((pl.col("Close") - pl.col("Close").shift(5)) / pl.col("Close").shift(5)).alias("feature_17"), # 5-period returnshift(1): Shifts values by 1 position (previous day's value)- Momentum indicators measure rate of price change
- Different periods (1, 3, 5) capture different time horizons
# Range-based features
((pl.col("High") - pl.col("Low")) / pl.col("Open")).alias("feature_23"), # Daily range
((pl.col("High") - pl.col("Close")) / pl.col("Open")).alias("feature_24"), # Upper shadow
((pl.col("Close") - pl.col("Low")) / pl.col("Open")).alias("feature_25"), # Lower shadow
((pl.col("Close") - pl.col("Open")).abs() / pl.col("Open")).alias("feature_26"), # Body size- Daily range: measure of intraday volatility
- Upper shadow: distance from close to high (for bearish candles)
- Lower shadow: distance from low to close (for bullish candles)
- Body size: absolute size of the candle body
# Log returns
(pl.col("Close") / pl.col("Close").shift(1)).log().alias("feature_27"),
# Directional features
(pl.col("Close") > pl.col("Open")).cast(pl.Float64).alias("feature_28"), # Bullish/Bearish
(pl.col("High") > pl.col("High").shift(1)).cast(pl.Float64).alias("feature_29"), # New high
(pl.col("Low") < pl.col("Low").shift(1)).cast(pl.Float64).alias("feature_30"), # New low- Log returns: logarithm of price ratios (more mathematically convenient than simple returns)
- Directional features: Boolean flags converted to floats (1.0 for True, 0.0 for False)
- New high/low: Indicates if current period reached new highs/lows
# Gap features
((pl.col("Open") - pl.col("Close").shift(1)) / pl.col("Close").shift(1)).alias("feature_31"), # Gap up/down
(pl.col("Open") > pl.col("Close").shift(1)).cast(pl.Float64).alias("feature_32"), # Gap up indicator
(pl.col("Open") < pl.col("Close").shift(1)).cast(pl.Float64).alias("feature_33"), # Gap down indicator- Gap up/down: Difference between today's open and yesterday's close
- Gap indicators: Boolean flags for gap detection
- Important for identifying market sentiment shifts
# Convert to numpy for additional complex calculations that Polars doesn't handle well
result_arrays = {col: result_df[col].to_numpy() for col in result_df.columns}
# Create a combined features array
feature_cols = [f"feature_{i}" for i in range(38)]
features = np.column_stack([result_arrays[col] for col in feature_cols])
# Pad with zeros for remaining features (38-100)
n_rows, n_existing_features = features.shape
remaining_features = 101 - n_existing_features
padding = np.zeros((n_rows, remaining_features))
features = np.hstack([features, padding])
return featuresto_numpy(): Converts Polars Series to NumPy arraysnp.column_stack(): Combines arrays column-wise- Padding with zeros to reach 101 features total
- NumPy arrays are efficient for mathematical operations
def calculate_moving_averages_polars(df, periods):
"""Calculate multiple moving averages using Polars"""
results = {}
for period in periods:
ma_col = f"MA_{period}"
# Create a temporary dataframe with the MA column
temp_df = df.select(pl.col("Close").rolling_mean(window_size=period).alias(ma_col))
results[ma_col] = temp_df[ma_col].drop_nulls().to_numpy()
return results- Calculates moving averages for multiple periods
drop_nulls(): Removes NaN values from rolling operationsto_numpy(): Converts to NumPy array for consistency
def main():
start_time = time.time()
print("Reading CSV file with Polars...")
df = pl.read_csv("USDJPY2.csv") # Polars CSV reading is very fast
print(f"Loaded {len(df)} records.")
print("Calculating 100+ quantitative features using Polars...")
features = calculate_features_polars(df)
print(f"Calculated {features.shape[1]} quantitative features for {features.shape[0]} rows.")
# Calculate moving averages for periods 200-220 using Polars
print("Calculating moving averages for periods 200-220 using Polars...")
ma_periods = list(range(200, 221)) # 200 to 220 inclusive
all_mas = calculate_moving_averages_polars(df, ma_periods)
for period, ma_values in all_mas.items():
print(f"Calculated {len(ma_values)} {period} values.")
end_time = time.time()
duration = (end_time - start_time) * 1000 # Convert to milliseconds
print(f"Total execution time: {duration:.2f} ms")
print(f"Features shape: {features.shape}")- Uses Polars for fast CSV reading
- Timing measurements to evaluate performance
- Clear output for monitoring progress
- Shape information for debugging
- Polars leverages Rust and Apache Arrow for high-performance operations
- Operations are applied to entire columns at once (vectorized)
- Lazy evaluation optimizes query plans before execution
- Memory-efficient compared to pandas
- Polars uses Apache Arrow memory format for efficient columnar storage
- NumPy arrays for numerical computations
- Efficient memory usage through columnar format
- Method chaining with Polars expressions
- Immutable operations (operations return new objects)
- Lazy evaluation for optimization
- Polars operations are implemented in Rust, providing near-C++ performance
- Columnar storage format enables efficient operations
- Memory mapping for large datasets
- Polars provides performance comparable to pandas but with better memory efficiency
- Vectorized operations eliminate Python loops
- Rust backend provides near-compiled language performance
- Lazy evaluation optimizes query execution
- Columnar storage format (Apache Arrow) is memory-efficient
- Polars can handle larger-than-memory datasets through streaming
- Reduced memory footprint compared to pandas
- Familiar SQL-like syntax for data manipulation
- Chainable operations for readable code
- Built-in functions for common financial calculations
- Polars can leverage multiple CPU cores for operations
- Efficient for large datasets (millions of rows)
- Better performance than traditional pandas for large datasets
- Written in Rust, providing better performance
- Lazy evaluation for query optimization
- Columnar memory format for efficient operations
- Better memory usage
- Multi-threaded operations by default
- More consistent API
- Large datasets (>1M rows)
- Performance-critical applications
- Memory-constrained environments
- When you need to leverage multiple CPU cores
- Smaller datasets (<1M rows)
- When you need specific pandas functionality
- When working with complex hierarchical data
- When integrating with scikit-learn ecosystem
The Polars implementation achieves significant performance improvements because:
- Rust Backend: Polars operations are implemented in Rust, providing near-compiled language performance
- Vectorization: Operations are applied to entire columns at once, eliminating Python loops
- Memory Layout: Columnar storage format enables efficient operations
- Query Optimization: Lazy evaluation allows Polars to optimize operations before execution
- Multi-threading: Many operations are automatically parallelized across CPU cores
This implementation demonstrates Python's ability to achieve high performance through the right libraries, showing that Python can be competitive for data processing tasks when using optimized libraries like Polars.