ERPL for Data Science: Building ML Pipelines with SAP Data
Summary
Discover how ERPL transforms SAP data integration for data science workflows. This comprehensive guide covers everything from data extraction to machine learning model deployment, showing how ERPL enables data scientists to work directly with SAP data using familiar tools and frameworks.
Introduction
Data science teams often struggle with SAP data integration, facing challenges like complex ETL processes, data latency, and limited access to real-time data. ERPL revolutionizes this by providing direct, SQL-based access to SAP data through DuckDB, enabling data scientists to focus on analysis rather than data engineering.
Why ERPL for Data Science?
Traditional Challenges
- Complex ETL: Multiple tools and processes required
- Data Latency: Delayed access to updated data
- Limited Flexibility: Rigid data structures and formats
- High Costs: Expensive licensing and infrastructure
- Skill Gaps: Need for specialized SAP knowledge
ERPL Advantages
- Direct Access: Real-time SAP data access
- SQL Interface: Familiar query language
- High Performance: DuckDB's optimized execution
- Cost Effective: Single extension license
- Easy Integration: Works with popular data science tools
Data Science Workflow with ERPL
1. Data Extraction and Exploration
-- Connect to SAP and explore data
SELECT sap_connect(
host = 'sap-server.com',
system_number = '00',
client = '100',
user = 'data_scientist',
password = 'password'
);
-- Explore available tables
SELECT * FROM sap_show_tables(TABNAME='*CUSTOMER*');
-- Get table metadata
SELECT * FROM sap_show_table_columns(TABNAME='KNA1');
-- Extract customer data for analysis
SELECT
KUNNR as customer_id,
NAME1 as customer_name,
LAND1 as country,
ORT01 as city,
REGIO as region
FROM sap_read_table('KNA1')
WHERE LAND1 = 'DE'
LIMIT 1000;
2. Data Quality Assessment
-- Assess data quality
WITH customer_data AS (
SELECT * FROM sap_read_table('KNA1')
)
SELECT
COUNT(*) as total_records,
COUNT(DISTINCT KUNNR) as unique_customers,
COUNT(CASE WHEN NAME1 IS NULL THEN 1 END) as missing_names,
COUNT(CASE WHEN LAND1 IS NULL THEN 1 END) as missing_countries,
COUNT(CASE WHEN ORT01 IS NULL THEN 1 END) as missing_cities
FROM customer_data;
3. Feature Engineering
-- Create features for machine learning
CREATE VIEW customer_features AS
SELECT
KUNNR as customer_id,
NAME1 as customer_name,
LAND1 as country,
ORT01 as city,
REGIO as region,
-- Geographic features
CASE
WHEN LAND1 IN ('DE', 'AT', 'CH') THEN 'DACH'
WHEN LAND1 IN ('US', 'CA') THEN 'NORTH_AMERICA'
WHEN LAND1 IN ('GB', 'FR', 'IT', 'ES') THEN 'WESTERN_EUROPE'
ELSE 'OTHER'
END as geographic_region,
-- Text features
LENGTH(NAME1) as name_length,
CASE
WHEN NAME1 LIKE '%GmbH%' THEN 'COMPANY'
WHEN NAME1 LIKE '%AG%' THEN 'COMPANY'
ELSE 'INDIVIDUAL'
END as customer_type
FROM sap_read_table('KNA1');
Python Integration
1. Data Extraction with DuckDB
import duckdb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Connect to DuckDB
conn = duckdb.connect('sap_analysis.db')
# Extract customer data
customer_data = conn.execute("""
SELECT
KUNNR as customer_id,
NAME1 as customer_name,
LAND1 as country,
ORT01 as city,
REGIO as region
FROM sap_read_table('KNA1')
WHERE LAND1 = 'DE'
""").fetchdf()
print(f"Extracted {len(customer_data)} customer records")
print(customer_data.head())
2. Sales Data Analysis
# Extract sales data
sales_data = conn.execute("""
SELECT
KUNNR as customer_id,
MATNR as material,
NETWR as net_value,
KWMENG as quantity,
ERDAT as order_date
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2024-01-01'
""").fetchdf()
# Analyze sales patterns
sales_summary = sales_data.groupby('customer_id').agg({
'net_value': ['sum', 'mean', 'count'],
'quantity': 'sum'
}).round(2)
print("Sales Summary by Customer:")
print(sales_summary.head())
3. Customer Segmentation
# Create customer segmentation features
segmentation_features = conn.execute("""
WITH customer_sales AS (
SELECT
KUNNR as customer_id,
SUM(NETWR) as total_sales,
AVG(NETWR) as avg_order_value,
COUNT(*) as order_count,
MAX(ERDAT) as last_order_date
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2024-01-01'
GROUP BY KUNNR
)
SELECT
cs.customer_id,
cs.total_sales,
cs.avg_order_value,
cs.order_count,
cs.last_order_date,
-- RFM Features
CASE
WHEN cs.total_sales > 100000 THEN 'HIGH_VALUE'
WHEN cs.total_sales > 10000 THEN 'MEDIUM_VALUE'
ELSE 'LOW_VALUE'
END as value_segment,
CASE
WHEN cs.order_count > 50 THEN 'FREQUENT'
WHEN cs.order_count > 10 THEN 'REGULAR'
ELSE 'OCCASIONAL'
END as frequency_segment
FROM customer_sales cs
""").fetchdf()
print("Customer Segmentation Features:")
print(segmentation_features.head())
Machine Learning Pipeline
1. Churn Prediction Model
# Prepare data for churn prediction
churn_data = conn.execute("""
WITH customer_metrics AS (
SELECT
KUNNR as customer_id,
COUNT(*) as total_orders,
SUM(NETWR) as total_sales,
AVG(NETWR) as avg_order_value,
MAX(ERDAT) as last_order_date,
MIN(ERDAT) as first_order_date,
-- Calculate days since last order
DATEDIFF('day', MAX(ERDAT), CURRENT_DATE) as days_since_last_order
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2023-01-01'
GROUP BY KUNNR
)
SELECT
customer_id,
total_orders,
total_sales,
avg_order_value,
days_since_last_order,
-- Create churn label (customers with no orders in last 90 days)
CASE
WHEN days_since_last_order > 90 THEN 1
ELSE 0
END as churn_label
FROM customer_metrics
WHERE total_orders >= 5 -- Only customers with sufficient history
""").fetchdf()
# Prepare features for ML
features = ['total_orders', 'total_sales', 'avg_order_value', 'days_since_last_order']
X = churn_data[features]
y = churn_data['churn_label']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
print("Churn Prediction Model Performance:")
print(classification_report(y_test, y_pred))
2. Sales Forecasting
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Prepare time series data
time_series_data = conn.execute("""
SELECT
DATE(ERDAT) as order_date,
SUM(NETWR) as daily_sales,
COUNT(*) as daily_orders
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2024-01-01'
GROUP BY DATE(ERDAT)
ORDER BY order_date
""").fetchdf()
# Create features for forecasting
time_series_data['day_of_week'] = pd.to_datetime(time_series_data['order_date']).dt.dayofweek
time_series_data['month'] = pd.to_datetime(time_series_data['order_date']).dt.month
time_series_data['lag_1'] = time_series_data['daily_sales'].shift(1)
time_series_data['lag_7'] = time_series_data['daily_sales'].shift(7)
# Remove rows with NaN values
time_series_data = time_series_data.dropna()
# Prepare features
features = ['day_of_week', 'month', 'lag_1', 'lag_7']
X = time_series_data[features]
y = time_series_data['daily_sales']
# Split data
split_point = int(len(X) * 0.8)
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]
# Train forecasting model
forecast_model = LinearRegression()
forecast_model.fit(X_train, y_train)
# Make predictions
y_pred = forecast_model.predict(X_test)
# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Sales Forecasting Model Performance:")
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test.values, label='Actual', alpha=0.7)
plt.plot(y_test.index, y_pred, label='Predicted', alpha=0.7)
plt.title('Sales Forecasting Results')
plt.xlabel('Time')
plt.ylabel('Daily Sales')
plt.legend()
plt.show()
R Integration
1. Data Extraction with R
library(DBI)
library(dplyr)
library(ggplot2)
library(randomForest)
# Connect to DuckDB
con <- dbConnect(duckdb::duckdb(), "sap_analysis.db")
# Extract customer data
customer_data <- dbGetQuery(con, "
SELECT
KUNNR as customer_id,
NAME1 as customer_name,
LAND1 as country,
ORT01 as city,
REGIO as region
FROM sap_read_table('KNA1')
WHERE LAND1 = 'DE'
")
# Extract sales data
sales_data <- dbGetQuery(con, "
SELECT
KUNNR as customer_id,
MATNR as material,
NETWR as net_value,
KWMENG as quantity,
ERDAT as order_date
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2024-01-01'
")
print(paste("Extracted", nrow(customer_data), "customer records"))
print(paste("Extracted", nrow(sales_data), "sales records"))
2. Customer Analysis
# Customer analysis
customer_analysis <- sales_data %>%
group_by(customer_id) %>%
summarise(
total_sales = sum(net_value),
avg_order_value = mean(net_value),
order_count = n(),
last_order_date = max(order_date)
) %>%
mutate(
value_segment = case_when(
total_sales > 100000 ~ "HIGH_VALUE",
total_sales > 10000 ~ "MEDIUM_VALUE",
TRUE ~ "LOW_VALUE"
),
frequency_segment = case_when(
order_count > 50 ~ "FREQUENT",
order_count > 10 ~ "REGULAR",
TRUE ~ "OCCASIONAL"
)
)
# Visualize customer segments
ggplot(customer_analysis, aes(x = value_segment, y = frequency_segment)) +
geom_count() +
labs(title = "Customer Segmentation",
x = "Value Segment",
y = "Frequency Segment") +
theme_minimal()
3. Predictive Modeling
# Prepare data for modeling
model_data <- customer_analysis %>%
mutate(
churn_risk = case_when(
order_count < 5 ~ "HIGH",
order_count < 15 ~ "MEDIUM",
TRUE ~ "LOW"
)
)
# Train random forest model
set.seed(42)
model <- randomForest(
churn_risk ~ total_sales + avg_order_value + order_count,
data = model_data,
ntree = 100
)
# Model importance
importance_df <- data.frame(
feature = rownames(importance(model)),
importance = importance(model)[, 1]
)
ggplot(importance_df, aes(x = reorder(feature, importance), y = importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Feature Importance for Churn Prediction",
x = "Feature",
y = "Importance") +
theme_minimal()
Advanced Analytics
1. Market Basket Analysis
-- Extract order items for market basket analysis
CREATE VIEW order_items AS
SELECT
VBELN as sales_document,
POSNR as item_number,
MATNR as material,
KWMENG as quantity,
NETWR as net_value
FROM sap_read_table('VBAP')
WHERE ERDAT >= '2024-01-01';
-- Find frequently co-purchased items
WITH item_pairs AS (
SELECT
o1.material as item1,
o2.material as item2,
COUNT(*) as co_purchase_count
FROM order_items o1
JOIN order_items o2 ON o1.sales_document = o2.sales_document
WHERE o1.material < o2.material -- Avoid duplicates
GROUP BY o1.material, o2.material
HAVING COUNT(*) >= 10 -- Minimum co-purchase threshold
)
SELECT
item1,
item2,
co_purchase_count
FROM item_pairs
ORDER BY co_purchase_count DESC
LIMIT 20;
2. Customer Lifetime Value
-- Calculate customer lifetime value
WITH customer_lifetime_value AS (
SELECT
KUNNR as customer_id,
COUNT(*) as total_orders,
SUM(NETWR) as total_sales,
AVG(NETWR) as avg_order_value,
MIN(ERDAT) as first_order_date,
MAX(ERDAT) as last_order_date,
DATEDIFF('day', MIN(ERDAT), MAX(ERDAT)) as customer_lifespan_days,
-- Calculate CLV using simple formula
SUM(NETWR) * (365.0 / DATEDIFF('day', MIN(ERDAT), MAX(ERDAT))) as annual_clv
FROM sap_read_table('VBAK')
WHERE ERDAT >= '2023-01-01'
GROUP BY KUNNR
HAVING COUNT(*) >= 3 -- Customers with at least 3 orders
)
SELECT
customer_id,
total_orders,
total_sales,
avg_order_value,
customer_lifespan_days,
annual_clv,
CASE
WHEN annual_clv > 50000 THEN 'HIGH_VALUE'
WHEN annual_clv > 10000 THEN 'MEDIUM_VALUE'
ELSE 'LOW_VALUE'
END as clv_segment
FROM customer_lifetime_value
ORDER BY annual_clv DESC;
Model Deployment
1. Model Serving with FastAPI
from fastapi import FastAPI
import duckdb
import pickle
import pandas as pd
# Load trained model
with open('churn_model.pkl', 'rb') as f:
model = pickle.load(f)
# Initialize FastAPI app
app = FastAPI()
# Connect to DuckDB
conn = duckdb.connect('sap_analysis.db')
@app.post("/predict_churn")
async def predict_churn(customer_id: str):
# Get customer features
features = conn.execute("""
SELECT
total_orders,
total_sales,
avg_order_value,
days_since_last_order
FROM customer_features
WHERE customer_id = ?
""", [customer_id]).fetchdf()
if features.empty:
return {"error": "Customer not found"}
# Make prediction
prediction = model.predict(features)[0]
probability = model.predict_proba(features)[0][1]
return {
"customer_id": customer_id,
"churn_prediction": int(prediction),
"churn_probability": float(probability)
}
@app.get("/customer_segments")
async def get_customer_segments():
segments = conn.execute("""
SELECT
clv_segment,
COUNT(*) as customer_count,
AVG(annual_clv) as avg_clv
FROM customer_lifetime_value
GROUP BY clv_segment
""").fetchdf()
return segments.to_dict('records')
2. Automated Model Retraining
import schedule
import time
from datetime import datetime
def retrain_models():
"""Retrain models with latest data"""
print(f"Retraining models at {datetime.now()}")
# Extract latest data
latest_data = conn.execute("""
SELECT * FROM sap_read_table('VBAK')
WHERE ERDAT >= CURRENT_DATE - INTERVAL '30 days'
""").fetchdf()
# Retrain churn model
# ... training code ...
# Save updated model
with open('churn_model.pkl', 'wb') as f:
pickle.dump(model, f)
print("Models retrained successfully")
# Schedule retraining
schedule.every().day.at("02:00").do(retrain_models)
# Run scheduler
while True:
schedule.run_pending()
time.sleep(60)
Best Practices
1. Data Quality
- Validation: Implement data quality checks
- Cleaning: Handle missing values and outliers
- Monitoring: Track data quality metrics
- Documentation: Document data transformations
2. Model Performance
- Validation: Use cross-validation for model evaluation
- Monitoring: Track model performance over time
- Retraining: Implement automated retraining
- A/B Testing: Test model improvements
3. Scalability
- Incremental Processing: Process data in batches
- Caching: Cache frequently used data
- Parallel Processing: Use multiple cores for training
- Resource Management: Monitor memory and CPU usage
4. Security
- Access Control: Implement proper authentication
- Data Encryption: Encrypt sensitive data
- Audit Logging: Log all model predictions
- Compliance: Ensure regulatory compliance
Conclusion
ERPL transforms SAP data integration for data science by providing:
- Direct Access: Real-time SAP data without ETL complexity
- High Performance: DuckDB's optimized execution engine
- Easy Integration: Works seamlessly with Python, R, and other tools
- Cost Effective: Single extension license vs. multiple tools
The combination of ERPL's SAP connectivity and DuckDB's analytical capabilities creates a powerful platform for data science workflows.
Next Steps
Ready to start your data science journey with ERPL? Here's how to get started:
- Install ERPL: Get started with installation
- Explore Examples: Review key tasks
- Build Your Pipeline: Use the provided examples as templates
- Contact Us: Get help with your specific requirements
Additional Resources
- ERPL Documentation
- DuckDB Documentation
- Scikit-learn Documentation
- R Documentation
- Contact Our Team for personalized guidance
