ERPL for Data Science: Building ML Pipelines with SAP Data

December 5, 2024 · 10 min read

Co-Founder & CTO

Summary

Discover how ERPL transforms SAP data integration for data science workflows. This comprehensive guide covers everything from data extraction to machine learning model deployment, showing how ERPL enables data scientists to work directly with SAP data using familiar tools and frameworks.

Introduction

Data science teams often struggle with SAP data integration, facing challenges like complex ETL processes, data latency, and limited access to real-time data. ERPL revolutionizes this by providing direct, SQL-based access to SAP data through DuckDB, enabling data scientists to focus on analysis rather than data engineering.

Why ERPL for Data Science?

Traditional Challenges

Complex ETL: Multiple tools and processes required
Data Latency: Delayed access to updated data
Limited Flexibility: Rigid data structures and formats
High Costs: Expensive licensing and infrastructure
Skill Gaps: Need for specialized SAP knowledge

ERPL Advantages

Direct Access: Real-time SAP data access
SQL Interface: Familiar query language
High Performance: DuckDB's optimized execution
Cost Effective: Single extension license
Easy Integration: Works with popular data science tools

Data Science Workflow with ERPL

1. Data Extraction and Exploration

-- Connect to SAP and explore data
SELECT sap_connect(
    host = 'sap-server.com',
    system_number = '00',
    client = '100',
    user = 'data_scientist',
    password = 'password'
);

-- Explore available tables
SELECT * FROM sap_show_tables(TABNAME='*CUSTOMER*');

-- Get table metadata
SELECT * FROM sap_show_table_columns(TABNAME='KNA1');

-- Extract customer data for analysis
SELECT 
    KUNNR as customer_id,
    NAME1 as customer_name,
    LAND1 as country,
    ORT01 as city,
    REGIO as region
FROM sap_read_table('KNA1')
WHERE LAND1 = 'DE'
LIMIT 1000;

2. Data Quality Assessment

-- Assess data quality
WITH customer_data AS (
    SELECT * FROM sap_read_table('KNA1')
)
SELECT 
    COUNT(*) as total_records,
    COUNT(DISTINCT KUNNR) as unique_customers,
    COUNT(CASE WHEN NAME1 IS NULL THEN 1 END) as missing_names,
    COUNT(CASE WHEN LAND1 IS NULL THEN 1 END) as missing_countries,
    COUNT(CASE WHEN ORT01 IS NULL THEN 1 END) as missing_cities
FROM customer_data;

3. Feature Engineering

-- Create features for machine learning
CREATE VIEW customer_features AS
SELECT 
    KUNNR as customer_id,
    NAME1 as customer_name,
    LAND1 as country,
    ORT01 as city,
    REGIO as region,
    -- Geographic features
    CASE 
        WHEN LAND1 IN ('DE', 'AT', 'CH') THEN 'DACH'
        WHEN LAND1 IN ('US', 'CA') THEN 'NORTH_AMERICA'
        WHEN LAND1 IN ('GB', 'FR', 'IT', 'ES') THEN 'WESTERN_EUROPE'
        ELSE 'OTHER'
    END as geographic_region,
    -- Text features
    LENGTH(NAME1) as name_length,
    CASE 
        WHEN NAME1 LIKE '%GmbH%' THEN 'COMPANY'
        WHEN NAME1 LIKE '%AG%' THEN 'COMPANY'
        ELSE 'INDIVIDUAL'
    END as customer_type
FROM sap_read_table('KNA1');

Python Integration

1. Data Extraction with DuckDB

import duckdb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Connect to DuckDB
conn = duckdb.connect('sap_analysis.db')

# Extract customer data
customer_data = conn.execute("""
    SELECT 
        KUNNR as customer_id,
        NAME1 as customer_name,
        LAND1 as country,
        ORT01 as city,
        REGIO as region
    FROM sap_read_table('KNA1')
    WHERE LAND1 = 'DE'
""").fetchdf()

print(f"Extracted {len(customer_data)} customer records")
print(customer_data.head())

2. Sales Data Analysis

# Extract sales data
sales_data = conn.execute("""
    SELECT 
        KUNNR as customer_id,
        MATNR as material,
        NETWR as net_value,
        KWMENG as quantity,
        ERDAT as order_date
    FROM sap_read_table('VBAK')
    WHERE ERDAT >= '2024-01-01'
""").fetchdf()

# Analyze sales patterns
sales_summary = sales_data.groupby('customer_id').agg({
    'net_value': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)

print("Sales Summary by Customer:")
print(sales_summary.head())

3. Customer Segmentation

# Create customer segmentation features
segmentation_features = conn.execute("""
    WITH customer_sales AS (
        SELECT 
            KUNNR as customer_id,
            SUM(NETWR) as total_sales,
            AVG(NETWR) as avg_order_value,
            COUNT(*) as order_count,
            MAX(ERDAT) as last_order_date
        FROM sap_read_table('VBAK')
        WHERE ERDAT >= '2024-01-01'
        GROUP BY KUNNR
    )
    SELECT 
        cs.customer_id,
        cs.total_sales,
        cs.avg_order_value,
        cs.order_count,
        cs.last_order_date,
        -- RFM Features
        CASE 
            WHEN cs.total_sales > 100000 THEN 'HIGH_VALUE'
            WHEN cs.total_sales > 10000 THEN 'MEDIUM_VALUE'
            ELSE 'LOW_VALUE'
        END as value_segment,
        CASE 
            WHEN cs.order_count > 50 THEN 'FREQUENT'
            WHEN cs.order_count > 10 THEN 'REGULAR'
            ELSE 'OCCASIONAL'
        END as frequency_segment
    FROM customer_sales cs
""").fetchdf()

print("Customer Segmentation Features:")
print(segmentation_features.head())

Machine Learning Pipeline

1. Churn Prediction Model

# Prepare data for churn prediction
churn_data = conn.execute("""
    WITH customer_metrics AS (
        SELECT 
            KUNNR as customer_id,
            COUNT(*) as total_orders,
            SUM(NETWR) as total_sales,
            AVG(NETWR) as avg_order_value,
            MAX(ERDAT) as last_order_date,
            MIN(ERDAT) as first_order_date,
            -- Calculate days since last order
            DATEDIFF('day', MAX(ERDAT), CURRENT_DATE) as days_since_last_order
        FROM sap_read_table('VBAK')
        WHERE ERDAT >= '2023-01-01'
        GROUP BY KUNNR
    )
    SELECT 
        customer_id,
        total_orders,
        total_sales,
        avg_order_value,
        days_since_last_order,
        -- Create churn label (customers with no orders in last 90 days)
        CASE 
            WHEN days_since_last_order > 90 THEN 1
            ELSE 0
        END as churn_label
    FROM customer_metrics
    WHERE total_orders >= 5  -- Only customers with sufficient history
""").fetchdf()

# Prepare features for ML
features = ['total_orders', 'total_sales', 'avg_order_value', 'days_since_last_order']
X = churn_data[features]
y = churn_data['churn_label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print("Churn Prediction Model Performance:")
print(classification_report(y_test, y_pred))

2. Sales Forecasting

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Prepare time series data
time_series_data = conn.execute("""
    SELECT 
        DATE(ERDAT) as order_date,
        SUM(NETWR) as daily_sales,
        COUNT(*) as daily_orders
    FROM sap_read_table('VBAK')
    WHERE ERDAT >= '2024-01-01'
    GROUP BY DATE(ERDAT)
    ORDER BY order_date
""").fetchdf()

# Create features for forecasting
time_series_data['day_of_week'] = pd.to_datetime(time_series_data['order_date']).dt.dayofweek
time_series_data['month'] = pd.to_datetime(time_series_data['order_date']).dt.month
time_series_data['lag_1'] = time_series_data['daily_sales'].shift(1)
time_series_data['lag_7'] = time_series_data['daily_sales'].shift(7)

# Remove rows with NaN values
time_series_data = time_series_data.dropna()

# Prepare features
features = ['day_of_week', 'month', 'lag_1', 'lag_7']
X = time_series_data[features]
y = time_series_data['daily_sales']

# Split data
split_point = int(len(X) * 0.8)
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

# Train forecasting model
forecast_model = LinearRegression()
forecast_model.fit(X_train, y_train)

# Make predictions
y_pred = forecast_model.predict(X_test)

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Sales Forecasting Model Performance:")
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test.values, label='Actual', alpha=0.7)
plt.plot(y_test.index, y_pred, label='Predicted', alpha=0.7)
plt.title('Sales Forecasting Results')
plt.xlabel('Time')
plt.ylabel('Daily Sales')
plt.legend()
plt.show()

R Integration

1. Data Extraction with R

library(DBI)
library(dplyr)
library(ggplot2)
library(randomForest)

# Connect to DuckDB
con <- dbConnect(duckdb::duckdb(), "sap_analysis.db")

# Extract customer data
customer_data <- dbGetQuery(con, "
    SELECT 
        KUNNR as customer_id,
        NAME1 as customer_name,
        LAND1 as country,
        ORT01 as city,
        REGIO as region
    FROM sap_read_table('KNA1')
    WHERE LAND1 = 'DE'
")

# Extract sales data
sales_data <- dbGetQuery(con, "
    SELECT 
        KUNNR as customer_id,
        MATNR as material,
        NETWR as net_value,
        KWMENG as quantity,
        ERDAT as order_date
    FROM sap_read_table('VBAK')
    WHERE ERDAT >= '2024-01-01'
")

print(paste("Extracted", nrow(customer_data), "customer records"))
print(paste("Extracted", nrow(sales_data), "sales records"))

2. Customer Analysis

# Customer analysis
customer_analysis <- sales_data %>%
    group_by(customer_id) %>%
    summarise(
        total_sales = sum(net_value),
        avg_order_value = mean(net_value),
        order_count = n(),
        last_order_date = max(order_date)
    ) %>%
    mutate(
        value_segment = case_when(
            total_sales > 100000 ~ "HIGH_VALUE",
            total_sales > 10000 ~ "MEDIUM_VALUE",
            TRUE ~ "LOW_VALUE"
        ),
        frequency_segment = case_when(
            order_count > 50 ~ "FREQUENT",
            order_count > 10 ~ "REGULAR",
            TRUE ~ "OCCASIONAL"
        )
    )

# Visualize customer segments
ggplot(customer_analysis, aes(x = value_segment, y = frequency_segment)) +
    geom_count() +
    labs(title = "Customer Segmentation",
         x = "Value Segment",
         y = "Frequency Segment") +
    theme_minimal()

3. Predictive Modeling

# Prepare data for modeling
model_data <- customer_analysis %>%
    mutate(
        churn_risk = case_when(
            order_count < 5 ~ "HIGH",
            order_count < 15 ~ "MEDIUM",
            TRUE ~ "LOW"
        )
    )

# Train random forest model
set.seed(42)
model <- randomForest(
    churn_risk ~ total_sales + avg_order_value + order_count,
    data = model_data,
    ntree = 100
)

# Model importance
importance_df <- data.frame(
    feature = rownames(importance(model)),
    importance = importance(model)[, 1]
)

ggplot(importance_df, aes(x = reorder(feature, importance), y = importance)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(title = "Feature Importance for Churn Prediction",
         x = "Feature",
         y = "Importance") +
    theme_minimal()

Advanced Analytics

1. Market Basket Analysis

-- Extract order items for market basket analysis
CREATE VIEW order_items AS
SELECT 
    VBELN as sales_document,
    POSNR as item_number,
    MATNR as material,
    KWMENG as quantity,
    NETWR as net_value
FROM sap_read_table('VBAP')
WHERE ERDAT >= '2024-01-01';

-- Find frequently co-purchased items
WITH item_pairs AS (
    SELECT 
        o1.material as item1,
        o2.material as item2,
        COUNT(*) as co_purchase_count
    FROM order_items o1
    JOIN order_items o2 ON o1.sales_document = o2.sales_document
    WHERE o1.material < o2.material  -- Avoid duplicates
    GROUP BY o1.material, o2.material
    HAVING COUNT(*) >= 10  -- Minimum co-purchase threshold
)
SELECT 
    item1,
    item2,
    co_purchase_count
FROM item_pairs
ORDER BY co_purchase_count DESC
LIMIT 20;

2. Customer Lifetime Value

-- Calculate customer lifetime value
WITH customer_lifetime_value AS (
    SELECT 
        KUNNR as customer_id,
        COUNT(*) as total_orders,
        SUM(NETWR) as total_sales,
        AVG(NETWR) as avg_order_value,
        MIN(ERDAT) as first_order_date,
        MAX(ERDAT) as last_order_date,
        DATEDIFF('day', MIN(ERDAT), MAX(ERDAT)) as customer_lifespan_days,
        -- Calculate CLV using simple formula
        SUM(NETWR) * (365.0 / DATEDIFF('day', MIN(ERDAT), MAX(ERDAT))) as annual_clv
    FROM sap_read_table('VBAK')
    WHERE ERDAT >= '2023-01-01'
    GROUP BY KUNNR
    HAVING COUNT(*) >= 3  -- Customers with at least 3 orders
)
SELECT 
    customer_id,
    total_orders,
    total_sales,
    avg_order_value,
    customer_lifespan_days,
    annual_clv,
    CASE 
        WHEN annual_clv > 50000 THEN 'HIGH_VALUE'
        WHEN annual_clv > 10000 THEN 'MEDIUM_VALUE'
        ELSE 'LOW_VALUE'
    END as clv_segment
FROM customer_lifetime_value
ORDER BY annual_clv DESC;

Model Deployment

1. Model Serving with FastAPI

from fastapi import FastAPI
import duckdb
import pickle
import pandas as pd

# Load trained model
with open('churn_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Initialize FastAPI app
app = FastAPI()

# Connect to DuckDB
conn = duckdb.connect('sap_analysis.db')

@app.post("/predict_churn")
async def predict_churn(customer_id: str):
    # Get customer features
    features = conn.execute("""
        SELECT 
            total_orders,
            total_sales,
            avg_order_value,
            days_since_last_order
        FROM customer_features
        WHERE customer_id = ?
    """, [customer_id]).fetchdf()
    
    if features.empty:
        return {"error": "Customer not found"}
    
    # Make prediction
    prediction = model.predict(features)[0]
    probability = model.predict_proba(features)[0][1]
    
    return {
        "customer_id": customer_id,
        "churn_prediction": int(prediction),
        "churn_probability": float(probability)
    }

@app.get("/customer_segments")
async def get_customer_segments():
    segments = conn.execute("""
        SELECT 
            clv_segment,
            COUNT(*) as customer_count,
            AVG(annual_clv) as avg_clv
        FROM customer_lifetime_value
        GROUP BY clv_segment
    """).fetchdf()
    
    return segments.to_dict('records')

2. Automated Model Retraining

import schedule
import time
from datetime import datetime

def retrain_models():
    """Retrain models with latest data"""
    print(f"Retraining models at {datetime.now()}")
    
    # Extract latest data
    latest_data = conn.execute("""
        SELECT * FROM sap_read_table('VBAK')
        WHERE ERDAT >= CURRENT_DATE - INTERVAL '30 days'
    """).fetchdf()
    
    # Retrain churn model
    # ... training code ...
    
    # Save updated model
    with open('churn_model.pkl', 'wb') as f:
        pickle.dump(model, f)
    
    print("Models retrained successfully")

# Schedule retraining
schedule.every().day.at("02:00").do(retrain_models)

# Run scheduler
while True:
    schedule.run_pending()
    time.sleep(60)

Best Practices

1. Data Quality

Validation: Implement data quality checks
Cleaning: Handle missing values and outliers
Monitoring: Track data quality metrics
Documentation: Document data transformations

2. Model Performance

Validation: Use cross-validation for model evaluation
Monitoring: Track model performance over time
Retraining: Implement automated retraining
A/B Testing: Test model improvements

3. Scalability

Incremental Processing: Process data in batches
Caching: Cache frequently used data
Parallel Processing: Use multiple cores for training
Resource Management: Monitor memory and CPU usage

4. Security

Access Control: Implement proper authentication
Data Encryption: Encrypt sensitive data
Audit Logging: Log all model predictions
Compliance: Ensure regulatory compliance

Conclusion

ERPL transforms SAP data integration for data science by providing:

Direct Access: Real-time SAP data without ETL complexity
High Performance: DuckDB's optimized execution engine
Easy Integration: Works seamlessly with Python, R, and other tools
Cost Effective: Single extension license vs. multiple tools

The combination of ERPL's SAP connectivity and DuckDB's analytical capabilities creates a powerful platform for data science workflows.

Next Steps

Ready to start your data science journey with ERPL? Here's how to get started:

Install ERPL: Get started with installation
Explore Examples: Review key tasks
Build Your Pipeline: Use the provided examples as templates
Contact Us: Get help with your specific requirements

Additional Resources

ERPL Documentation
DuckDB Documentation
Scikit-learn Documentation
R Documentation
Contact Our Team for personalized guidance

Summary​

Introduction​

Why ERPL for Data Science?​

Traditional Challenges​

ERPL Advantages​

Data Science Workflow with ERPL​

1. Data Extraction and Exploration​

2. Data Quality Assessment​

3. Feature Engineering​

Python Integration​

1. Data Extraction with DuckDB​

2. Sales Data Analysis​

3. Customer Segmentation​

Machine Learning Pipeline​

1. Churn Prediction Model​

2. Sales Forecasting​

R Integration​

1. Data Extraction with R​

2. Customer Analysis​

3. Predictive Modeling​

Advanced Analytics​

1. Market Basket Analysis​

2. Customer Lifetime Value​

Model Deployment​

1. Model Serving with FastAPI​

2. Automated Model Retraining​

Best Practices​

1. Data Quality​

2. Model Performance​

3. Scalability​

4. Security​

Conclusion​

Next Steps​

Additional Resources​