Apr 15, 2025 • HacknWatch

Bridging the Gap: Machine Learning with Elasticsearch and Eland

Learn how to use Eland to apply Machine Learning workflows directly to Elasticsearch using familiar Pandas syntax.

5 min read • machine-learning

python elasticsearch data-science

Machine Learning with Elasticsearch using Eland#

Elasticsearch has historically presented a friction point. While it is an incredible engine for storage and retrieval, the Query DSL (Domain Specific Language)—those massive nested JSON objects—feels alien compared to the elegance of Python’s pandas.

Enter Eland.

Eland is a Python client that enables you to explore and analyze data in Elasticsearch with a familiar pandas-like API. More importantly, it acts as a bridge, allowing you to train models in Python and deploy them into Elasticsearch for high-speed inference.

The Python-Elasticsearch Ecosystem#

To understand Eland’s value, we have to look at the available tools:

elasticsearch-py (The Driver): The low-level client. It maps 1:1 with REST API endpoints. Great for engineering, verbose for analysis.
elasticsearch-dsl (The Builder): A higher-level wrapper that makes writing queries easier, but still requires understanding Elasticsearch concepts (Requests, Aggregations).
eland (The Data Science Layer): Wraps the data in a DataFrame structure. It translates pandas syntax into Elasticsearch queries behind the scenes.

Getting Started#

Installation#

# Install Eland with plotting support
pip install "eland[all]" xgboost scikit-learn

The DataFrame Proxy#

The magic of Eland is that the DataFrame is a proxy. When you create an Eland DataFrame, you aren’t downloading the data; you are establishing a view.

import eland as ed

# Connect to localhost (or Elastic Cloud)
df = ed.DataFrame(
    es_client="http://localhost:9200",
    es_index_pattern="network-traffic-logs"
)

# This looks like pandas, but it executes an Elasticsearch Search Query!
# Computation happens on the CLUSTER, not your laptop.
print(df.head())

# This executes a 'Composite Aggregation' query on the cluster
summary = df.describe()
print(summary)

Exploratory Data Analysis (EDA)#

You can scrub, filter, and visualize data without ever writing a JSON query.

import matplotlib.pyplot as plt

# Filtering: This becomes a 'bool' query with a 'range' filter
high_traffic = df[
    (df['bytes_sent'] > 5000) & 
    (df['protocol'] == 'tcp')
]

# ID fields often don't help ML models, drop them (Lazy operation)
ml_view = high_traffic.drop(columns=['event_id', 'user_agent'])

# Histograms happen server-side; only the summary builds are returned to Python
ml_view[['bytes_sent', 'response_time']].hist(figsize=(10,5))
plt.show()

The ML Workflow: Train Locally, Deploy Globally#

The most powerful feature of Eland is the ability to bridge the gap between training and production.

Step 1: Data Egress for Training#

Eland allows you to filter your data server-side, and then pull only what you need into local memory for training.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# 1. Filter on the server first (save bandwidth)
ed_filtered = df[df['status_code'] == 200]

# 2. Pull to local memory (converts to standard pandas DataFrame)
# WARNING: Ensure you have enough RAM for the result set
pandas_df = ed_filtered.to_pandas()

# 3. Standard Scikit-Learn Workflow
X = pandas_df[['bytes_sent', 'duration', 'geo_distance']]
y = pandas_df['is_suspicious']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

Step 2: Deploying the Model to Elasticsearch#

This is the “killer feature.” Instead of building a separate Flask API to host your model, you can serialize the model and upload it directly to the Elasticsearch cluster.

Elasticsearch can then use this model to make predictions during data ingestion (using an Ingest Pipeline).

from eland.ml import MLModel

# Deploy the trained Scikit-Learn model to Elasticsearch
es_model = MLModel.import_model(
    es_client="http://localhost:9200",
    model_id="suspicious_traffic_detector_v1",
    model=dt_model,
    feature_names=['bytes_sent', 'duration', 'geo_distance'],
    es_if_exists='overwrite'
)

print(f"Model deployed with ID: {es_model.model_id}")

Step 3: Inference#

Once uploaded, you can use the model to infer against data already in your index without moving the data.

# Predict on the data sitting in Elasticsearch
# This runs the prediction using the ES Ingest Node capabilities
predictions = es_model.predict(
    df.head(10)
)

print(predictions)

Supported Models and Limitations#

Eland is powerful, but it relies on the inference capabilities of the Elasticsearch node. It is not a generic Python runner.

Supported Model Types#

Elasticsearch natively supports converting specific tree-based models into its internal format:

Scikit-Learn: Decision Trees, Random Forests.
XGBoost: Classifier and Regressor.
LightGBM: Classifier and Regressor.

Limitations#

Memory Management: While Eland operates lazily, calling to_pandas() triggers a download. If your index has 100GB of data, you cannot pull it all into a laptop. You must aggregate or sample first.
Unsupervised Learning: While you can use Eland to fetch data for standard Isolation Forests or K-Means (as seen in general tutorials), you cannot currently deploy these unsupervised models back to Elasticsearch for native inference via Eland.
Complex Pipelines: Feature engineering steps (like StandardScaler or One-Hot Encoding) done in Python are not automatically part of the deployed model. You must use Elasticsearch Ingest Pipelines to replicate those data transformations before the data hits the model.

Conclusion#

Eland enables a hybrid workflow that is ideal for modern data teams to:

Push compute to the data for heavy lifting (filtering, aggregating).
Pull data to local memory for precision tasks (training).
Push logic back to the database for production (inference).

By mastering Eland, you remove the need for complex middleware when deploying standard classification and regression models on your log and metric data.