Best Practices¶

Opinionated recommendations for running the Elastic Stack in production. These are not hard requirements — adjust to your environment.

Elasticsearch¶

Cluster sizing¶

Minimum 3 master-eligible nodes for quorum. A 2-node cluster has no fault tolerance — if one node goes down, the other can't form a quorum and the cluster locks up.
Odd number of master-eligible nodes (3, 5, 7) to avoid split-brain during network partitions.
Dedicate 3 nodes to master-only for any production cluster with non-trivial indexing or search load. A data node under heavy GC pressure or large merges can delay master duties (cluster state updates, shard allocation), which destabilizes the whole cluster. Combined master+data roles are fine for development and light workloads.

host_vars/es-master1.yml

elasticsearch_node_types: ["master"]
elasticsearch_heap: "4"   # masters need little heap — 4GB is plenty

JVM heap¶

Set heap to half of available RAM, up to 30GB. The other half is for the OS filesystem cache, which Elasticsearch relies on for search performance. Beyond 30GB the JVM loses compressed ordinary object pointers (compressed OOPs), which increases pointer size from 4 to 8 bytes and actually reduces the amount of heap you can use effectively. A 64GB host should have elasticsearch_heap: "30".
The role auto-calculates heap from physical RAM when elasticsearch_heap is not set. For production, set it explicitly.

Memory locking¶

Always enable in production:

elasticsearch_memory_lock: true

Swapping destroys Elasticsearch performance. A node that swaps looks like a slow node to the cluster, causes GC storms, and can trigger unnecessary shard relocations.

Disk watermarks¶

Tune watermarks before you run out of space, not after:

elasticsearch_cluster_settings:
  cluster.routing.allocation.disk.watermark.low: "85%"
  cluster.routing.allocation.disk.watermark.high: "90%"
  cluster.routing.allocation.disk.watermark.flood_stage: "95%"

At flood stage, Elasticsearch makes indices read-only. Recovery requires manual intervention (PUT _all/_settings {"index.blocks.read_only_allow_delete": null}). Keep enough headroom to avoid hitting it.

Index Lifecycle Management (ILM)¶

Don't keep indices open forever. Use ILM policies to:

Rollover indices when they hit a size or age threshold
Force merge old indices to reduce segment count
Move to cheaper storage tiers (warm → cold → frozen)
Delete when retention expires

This isn't managed by Ansible (it's runtime Elasticsearch config), but make sure your deployment includes ILM policies from day one.

Slow query logging¶

Enable slow query logs to catch problematic queries before they cause outages. These are index-level settings — apply them via an index template so all new indices inherit them:

curl -k -u elastic:PASSWORD -X PUT https://localhost:9200/_index_template/slowlog-defaults \
  -H 'Content-Type: application/json' -d '{
  "index_patterns": ["*"],
  "priority": 0,
  "template": {
    "settings": {
      "index.search.slowlog.threshold.query.warn": "10s",
      "index.search.slowlog.threshold.query.info": "5s",
      "index.search.slowlog.threshold.fetch.warn": "1s",
      "index.indexing.slowlog.threshold.index.warn": "10s"
    }
  }
}'

Ship slow logs to a monitoring cluster (not the same cluster) via Filebeat.

Security¶

Log all authentication events¶

Enable audit logging to track sign-ins, failures, and privilege escalations:

elasticsearch_extra_config:
  xpack.security.audit.enabled: true
  xpack.security.audit.logfile.events.include:
    - authentication_success
    - authentication_failed
    - access_denied
    - connection_denied
    - tampered_request
    - run_as_denied
    - run_as_granted
  xpack.security.audit.logfile.events.exclude:
    - anonymous_access_denied   # reduce noise from health checks

This writes structured JSON logs to <cluster>_audit.json. The access_granted event is omitted here because it logs every successful API call and generates very high volume. Add it only if you need full access tracing and have the storage for it. At minimum, always include authentication_failed and access_denied — these surface brute-force attempts and misconfigured services.

Ship audit logs to a separate cluster¶

Never store audit logs on the cluster being audited — a compromised cluster could delete its own audit trail:

group_vars/beats.yml

beats_filebeat: true
beats_filebeat_log_inputs:
  es_audit:
    name: es-audit
    paths:
      - /var/log/elasticsearch/*_audit.json
    fields:
      type: audit
      source_cluster: "{{ elasticsearch_cluster_name }}"

Point this Filebeat at a separate monitoring/SIEM cluster.

Rotate the `elastic` superuser password¶

The elastic user has full cluster access. After initial setup:

Create named admin accounts with appropriate roles
Use API keys for applications instead of username/password
Rotate the elastic password and store it in a vault

# Change the elastic password
curl -k -u elastic:OLD_PASSWORD -X POST \
  https://localhost:9200/_security/user/elastic/_password \
  -H 'Content-Type: application/json' \
  -d '{"password": "NEW_PASSWORD"}'

Restrict network exposure¶

Elasticsearch should never be directly accessible from the internet. The role defaults to binding to _site_ (private network interface), which is correct for most deployments. Combine this with firewall rules to restrict port 9200 (HTTP) and 9300 (transport) to known hosts only — Kibana nodes, Logstash nodes, and your admin workstations.

For hosts with multiple network interfaces, be explicit:

elasticsearch_extra_config:
  network.host: ["_site_", "_local_"]   # private interface + loopback

Kibana¶

Use a reverse proxy¶

Always put Kibana behind a reverse proxy (Nginx, Caddy, HAProxy) in production:

Terminates TLS with proper certificates (Let's Encrypt, corporate CA)
Adds security headers (HSTS, CSP, X-Frame-Options)
Provides rate limiting and access control
Handles HTTP/2 and compression

See Kibana behind a reverse proxy for a complete nginx example.

Session and encryption keys¶

For multi-instance Kibana deployments, set consistent encryption keys so sessions work across instances:

kibana_extra_config:
  xpack.security.encryptionKey: "{{ vault_kibana_encryption_key }}"       # min 32 chars
  xpack.encryptedSavedObjects.encryptionKey: "{{ vault_kibana_saved_objects_key }}"
  xpack.reporting.encryptionKey: "{{ vault_kibana_reporting_key }}"

Without these, each Kibana instance generates random keys on startup and sessions break when a load balancer sends requests to a different instance.

Kibana logging¶

Configure Kibana to log security events to a separate file for easier monitoring and shipping:

kibana_extra_config:
  logging.appenders.security.type: rolling-file
  logging.appenders.security.fileName: /var/log/kibana/security.log
  logging.appenders.security.policy.type: time-interval
  logging.appenders.security.policy.interval: 24h
  logging.appenders.security.strategy.type: numeric
  logging.appenders.security.strategy.max: 30
  logging.appenders.security.layout.type: json
  logging.loggers:
    - name: plugins.security
      level: info
      appenders: [security]

Logstash¶

Persistent queues for durability¶

If losing events during a Logstash restart is unacceptable:

logstash_queue_type: persisted
logstash_queue_max_bytes: "4gb"

Use fast storage (SSD/NVMe) for the queue path. Monitor queue depth — if it grows consistently, your output (Elasticsearch) can't keep up.

Dead letter queue for failed events¶

Instead of dropping events that fail to index, send them to the DLQ for later analysis:

logstash_dead_letter_queue_enable: true
logstash_dead_letter_queue_retain_age: "7d"

Check the DLQ periodically for mapping conflicts or malformed events that need pipeline fixes.

Monitoring¶

Monitor the stack itself¶

At minimum, deploy Metricbeat on all nodes to collect cluster metrics:

group_vars/all.yml

beats_metricbeat: true
beats_metricbeat_modules:
  - system
  - elasticsearch-xpack
  - kibana-xpack
  - logstash-xpack

This feeds Kibana's Stack Monitoring dashboards. For production, send these metrics to a separate monitoring cluster so you can still see what's happening when the production cluster is down.

Alerting¶

Set up Kibana alerting rules for:

Cluster health goes yellow or red
Disk usage exceeds 80% on any node
JVM heap consistently above 85%
Search latency p95 exceeds your SLA
Authentication failures spike (brute force detection)
Audit log volume drops to zero (logging may be broken)

Backups¶

Snapshot every day¶

Configure automated snapshots from day one, not after you need them:

elasticsearch_fs_repo:
  - /mnt/nfs/es-snapshots

After deployment, create an SLM (Snapshot Lifecycle Management) policy via the API or Kibana. Minimum recommended: daily snapshots, 30-day retention.

Test your restores¶

A backup that hasn't been tested is not a backup. Schedule quarterly restore tests to a separate cluster to verify:

Snapshots are complete and not corrupted
Restore procedures are documented and work
Restore time fits within your RTO (Recovery Time Objective)