Issues related to distributed model monitoring (DMM) require looking at the Kubernetes pods, since the UI provides only minimal logs. The pod status and logs below should be inspected for any issues. Domino DMM architecture consists of many Kubernetes pods. Any pod in a failed state may indicate an issue with model monitoring.
<user-id>$ prod-field % kubectl get pods -n domino-platform| grep -i DMM
dmm-compute-59d655fb6d-zl6z7 1/1 Running 0 21d
dmm-config-backup-28111680-wpdb6 0/1 Completed 0 2d11h
dmm-config-backup-28113120-5xjbz 0/1 Completed 0 35h
dmm-config-backup-28114560-f5m48 0/1 Completed 0 11h
dmm-frontend-78c575d4dc-2wwbz 1/1 Running 0 21d
dmm-parquet-cleanup-job-28114560-2q7vh 0/1 Completed 0 11h
dmm-parquet-conversion-job-28115220-h7x9x 0/1 Completed 0 59m
dmm-plier-59d9b68c9c-gn2tq 1/1 Running 0 21d
dmm-plier-59d9b68c9c-w6zbj 1/1 Running 0 21d
dmm-plier-migrations-qmpzh 0/1 Completed 0 21d
dmm-redis-ha-server-0 4/4 Running 0 21d
dmm-scheduler-674946f659-7wskh 1/1 Running 0 21d
<user-id>$ prod-field % kubectl get pods -n domino-platform| grep -i spark
spark3-master-0 1/1 Running 0 21d
spark3-master-1 1/1 Running 0 21d
<user-id>$ prod-field % kubectl get pods -n domino-compute| grep -i spark
spark3-worker-0 1/1 Running 0 21d
<user-id>$ prod-field %
The following is a simple script to gather logs for the support team and the engineering team for further analysis. Update PLATFORM_NS
and COMPUTE_NS
to match your Domino deployment.
PLATFORM_NS=domino-platform
COMPUTE_NS=domino-compute
mkdir logs/
kubectl logs -n $PLATFORM_NS statefulset/spark3-master -c spark-master --timestamps=true --prefix=true > logs/spark3-master.log
kubectl logs -n $COMPUTE_NS statefulset/spark3-worker -c spark-worker --timestamps=true --prefix=true > logs/spark3-worker.log
kubectl logs -n $PLATFORM_NS deployment/dmm-compute -c compute --timestamps=true --prefix=true > logs/dmm-compute.log
kubectl logs -n $PLATFORM_NS deployment/dmm-plier -c plier --timestamps=true --prefix=true > logs/dmm-plier.log
kubectl logs -n $PLATFORM_NS deployment/dmm-scheduler -c scheduler --timestamps=true --prefix=true > logs/dmm-scheduler.log
kubectl logs -n $PLATFORM_NS -l app.kubernetes.io/component=dmm-parquet-conversion-job --timestamps=true --prefix=true > logs/dmm-parquet-conversion-job.log
kubectl describe cronjob -A > logs/describe-cronjobs.log
kubectl describe pods -A > logs/describe-pods.log
kubectl describe deployments -A > logs/describe-deployments.log
kubectl describe statefulsets -A > logs/describe-statefulsets.log
tar czvf logs.tar.gz logs/
rm -rf logs/
The following sections provide useful steps to troubleshoot: - Basic Domino health - Connectivity and latency - Workspaces and Jobs issues - Compute environment build issues - Model API issues - Data sources issues