automation-suite
2023.4
false
- Overview
- Requirements
- Installation
- Q&A: Deployment templates
- Configuring the machines
- Configuring the external objectstore
- Configuring an external Docker registry
- Configuring the load balancer
- Configuring the DNS
- Configuring Microsoft SQL Server
- Configuring the certificates
- Online multi-node HA-ready production installation
- Offline multi-node HA-ready production installation
- Disaster recovery - Installing the secondary cluster
- Downloading the installation packages
- install-uipath.sh parameters
- Enabling Redis High Availability Add-On for the cluster
- Document Understanding configuration file
- Adding a dedicated agent node with GPU support
- Adding a dedicated agent Node for Task Mining
- Connecting Task Mining application
- Adding a Dedicated Agent Node for Automation Suite Robots
- Post-installation
- Cluster administration
- Monitoring and alerting
- Migration and upgrade
- Migration options
- Step 1: Moving the Identity organization data from standalone to Automation Suite
- Step 2: Restoring the standalone product database
- Step 3: Backing up the platform database in Automation Suite
- Step 4: Merging organizations in Automation Suite
- Step 5: Updating the migrated product connection strings
- Step 6: Migrating standalone Insights
- Step 7: Deleting the default tenant
- B) Single tenant migration
- Product-specific configuration
- Best practices and maintenance
- Troubleshooting
- How to troubleshoot services during installation
- How to uninstall the cluster
- How to clean up offline artifacts to improve disk space
- How to clear Redis data
- How to enable Istio logging
- How to manually clean up logs
- How to clean up old logs stored in the sf-logs bucket
- How to disable streaming logs for AI Center
- How to debug failed Automation Suite installations
- How to delete images from the old installer after upgrade
- How to automatically clean up Longhorn snapshots
- How to disable TX checksum offloading
- How to manually set the ArgoCD log level to Info
- How to generate the encoded pull_secret_value for external registries
- How to address weak ciphers in TLS 1.2
- Unable to run an offline installation on RHEL 8.4 OS
- Error in downloading the bundle
- Offline installation fails because of missing binary
- Certificate issue in offline installation
- First installation fails during Longhorn setup
- SQL connection string validation error
- Prerequisite check for selinux iscsid module fails
- Azure disk not marked as SSD
- Failure after certificate update
- Antivirus causes installation issues
- Automation Suite not working after OS upgrade
- Automation Suite requires backlog_wait_time to be set to 0
- GPU node affected by resource unavailability
- Volume unable to mount due to not being ready for workloads
- Support bundle log collection failure
- Failure to upload or download data in objectstore
- PVC resize does not heal Ceph
- Failure to resize PVC
- Failure to resize objectstore PVC
- Rook Ceph or Looker pod stuck in Init state
- StatefulSet volume attachment error
- Failure to create persistent volumes
- Storage reclamation patch
- Backup failed due to TooManySnapshots error
- All Longhorn replicas are faulted
- Setting a timeout interval for the management portals
- Update the underlying directory connections
- Authentication not working after migration
- Kinit: Cannot find KDC for realm <AD Domain> while getting initial credentials
- Kinit: Keytab contains no suitable keys for *** while getting initial credentials
- GSSAPI operation failed due to invalid status code
- Alarm received for failed Kerberos-tgt-update job
- SSPI provider: Server not found in Kerberos database
- Login failed for AD user due to disabled account
- ArgoCD login failed
- Failure to get the sandbox image
- Pods not showing in ArgoCD UI
- Redis probe failure
- RKE2 server fails to start
- Secret not found in UiPath namespace
- ArgoCD goes into progressing state after first installation
- Issues accessing the ArgoCD read-only account
- MongoDB pods in CrashLoopBackOff or pending PVC provisioning after deletion
- Unhealthy services after cluster restore or rollback
- Pods stuck in Init:0/X
- Prometheus in CrashloopBackoff state with out-of-memory (OOM) error
- Missing Ceph-rook metrics from monitoring dashboards
- Pods cannot communicate with FQDN in a proxy environment
- Running High Availability with Process Mining
- Process Mining ingestion failed when logged in using Kerberos
- Unable to connect to AutomationSuite_ProcessMining_Warehouse database using a pyodbc format connection string
- Airflow installation fails with sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string ''
- How to add an IP table rule to use SQL Server port 1433
- Using the Automation Suite Diagnostics Tool
- Using the Automation Suite support bundle
- Exploring Logs
Upgrade fails due to unhealthy Ceph
Automation Suite on Linux Installation Guide
Last updated Nov 21, 2024
Upgrade fails due to unhealthy Ceph
When trying to upgrade to a new Automation Suite version, you might see the following error message:
Ceph objectstore is not completely healthy at the moment. Inner exception - Timeout waiting for all PGs to become active+clean
.
To fix this upgrade issue, verify if the OSD pods are running and healthy by running the following command:
kubectl -n rook-ceph get pod -l app=rook-ceph-osd --no-headers | grep -P '([0-9])/\1' -v
kubectl -n rook-ceph get pod -l app=rook-ceph-osd --no-headers | grep -P '([0-9])/\1' -v
-
If the command does not output any pods, verify if Ceph placement groups (PGs) are recovering or not by running the following command:
function is_ceph_pg_active_clean() { local return_code=1 if kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count) // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then return_code=0 fi [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean" if [[ $return_code -ne 0 ]]; then echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean" kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv' fi return "${return_code}" } # Execute the function multiple times to get updated ceph PG status is_ceph_pg_active_clean
function is_ceph_pg_active_clean() { local return_code=1 if kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq '. as $root | ($root | .pgmap.num_pgs) as $total_pgs | try ( ($root | .pgmap.pgs_by_state[] | select(.state_name == "active+clean").count) // 0) as $active_pgs | if $total_pgs == $active_pgs then true else false end' | grep -q 'true';then return_code=0 fi [[ $return_code -eq 0 ]] && echo "All Ceph Placement groups(PG) are active+clean" if [[ $return_code -ne 0 ]]; then echo "All Ceph Placement groups(PG) are not active+clean. Please wait for PGs to become active+clean" kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg dump --format json | jq -r '.pg_map.pg_stats[] | select(.state!="active+clean") | [.pgid, .state] | @tsv' fi return "${return_code}" } # Execute the function multiple times to get updated ceph PG status is_ceph_pg_active_cleanNote: If none of the affected Ceph PG recovers even after waiting for more than 30 minutes, raise a ticket with UiPath® Support. -
If the command outputs pod(s), you must first fix the issue affecting them:
- If a pod is stuck in
Init:0/4
, then it could be a PV provider (Longhorn) issue. To debut this issue, raise a ticket with UiPath® Support. -
If a pod is in
CrashLoopBackOff
, fix the issue by running the following command:function cleanup_crashing_osd() { local restart_operator="false" local min_required_healthy_osd=1 local in_osd local up_osd local healthy_osd_pod_count local crashed_osd_deploy local crashed_pvc_name if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then min_required_healthy_osd=2 fi in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_in_osds') up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_up_osds') healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1') if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then return fi for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then echo "Found crashing OSD deployment: '${crashed_osd_deploy}'" crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]') info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'" timeout 60 kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0 timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0 restart_operator="true" fi done if [[ $restart_operator == "true" ]]; then kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator fi return 0 } # Execute the cleanup function cleanup_crashing_osd
function cleanup_crashing_osd() { local restart_operator="false" local min_required_healthy_osd=1 local in_osd local up_osd local healthy_osd_pod_count local crashed_osd_deploy local crashed_pvc_name if ! kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd pool ls detail | grep 'rook-ceph.rgw.buckets.data' | grep -q 'replicated'; then min_required_healthy_osd=2 fi in_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_in_osds') up_osd=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status -f json | jq -r '.osdmap.num_up_osds') healthy_osd_pod_count=$(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'Running' | grep -c -P '([0-9])/\1') if ! [[ $in_osd -ge $min_required_healthy_osd && $up_osd -ge $min_required_healthy_osd && $healthy_osd_pod_count -ge $min_required_healthy_osd ]]; then return fi for crashed_osd_deploy in $(kubectl -n rook-ceph get pod -l app=rook-ceph-osd | grep 'CrashLoopBackOff' | cut -d'-' -f'1-4') ; do if kubectl -n rook-ceph logs "deployment/${crashed_osd_deploy}" | grep -q '/crash/'; then echo "Found crashing OSD deployment: '${crashed_osd_deploy}'" crashed_pvc_name=$(kubectl -n rook-ceph get deployment "${crashed_osd_deploy}" -o json | jq -r '.metadata.labels["ceph.rook.io/pvc"]') info "Removing crashing OSD deployment: '${crashed_osd_deploy}' and PVC: '${crashed_pvc_name}'" timeout 60 kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" || kubectl -n rook-ceph delete deployment "${crashed_osd_deploy}" --force --grace-period=0 timeout 100 kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" || kubectl -n rook-ceph delete pvc "${crashed_pvc_name}" --force --grace-period=0 restart_operator="true" fi done if [[ $restart_operator == "true" ]]; then kubectl -n rook-ceph rollout restart deployment/rook-ceph-operator fi return 0 } # Execute the cleanup function cleanup_crashing_osd
- If a pod is stuck in
After fixing the crashing OSD, verify if PGs are recovering or not by running the following command:
is_ceph_pg_active_clean
is_ceph_pg_active_clean