Skip to main content
Back to blog

Self-Healing Test Automation in Azure DevOps Pipelines

How to implement self-healing test automation in Azure DevOps. Covers Healenium for Selenium, Playwright's resilient locators, automatic retry strategies, flakiness detection, and maintaining test suite health in CI/CD.

InnovateBits5 min read
Share

Self-healing test automation reduces maintenance overhead by automatically adapting to UI changes. Combined with Azure DevOps's retry mechanisms and analytics, you can build a test suite that stays green even as the application evolves.


Why tests break unnecessarily

The majority of test maintenance work falls into these categories:

CauseFrequencyPreventable?
CSS selector changed (UI refactor)HighYes — with self-healing
Text changed (copy update)MediumYes — with flexible matchers
Timing (flakiness)HighYes — with waits and retries
API response structure changedMediumYes — with contract tests
Actual bug (genuine regression)LowNo — this is what we want to catch

Self-healing addresses the first three categories — reducing noise so genuine failures stand out.


Playwright's resilient locator strategy

Playwright's built-in locators are already more resilient than raw CSS selectors:

// Fragile — breaks when CSS class changes
await page.locator('.btn-checkout-v2').click()
 
// Resilient — tied to role and accessible name
await page.getByRole('button', { name: 'Place order' }).click()
 
// Resilient — tied to data attribute (never changes if dev keeps it)
await page.getByTestId('place-order-btn').click()
 
// Resilient — tied to user-visible text
await page.getByText('Place order').click()

Add data-testid to your development Definition of Done: every interactive element must have a data-testid attribute. This single convention eliminates 80% of selector-related test breaks.


Automatic retries in Azure Pipelines

Configure retries at the pipeline level:

# playwright.config.ts
export default defineConfig({
  retries: process.env.CI ? 2 : 0,  // Retry failed tests up to 2 times in CI
  expect: {
    timeout: 10_000,
  },
})
# azure-pipelines.yml — retry the entire job if flakiness is suspected
jobs:
  - job: E2ETests
    timeoutInMinutes: 30
    retryCountOnTaskFailure: 1   # Retry the entire job once if it fails
    steps:
      - script: npx playwright test

Use job-level retries sparingly — they mask systemic issues. Prefer test-level retries via retries: 2 in Playwright config.


Healenium for Selenium self-healing

Healenium uses ML to find elements when their locators break:

# docker-compose.healenium.yml
version: '3'
services:
  healenium-db:
    image: postgres:15
    environment:
      POSTGRES_DB: healenium
      POSTGRES_USER: healenium
      POSTGRES_PASSWORD: healenium
 
  healenium-proxy:
    image: healenium/hlm-proxy:3.5.0
    ports:
      - "8085:8085"
    environment:
      SPRING_DATASOURCE_URL: jdbc:postgresql://healenium-db:5432/healenium
    depends_on:
      - healenium-db
 
  healenium-backend:
    image: healenium/hlm-backend:3.5.0
    ports:
      - "7878:7878"
    environment:
      DB_HOST: healenium-db
    depends_on:
      - healenium-db
# azure-pipelines.yml — start Healenium before Selenium tests
steps:
  - script: |
      docker-compose -f docker-compose.healenium.yml up -d
      sleep 10  # Wait for services to start
    displayName: Start Healenium
 
  - script: |
      mvn test \
        -Dselenide.remote=http://localhost:8085/wd/hub \
        -DBASE_URL=$(STAGING_URL)
    displayName: Run Selenium with self-healing
 
  - script: docker-compose -f docker-compose.healenium.yml down
    displayName: Stop Healenium
    condition: always()

When a locator fails, Healenium:

  1. Takes a screenshot of the current page
  2. Compares it to the last successful screenshot
  3. Finds the element using visual similarity and nearby text
  4. Updates the locator for future runs
  5. Logs the healed locator so developers can update the code

Flakiness detection in Azure DevOps

Azure Pipelines automatically tracks test flakiness. Go to Pipelines → [Pipeline] → Analytics → Test flakiness:

  • Tests flagged as flaky: pass rate between 5–95% across multiple runs
  • Flakiness rate trend over time
  • Most flaky tests list

Act on flakiness immediately:

# Mark known flaky tests and quarantine them
- script: |
    # Generate flakiness report from pipeline analytics API
    az pipelines runs list \
      --pipeline-id $(System.DefinitionId) \
      --result failed \
      --query "[].{id:id, tests:url}" \
      --output table
  displayName: Check flakiness trend

Quarantine flaky tests so they don't block CI, then fix them:

// Temporarily skip flaky test while fixing
test.skip('TC-204: Wishlist limit — flaky timing issue', async ({ page }) => {
  // TODO: fix timing issue — tracked in bug #891
})

Create a bug in Azure Boards for each quarantined test and track resolution.


Proactive maintenance: test health dashboard

Build a dashboard that shows test suite health over time:

Test Suite Health — Week of 2025-10-20

Total tests:        284
Passing rate:        97.2%
Flaky tests:           4  (1.4% — threshold: < 3%)
Quarantined:           2
Average duration:  11m 34s  (target: < 15m)

Top flaky tests:
  checkout › payment › 3DS redirect     (fails 23% of runs)
  auth › session › token refresh        (fails 8% of runs)
  profile › avatar › upload large file  (fails 12% of runs)
  search › filters › price range        (fails 6% of runs)

Actions needed:
  ⚠ Fix 3DS redirect timing (priority: high)
  ⚠ Add explicit wait to token refresh test

Common errors and fixes

Error: Healenium doesn't heal broken locators — falls through to original failure Fix: Healenium requires a previous successful run to compare against. On the first run after a UI change, it will fail. After one successful run with the new UI, future runs self-heal.

Error: Playwright retries inflate test counts in the Tests tab Fix: Configure the JUnit reporter to only report final results: reporter: [['junit', { includeProjectInTestName: true }]]. Each retry shows as a separate row in the raw XML but Azure DevOps collapses them by default.

Error: Job-level retries run all tests again even when only 2 out of 200 failed Fix: Use Playwright's built-in --retries instead of job-level retry. Playwright retries only the failed tests, not the entire suite.

Error: Flakiness analytics show 0 flaky tests despite known intermittent failures Fix: Azure DevOps needs at least 7 pipeline runs to calculate flakiness. Also ensure test results are being published consistently — a pipeline that stops publishing results when it fails won't accumulate enough data.

Free newsletter

Stay ahead in AI-driven QA

Get practical tutorials on test automation, AI testing, and quality engineering — straight to your inbox. No spam, unsubscribe any time.

Discussion

Sign in with GitHub to comment · powered by Giscus