Back to Knowledege base

Diagnosing Why Ready Jobs Won't Start

Understanding conditions, load levels, and scheduling constraints that prevent job execution

Understanding Job Ready States

A job becomes "ready" when:

  • Current time ≥ scheduled run time
  • All conditions satisfied
  • Not already running or held

However, being ready doesn't guarantee immediate start. Several factors can block execution.

Common Reasons Jobs Wait

Load level constraints : CLOAD + job load > LOADLEVEL

Unsatisfied conditions : One or more conditions evaluate false

Remote variable unavailable : Critical condition references unavailable remote host

User load limit exceeded : User's total running jobs exceed their limit

Priority ordering : Higher priority jobs started first

Start rate limiting : STARTLIM/STARTWAIT delaying batch start

Diagnostic Approach

Step 1: Identify Waiting Job

bash

# List all jobs and their states
btjlist

# Focus on jobs that should be running
btjlist | grep -v " Run " | grep -v " Held "

Note the job number of the waiting job.

Step 2: Check Job Details

bash

# Get comprehensive job information
btjlist <job_number>

# Or use btjstat for detailed status
btjstat <job_number>

Look for:

  • Time to run (should be in past)
  • Conditions list
  • Load level value
  • Priority
  • State/status

Step 3: Check Load Levels

bash

# View system load variables
btvar -v LOADLEVEL CLOAD

# Check job's load level
btjlist <job_number> | grep "Load level"

Calculation:

If CLOAD + job_load > LOADLEVEL, job cannot start.

Example:

LOADLEVEL: 20000
CLOAD: 18500
Job load: 2000

18500 + 2000 = 20500 > 20000  ← Job blocked

Solution:

Wait for running jobs to complete (reducing CLOAD), or increase LOADLEVEL:

bash

btvar -s LOADLEVEL 25000

Step 4: Examine Conditions

bash

# View job conditions
btjlist <job_number>
# Look for "Conditions:" section

Each condition shows:

  • Variable name
  • Comparison operator
  • Expected value
  • Critical flag

Example condition:

STATUS = Ready

Check variable's actual value:

bash

btvar -v STATUS

If STATUS contains "Pending" instead of "Ready", condition not satisfied.

Step 5: Check Variable Values

For each condition on the job, verify variable value:

bash

# Check specific variable
btvar -v <variable_name>

# Check multiple variables
btvlist | grep -E "var1|var2|var3"

For remote variables:

bash

# Remote variable format: machine:varname
btvar -v remotemachine:STATUS

If remote machine unavailable and condition is critical, job blocked.

Step 6: Verify User Load Limits

bash

# Check user's current load usage
btjlist -u <username> | grep " Run "

# Count running jobs' total load
btjlist -u <username> | grep " Run " | awk '{sum+=$NF} END {print sum}'

Compare against user's load level limit (requires admin access to view):

bash

btuser -l <username>
# Look for "Max total ll" (maximum total load level)

Resolving Common Blocking Scenarios

Scenario 1: Load Level Exceeded

Symptom:

bash

btvar -v LOADLEVEL CLOAD
# LOADLEVEL: 20000
# CLOAD: 19500
# Job load: 1000
# 19500 + 1000 > 20000 - Job waits

Solutions:

Option A: Increase LOADLEVEL

bash

btvar -s LOADLEVEL 25000

Option B: Wait for jobs to complete

Monitor CLOAD:

bash

watch -n 5 'btvar -v CLOAD'

When CLOAD drops below 19000, job will start.

Option C: Reduce job's load level

bash

btjchange -l 500 <job_number>

Scenario 2: Condition Not Satisfied

Symptom:

bash

# Job condition: STATUS = Ready
btvar -v STATUS
# STATUS: Pending

# Condition false, job waits

Solutions:

Option A: Set variable to required value

bash

btvar -s STATUS Ready

Job should start immediately (if no other blocking conditions).

Option B: Remove condition

If condition no longer relevant:

bash

btjchange <job_number>
# In btq interface, edit conditions, delete the condition

Option C: Wait for another job to set variable

If variable should be set by another job's assignment:

bash

# Check which job sets STATUS
btjlist | grep -i status
# Look for jobs with assignments to STATUS variable

Scenario 3: Remote Variable Unavailable

Symptom:

bash

# Job condition: server2:BACKUP_DONE = Yes (critical)
btvar -v server2:BACKUP_DONE
# Error: Cannot connect to server2

Solutions:

Option A: Restore remote host connectivity

Investigate network or Xi-Batch connectivity to server2:

bash

# Test network connectivity
ping server2

# Check Xi-Batch scheduler on server2
ssh server2 "ps aux | grep btsched"

Option B: Change condition to non-critical

If acceptable for job to start when remote unavailable:

bash

btjchange <job_number>
# Edit condition, remove critical flag

Option C: Use local variable instead

Create local copy of variable:

bash

# Create local variable
btvar -c BACKUP_DONE "Yes"

# Change job to use local variable
btjchange <job_number>
# Edit condition: server2:BACKUP_DONE → BACKUP_DONE

Scenario 4: User Load Limit Exceeded

Symptom:

User already running many jobs, new job waits:

bash

# User jsmith running jobs
btjlist -u jsmith | grep " Run "
# Shows 5 large jobs already running
# User's max total load level: 10000
# Already using: 9500
# New job load: 1000
# Would exceed limit

Solutions:

Option A: Wait for user's jobs to complete

Monitor user's current load:

bash

watch -n 10 'btjlist -u jsmith | grep " Run "'

Option B: Increase user's load limit (requires admin)

bash

btuser -u jsmith
# Edit user, increase "Max total ll"

Option C: Reduce job's load level

bash

btjchange -l 200 <job_number>

Scenario 5: Priority Ordering

Symptom:

Lower priority job waiting while higher priority jobs start:

bash

# Your job priority: 100
# Other ready jobs priority: 150-200
# LOADLEVEL limit reached
# Higher priority jobs start first

Solutions:

Option A: Increase job priority

bash

btjchange -p 200 <job_number>

Option B: Wait for higher priority jobs to complete

This is normal behavior - intentional priority ordering.

Option C: Reduce other jobs' priority (if you own them)

bash

btjchange -p 50 <other_job_number>

Systematic Troubleshooting Checklist

Use this checklist to diagnose waiting jobs:

1. Load Level Check

bash

btvar -v LOADLEVEL CLOAD
btjlist <job_number> | grep "Load level"
# Calculate: CLOAD + job_load vs LOADLEVEL
  • Load levels allow job to start

2. Time Check

bash

btjlist <job_number> | grep "Time to run"
# Compare against current time
  • Run time is in the past

3. Conditions Check

bash

btjlist <job_number>
# Review conditions list

For each condition:

  • Variable exists
  • Variable value satisfies condition
  • Remote variables accessible (if critical)

4. User Limits Check

bash

btjlist -u <username> | grep " Run "
# Sum load levels of running jobs
  • User hasn't exceeded load limit

5. Priority Check

bash

btjlist | grep " Ready " | sort -k6 -rn
# Shows ready jobs by priority
  • No higher priority jobs waiting

6. Hold Status Check

bash

btjstat <job_number> | grep -i hold
  • Job not on hold

Advanced Diagnostics

Trace Variable Changes

If condition involves variable that should change:

bash

# Enable variable logging
btvar -s LOGVARS varlog

# Watch variable changes
tail -f /var/spool/xi/batch/varlog | grep <variable_name>

Monitor Job Transitions

Enable job logging to track state changes:

bash

# Enable job logging
btvar -s LOGJOBS joblog

# Watch job state changes
tail -f /var/spool/xi/batch/joblog | grep <job_number>

Check Network Variables

For jobs with remote conditions:

bash

# List all remote variables in conditions
btjlist <job_number> | grep -E "machine:"

# Test each remote variable
for var in server1:VAR1 server2:VAR2; do
    echo "Testing $var:"
    btvar -v "$var"
done

Verification After Resolution

After making changes, verify job starts:

bash

# Watch job list
watch -n 2 'btjlist <job_number>'

# Should transition to "Run" state within moments

If still waiting, repeat diagnostic process.

Best Practices

Use informative variable names : Makes condition troubleshooting easier

Document job dependencies : Note which variables jobs depend on

Monitor variable changes : Enable LOGVARS for critical workflows

Set reasonable load levels : Avoid jobs with excessive load values

Use appropriate priorities : Reserve high priorities for truly critical jobs

Test conditions before deployment : Verify conditions work as expected

Provide fallback mechanisms : Use non-critical remote conditions when appropriate

Managing Job Start Rate with STARTLIM and STARTWAIT
Preventing resource exhaustion by controlling how many jobs start simultaneously