Understanding Job Start Rate Control
When many jobs become ready simultaneously (scheduled for the same time, or conditions satisfied together), Xi-Batch can overwhelm system resources by starting them all at once. STARTLIM and STARTWAIT variables control this behaviour.
The Problem
Resource swamping occurs when:
- Hundreds of jobs scheduled for same time (e.g., midnight)
- Cascade effect when one variable change releases many waiting jobs
- Network-intensive jobs saturating bandwidth
- Disk I/O overwhelming storage subsystem
- Process table exhaustion from too many simultaneous spawns
Symptoms:
- System becomes unresponsive at job start times
- Network timeouts during peak job starts
- Jobs fail with "resource temporarily unavailable" errors
- High load average spikes
- Disk I/O wait times increase dramatically
How STARTLIM Works
STARTLIM : Maximum number of jobs Xi-Batch will start in a single batch
Default value: 5
When jobs are ready to start, Xi-Batch processes them in batches:
- Scheduler identifies all ready jobs
- Starts first STARTLIM jobs (highest priority first)
- Waits STARTWAIT seconds
- Starts next batch of STARTLIM jobs
- Repeats until all ready jobs started
How STARTWAIT Works
STARTWAIT : Waiting time in seconds between job start batches
Default value: 30 seconds
This delay allows started jobs to:
- Complete initialization
- Establish network connections
- Allocate resources
- Reduce competition for system resources
Checking Current Settings
bash
# View current values btvar -v STARTLIM STARTWAIT # Or use btq btq -V # Switch to variables screen # Look for STARTLIM and STARTWAIT
Example output:
STARTLIM 5 # Number of jobs to start at once STARTWAIT 30 # Wait time in seconds for job start
Adjusting STARTLIM
Increase STARTLIM when:
- High-performance hardware can handle more concurrent starts
- Jobs are lightweight and start quickly
- Network and I/O subsystems are fast
- No resource contention observed
bash
# Increase to 10 jobs per batch btvar -s STARTLIM 10
Decrease STARTLIM when:
- System becomes unresponsive during job starts
- Network saturates during peak times
- Disk I/O bottlenecks occur
- Process table fills up
- Resource allocation failures observed
bash
# Reduce to 3 jobs per batch btvar -s STARTLIM 3
Adjusting STARTWAIT
Increase STARTWAIT when:
- Jobs need more initialization time
- Network connections take time to establish
- Resource contention observed between batches
- Slower hardware or storage
bash
# Increase wait to 60 seconds btvar -s STARTWAIT 60
Decrease STARTWAIT when:
- Jobs start quickly and cleanly
- No resource contention
- High-performance systems
- Want faster job throughput
bash
# Reduce wait to 15 seconds btvar -s STARTWAIT 15
Finding Optimal Settings
Test different values to find optimal settings for your environment:
Step 1: Establish baseline
Monitor system during typical job start period:
bash
# Watch load average and job starts watch -n 5 'uptime; btjlist | grep " Run " | wc -l'
Step 2: Test incremental changes
Make small adjustments:
bash
# Start conservative btvar -s STARTLIM 3 btvar -s STARTWAIT 45 # Monitor for several days # Gradually increase STARTLIM if system handles load well
Step 3: Monitor key metrics
- System load average
- Network utilization
- Disk I/O wait percentage
- Job failure rates
- Time to complete job batches
Step 4: Iterate
Adjust based on observations until optimal balance achieved.
Example Scenarios
High-Volume Network Jobs
400 jobs scheduled for midnight, each performs network file transfer:
bash
# Conservative settings to prevent network saturation btvar -s STARTLIM 2 btvar -s STARTWAIT 60 # Jobs start 2 at a time, 60 seconds between batches # Takes approximately 200 minutes to start all 400
Lightweight Batch Jobs
100 small jobs that complete in seconds:
bash
# Aggressive settings for fast throughput btvar -s STARTLIM 15 btvar -s STARTWAIT 10 # Jobs start 15 at a time, 10 seconds between batches # All 100 started within approximately 70 seconds
Mixed Workload
Mix of heavy and light jobs:
bash
# Moderate settings for balance btvar -s STARTLIM 5 btvar -s STARTWAIT 30 # Default settings often work well for mixed workloads
Dynamic Adjustment
Adjust settings based on time of day or system load:
Example: Business hours vs overnight
bash
#!/bin/bash
# Scheduled job to adjust start rate
HOUR=$(date +%H)
if [ "$HOUR" -ge 8 ] && [ "$HOUR" -lt 18 ]; then
# Business hours: be conservative
btvar -s STARTLIM 2
btvar -s STARTWAIT 60
else
# Overnight: more aggressive
btvar -s STARTLIM 10
btvar -s STARTWAIT 20
fi
Schedule this to run hourly:
bash
echo "0 * * * * /usr/local/bin/adjust-startrate.sh" | btr -r 1:h
Integration with Monitoring
Monitor and alert on resource exhaustion:
bash
#!/bin/bash
# Alert if too many jobs starting causes issues
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
THRESHOLD=10
if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )); then
# High load detected, reduce start rate
CURRENT_STARTLIM=$(btvar -v STARTLIM | awk '{print $2}')
if [ "$CURRENT_STARTLIM" -gt 2 ]; then
NEW_STARTLIM=$((CURRENT_STARTLIM - 1))
btvar -s STARTLIM "$NEW_STARTLIM"
logger "Xi-Batch: Reduced STARTLIM to $NEW_STARTLIM due to high load"
echo "Load average $LOAD exceeds threshold, reduced STARTLIM" | \
mail -s "Xi-Batch Auto-Adjustment" admin@example.com
fi
fi
Verifying Settings Are Effective
Watch job starts in real-time:
bash
# Terminal 1: Watch variables watch -n 2 'btvar -v STARTLIM STARTWAIT; echo ""; btvar -v CLOAD' # Terminal 2: Monitor job starts btjlist | grep " Run " # Terminal 3: System load watch -n 2 uptime
Observe that:
- Jobs start in batches of STARTLIM size
- Delay of STARTWAIT seconds between batches
- System load remains manageable
- No resource exhaustion errors
Troubleshooting
Jobs not starting despite being ready : Check LOADLEVEL hasn't been exceeded. STARTLIM only controls batch size, not total running jobs.
bash
btvar -v LOADLEVEL CLOAD
All jobs start simultaneously despite STARTLIM : Verify STARTLIM actually set:
bash
btvar -v STARTLIM # Should show your configured value, not 0 or blank
System still overloaded : STARTLIM may still be too high, or STARTWAIT too short. Continue reducing:
bash
btvar -s STARTLIM 1 btvar -s STARTWAIT 120
Jobs taking too long to start : If STARTWAIT too long, reduce incrementally:
bash
# Current: 120 seconds # Try: 90 seconds btvar -s STARTWAIT 90
Best Practices
Start conservative : Begin with low STARTLIM (2-3) and high STARTWAIT (60-90 seconds), increase gradually
Monitor before adjusting : Collect data on system behavior before making changes
Document changes : Record STARTLIM/STARTWAIT adjustments and reasons
Test during low-impact periods : Experiment with settings during non-critical times
Consider hardware limitations : Slower systems need more conservative settings
Account for network topology : Jobs accessing network resources need longer delays
Review regularly : As workload patterns change, revisit settings