Why Housekeeping Matters
Xi-Batch systems accumulate jobs over time. Schedules that were once critical fall out of use, one-off jobs remain on the queue in a Done state long after they served their purpose, and jobs targeting decommissioned remote hosts sit idle indefinitely. Each job consumes a slot in shared memory and adds visual clutter to the queue, making it harder for administrators to focus on what is actively running.
Regular housekeeping keeps the scheduler lean, reduces the risk of hitting shared memory limits, and ensures that conditions and assignments referencing orphaned variables do not cause unexpected behaviour.
This article walks through the process of exploring the job queue, identifying candidates for removal, and safely carrying out the cleanup.
Exploring the Job Queue
The btjlist command is the primary tool for reviewing jobs from the command line. By default it shows only local jobs in a terse format. Add -H for column headings and -R to include jobs on remote hosts:
bash
btjlist -H btjlist -HR
The default display shows the job number, user, title, command interpreter, priority, load level, next scheduled time, conditions, and progress state. This is a good starting point but for housekeeping you will want additional fields.
Useful Format Codes for Housekeeping
The -F option lets you specify exactly which fields to display. The format string uses % codes, each representing a job attribute. A selection of codes particularly useful for housekeeping:
| Code | Meaning |
|---|---|
| %N | Job number (includes host prefix for remote jobs) |
| %U | User |
| %H | Job name in full (includes queue prefix) |
| %h | Title without queue name |
| %q | Queue name |
| %P | Progress state (Run, Done, Err, Abrt, Canc, or blank) |
| %T | Date and time in full |
| %t | Time or date (short form) |
| %o | Time submitted |
| %W | Last or next run time |
| %r | Repeat specification |
| %d | Delete time (hours) |
| %c | Conditions (abbreviated) |
| %s | Assignments (abbreviated) |
| %x | Exit code returned by last run |
| %e | Export scope (local, export, or remote runnable) |
| %O | Originating host |
A format string tailored for housekeeping review might look like this:
bash
btjlist -HR -F "%N %U %h %P %T %r %d"
This shows the job number, owner, title, progress state, full date and time, repeat specification, and auto-delete time for every job on all connected hosts.
To filter by queue name, user, or group:
bash
btjlist -H -q "nightly*" -F "%N %h %P %W %r" btjlist -H -u olduser -F "%N %h %P %T" btjlist -H -g finance -F "%N %h %P %W"
Understanding Progress States
The progress state is the single most important indicator when deciding whether a job is still needed. The possible states are:
blank (no state) : The job is waiting to run, either for its scheduled time or for conditions to be satisfied. This is the normal state for a healthy repeating job that is not currently executing.
Run : The job is currently executing. Do not remove jobs in this state.
Done : The job completed successfully and has been retained on the queue. If there is no repeat specification, it will sit in Done state indefinitely unless it has an auto-delete time set.
Err : The job terminated with an exit code in the error range. This could indicate a problem that was never resolved, or a job whose underlying process no longer exists.
Abrt : The job was terminated by a signal, either killed by an operator or due to a program fault. Persistent Abrt states often indicate abandoned jobs.
Canc : The job was cancelled before it ran. Jobs left in Canc state were typically set up but never activated, or were cancelled and forgotten about.
Identifying Stale Jobs
When reviewing the queue, look for the following indicators that a job may be a candidate for removal.
Jobs in Done, Err, Abrt, or Canc state with old dates : Use the full date and time field (%T) or the last/next time field (%W) to see when the job last ran or was last scheduled. If the date is months or years ago and the job has no repeat specification, it is very likely orphaned.
bash
btjlist -HR -F "%N %h %P %W %r" | grep -E "(Done|Err|Abrt|Canc)"
Jobs with a repeat specification that are stuck : A repeating job should have a future date in the time field or be currently running. If a repeating job shows a date far in the past, it may have encountered an error and stopped advancing. Check whether the "advance time on error" flag is set by reviewing the job's process parameters in btq.
Jobs with conditions referencing non-existent variables : If a job's conditions reference a variable that has been deleted, the job will never run. In btq, press C on the job to view its conditions and check that the referenced variables still exist. From the command line:
bash
btjlist -F "%N %h %C" | grep -i "variable_name"
Jobs belonging to users who have left : Filter by user and review whether any of their jobs are still required:
bash
btjlist -HR -u departed_user -F "%N %h %P %W %r"
Jobs with no time and no conditions : A job with no scheduled time and no conditions that is not in Run state will never execute. These are typically test jobs or jobs that were submitted in cancelled state and never activated.
Jobs targeting remote hosts that are no longer connected : Jobs with the export scope set to "remote runnable" may reference hosts that have been decommissioned. Check the Xi-Batch hosts file (typically /etc/xibatch-hosts) and verify connectivity:
bash
btconn hostname
If the host does not respond, jobs configured to run on that host are candidates for removal.
Checking When a Job Last Ran
Xi-Batch can maintain an audit trail of job activity if the LOGJOBS variable is configured. This variable specifies a file path where the scheduler writes a line for every job event - creation, completion, error, cancellation, and so on.
Each log line contains pipe-separated fields: date, time, job number, job title, status code, user, group, priority, and load level. The status codes include Completed, Abort, Cancel, Error, and others.
To find the last time a specific job completed successfully:
bash
grep "jobname" /usr/spool/batch/joblog | grep "Completed" | tail -5
If the LOGJOBS variable is not configured, you can still use the last/next time field in btjlist and the progress state to infer activity. A job in Done state with an old date has not run recently.
Reviewing Variables for Orphaned References
Variables and jobs are tightly linked through conditions and assignments. Before removing jobs, check whether they set or depend on variables that other jobs also reference. Use btvlist to list all variables:
bash
btvlist -H
To see which jobs reference a particular variable in their conditions or assignments:
bash
btjlist -HR -F "%N %h %C %S" | grep "VARIABLE_NAME"
If a variable is only referenced by jobs you plan to remove, the variable itself can also be removed afterwards. However, attempting to delete a variable that is still referenced by any job will produce an error.
Backing Up Before Cleanup
Before removing anything, create a backup of the current state. Xi-Batch provides utilities that export jobs, variables, command interpreters, and user profiles as shell scripts.
To back up jobs:
bash
mkdir -p /usr/batchsave/$(date +%Y%m%d)/Scripts cd /usr/batchsave/$(date +%Y%m%d) gbch-cjlist -D /usr/spool/batch btsched_jfile Jcmd Scripts
This creates Jcmd, a shell script that would resubmit all jobs, with the job scripts saved in the Scripts directory.
To back up variables:
bash
gbch-cvlist -D /usr/spool/batch btsched_vfile Vcmd
To back up command interpreters:
bash
gbch-ciconv -D /usr/spool/batch cifile Cicmd
To back up user permissions:
bash
gbch-uconv -D /usr/spool/batch btufile6 Ucmd
When restoring, the recommended order is: user permissions, command interpreters, variables, then jobs. This avoids errors from jobs referencing items that do not yet exist.
These backup scripts can be edited before restoration if you only need to recover specific items.
Safely Removing Jobs
Once you have confirmed a job is no longer needed, it can be removed in several ways.
Using btq interactively : Navigate to the job in the job list and press D to delete it. If the job is running, you must first kill it with K (which offers Int, Quit, Term, or Kill signals) and wait for it to stop before deleting. Confirmation may be requested depending on your settings.
Using btjdel from the command line : Delete a job by its job number:
bash
btjdel 1420
The job must not be running. For remote jobs, include the host prefix:
bash
btjdel avon:24918
Cancelling before deleting : If you want to stop a job from running but are not yet ready to remove it, use btjchange to set it to cancelled state:
bash
btjchange -C 1420
This prevents the job from executing whilst keeping it on the queue for review. The job can be deleted later when you are satisfied it is no longer needed.
Unqueueing for archival : If you want to preserve a copy of the job before removing it, use the unqueue function. In btq, press U on the job. This saves the job script and a command file that could resubmit it, then optionally removes it from the queue. This is the safest approach when you are unsure whether a job might be needed again.
Bulk identification : To list just the job numbers of all jobs in Done state for a specific user:
bash
btjlist -u olduser -F "%N %P" | grep "Done" | awk '{print $1}'
This output can be used to script bulk removal, though care should be taken to review each job before deleting.
Cleaning Up Variables After Job Removal
After removing jobs, check whether any variables are now orphaned. A variable is orphaned if no remaining job references it in a condition or assignment.
To check:
bash
btvlist -F "%N" | while read VARNAME; do
REFS=$(btjlist -HR -F "%C %S" | grep -c "$VARNAME")
if [ "$REFS" -eq 0 ]; then
echo "Orphaned: $VARNAME"
fi
done
Orphaned variables can be removed using btvar:
bash
btvar -d VARIABLE_NAME
Or from btq, switch to the variable list with V and delete with D.
Be cautious with variables that have a system-wide purpose, such as LOGJOBS, LOGVARS, STARTLIM, and STARTWAIT. These control scheduler behaviour and should not be removed.
Housekeeping Checklist
A periodic review - quarterly or before major upgrades - should cover the following.
Review all jobs by progress state : List jobs in Done, Err, Abrt, and Canc states. Determine whether each is still needed or can be removed.
Check repeat specifications and scheduled times : Identify repeating jobs with dates in the past that are no longer advancing. Investigate whether conditions or errors are preventing them from running.
Review job ownership : Identify jobs belonging to users who have left the organisation or changed roles.
Check remote host connectivity : Verify that all hosts in the Xi-Batch hosts file are still reachable. Use btconn to test connectivity.
Inspect conditions and assignments : Ensure that all variables referenced by conditions and assignments still exist. Look for circular dependencies or conditions that can never be satisfied.
Review the job log : If LOGJOBS is configured, check for jobs that have not completed successfully in a long time.
Back up before removing : Always run the backup utilities before deleting jobs or variables.
Document changes : Keep a record of what was removed and why. The backup scripts serve as a partial record, but a separate log of the rationale is helpful for audit purposes.
Best Practices
Schedule housekeeping during quiet periods when the batch schedule has no critical jobs running. Avoid deleting jobs whilst they are executing, as this will cause them to be killed.
When decommissioning a remote host, disconnect it cleanly with btdisconn before removing its entry from the hosts file. This avoids the scheduler attempting to reconnect during shutdown.
If shared memory is approaching capacity, removing old retained jobs frees slots immediately. Use btstart with appropriate sizing arguments when restarting the scheduler after significant cleanup to right-size the allocated shared memory.
For systems with complex job schedules, consider enabling LOGJOBS if not already configured. This provides an ongoing audit trail that makes future housekeeping reviews much simpler. Set it to a file path with a pipe-separated format for easy processing with standard Unix tools.
When removing jobs that are part of a queue (jobs sharing a queue name prefix), review the entire queue as a group rather than individual jobs. Queues often represent workflows where jobs have interdependencies through conditions and assignments. Removing one job from a queue without considering the others may leave the remaining jobs unable to run.