Hot Takes - Stop using SQL Agent jobs for application logic- 6 mins
I get it, it’s a tempting proposition: your SQL Server instances come with a job scheduler out of the box and you even get a GUI with Management Studio to create and modify them. The database engineers say they use the very same agent to handle index maintenance and backups, so it must be a trustworthy feature. But please, don’t.
Stop using SQL Agent jobs for application logic. Don’t use them to run hotfixes to batch resolve data integrity issues. Don’t use them to clear the procedure cache for a stored procedure every 5 minutes. Don’t use them to check for data integrity errors not enforced by an application or schema. Don’t use them to send report emails to business analysts. Don’t.
Here’s a few reasons why:
Outside of using a SQL Server specific monitoring tool (which continually becomes a more costly and less desirable investment) it can be difficult to properly monitor jobs. Adding observability to more generic platforms usually requires complex custom queries that take into account environment specific factors in order to access job completion statuses, step performance, and similar overview metrics.
To get debug level details, which aren often desired and helpful for jobs that exist with business contexts, agent jobs can log to discrete flat files or configure a higher number of rows to be logged to the msdb database. Logging to the msdb is not without issue; an msdb that is too large can affect system and backup/restore performance, which is why Aamazon’s RDS for SQL Server places a limit on the number of rows that can be logged. This RDS feature limitation should be taken as an advisory.
Logging to discrete files creates a downstream problem of trying to ingest N number of job specific files, or ensuring that a single log file is being written to with the same logging patterns by N jobs. This is not available in RDS, and again should be taken as an advisory.
If using AlwaysOn to provide high availability for databases, synchronizing SQL Agent jobs across each node is probably required. Since these exist at the instance level (in the msdb database), they need to be kept in parity across all nodes with something like DbaTools’ Copy-DbaAgentJob, in addition to custom T-SQL that checks the database’s status before attempting to perform any write operations at the job step level.
At this point, ingesting job metrics becomes more difficult. All potential failover nodes should be queried for job metrics and their results merged to account for an automated failover that changes where the jobs are actually running from and thus where the job results are logged to. It would be wise at this point to filter out successful job run metrics from the non-primary nodes, since they are likely no-op runs of the job.
This becomes even harder to manage when logging to discrete flat files now located across different servers.
Amazon’s RDS for SQL Server does replicate SQL Agent jobs as a feature, but has some pretty large limitations as well, such as only working on jobs that have T-SQL steps.
I undertook the task of attempting to source control SQL Agent jobs in my SQL Server collection for Ansible, lowlydba/sqlserver. The process was harder than anticipated due no shortage of undocumented SMO bugs related to the agent internals, often returning stale data even after refreshing. This was ultimatley only possible through trial and error and the groundwork DbaTools had laid for a smoother SMO experience overall.
Even then, there are still many potential pitfalls, like the fact that schedule names are not unique and can be tied to 0 or more jobs. This makes configuration rely on self-imposed restrictions (i.e. schedule names must be unique) or a desire to use instance-specific GUIDs when defining job components (not recommended for the sane).
While it is technically possible to source control the jobs with pure T-SQL, it is unreasonably complex when compared to SMO and is only realistically feasible by dropping and re-creating the jobs wholesale with each modification, thereby erasing job history.
One of the biggest problems with using SQL Agent jobs for business logic is that it is fundamentally separated from its application in most cases. The jobs may be monitored or owned by a wholly separate team. This leads to jobs being append-only in practice, where more and more get piled on over the years and context is lost to tribal knowledge.
These jobs are often left out of an application’s golden signals and are only revisisted once they create problems for end users.
There are three builtin permission roles for managing SQL Agent jobs: SqlAgentUserRole, SQLAgentReaderRole, SQLAgentOperatorRole. The documentation explains all of them in-depth, but the takeaway is that they are woefully underwhelming when using SQL Agent jobs as a generic job scheduler.
There are two ways to allow users to self-manage jobs:
The SQLAgentUserRole can manage jobs owned by the user’s login. A job’s owner can’t be a Windows security group however, so this requires a 1:1 mapping of a login and job(s), which usually results in a login’s credentials being unsecurely shared amongst a group of people.
The SQLAgentOperatorRole can manage any job regardless of the job’s owner. This level of permissioning is usually at odds with the principal of least privilege and not often not an option for non-administrators.
To be effective, an automation framework is usually needed to abstract away these permissioning issues, or having an on-call team with the proper permissions at the ready to click a start/stop button in the event of a related incident. Both have complexities of their own, but those are out of scope for this post.
Performing QA on SQL Agent jobs may ideally look something like this:
- All queries are in stored procedures
- All stored procedures have tSQLt unit tests
- All stored procedure calls are wrapped in PowerShell functions
- All PowerShell fcuntions have Pester tests
- All PowerShell functions are in a module that is updatedatable from a package manager
- All SQL Agent jobs call tested PowerShell scripts
I have never seen a company have any of these steps in place, despite the fact that jobs often entail data changes in critical systems. Usually, making a “database change” is a convenient way to bypass traditional QA requirements and push out data/bug fixes ASAP and never look back.
That is to say nothing of the behavioral differences when running PowerShell scripts from SQL Agent on different SQL Server versions and installed PowerShell versions, which should be investigated and tightly controlled.
All of this is to say, there are generally better options out there if you care about longevity, observability, and overall code quality.