Job dispatching with loads needs optimization

Steps to reproduce

Steps to reproduce:
1. Run Opencast 3.x with a large (100k+) number of jobs in the jobs table

Actual Results:
Scheduling can take between 20 seconds to a full minute. Expensive queries to ServiceRegistration.hostloads bog things down.

Expected Results:
Faster scheduling

Workaround (if any):

Activity

Show:
James Perrin
November 21, 2017, 4:22 PM

Yeah we are seeing a lot of deadlock errors regards queries to hostloads, which are probably due to slow queries clogging things up - though our NewRelic doesn't really show these as being particularly slow.

Internal Exception: java.sql.SQLException: Transaction (Process ID 103) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Error Code: 1205
Call: SELECT t0.id, t0.active, t0.error_state_trigger, t0.job_producer, t0.online, t0.online_from, t0.path, t0.service_state, t0.service_type, t0.state_changed, t0.warning_state_trigger, t0.host_registration, t1.status, SUM(t1.job_load) FROM mh_service_registration t0, mh_host_registration t2, mh_job t1 WHERE (((((t0.online = ?) AND (t0.active = ?)) AND (t2.maintenance = ?)) AND (t1.status IN (?,?,?))) AND ((t0.id = t1.processor_service) AND (t2.id = t0.host_registration))) GROUP BY t0.id, t0.active, t0.error_state_trigger, t0.job_producer, t0.online, t0.online_from, t0.path, t0.service_state, t0.service_type, t0.state_changed, t0.warning_state_trigger, t0.host_registration, t1.status
bind => [6 parameters bound]
Query: ReportQuery(name="ServiceRegistration.hostloads" referenceClass=JpaJob sql="SELECT t0.id, t0.active, t0.error_state_trigger, t0.job_producer, t0.online, t0.online_from, t0.path, t0.service_state, t0.service_type, t0.state_changed, t0.warning_state_trigger, t0.host_registration, t1.status, SUM(t1.job_load) FROM mh_service_registration t0, mh_host_registration t2, mh_job t1 WHERE (((((t0.online = ?) AND (t0.active = ?)) AND (t2.maintenance = ?)) AND (t1.status IN ?)) AND ((t0.id = t1.processor_service) AND (t2.id = t0.host_registration))) GROUP BY t0.id, t0.active, t0.error_state_trigger, t0.job_producer, t0.online, t0.online_from, t0.path, t0.service_state, t0.service_type, t0.state_changed, t0.warning_state_trigger, t0.host_registration, t1.status")
Default FetchGroup(){creatorServiceRegistration, creator, jobLoad, dispatchable, version, queueTime, dateCreated, blockingJobId, dateStarted, parentJob, dateCompleted, organization, rootJob, arguments, id, runTime, blockedJobIds, operation, childJobs, processorServiceRegistration, status}

Stephen Marquard
November 21, 2017, 4:30 PM

I would suspect there's some other cause behind that deadlock, like an incorrect transaction isolation level, or too much going on in one transaction. Generally a SELECT shouldn't deadlock.

Assignee

Greg Logan

Reporter

Greg Logan

Severity

Performance

Tags (folksonomy)

None

Components

Fix versions

Affects versions

Priority

Major
Configure