Facing an issue where the email scheduler job keep sending duplicate activation email to user after they registered to the system.
After some troubleshooting time, found the culprit was actually spring schedule job running separately in separate node under cluster environment.
I then decide to “borrow” the idea from Liquibase on how it ensure only one instance running the changeset at one time. I first create a table call schedule_lock with column below:
CREATE TABLE scheduler_lock ( id integer NOT NULL, job_type character varying(50), is_lock boolean, lock_time timestamp without time zone, lock_by character varying(255), CONSTRAINT pk_scheduler_lock PRIMARY KEY (id) )
By having the job_type column, we can store different schedule job status in single table. lock_time and lock_by simply store the lock time and host name for easy debugging later. So the idea behind is pretty simple, any job will need to check against the lock status and their job type before running the job.
For eg: Node A run the job, it own the lock by updating is_lock to TRUE, lock_time and lock_by to current time and host name. Now Node B kicks in with the same job, it then check against the column, found is_lock = TRUE, it know some node already performing the job, so it can have some rest time by discard it’s own job.
After Node A complete it’s job, it will then release the lock by updating is_lock to FALSE, lock_time and lock_by to null.
And here is the code snippet for getting and releasing the lock.
public boolean getLock(SchedulerJobType jobType, String author) { log.debug("[{}] getting the lock", author); Session session = sessionFactory.getCurrentSession(); Criteria criteria = session.createCriteria(SchedulerLock.class); criteria.add(eq("jobType", jobType.getCode())); SchedulerLock scheduleLock = (SchedulerLock) criteria.uniqueResult(); if (scheduleLock.getIsLock().equals(Boolean.FALSE)) { scheduleLock.setLock(Boolean.TRUE); scheduleLock.setLockBy(author); scheduleLock.setLockTime(new Date()); session.update(scheduleLock); log.debug("[{}] got the lock", author); return true; } return false; }
public void releaseLock(SchedulerJobType jobType, String author) { log.debug("[{}] releasing the lock", author); Session session = sessionFactory.getCurrentSession(); Criteria criteria = session.createCriteria(SchedulerLock.class); criteria.add(eq("jobType", jobType.getCode())); criteria.add(eq("lockBy", author)); try { SchedulerLock scheduleLock = (SchedulerLock) criteria.uniqueResult(); if (scheduleLock.getIsLock().equals(Boolean.TRUE)) { scheduleLock.setLock(Boolean.FALSE); scheduleLock.setLockBy(null); scheduleLock.setLockTime(null); session.update(scheduleLock); log.debug("[{}] released the lock", author); } } catch (HibernateException e) { log.debug("Not able to release lock by current thread {} due to {}", author, e.getCause()); } }
OK, seems everything is now in place and my spring scheduler is “cluster safe”, Yay !
BUT … when the job run again with this new implementation, some weird behavior again happen.
Node A and Node B got the lock one after another, Node A release the lock successfully, but Node B throw an exception when releasing the lock. Logs below explain the situation:
#Node A 2016-02-02 03:12:00,000 DEBUG EmailServiceImpl: Email scheduler job started... 2016-02-02 03:12:00,013 DEBUG SchedulerLockDAOImpl: [Node-A] getting the lock 2016-02-02 03:12:00,022 DEBUG SchedulerLockDAOImpl: [Node-A] got the lock ..... 2016-02-02 03:12:06,199 DEBUG SchedulerLockDAOImpl: [Node-A] releasing the lock 2016-02-02 03:12:06,204 DEBUG SchedulerLockDAOImpl: [Node-A] released the lock 2016-02-02 03:12:06,210 DEBUG EmailServiceImpl: Email scheduler job end successfully...
#Node B 2016-02-02 03:12:00,000 DEBUG EmailServiceImpl: Email scheduler job started... 2016-02-02 03:12:00,013 DEBUG SchedulerLockDAOImpl: [Node-B] getting the lock 2016-02-02 03:12:00,019 DEBUG SchedulerLockDAOImpl: [Node-B] got the lock ..... 2016-02-02 03:12:05,082 DEBUG SchedulerLockDAOImpl: [Node-B] releasing the lock 2016-02-02 03:12:05,102 ERROR TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task.
Lets zoom into the code and see what is happening.
if (scheduleLock.getIsLock().equals(Boolean.FALSE)) { scheduleLock.setLock(Boolean.TRUE); //<-- Node A scheduleLock.setLockBy(author); scheduleLock.setLockTime(new Date()); session.update(scheduleLock); //<-- Node B log.debug("[{}] got the lock", author); return true; }
In this situation, the only possibility I can think of is Node B run the job first and got the lock by updating the record into database, but before Node B commit the changes into database, Node A kicks in safely with lock status still FALSE, so it then overwrite the lock status updated by Node B just almost immediately.
After some thought, I come out with an idea to make them run in different time by setting random Thread sleep time for each of the job and guess what, problem solved 🙂
@Scheduled(cron = "0 0 */2 * * ?") public void sendActivationEmailTask() { try { int sleepPeriod = ThreadLocalRandom.current().nextInt(3, 60); Thread.sleep(sleepPeriod * 1000); String author = InetAddress.getLocalHost().getHostName(); if (schedulerLockDAO.getLock(SchedulerJobType.EMAIL, author)) { // GET THE RECORD AND SEND EMAIL schedulerLockDAO.releaseLock(SchedulerJobType.EMAIL, author); } } catch (Exception e) { // LOG EXCEPTION } }
p/s: This might not be the best way, as the scheduler job will not run in exact 2 minutes we set, in this case will be 2 minutes + Thread sleep time from 1 – 60 sec. Leave me a comment if you have better idea.