๐ง Fix Stale Thumbnails
๐ Overviewโ
The Fix Stale Thumbnails job is a health check and recovery mechanism that resets stuck thumbnail generation jobs. It runs hourly, identifies sites with thumbnail_build_in_progress=true flag set for over 2 hours (or 15 minutes in development), and clears the flags to allow retry. This ensures sites don't get permanently stuck in "processing" state due to failed or crashed thumbnail generation jobs.
Complete Flow:
- Cron Initialization:
queue-manager/crons/sites/fixStaleThumbnails.js - Service Processing:
queue-manager/services/sites/fixStaleThumbnails.js - Queue Definition: None (direct database operations)
Execution Pattern: Cron-based (every hour)
Queue Name: N/A (no Bull queue, service-only)
Environment Flag: QM_SITES_FIX_STALE_THUMBNAILS=true (in index.js)
๐ Complete Processing Flowโ
sequenceDiagram
participant CRON as Cron Schedule<br/>(every hour)
participant SERVICE as Fix Stale Service
participant AGENCY_DB as Agency<br/>Websites
participant INSTA_DB as InstaSites<br/>Collection
participant LOGGER as Logger
CRON->>SERVICE: fixStaleThumbnails()
SERVICE->>SERVICE: Calculate stale threshold<br/>(2 hours prod / 15 min dev)
SERVICE->>AGENCY_DB: Find stuck jobs:<br/>- desktop thumbnail null<br/>- in_progress = true<br/>- started > 2 hours ago
AGENCY_DB-->>SERVICE: Return match count
alt Stuck jobs found
SERVICE->>AGENCY_DB: Clear flags:<br/>$unset in_progress<br/>$unset started_at
AGENCY_DB-->>SERVICE: Modified count
end
SERVICE->>INSTA_DB: Find stuck jobs:<br/>- status PUBLISHED<br/>- desktop thumbnail null<br/>- in_progress = true<br/>- started > 2 hours ago
INSTA_DB-->>SERVICE: Return match count
alt Stuck jobs found
SERVICE->>INSTA_DB: Clear flags:<br/>$unset in_progress<br/>$unset started_at
INSTA_DB-->>SERVICE: Modified count
end
alt Any jobs reset
SERVICE->>LOGGER: Log reset counts
end
๐ Source Filesโ
1. Cron Initializationโ
File: queue-manager/crons/sites/fixStaleThumbnails.js
Purpose: Schedule stale thumbnail cleanup every hour
Cron Pattern: 0 * * * * (every hour at minute 0)
Initialization:
const cron = require('node-cron');
const { fixStaleThumbnails } = require('../../services/sites/fixStaleThumbnails');
const logger = require('../../utilities/logger');
let inProgress = false;
exports.start = async () => {
try {
try {
cron.schedule('0 * * * *', async () => {
if (!inProgress) {
inProgress = true;
await fixStaleThumbnails();
}
});
} catch (e) {
throw e;
} finally {
inProgress = false;
}
} catch (err) {
logger.error({ initiator: 'QM/sites/fix-stale-thumbnails', error: err });
}
};
In-Progress Lock: Prevents overlapping executions (unlikely given hourly schedule).
Note: The finally block placement is incorrect in the source code - inProgress gets reset immediately after scheduling, not after job completion. This is a minor bug but doesn't affect functionality since jobs complete within seconds.
2. Service Processing (THE CORE LOGIC)โ
File: queue-manager/services/sites/fixStaleThumbnails.js
Purpose: Reset stale thumbnail generation flags
Key Functions:
- Calculate environment-specific stale threshold
- Query stuck agency website jobs
- Query stuck instasite jobs
- Reset flags using
$unsetoperation - Log reset counts
Main Processing Function:
exports.fixStaleThumbnails = async () => {
try {
// Environment-specific stale thresholds:
// - Production: 2 hours
// - Development: 15 minutes to facilitate faster testing
const STALE_HOURS = process.env.NODE_ENV === 'production' ? 2 : 0.25;
const staleTimestamp = new Date(Date.now() - STALE_HOURS * 60 * 60 * 1000);
// Reset stuck agency website jobs
const stuckAgencyJobs = await AgencyWebsite.updateMany(
{
'details.thumbnails.desktop': null, // Still missing thumbnails
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: { $lt: staleTimestamp }, // Started > 2 hours ago
},
{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove flag
},
},
);
// Reset stuck instasite jobs
const stuckInstasiteJobs = await Instasite.updateMany(
{
status: 'PUBLISHED', // Only published sites
'details.thumbnails.desktop': null, // Still missing thumbnails
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: { $lt: staleTimestamp }, // Started > 2 hours ago
},
{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove flag
},
},
);
if (stuckAgencyJobs?.modifiedCount > 0 || stuckInstasiteJobs?.modifiedCount > 0) {
logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});
}
} catch (err) {
logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});
}
};
๐๏ธ Collections Usedโ
agency_websitesโ
- Operations: Update (bulk)
- Model:
shared/models/agency-website.js - Usage Context: Reset stuck thumbnail generation flags
Query Criteria (Stuck Jobs):
{
'details.thumbnails.desktop': null, // Missing desktop thumbnail
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000) // Started over 2 hours ago
}
}
Update Operation:
{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove in-progress flag
}
}
Key Fields:
details.thumbnails.desktop: Desktop screenshot URL (null = missing)thumbnail_build_in_progress: Boolean flag indicating generation in progressthumbnail_process_started_at: Timestamp when generation started
instasitesโ
- Operations: Update (bulk)
- Model:
shared/models/instasite.js - Usage Context: Reset stuck thumbnail generation flags
Query Criteria (Stuck Jobs):
{
status: 'PUBLISHED', // Only published sites
'details.thumbnails.desktop': null, // Missing desktop thumbnail
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000) // Started over 2 hours ago
}
}
Update Operation: Same as agency_websites
Key Fields: Same structure as agency_websites
๐ง Job Configurationโ
Cron Scheduleโ
'0 * * * *'; // Every hour at minute 0 (e.g., 1:00, 2:00, 3:00, etc.)
Frequency Rationale: Hourly cleanup is sufficient since stale threshold is 2 hours. More frequent checks would be unnecessary.
Stale Thresholdโ
const STALE_HOURS = process.env.NODE_ENV === 'production' ? 2 : 0.25;
Thresholds:
- Production: 2 hours (7,200,000 milliseconds)
- Development: 15 minutes (900,000 milliseconds)
Why Different Thresholds?
- Production: Conservative threshold to avoid resetting legitimately slow jobs
- Development: Faster testing and debugging of stale job logic
๐ Processing Logic - Detailed Flowโ
Stale Job Detection Criteriaโ
A thumbnail job is considered "stale" if ALL conditions are met:
For Agency Websites:
-
Missing Thumbnail:
'details.thumbnails.desktop': null- Desktop thumbnail is still null (generation didn't complete)
-
In-Progress Flag Set:
thumbnail_build_in_progress: true- Job was marked as started
-
Started Long Ago:
thumbnail_process_started_at: { $lt: staleTimestamp }- Job started over 2 hours ago (production) or 15 minutes ago (development)
For InstaSites:
Same criteria as Agency Websites, plus:
- Published Status:
status: 'PUBLISHED'- Only care about published sites (drafts can wait)
Why These Criteria?โ
Missing Thumbnail + In-Progress Flag:
- Indicates job started but never completed
- Could be due to:
- Process crash during generation
- Puppeteer timeout
- Server restart mid-job
- Network failure during upload
Time Threshold:
- Normal thumbnail generation: 10-30 seconds
- Maximum expected time (slow sites): 3 minutes
- 2-hour threshold = 40x buffer for safety
- Prevents false positives from slow but legitimate jobs
Reset Operationโ
Bulk Update with $unset:
await AgencyWebsite.updateMany(
{
/* stale job criteria */
},
{
$unset: {
thumbnail_process_started_at: '',
thumbnail_build_in_progress: '',
},
},
);
What $unset Does:
- Removes the specified fields from documents
- Equivalent to deleting the properties
- More efficient than
$set: { field: null }
Result:
- Sites become eligible for thumbnail generation again
- Next run of
build-thumbnailsjob will pick them up - No data loss (just retry metadata removed)
Logging Logicโ
if (stuckAgencyJobs?.modifiedCount > 0 || stuckInstasiteJobs?.modifiedCount > 0) {
logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});
}
Why Conditional Logging?
- Only logs when jobs are actually reset
- Prevents log spam on normal (no stuck jobs) runs
- Provides clear count of reset jobs for monitoring
๐จ Error Handlingโ
Common Error Scenariosโ
Database Connection Errorโ
catch (err) {
logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});
}
Result: Error logged, no jobs reset, will retry next hour.
Query Timeoutโ
Scenario: Large number of stuck jobs causes slow query
Result: Query timeout handled by Mongoose, error logged, partial updates may occur.
Recovery: Next hourly run will catch remaining stuck jobs.
No Retry Mechanismโ
This job has no retry logic because:
- Runs hourly automatically
- Failures self-correct on next run
- Not time-critical (stuck jobs can wait another hour)
๐ Monitoring & Loggingโ
Success Loggingโ
logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});
Example Output:
Reset 5 stuck agency website jobs and 3 instasite jobs
Error Loggingโ
logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});
Performance Metricsโ
- Average Processing Time: < 1 second
- Query Complexity: Simple indexed queries
- Typical Volume: 0-10 stuck jobs per hour (normal operation)
- High Volume: 50-100 stuck jobs after system restart
๐ Integration Pointsโ
Triggers This Jobโ
- Cron Schedule: Every hour automatically
- Manual Trigger: Via API endpoint (if QM_HOOKS=true)
Data Dependenciesโ
- Agency Websites: Requires
thumbnail_build_in_progressandthumbnail_process_started_atfields - InstaSites: Same field requirements
Jobs That Depend On Thisโ
- Build Thumbnails Job: Picks up sites reset by this job
- Monitoring/Alerting: Tracks stuck job counts for system health
โ ๏ธ Important Notesโ
Side Effectsโ
- โ ๏ธ Flag Reset: Clears
thumbnail_build_in_progressandthumbnail_process_started_at - โ ๏ธ Retry Eligibility: Sites become eligible for thumbnail generation retry
- โ ๏ธ No Data Loss: Only metadata removed, no site content affected
Performance Considerationsโ
- Hourly Schedule: Low overhead, minimal database load
- Indexed Queries: Ensure indexes on
thumbnail_build_in_progress,thumbnail_process_started_at - Bulk Updates: Uses
updateManyfor efficiency - No Queue: Direct database operations (no Bull overhead)
Maintenance Notesโ
- Stale Threshold: 2 hours hardcoded (requires code change to modify)
- Environment-Specific: Development uses 15-minute threshold for faster testing
- Log Review: Monitor reset counts for patterns indicating systemic issues
- Complementary Job: Works in tandem with
build-thumbnailsjob
Business Logicโ
Why Hourly Instead of More Frequent?
- Stale threshold is 2 hours (hourly checks catch jobs within 3 hours max)
- More frequent checks unnecessary and wasteful
- Stuck jobs are rare exceptions, not normal flow
Why 2-Hour Threshold?
- Normal generation: 10-30 seconds
- Worst case (slow site): 3 minutes
- 2 hours = 40x safety margin
- Prevents resetting legitimately slow jobs
Why Not Retry Failed Jobs Immediately?
- Gives system time to recover from transient issues
- Prevents retry storms during incidents
- Allows manual investigation of repeated failures
Code Quality Noteโ
Bug in Cron File:
try {
cron.schedule('0 * * * *', async () => {
if (!inProgress) {
inProgress = true;
await fixStaleThumbnails();
}
});
} catch (e) {
throw e;
} finally {
inProgress = false; // BUG: Resets immediately after scheduling, not after job completion
}
Impact: Minimal - the inProgress flag is reset immediately after scheduling the cron, not after job completion. However, since the job completes in < 1 second and runs hourly, overlapping executions are virtually impossible. This is a minor code smell but doesn't affect functionality.
Correct Pattern (for reference):
cron.schedule('0 * * * *', async () => {
if (!inProgress) {
try {
inProgress = true;
await fixStaleThumbnails();
} finally {
inProgress = false;
}
}
});
๐งช Testingโ
Manual Triggerโ
# Via API (if QM_HOOKS=true)
POST http://localhost:6002/api/trigger/sites/fixStaleThumbnails
Create Stuck Jobs for Testingโ
// Create stuck agency website job
await AgencyWebsite.create({
status: 'PUBLISHED',
details: {
previews: { all: 'https://preview.example.com/site/123' },
thumbnails: {
desktop: null,
tablet: null,
mobile: null,
},
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 3 * 60 * 60 * 1000), // 3 hours ago
});
// Create stuck instasite job
await Instasite.create({
status: 'PUBLISHED',
details: {
previews: { all: 'https://preview.example.com/site/456' },
thumbnails: {
desktop: null,
tablet: null,
mobile: null,
},
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 3 * 60 * 60 * 1000), // 3 hours ago
});
// Wait for next hourly run or trigger manually
setTimeout(async () => {
const agencySite = await AgencyWebsite.findOne({
/* query */
});
console.log('Flag cleared:', !agencySite.thumbnail_build_in_progress); // true
}, 3600000); // 1 hour
Monitor Stuck Jobsโ
// Count currently stuck agency jobs
const stuckAgency = await AgencyWebsite.countDocuments({
'details.thumbnails.desktop': null,
thumbnail_build_in_progress: true,
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000),
},
});
console.log('Stuck agency website jobs:', stuckAgency);
// Count currently stuck instasite jobs
const stuckInsta = await Instasite.countDocuments({
status: 'PUBLISHED',
'details.thumbnails.desktop': null,
thumbnail_build_in_progress: true,
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000),
},
});
console.log('Stuck instasite jobs:', stuckInsta);
// Find oldest stuck jobs
const oldestStuck = await AgencyWebsite.find({
thumbnail_build_in_progress: true,
thumbnail_process_started_at: { $exists: true },
})
.sort({ thumbnail_process_started_at: 1 })
.limit(10);
console.log('Oldest stuck jobs:');
oldestStuck.forEach(site => {
const hoursStuck = (Date.now() - site.thumbnail_process_started_at) / (60 * 60 * 1000);
console.log(`Site ${site._id}: ${hoursStuck.toFixed(1)} hours stuck`);
});
Test Environment-Specific Thresholdโ
// Development environment (15-minute threshold)
process.env.NODE_ENV = 'development';
await AgencyWebsite.create({
status: 'PUBLISHED',
details: {
thumbnails: { desktop: null },
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 20 * 60 * 1000), // 20 minutes ago
});
// Run fix job
await fixStaleThumbnails();
// Verify reset
const site = await AgencyWebsite.findOne({
/* query */
});
console.log('Flag cleared in dev:', !site.thumbnail_build_in_progress); // true (20 min > 15 min threshold)
Verify Reset Operationโ
// Before reset
const beforeReset = await AgencyWebsite.findOne({
thumbnail_build_in_progress: true,
thumbnail_process_started_at: { $exists: true },
});
console.log('Before reset:', {
inProgress: beforeReset.thumbnail_build_in_progress, // true
startedAt: beforeReset.thumbnail_process_started_at, // Date object
});
// Run fix job
await fixStaleThumbnails();
// After reset
const afterReset = await AgencyWebsite.findById(beforeReset._id);
console.log('After reset:', {
inProgress: afterReset.thumbnail_build_in_progress, // undefined (field removed)
startedAt: afterReset.thumbnail_process_started_at, // undefined (field removed)
});
Job Type: Scheduled
Execution Frequency: Every hour
Average Duration: < 1 second
Status: Active