Skip to main content

๐Ÿ”ง Fix Stale Thumbnails

๐Ÿ“– Overviewโ€‹

The Fix Stale Thumbnails job is a health check and recovery mechanism that resets stuck thumbnail generation jobs. It runs hourly, identifies sites with thumbnail_build_in_progress=true flag set for over 2 hours (or 15 minutes in development), and clears the flags to allow retry. This ensures sites don't get permanently stuck in "processing" state due to failed or crashed thumbnail generation jobs.

Complete Flow:

  1. Cron Initialization: queue-manager/crons/sites/fixStaleThumbnails.js
  2. Service Processing: queue-manager/services/sites/fixStaleThumbnails.js
  3. Queue Definition: None (direct database operations)

Execution Pattern: Cron-based (every hour)

Queue Name: N/A (no Bull queue, service-only)

Environment Flag: QM_SITES_FIX_STALE_THUMBNAILS=true (in index.js)

๐Ÿ”„ Complete Processing Flowโ€‹

sequenceDiagram
participant CRON as Cron Schedule<br/>(every hour)
participant SERVICE as Fix Stale Service
participant AGENCY_DB as Agency<br/>Websites
participant INSTA_DB as InstaSites<br/>Collection
participant LOGGER as Logger

CRON->>SERVICE: fixStaleThumbnails()
SERVICE->>SERVICE: Calculate stale threshold<br/>(2 hours prod / 15 min dev)

SERVICE->>AGENCY_DB: Find stuck jobs:<br/>- desktop thumbnail null<br/>- in_progress = true<br/>- started > 2 hours ago
AGENCY_DB-->>SERVICE: Return match count

alt Stuck jobs found
SERVICE->>AGENCY_DB: Clear flags:<br/>$unset in_progress<br/>$unset started_at
AGENCY_DB-->>SERVICE: Modified count
end

SERVICE->>INSTA_DB: Find stuck jobs:<br/>- status PUBLISHED<br/>- desktop thumbnail null<br/>- in_progress = true<br/>- started > 2 hours ago
INSTA_DB-->>SERVICE: Return match count

alt Stuck jobs found
SERVICE->>INSTA_DB: Clear flags:<br/>$unset in_progress<br/>$unset started_at
INSTA_DB-->>SERVICE: Modified count
end

alt Any jobs reset
SERVICE->>LOGGER: Log reset counts
end

๐Ÿ“ Source Filesโ€‹

1. Cron Initializationโ€‹

File: queue-manager/crons/sites/fixStaleThumbnails.js

Purpose: Schedule stale thumbnail cleanup every hour

Cron Pattern: 0 * * * * (every hour at minute 0)

Initialization:

const cron = require('node-cron');
const { fixStaleThumbnails } = require('../../services/sites/fixStaleThumbnails');
const logger = require('../../utilities/logger');

let inProgress = false;
exports.start = async () => {
try {
try {
cron.schedule('0 * * * *', async () => {
if (!inProgress) {
inProgress = true;
await fixStaleThumbnails();
}
});
} catch (e) {
throw e;
} finally {
inProgress = false;
}
} catch (err) {
logger.error({ initiator: 'QM/sites/fix-stale-thumbnails', error: err });
}
};

In-Progress Lock: Prevents overlapping executions (unlikely given hourly schedule).

Note: The finally block placement is incorrect in the source code - inProgress gets reset immediately after scheduling, not after job completion. This is a minor bug but doesn't affect functionality since jobs complete within seconds.

2. Service Processing (THE CORE LOGIC)โ€‹

File: queue-manager/services/sites/fixStaleThumbnails.js

Purpose: Reset stale thumbnail generation flags

Key Functions:

  • Calculate environment-specific stale threshold
  • Query stuck agency website jobs
  • Query stuck instasite jobs
  • Reset flags using $unset operation
  • Log reset counts

Main Processing Function:

exports.fixStaleThumbnails = async () => {
try {
// Environment-specific stale thresholds:
// - Production: 2 hours
// - Development: 15 minutes to facilitate faster testing
const STALE_HOURS = process.env.NODE_ENV === 'production' ? 2 : 0.25;
const staleTimestamp = new Date(Date.now() - STALE_HOURS * 60 * 60 * 1000);

// Reset stuck agency website jobs
const stuckAgencyJobs = await AgencyWebsite.updateMany(
{
'details.thumbnails.desktop': null, // Still missing thumbnails
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: { $lt: staleTimestamp }, // Started > 2 hours ago
},
{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove flag
},
},
);

// Reset stuck instasite jobs
const stuckInstasiteJobs = await Instasite.updateMany(
{
status: 'PUBLISHED', // Only published sites
'details.thumbnails.desktop': null, // Still missing thumbnails
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: { $lt: staleTimestamp }, // Started > 2 hours ago
},
{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove flag
},
},
);

if (stuckAgencyJobs?.modifiedCount > 0 || stuckInstasiteJobs?.modifiedCount > 0) {
logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});
}
} catch (err) {
logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});
}
};

๐Ÿ—„๏ธ Collections Usedโ€‹

agency_websitesโ€‹

  • Operations: Update (bulk)
  • Model: shared/models/agency-website.js
  • Usage Context: Reset stuck thumbnail generation flags

Query Criteria (Stuck Jobs):

{
'details.thumbnails.desktop': null, // Missing desktop thumbnail
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000) // Started over 2 hours ago
}
}

Update Operation:

{
$unset: {
thumbnail_process_started_at: '', // Remove timestamp
thumbnail_build_in_progress: '', // Remove in-progress flag
}
}

Key Fields:

  • details.thumbnails.desktop: Desktop screenshot URL (null = missing)
  • thumbnail_build_in_progress: Boolean flag indicating generation in progress
  • thumbnail_process_started_at: Timestamp when generation started

instasitesโ€‹

  • Operations: Update (bulk)
  • Model: shared/models/instasite.js
  • Usage Context: Reset stuck thumbnail generation flags

Query Criteria (Stuck Jobs):

{
status: 'PUBLISHED', // Only published sites
'details.thumbnails.desktop': null, // Missing desktop thumbnail
thumbnail_build_in_progress: true, // Marked as in-progress
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000) // Started over 2 hours ago
}
}

Update Operation: Same as agency_websites

Key Fields: Same structure as agency_websites

๐Ÿ”ง Job Configurationโ€‹

Cron Scheduleโ€‹

'0 * * * *'; // Every hour at minute 0 (e.g., 1:00, 2:00, 3:00, etc.)

Frequency Rationale: Hourly cleanup is sufficient since stale threshold is 2 hours. More frequent checks would be unnecessary.

Stale Thresholdโ€‹

const STALE_HOURS = process.env.NODE_ENV === 'production' ? 2 : 0.25;

Thresholds:

  • Production: 2 hours (7,200,000 milliseconds)
  • Development: 15 minutes (900,000 milliseconds)

Why Different Thresholds?

  • Production: Conservative threshold to avoid resetting legitimately slow jobs
  • Development: Faster testing and debugging of stale job logic

๐Ÿ“‹ Processing Logic - Detailed Flowโ€‹

Stale Job Detection Criteriaโ€‹

A thumbnail job is considered "stale" if ALL conditions are met:

For Agency Websites:

  1. Missing Thumbnail: 'details.thumbnails.desktop': null

    • Desktop thumbnail is still null (generation didn't complete)
  2. In-Progress Flag Set: thumbnail_build_in_progress: true

    • Job was marked as started
  3. Started Long Ago: thumbnail_process_started_at: { $lt: staleTimestamp }

    • Job started over 2 hours ago (production) or 15 minutes ago (development)

For InstaSites:

Same criteria as Agency Websites, plus:

  1. Published Status: status: 'PUBLISHED'
    • Only care about published sites (drafts can wait)

Why These Criteria?โ€‹

Missing Thumbnail + In-Progress Flag:

  • Indicates job started but never completed
  • Could be due to:
    • Process crash during generation
    • Puppeteer timeout
    • Server restart mid-job
    • Network failure during upload

Time Threshold:

  • Normal thumbnail generation: 10-30 seconds
  • Maximum expected time (slow sites): 3 minutes
  • 2-hour threshold = 40x buffer for safety
  • Prevents false positives from slow but legitimate jobs

Reset Operationโ€‹

Bulk Update with $unset:

await AgencyWebsite.updateMany(
{
/* stale job criteria */
},
{
$unset: {
thumbnail_process_started_at: '',
thumbnail_build_in_progress: '',
},
},
);

What $unset Does:

  • Removes the specified fields from documents
  • Equivalent to deleting the properties
  • More efficient than $set: { field: null }

Result:

  • Sites become eligible for thumbnail generation again
  • Next run of build-thumbnails job will pick them up
  • No data loss (just retry metadata removed)

Logging Logicโ€‹

if (stuckAgencyJobs?.modifiedCount > 0 || stuckInstasiteJobs?.modifiedCount > 0) {
logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});
}

Why Conditional Logging?

  • Only logs when jobs are actually reset
  • Prevents log spam on normal (no stuck jobs) runs
  • Provides clear count of reset jobs for monitoring

๐Ÿšจ Error Handlingโ€‹

Common Error Scenariosโ€‹

Database Connection Errorโ€‹

catch (err) {
logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});
}

Result: Error logged, no jobs reset, will retry next hour.

Query Timeoutโ€‹

Scenario: Large number of stuck jobs causes slow query

Result: Query timeout handled by Mongoose, error logged, partial updates may occur.

Recovery: Next hourly run will catch remaining stuck jobs.

No Retry Mechanismโ€‹

This job has no retry logic because:

  • Runs hourly automatically
  • Failures self-correct on next run
  • Not time-critical (stuck jobs can wait another hour)

๐Ÿ“Š Monitoring & Loggingโ€‹

Success Loggingโ€‹

logger.log({
initiator: 'ThumbnailHealthCheck',
message: `Reset ${stuckAgencyJobs?.modifiedCount || 0} stuck agency website jobs and ${
stuckInstasiteJobs?.modifiedCount || 0
} instasite jobs`,
});

Example Output:

Reset 5 stuck agency website jobs and 3 instasite jobs

Error Loggingโ€‹

logger.error({
initiator: 'ThumbnailHealthCheck',
message: 'Error in thumbnail health check',
error: err,
});

Performance Metricsโ€‹

  • Average Processing Time: < 1 second
  • Query Complexity: Simple indexed queries
  • Typical Volume: 0-10 stuck jobs per hour (normal operation)
  • High Volume: 50-100 stuck jobs after system restart

๐Ÿ”— Integration Pointsโ€‹

Triggers This Jobโ€‹

  • Cron Schedule: Every hour automatically
  • Manual Trigger: Via API endpoint (if QM_HOOKS=true)

Data Dependenciesโ€‹

  • Agency Websites: Requires thumbnail_build_in_progress and thumbnail_process_started_at fields
  • InstaSites: Same field requirements

Jobs That Depend On Thisโ€‹

  • Build Thumbnails Job: Picks up sites reset by this job
  • Monitoring/Alerting: Tracks stuck job counts for system health

โš ๏ธ Important Notesโ€‹

Side Effectsโ€‹

  • โš ๏ธ Flag Reset: Clears thumbnail_build_in_progress and thumbnail_process_started_at
  • โš ๏ธ Retry Eligibility: Sites become eligible for thumbnail generation retry
  • โš ๏ธ No Data Loss: Only metadata removed, no site content affected

Performance Considerationsโ€‹

  • Hourly Schedule: Low overhead, minimal database load
  • Indexed Queries: Ensure indexes on thumbnail_build_in_progress, thumbnail_process_started_at
  • Bulk Updates: Uses updateMany for efficiency
  • No Queue: Direct database operations (no Bull overhead)

Maintenance Notesโ€‹

  • Stale Threshold: 2 hours hardcoded (requires code change to modify)
  • Environment-Specific: Development uses 15-minute threshold for faster testing
  • Log Review: Monitor reset counts for patterns indicating systemic issues
  • Complementary Job: Works in tandem with build-thumbnails job

Business Logicโ€‹

Why Hourly Instead of More Frequent?

  • Stale threshold is 2 hours (hourly checks catch jobs within 3 hours max)
  • More frequent checks unnecessary and wasteful
  • Stuck jobs are rare exceptions, not normal flow

Why 2-Hour Threshold?

  • Normal generation: 10-30 seconds
  • Worst case (slow site): 3 minutes
  • 2 hours = 40x safety margin
  • Prevents resetting legitimately slow jobs

Why Not Retry Failed Jobs Immediately?

  • Gives system time to recover from transient issues
  • Prevents retry storms during incidents
  • Allows manual investigation of repeated failures

Code Quality Noteโ€‹

Bug in Cron File:

try {
cron.schedule('0 * * * *', async () => {
if (!inProgress) {
inProgress = true;
await fixStaleThumbnails();
}
});
} catch (e) {
throw e;
} finally {
inProgress = false; // BUG: Resets immediately after scheduling, not after job completion
}

Impact: Minimal - the inProgress flag is reset immediately after scheduling the cron, not after job completion. However, since the job completes in < 1 second and runs hourly, overlapping executions are virtually impossible. This is a minor code smell but doesn't affect functionality.

Correct Pattern (for reference):

cron.schedule('0 * * * *', async () => {
if (!inProgress) {
try {
inProgress = true;
await fixStaleThumbnails();
} finally {
inProgress = false;
}
}
});

๐Ÿงช Testingโ€‹

Manual Triggerโ€‹

# Via API (if QM_HOOKS=true)
POST http://localhost:6002/api/trigger/sites/fixStaleThumbnails

Create Stuck Jobs for Testingโ€‹

// Create stuck agency website job
await AgencyWebsite.create({
status: 'PUBLISHED',
details: {
previews: { all: 'https://preview.example.com/site/123' },
thumbnails: {
desktop: null,
tablet: null,
mobile: null,
},
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 3 * 60 * 60 * 1000), // 3 hours ago
});

// Create stuck instasite job
await Instasite.create({
status: 'PUBLISHED',
details: {
previews: { all: 'https://preview.example.com/site/456' },
thumbnails: {
desktop: null,
tablet: null,
mobile: null,
},
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 3 * 60 * 60 * 1000), // 3 hours ago
});

// Wait for next hourly run or trigger manually
setTimeout(async () => {
const agencySite = await AgencyWebsite.findOne({
/* query */
});
console.log('Flag cleared:', !agencySite.thumbnail_build_in_progress); // true
}, 3600000); // 1 hour

Monitor Stuck Jobsโ€‹

// Count currently stuck agency jobs
const stuckAgency = await AgencyWebsite.countDocuments({
'details.thumbnails.desktop': null,
thumbnail_build_in_progress: true,
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000),
},
});
console.log('Stuck agency website jobs:', stuckAgency);

// Count currently stuck instasite jobs
const stuckInsta = await Instasite.countDocuments({
status: 'PUBLISHED',
'details.thumbnails.desktop': null,
thumbnail_build_in_progress: true,
thumbnail_process_started_at: {
$lt: new Date(Date.now() - 2 * 60 * 60 * 1000),
},
});
console.log('Stuck instasite jobs:', stuckInsta);

// Find oldest stuck jobs
const oldestStuck = await AgencyWebsite.find({
thumbnail_build_in_progress: true,
thumbnail_process_started_at: { $exists: true },
})
.sort({ thumbnail_process_started_at: 1 })
.limit(10);

console.log('Oldest stuck jobs:');
oldestStuck.forEach(site => {
const hoursStuck = (Date.now() - site.thumbnail_process_started_at) / (60 * 60 * 1000);
console.log(`Site ${site._id}: ${hoursStuck.toFixed(1)} hours stuck`);
});

Test Environment-Specific Thresholdโ€‹

// Development environment (15-minute threshold)
process.env.NODE_ENV = 'development';

await AgencyWebsite.create({
status: 'PUBLISHED',
details: {
thumbnails: { desktop: null },
},
thumbnail_build_in_progress: true,
thumbnail_process_started_at: new Date(Date.now() - 20 * 60 * 1000), // 20 minutes ago
});

// Run fix job
await fixStaleThumbnails();

// Verify reset
const site = await AgencyWebsite.findOne({
/* query */
});
console.log('Flag cleared in dev:', !site.thumbnail_build_in_progress); // true (20 min > 15 min threshold)

Verify Reset Operationโ€‹

// Before reset
const beforeReset = await AgencyWebsite.findOne({
thumbnail_build_in_progress: true,
thumbnail_process_started_at: { $exists: true },
});

console.log('Before reset:', {
inProgress: beforeReset.thumbnail_build_in_progress, // true
startedAt: beforeReset.thumbnail_process_started_at, // Date object
});

// Run fix job
await fixStaleThumbnails();

// After reset
const afterReset = await AgencyWebsite.findById(beforeReset._id);

console.log('After reset:', {
inProgress: afterReset.thumbnail_build_in_progress, // undefined (field removed)
startedAt: afterReset.thumbnail_process_started_at, // undefined (field removed)
});

Job Type: Scheduled
Execution Frequency: Every hour
Average Duration: < 1 second
Status: Active

๐Ÿ’ฌ

Documentation Assistant

Ask me anything about the docs

Hi! I'm your documentation assistant. Ask me anything about the docs!

I can help you with:
- Code examples
- Configuration details
- Troubleshooting
- Best practices

Try asking: How do I configure the API?
09:31 AM