Skip to main content

InstReports - Report Building

Overview

The InstReports Report Building module orchestrates the automated generation of comprehensive business intelligence reports by aggregating data from 7+ third-party sources: Yext listings, Google Maps, Yelp, Facebook, Semrush, PageSpeed Insights, and Facebook Ads Library. It runs every 30 seconds to process queued reports, using a sophisticated scraper coordination system with OneBalance credit verification, JWT authentication, and intelligent fallback logic when listings are not found.

Key Features:

  • Multi-Source Aggregation: Combines data from 7+ external APIs
  • 30-Second Schedule: High-frequency processing for rapid report generation
  • OneBalance Integration: Verifies credits before building reports
  • Smart Matching: Uses confidence scores (>90%) to match listings across platforms
  • Cascading Scrapers: Yext scan triggers dependent scrapers (Map, Yelp, Facebook, etc.)
  • Configurable Scrapers: Per-report configuration controls which scrapers run
  • Retry Logic: Multiple attempts with exponential backoff for each scraper
  • Status Tracking: Real-time status updates for each scraper component

Critical Business Impact:

  • Competitive Intelligence: Provides 360° view of business online presence
  • Lead Generation: Powers demo reports for prospects
  • Revenue Generation: Consumes OneBalance credits (monetization)
  • Client Insights: Helps clients understand their digital footprint
  • SEO Analysis: Combines multiple data sources for comprehensive SEO insights

Architecture

Execution Flow

sequenceDiagram
participant Cron as Cron Scheduler
participant Service as Schedule Service
participant DB as MongoDB
participant BuildQ as Build Queue
participant YextQ as Yext Queue
participant Yext as Yext API
participant Map as Google Maps Queue
participant Yelp as Yelp Queue
participant FB as Facebook Queue
participant Semrush as Semrush Queue
participant PageSpeed as PageSpeed Queue
participant FBAds as FB Ads Library Queue
participant OneBalance as OneBalance Service

Note over Cron,OneBalance: Every 30 Seconds

Cron->>Service: Trigger schedule check
Service->>DB: Find reports ready to build
Note over DB: type: 'instareport'<br/>build_at <= now<br/>in_progress: false<br/>OR stale (30min+)
DB-->>Service: Queue items grouped by account

Service->>Service: Take first item per account
Note over Service: One report per account

Service->>BuildQ: Add to build queue
Note over BuildQ: 15 attempts, 4s backoff

BuildQ->>BuildQ: Verify OneBalance
BuildQ->>OneBalance: Check credits (event: 'instareport')

alt Sufficient Credits
OneBalance-->>BuildQ: Credits verified
BuildQ->>DB: Fetch InstaReport
BuildQ->>BuildQ: Extract business info
BuildQ->>YextQ: Add to Yext queue
Note over YextQ: Start scraping process

YextQ->>YextQ: Generate JWT token (10min)
YextQ->>Yext: POST /v1/e/yext/scan
Note over Yext: Start listings scan
Yext-->>YextQ: Job ID + Sites list

loop Poll every 10 seconds
YextQ->>Yext: GET /v1/e/yext/scan/{jobID}
Yext-->>YextQ: Scan status
Note over YextQ: Wait until all scans complete
end

YextQ->>DB: Update scrapers.yext data

YextQ->>YextQ: Match listings with confidence scores
Note over YextQ: Address, Name, Phone > 90%

alt Google Maps Found (>90% match)
YextQ->>Map: Add to Map queue with URL
else No Match
YextQ->>Map: Add to Map queue (empty URL)
end

alt Yelp Found (>90% match)
YextQ->>Yelp: Add to Yelp queue with URL
else No Match
YextQ->>Yelp: Add to Yelp queue (empty URL)
end

alt Facebook Found (>90% match)
YextQ->>FB: Add to FB queue with URL
YextQ->>FBAds: Add to FB Ads queue with URL
else Facebook URL Provided
YextQ->>FB: Use provided URL
YextQ->>FBAds: Use provided URL
else No Match
YextQ->>FB: Add to FB queue (empty URL)
YextQ->>FBAds: Add to FB Ads queue (empty URL)
end

alt Website Provided
YextQ->>Semrush: Add to Semrush queue with domain
YextQ->>PageSpeed: Add to PageSpeed queue with URL
else No Website
YextQ->>Semrush: Add to Semrush queue (empty domain)
YextQ->>PageSpeed: Add to PageSpeed queue (empty domain)
end

Note over Map,PageSpeed: Each scraper runs independently
Note over Map,PageSpeed: Updates report.scrapers.{name}

else Insufficient Credits
OneBalance-->>BuildQ: Error: Insufficient balance
BuildQ->>DB: Delete queue item
BuildQ->>DB: Set InstaReport status: FAILED
end

Component Structure

queue-manager/
├── crons/
│ └── instareports/
│ └── build.js # Cron scheduler (30 seconds)
├── services/
│ └── instareports/
│ ├── index.js # Service exports
│ └── build/
│ └── index.js # Schedule service + Yext dispatcher
├── queues/
│ └── instareports/
│ ├── build.js # Main build queue
│ ├── yext/
│ │ └── index.js # Yext orchestrator
│ ├── map/ # Google Maps scraper
│ ├── yelp/ # Yelp scraper
│ ├── facebook/ # Facebook scraper
│ ├── semrush/ # Semrush SEO scraper
│ ├── pageSpeed/ # PageSpeed Insights scraper
│ └── adLibrary/ # Facebook Ads Library scraper
└── utilities/
└── onebalance.js # Credit verification

Cron Schedule

File: queue-manager/crons/instareports/build.js

'*/30 * * * * *'; // Every 30 seconds

Pattern: High-frequency scheduler for rapid report generation

  • In-Progress Locking: Prevents concurrent executions
  • Purpose: Quick turnaround for queued reports

Configuration

Environment Variables

VariableTypeRequiredDescription
API_BASE_URLStringYesInternal API base URL for Yext/other integrations
APP_SECRETStringYesJWT secret for token signing

Queue Retry Configuration

Build Queue: queue-manager/queues/instareports/build.js

{
attempts: 15,
backoff: 4000 // 4 seconds fixed delay
}

Yext Queue: queue-manager/queues/instareports/yext/index.js

{
attempts: 10,
backoff: {
delay: 4000,
type: 'exponential'
}
}

Scraper Queues: (Map, Yelp, Facebook, Semrush, PageSpeed, FBAds)

{
attempts: 6,
backoff: {
delay: 4000,
type: 'exponential'
},
removeOnComplete: true
}

Service Implementation

Report Scheduling Logic

File: queue-manager/services/instareports/build/index.js

Query with Stale Lock Recovery

let filter = {
type: 'instareport',
build_at: { $lte: new Date() },
$or: [
{ in_progress: { $ne: true } },
{ updated_at: { $lt: new Date(new Date().getTime() - 30 * 60 * 1000) } },
],
};

Conditions:

  1. type: 'instareport' - Report build type
  2. build_at <= now - Scheduled time has passed
  3. in_progress !== true OR updated_at < 30 minutes ago - Stale lock recovery

Purpose: Prevents permanently stuck reports with 30-minute timeout

Account Grouping Aggregation

let query = [
{ $match: filter },
{
$facet: {
count_instareports: [{ $count: 'count_instareports' }],
items: [
{
$group: {
_id: '$account_id',
queue_items: {
$push: '$_id',
},
},
},
{ $sort: { _id: 1 } },
],
},
},
{
$project: {
total: {
$arrayElemAt: ['$count_instareports.count_instareports', 0],
},
users: '$items',
},
},
];

Purpose: Groups queue items by account, ensures one report per account

  • $facet: Parallel aggregation for count + grouping
  • $group: Groups by account_id, collects all queue item IDs
  • First Item Selection: Only processes queue_items[0] per account

Output Structure:

{
total: 45, // Total reports ready
users: [
{ _id: ObjectId("account1"), queue_items: [ObjectId("item1"), ObjectId("item2")] },
{ _id: ObjectId("account2"), queue_items: [ObjectId("item3")] }
]
}

Queue Addition

const queue = await build_queue.start();
await Promise.allSettled(
users.map(async user => {
try {
let item = user.queue_items[0]; // First item only
await queue.add(
{ id: item },
{
attempts: 15,
backoff: 4000,
jobId: item.toString(),
},
);
await Queue.findByIdAndUpdate({ _id: item }, { in_progress: true, failure_reason: null });
} catch (err) {
await Queue.updateOne(
{ _id: item },
{ in_progress: false, failure_reason: `${err.message}, ${err.stack}` },
);
}
}),
);

Pattern: One report per account processed concurrently

  • Promise.allSettled: All queue additions attempted, failures logged
  • jobId: Prevents duplicate jobs in Bull queue

Build Queue Processor

OneBalance Verification

File: queue-manager/queues/instareports/build.js

const { verifyBalance } = require('../../utilities/onebalance');

try {
const account = await Account.findById(report.account_id);
await verifyBalance({
event: 'instareport',
account: account.toJSON(),
user_id: report.created_by,
quantity: 1,
});
} catch (err) {
await InstareportQueue.deleteOne({ _id: id });
await InstaReport.findByIdAndUpdate(queue._doc.reference_id, {
status: 'FAILED',
});
}

Purpose: Ensures account has credits before building report

  • Event: instareport - Credit type
  • Quantity: 1 credit per report
  • Failure: Deletes queue item, marks report FAILED

Business Info Extraction

const address =
(report.details.business_info.address.street
? report.details.business_info.address.street + ' '
: '') +
(report.details.business_info.address.unit
? report.details.business_info.address.unit + ' '
: '') +
(report.details.business_info.address.suite
? report.details.business_info.address.suite + ' '
: '') +
(report.details.business_info.address.city
? report.details.business_info.address.city + ' '
: '') +
(report.details.business_info.address.state_province
? report.details.business_info.address.state_province + ' '
: '') +
(report.details.business_info.address.postal_code
? report.details.business_info.address.postal_code + ' '
: '') +
(report.details.business_info.address.country
? report.details.business_info.address.country
: '');

Purpose: Concatenates full address for Yext scan

  • Handles Missing Fields: Conditional concatenation
  • Space Delimited: Builds searchable address string

Yext Queue Addition

const data = {
auth: {
uid: report.created_by,
account_id: report.account_id,
parent_account: report.parent_account || report.account_id,
},
reportID: report._id,
name: report.details.business_info.name,
address: address,
phone: report.details.business_info.phone,
website: report.details.business_info.website,
facebookURL: report.details.business_info.facebookURL,
};

const YextQueue = await yext_queue.start();
await YextQueue.add(data, {
attempts: 10,
backoff: {
delay: 4000,
type: 'exponential',
},
removeOnComplete: true,
});

Payload Structure: Contains auth context + business details

  • Auth: User/account for JWT generation
  • Business Info: Name, address, phone, URLs for matching

Yext Orchestrator

JWT Token Generation

File: queue-manager/queues/instareports/yext/index.js

importParams.integrationToken = jwt.sign(
{
type: 'access_token',
uid: importParams.auth.uid.toString(),
account_id: importParams.auth.account_id.toString(),
parent_account:
importParams.auth?.parent_account?.toString() || importParams.auth.account_id.toString(),
scope: 'analytics',
},
process.env.APP_SECRET,
{ expiresIn: '10m' },
);

Token Claims:

  • type: access_token
  • scope: analytics - Permission for analytics APIs
  • expiresIn: 10m - Short-lived for security

Yext Listings Scan

Initiate Scan

const res = await newScan(importParams.integrationToken, {
name: importParams.name,
address: importParams.address,
phone: importParams.phone,
});

// API: POST /v1/e/yext/scan

Response:

{
success: true,
data: {
response: {
jobId: "scan_12345",
sites: [
{ siteId: "GOOGLEPLACES", name: "Google My Business", homepage: "...", logo: "..." },
{ siteId: "YELP", name: "Yelp", homepage: "...", logo: "..." },
{ siteId: "FACEBOOK", name: "Facebook", homepage: "...", logo: "..." },
// ... more sites
]
}
}
}

Purpose: Starts Yext scan job for business listings across platforms

Poll for Completion

do {
await timer(10000); // Wait 10 seconds
const scanDatares = await getScanData(importParams.integrationToken, jobID, sitesIDs);
if (scanDatares.success === true) {
scanData = scanDatares.data.response;
let index = scanData.findIndex(x => x.status === 'SCAN_IN_PROGRESS');
if (index == -1) {
allScanPending = false;
}
}
} while (allScanPending);

// API: GET /v1/e/yext/scan/{jobID}/{siteIDs}

Polling Logic:

  • Interval: 10 seconds
  • Termination: When no sites have SCAN_IN_PROGRESS status
  • Timeout: Implicit (job will retry after backoff)

Scan Result Structure:

{
siteId: "GOOGLEPLACES",
status: "LISTING_FOUND" | "LISTING_NOT_FOUND" | "SCAN_IN_PROGRESS",
url: "https://maps.google.com/...",
match_address_score: 0.95,
match_name_score: 0.98,
match_phone_score: 1.00
}

Confidence-Based Matching

Threshold: 90% confidence for all three scores

let addressScore = Yextmerged[GoogleMapindex].match_address_score;
let nameScore = Yextmerged[GoogleMapindex].match_name_score;
let phoneScore = Yextmerged[GoogleMapindex].match_phone_score;

if (addressScore > 0.9 && nameScore > 0.9 && phoneScore > 0.9) {
// High confidence - use listing URL
googleTopicPayload.mapURL = Yextmerged[GoogleMapindex].url;
} else {
// Low confidence - empty URL (scraper will handle)
googleTopicPayload.mapURL = '';
}

Purpose: Only uses matched listings if confidence is high

  • Address: Location match
  • Name: Business name match
  • Phone: Phone number match

Scraper Configuration

Default Config:

{
yext: true,
semrush: true,
facebook_ads: true,
yelp: true,
seo: true,
google_ads: true,
facebook: true,
google_map: true,
page_speed: true
}

Conditional Execution:

let scrapesToRun = {
yext: (init || previousScrapes?.yext?.status == 'FAILED') && configs.yext,
semrush: (init || previousScrapes?.semrush?.status == 'FAILED') && configs.semrush,
// ... etc
};

Logic: Run scraper if:

  1. Initial run (init=true), OR
  2. Previous run failed, AND
  3. Config enables scraper

Data Models

InstareportsQueue Collection

{
_id: ObjectId,
account_id: ObjectId,
reference_id: ObjectId, // InstaReport document ID
type: 'instareport',
build_at: Date, // Scheduled build time
in_progress: Boolean,
failure_reason: String,
created_at: Date,
updated_at: Date
}

InstaReport Collection

{
_id: ObjectId,
account_id: ObjectId,
parent_account: ObjectId,
created_by: ObjectId,
status: String, // 'QUEUED' | 'RUNNING' | 'COMPLETED' | 'FAILED'
details: {
business_info: {
name: String,
phone: String,
website: String,
facebookURL: String,
address: {
street: String,
unit: String,
suite: String,
city: String,
state_province: String,
postal_code: String,
country: String
}
},
configs: {
yext: Boolean,
semrush: Boolean,
facebook_ads: Boolean,
yelp: Boolean,
google_map: Boolean,
page_speed: Boolean,
facebook: Boolean
}
},
scrapers: {
yext: {
status: String, // 'QUEUED' | 'RUNNING' | 'COMPLETED' | 'FAILED'
data: Array, // Scan results
error: String
},
google_map: { status, data, error },
yelp: { status, data, error },
facebook: { status, data, error },
semrush: { status, data, error },
page_speed: { status, data, error },
facebook_ads: { status, data, error }
},
created_at: Date,
updated_at: Date
}

Scraper Coordination

Execution Order

  1. Yext (Primary) - Scans 100+ listing platforms
  2. Parallel Dependent Scrapers:
    • Google Maps - Reviews, ratings, photos
    • Yelp - Reviews, ratings, business info
    • Facebook - Page info, followers, engagement
    • Semrush - SEO metrics, keywords, backlinks
    • PageSpeed - Performance metrics
    • Facebook Ads Library - Active ads

Status Tracking

Per-Scraper Status:

  • QUEUED: Waiting to start
  • RUNNING: Currently scraping
  • COMPLETED: Successfully finished
  • FAILED: Error occurred

Report Status:

  • Aggregated from all scraper statuses
  • COMPLETED when all scrapers complete
  • FAILED if critical scrapers fail

Error Handling

OneBalance Failures

Scenario: Insufficient credits

try {
await verifyBalance({...});
} catch (err) {
await InstareportQueue.deleteOne({ _id: id });
await InstaReport.findByIdAndUpdate(reference_id, {
status: 'FAILED'
});
}

Impact: Report marked FAILED, queue item deleted, no retry

Yext Scan Failures

Scenarios:

  • API timeout
  • Invalid business data
  • Rate limiting

Handling: 10 retries with exponential backoff

if (job.attemptsMade >= 6) {
await Instareport.findByIdAndUpdate(reportID, {
'scrapers.yext': {
error: err.response?.data || err.message,
status: 'FAILED',
},
});
}

Scraper Failures

Individual Scraper Failure: Other scrapers continue

  • Isolation: Each scraper runs independently
  • Partial Results: Report completes with available data

Performance Considerations

Account-Level Throttling

One Report Per Account: Prevents overwhelming account resources

  • Grouping: Aggregation groups by account_id
  • Selection: Only first queue item per account processed

Polling Overhead

Yext Scan Polling: 10-second intervals until complete

  • Duration: Typically 30-120 seconds
  • Network: Multiple API calls per scan

Concurrency

Scraper Parallelization: All scrapers run concurrently

  • 6 Scrapers: After Yext completes
  • Bull Queues: Separate queue per scraper type

Monitoring & Logging

Key Metrics

  1. Queue Depth: Pending instareport builds
  2. Build Duration: Time from queue to completion
  3. Scraper Success Rates: Per-scraper completion percentage
  4. OneBalance Rejections: Insufficient credit failures
  5. Yext Scan Duration: Time to complete listings scan
  6. Match Confidence: Distribution of confidence scores

Alerting Scenarios

  • High Queue Depth: > 100 pending builds
  • Low Success Rate: < 80% completion
  • Frequent Credit Failures: Many OneBalance rejections
  • Long Build Times: > 10 minutes per report
  • Scraper Failures: Individual scraper failure rate > 20%

  • InstReports Module Overview - Module overview
  • Common OneBalance (link removed - file does not exist) - Credit verification utility
  • External Yext Integration (link removed - file does not exist) - Yext API details

Summary

The InstReports Report Building module provides comprehensive business intelligence by orchestrating data collection from 7+ sources through a sophisticated multi-queue architecture. Its confidence-based matching system ensures data accuracy, while OneBalance integration prevents unauthorized usage. The 30-second scheduling ensures rapid turnaround for queued reports with stale lock recovery preventing stuck jobs.

Key Strengths:

  • Multi-Source Intelligence: Aggregates 7+ data sources
  • Smart Matching: 90%+ confidence thresholds ensure accuracy
  • Account Throttling: One report per account prevents overload
  • Credit Verification: OneBalance integration ensures authorized usage
  • Configurable: Per-report scraper configuration
  • Resilient: Retry logic for each scraper independently

Critical for:

  • Lead generation and demos
  • Competitive intelligence
  • SEO analysis
  • Revenue generation (credit consumption)
  • Client value delivery
💬

Documentation Assistant

Ask me anything about the docs

Hi! I'm your documentation assistant. Ask me anything about the docs!

I can help you with:
- Code examples
- Configuration details
- Troubleshooting
- Best practices

Try asking: How do I configure the API?
09:31 AM