InstReports - Report Building
Overview
The InstReports Report Building module orchestrates the automated generation of comprehensive business intelligence reports by aggregating data from 7+ third-party sources: Yext listings, Google Maps, Yelp, Facebook, Semrush, PageSpeed Insights, and Facebook Ads Library. It runs every 30 seconds to process queued reports, using a sophisticated scraper coordination system with OneBalance credit verification, JWT authentication, and intelligent fallback logic when listings are not found.
Key Features:
- Multi-Source Aggregation: Combines data from 7+ external APIs
- 30-Second Schedule: High-frequency processing for rapid report generation
- OneBalance Integration: Verifies credits before building reports
- Smart Matching: Uses confidence scores (>90%) to match listings across platforms
- Cascading Scrapers: Yext scan triggers dependent scrapers (Map, Yelp, Facebook, etc.)
- Configurable Scrapers: Per-report configuration controls which scrapers run
- Retry Logic: Multiple attempts with exponential backoff for each scraper
- Status Tracking: Real-time status updates for each scraper component
Critical Business Impact:
- Competitive Intelligence: Provides 360° view of business online presence
- Lead Generation: Powers demo reports for prospects
- Revenue Generation: Consumes OneBalance credits (monetization)
- Client Insights: Helps clients understand their digital footprint
- SEO Analysis: Combines multiple data sources for comprehensive SEO insights
Architecture
Execution Flow
sequenceDiagram
participant Cron as Cron Scheduler
participant Service as Schedule Service
participant DB as MongoDB
participant BuildQ as Build Queue
participant YextQ as Yext Queue
participant Yext as Yext API
participant Map as Google Maps Queue
participant Yelp as Yelp Queue
participant FB as Facebook Queue
participant Semrush as Semrush Queue
participant PageSpeed as PageSpeed Queue
participant FBAds as FB Ads Library Queue
participant OneBalance as OneBalance Service
Note over Cron,OneBalance: Every 30 Seconds
Cron->>Service: Trigger schedule check
Service->>DB: Find reports ready to build
Note over DB: type: 'instareport'<br/>build_at <= now<br/>in_progress: false<br/>OR stale (30min+)
DB-->>Service: Queue items grouped by account
Service->>Service: Take first item per account
Note over Service: One report per account
Service->>BuildQ: Add to build queue
Note over BuildQ: 15 attempts, 4s backoff
BuildQ->>BuildQ: Verify OneBalance
BuildQ->>OneBalance: Check credits (event: 'instareport')
alt Sufficient Credits
OneBalance-->>BuildQ: Credits verified
BuildQ->>DB: Fetch InstaReport
BuildQ->>BuildQ: Extract business info
BuildQ->>YextQ: Add to Yext queue
Note over YextQ: Start scraping process
YextQ->>YextQ: Generate JWT token (10min)
YextQ->>Yext: POST /v1/e/yext/scan
Note over Yext: Start listings scan
Yext-->>YextQ: Job ID + Sites list
loop Poll every 10 seconds
YextQ->>Yext: GET /v1/e/yext/scan/{jobID}
Yext-->>YextQ: Scan status
Note over YextQ: Wait until all scans complete
end
YextQ->>DB: Update scrapers.yext data
YextQ->>YextQ: Match listings with confidence scores
Note over YextQ: Address, Name, Phone > 90%
alt Google Maps Found (>90% match)
YextQ->>Map: Add to Map queue with URL
else No Match
YextQ->>Map: Add to Map queue (empty URL)
end
alt Yelp Found (>90% match)
YextQ->>Yelp: Add to Yelp queue with URL
else No Match
YextQ->>Yelp: Add to Yelp queue (empty URL)
end
alt Facebook Found (>90% match)
YextQ->>FB: Add to FB queue with URL
YextQ->>FBAds: Add to FB Ads queue with URL
else Facebook URL Provided
YextQ->>FB: Use provided URL
YextQ->>FBAds: Use provided URL
else No Match
YextQ->>FB: Add to FB queue (empty URL)
YextQ->>FBAds: Add to FB Ads queue (empty URL)
end
alt Website Provided
YextQ->>Semrush: Add to Semrush queue with domain
YextQ->>PageSpeed: Add to PageSpeed queue with URL
else No Website
YextQ->>Semrush: Add to Semrush queue (empty domain)
YextQ->>PageSpeed: Add to PageSpeed queue (empty domain)
end
Note over Map,PageSpeed: Each scraper runs independently
Note over Map,PageSpeed: Updates report.scrapers.{name}
else Insufficient Credits
OneBalance-->>BuildQ: Error: Insufficient balance
BuildQ->>DB: Delete queue item
BuildQ->>DB: Set InstaReport status: FAILED
end
Component Structure
queue-manager/
├── crons/
│ └── instareports/
│ └── build.js # Cron scheduler (30 seconds)
├── services/
│ └── instareports/
│ ├── index.js # Service exports
│ └── build/
│ └── index.js # Schedule service + Yext dispatcher
├── queues/
│ └── instareports/
│ ├── build.js # Main build queue
│ ├── yext/
│ │ └── index.js # Yext orchestrator
│ ├── map/ # Google Maps scraper
│ ├── yelp/ # Yelp scraper
│ ├── facebook/ # Facebook scraper
│ ├── semrush/ # Semrush SEO scraper
│ ├── pageSpeed/ # PageSpeed Insights scraper
│ └── adLibrary/ # Facebook Ads Library scraper
└── utilities/
└── onebalance.js # Credit verification
Cron Schedule
File: queue-manager/crons/instareports/build.js
'*/30 * * * * *'; // Every 30 seconds
Pattern: High-frequency scheduler for rapid report generation
- In-Progress Locking: Prevents concurrent executions
- Purpose: Quick turnaround for queued reports
Configuration
Environment Variables
| Variable | Type | Required | Description |
|---|---|---|---|
API_BASE_URL | String | Yes | Internal API base URL for Yext/other integrations |
APP_SECRET | String | Yes | JWT secret for token signing |
Queue Retry Configuration
Build Queue: queue-manager/queues/instareports/build.js
{
attempts: 15,
backoff: 4000 // 4 seconds fixed delay
}
Yext Queue: queue-manager/queues/instareports/yext/index.js
{
attempts: 10,
backoff: {
delay: 4000,
type: 'exponential'
}
}
Scraper Queues: (Map, Yelp, Facebook, Semrush, PageSpeed, FBAds)
{
attempts: 6,
backoff: {
delay: 4000,
type: 'exponential'
},
removeOnComplete: true
}
Service Implementation
Report Scheduling Logic
File: queue-manager/services/instareports/build/index.js
Query with Stale Lock Recovery
let filter = {
type: 'instareport',
build_at: { $lte: new Date() },
$or: [
{ in_progress: { $ne: true } },
{ updated_at: { $lt: new Date(new Date().getTime() - 30 * 60 * 1000) } },
],
};
Conditions:
type: 'instareport'- Report build typebuild_at <= now- Scheduled time has passedin_progress !== trueORupdated_at < 30 minutes ago- Stale lock recovery
Purpose: Prevents permanently stuck reports with 30-minute timeout
Account Grouping Aggregation
let query = [
{ $match: filter },
{
$facet: {
count_instareports: [{ $count: 'count_instareports' }],
items: [
{
$group: {
_id: '$account_id',
queue_items: {
$push: '$_id',
},
},
},
{ $sort: { _id: 1 } },
],
},
},
{
$project: {
total: {
$arrayElemAt: ['$count_instareports.count_instareports', 0],
},
users: '$items',
},
},
];
Purpose: Groups queue items by account, ensures one report per account
- $facet: Parallel aggregation for count + grouping
- $group: Groups by
account_id, collects all queue item IDs - First Item Selection: Only processes
queue_items[0]per account
Output Structure:
{
total: 45, // Total reports ready
users: [
{ _id: ObjectId("account1"), queue_items: [ObjectId("item1"), ObjectId("item2")] },
{ _id: ObjectId("account2"), queue_items: [ObjectId("item3")] }
]
}
Queue Addition
const queue = await build_queue.start();
await Promise.allSettled(
users.map(async user => {
try {
let item = user.queue_items[0]; // First item only
await queue.add(
{ id: item },
{
attempts: 15,
backoff: 4000,
jobId: item.toString(),
},
);
await Queue.findByIdAndUpdate({ _id: item }, { in_progress: true, failure_reason: null });
} catch (err) {
await Queue.updateOne(
{ _id: item },
{ in_progress: false, failure_reason: `${err.message}, ${err.stack}` },
);
}
}),
);
Pattern: One report per account processed concurrently
- Promise.allSettled: All queue additions attempted, failures logged
- jobId: Prevents duplicate jobs in Bull queue
Build Queue Processor
OneBalance Verification
File: queue-manager/queues/instareports/build.js
const { verifyBalance } = require('../../utilities/onebalance');
try {
const account = await Account.findById(report.account_id);
await verifyBalance({
event: 'instareport',
account: account.toJSON(),
user_id: report.created_by,
quantity: 1,
});
} catch (err) {
await InstareportQueue.deleteOne({ _id: id });
await InstaReport.findByIdAndUpdate(queue._doc.reference_id, {
status: 'FAILED',
});
}
Purpose: Ensures account has credits before building report
- Event:
instareport- Credit type - Quantity: 1 credit per report
- Failure: Deletes queue item, marks report FAILED
Business Info Extraction
const address =
(report.details.business_info.address.street
? report.details.business_info.address.street + ' '
: '') +
(report.details.business_info.address.unit
? report.details.business_info.address.unit + ' '
: '') +
(report.details.business_info.address.suite
? report.details.business_info.address.suite + ' '
: '') +
(report.details.business_info.address.city
? report.details.business_info.address.city + ' '
: '') +
(report.details.business_info.address.state_province
? report.details.business_info.address.state_province + ' '
: '') +
(report.details.business_info.address.postal_code
? report.details.business_info.address.postal_code + ' '
: '') +
(report.details.business_info.address.country
? report.details.business_info.address.country
: '');
Purpose: Concatenates full address for Yext scan
- Handles Missing Fields: Conditional concatenation
- Space Delimited: Builds searchable address string
Yext Queue Addition
const data = {
auth: {
uid: report.created_by,
account_id: report.account_id,
parent_account: report.parent_account || report.account_id,
},
reportID: report._id,
name: report.details.business_info.name,
address: address,
phone: report.details.business_info.phone,
website: report.details.business_info.website,
facebookURL: report.details.business_info.facebookURL,
};
const YextQueue = await yext_queue.start();
await YextQueue.add(data, {
attempts: 10,
backoff: {
delay: 4000,
type: 'exponential',
},
removeOnComplete: true,
});
Payload Structure: Contains auth context + business details
- Auth: User/account for JWT generation
- Business Info: Name, address, phone, URLs for matching
Yext Orchestrator
JWT Token Generation
File: queue-manager/queues/instareports/yext/index.js
importParams.integrationToken = jwt.sign(
{
type: 'access_token',
uid: importParams.auth.uid.toString(),
account_id: importParams.auth.account_id.toString(),
parent_account:
importParams.auth?.parent_account?.toString() || importParams.auth.account_id.toString(),
scope: 'analytics',
},
process.env.APP_SECRET,
{ expiresIn: '10m' },
);
Token Claims:
type:access_tokenscope:analytics- Permission for analytics APIsexpiresIn:10m- Short-lived for security
Yext Listings Scan
Initiate Scan
const res = await newScan(importParams.integrationToken, {
name: importParams.name,
address: importParams.address,
phone: importParams.phone,
});
// API: POST /v1/e/yext/scan
Response:
{
success: true,
data: {
response: {
jobId: "scan_12345",
sites: [
{ siteId: "GOOGLEPLACES", name: "Google My Business", homepage: "...", logo: "..." },
{ siteId: "YELP", name: "Yelp", homepage: "...", logo: "..." },
{ siteId: "FACEBOOK", name: "Facebook", homepage: "...", logo: "..." },
// ... more sites
]
}
}
}
Purpose: Starts Yext scan job for business listings across platforms
Poll for Completion
do {
await timer(10000); // Wait 10 seconds
const scanDatares = await getScanData(importParams.integrationToken, jobID, sitesIDs);
if (scanDatares.success === true) {
scanData = scanDatares.data.response;
let index = scanData.findIndex(x => x.status === 'SCAN_IN_PROGRESS');
if (index == -1) {
allScanPending = false;
}
}
} while (allScanPending);
// API: GET /v1/e/yext/scan/{jobID}/{siteIDs}
Polling Logic:
- Interval: 10 seconds
- Termination: When no sites have
SCAN_IN_PROGRESSstatus - Timeout: Implicit (job will retry after backoff)
Scan Result Structure:
{
siteId: "GOOGLEPLACES",
status: "LISTING_FOUND" | "LISTING_NOT_FOUND" | "SCAN_IN_PROGRESS",
url: "https://maps.google.com/...",
match_address_score: 0.95,
match_name_score: 0.98,
match_phone_score: 1.00
}
Confidence-Based Matching
Threshold: 90% confidence for all three scores
let addressScore = Yextmerged[GoogleMapindex].match_address_score;
let nameScore = Yextmerged[GoogleMapindex].match_name_score;
let phoneScore = Yextmerged[GoogleMapindex].match_phone_score;
if (addressScore > 0.9 && nameScore > 0.9 && phoneScore > 0.9) {
// High confidence - use listing URL
googleTopicPayload.mapURL = Yextmerged[GoogleMapindex].url;
} else {
// Low confidence - empty URL (scraper will handle)
googleTopicPayload.mapURL = '';
}
Purpose: Only uses matched listings if confidence is high
- Address: Location match
- Name: Business name match
- Phone: Phone number match
Scraper Configuration
Default Config:
{
yext: true,
semrush: true,
facebook_ads: true,
yelp: true,
seo: true,
google_ads: true,
facebook: true,
google_map: true,
page_speed: true
}
Conditional Execution:
let scrapesToRun = {
yext: (init || previousScrapes?.yext?.status == 'FAILED') && configs.yext,
semrush: (init || previousScrapes?.semrush?.status == 'FAILED') && configs.semrush,
// ... etc
};
Logic: Run scraper if:
- Initial run (
init=true), OR - Previous run failed, AND
- Config enables scraper
Data Models
InstareportsQueue Collection
{
_id: ObjectId,
account_id: ObjectId,
reference_id: ObjectId, // InstaReport document ID
type: 'instareport',
build_at: Date, // Scheduled build time
in_progress: Boolean,
failure_reason: String,
created_at: Date,
updated_at: Date
}
InstaReport Collection
{
_id: ObjectId,
account_id: ObjectId,
parent_account: ObjectId,
created_by: ObjectId,
status: String, // 'QUEUED' | 'RUNNING' | 'COMPLETED' | 'FAILED'
details: {
business_info: {
name: String,
phone: String,
website: String,
facebookURL: String,
address: {
street: String,
unit: String,
suite: String,
city: String,
state_province: String,
postal_code: String,
country: String
}
},
configs: {
yext: Boolean,
semrush: Boolean,
facebook_ads: Boolean,
yelp: Boolean,
google_map: Boolean,
page_speed: Boolean,
facebook: Boolean
}
},
scrapers: {
yext: {
status: String, // 'QUEUED' | 'RUNNING' | 'COMPLETED' | 'FAILED'
data: Array, // Scan results
error: String
},
google_map: { status, data, error },
yelp: { status, data, error },
facebook: { status, data, error },
semrush: { status, data, error },
page_speed: { status, data, error },
facebook_ads: { status, data, error }
},
created_at: Date,
updated_at: Date
}
Scraper Coordination
Execution Order
- Yext (Primary) - Scans 100+ listing platforms
- Parallel Dependent Scrapers:
- Google Maps - Reviews, ratings, photos
- Yelp - Reviews, ratings, business info
- Facebook - Page info, followers, engagement
- Semrush - SEO metrics, keywords, backlinks
- PageSpeed - Performance metrics
- Facebook Ads Library - Active ads
Status Tracking
Per-Scraper Status:
QUEUED: Waiting to startRUNNING: Currently scrapingCOMPLETED: Successfully finishedFAILED: Error occurred
Report Status:
- Aggregated from all scraper statuses
COMPLETEDwhen all scrapers completeFAILEDif critical scrapers fail
Error Handling
OneBalance Failures
Scenario: Insufficient credits
try {
await verifyBalance({...});
} catch (err) {
await InstareportQueue.deleteOne({ _id: id });
await InstaReport.findByIdAndUpdate(reference_id, {
status: 'FAILED'
});
}
Impact: Report marked FAILED, queue item deleted, no retry
Yext Scan Failures
Scenarios:
- API timeout
- Invalid business data
- Rate limiting
Handling: 10 retries with exponential backoff
if (job.attemptsMade >= 6) {
await Instareport.findByIdAndUpdate(reportID, {
'scrapers.yext': {
error: err.response?.data || err.message,
status: 'FAILED',
},
});
}
Scraper Failures
Individual Scraper Failure: Other scrapers continue
- Isolation: Each scraper runs independently
- Partial Results: Report completes with available data
Performance Considerations
Account-Level Throttling
One Report Per Account: Prevents overwhelming account resources
- Grouping: Aggregation groups by
account_id - Selection: Only first queue item per account processed
Polling Overhead
Yext Scan Polling: 10-second intervals until complete
- Duration: Typically 30-120 seconds
- Network: Multiple API calls per scan
Concurrency
Scraper Parallelization: All scrapers run concurrently
- 6 Scrapers: After Yext completes
- Bull Queues: Separate queue per scraper type
Monitoring & Logging
Key Metrics
- Queue Depth: Pending instareport builds
- Build Duration: Time from queue to completion
- Scraper Success Rates: Per-scraper completion percentage
- OneBalance Rejections: Insufficient credit failures
- Yext Scan Duration: Time to complete listings scan
- Match Confidence: Distribution of confidence scores
Alerting Scenarios
- High Queue Depth: > 100 pending builds
- Low Success Rate: < 80% completion
- Frequent Credit Failures: Many OneBalance rejections
- Long Build Times: > 10 minutes per report
- Scraper Failures: Individual scraper failure rate > 20%
Related Documentation
- InstReports Module Overview - Module overview
- Common OneBalance (link removed - file does not exist) - Credit verification utility
- External Yext Integration (link removed - file does not exist) - Yext API details
Summary
The InstReports Report Building module provides comprehensive business intelligence by orchestrating data collection from 7+ sources through a sophisticated multi-queue architecture. Its confidence-based matching system ensures data accuracy, while OneBalance integration prevents unauthorized usage. The 30-second scheduling ensures rapid turnaround for queued reports with stale lock recovery preventing stuck jobs.
Key Strengths:
- Multi-Source Intelligence: Aggregates 7+ data sources
- Smart Matching: 90%+ confidence thresholds ensure accuracy
- Account Throttling: One report per account prevents overload
- Credit Verification: OneBalance integration ensures authorized usage
- Configurable: Per-report scraper configuration
- Resilient: Retry logic for each scraper independently
Critical for:
- Lead generation and demos
- Competitive intelligence
- SEO analysis
- Revenue generation (credit consumption)
- Client value delivery