Reporting on Data in High Volume Mongo Collections

There are several services used here at Barstool that generate millions of mongo documents every day. In order to facilitate efficient reporting tools for the data and analytics team, we implemented a pattern that would yield near real-time results while not creating any long-running queries on millions of records. Our solution entailed using TTL on Mongo collections, Stitch to Snowflake for backup, and Mongo aggregations.

Mongo has built-in support to set TTLs on collection to act as a sort of cleanup after a set period of time. Even with proper indexing running aggregates on a high volume of documents is non-performant. We did an assessment of our collections and determined what the most useful timeframe would be and then created the indexes. When adding a TTL to an existing collection, if you have a large number of documents that will qualify for deletion you'll need to consider the strain this will place on the DB. In order to launch, we began by setting the expireAfterSeconds to a date far in the past and periodically updated it until we reached our desired TTL (collMod to update index). This minimized the number of records it was deleting and distributed them over time.

// Create the initial index
db.reportCollection.index(
  {
    created_at: 1
  },
  {
    expireAfterSeconds: 60 * 60 * 24 * [STARTING_NUM_DAYS]
  }
)

// Update the TTL
db.runCommand({
  collMod: reportCollection,
  index: {
    keyPattern: created_at_1,
    expireAfterSeconds: 60 * 60 * 24 * [NUM_DAYS]
  }
})
TTL Index

After NUM_DAYS, based on created_at, the document will be deleted. To ensure proper backup we sync records via Stitch into Snowflake for long-term storage. The collection now has a manageable number of documents that we can report on. To further streamline the process and ensure Mongo doesn't get overloaded, we generate aggregates on a cron and store the results in another collection.

Working with the data and analytics team we determined what the smallest granularity of reporting would be and wrote a Mongo aggregation based on their parameters. On our end, we store the results and create records for each level of granularity to cut down on the amount of processing needed when generating reports.

// Aggregation
db.collection.aggregate({
  $match: {
    field: 'foo'
  }
},
{
  $group: {
    _id: {
      topLevelGranularity: '$topLevel',
      midLevelGranularity: '$midLevel',
      baseLevelGranularity: '$baseLevel'
    }
  },
  sum: {
    $sum: 1
  }
}
                        
// Records Examples
{
  aggregation: 'baseLevel',
  baseLevel: 'blog',
  midLevel: 'sports',
  topLevel: 'barstool',
  sum: 1
}
{
  aggregation: 'baseLevel',
  baseLevel: 'video',
  midLevel: 'sports',
  topLevel: 'barstool',
  sum: 1
}
{
  aggregation: 'midLevel',
  midLevel: 'sports',
  topLevel: 'barstool',
  sum: 2
}
{
  aggregation: 'topLevel',
  topLevel: 'barstool',
  sum: 2
}
Aggregate Query

The above is a very basic aggregation example but can be expanded to have as many pipelines as required. Additionally, add indexes on the aggregated data to make querying data for reports performant. The front-end team uses our API to collect and present this data in various charts and graphs. We have a roadmap to further offload these reports using a few services and will post about the outcome.