If you use scan in your applications and would like to speed it up, then parallel scan would come in handy. As name implies, parallel scan consists of multiple scans that are run in parallel to achieve faster result.
The number of segments (TotalSegments) determines how many logical partitions the scan will be split into. Each segment can then be read independently (and in parallel) by separate workers/threads.
Be aware, that while parallel scan will improve the speed, it will cost more, since more RCU (read capacity units) will be consumed.
How to choose amount of segments
For large tables (tens/hundreds of GB+), more segments helps distribute work evenly.
For small tables, few segments are enough - too many creates overhead without benefit.
You might want to experiment with different amount of segments to achieve best speed results.
Examples and time measurements
Timing of scan operation depends on many params like amount of items, amount of attributes to return, provisioned/on-demand capacity, filters expression, pagination.
As an example we take 100k items table, total size of data is around 20mb, table is on-demand table, no additional filter-expressions for the scan applied. Typescript code chunks down below are simplified to demonstrate only important moments.
Example of paginated scan:
export class ExampleRepository {
...
async scan() {
let result = []
let lastEvaluatedKey = undefined
do {
const scanResponse = await this.dynamoClient.send(
new ScanCommand({
TableName: this.tableName,
ExclusiveStartKey: lastEvaluatedKey
})
)
lastEvaluatedKey = scanResponse.LastEvaluatedKey
result = result.concat(scanResponse.Items)
} while (lastEvaluatedKey)
return result
}
}
Approximate time to return result is 5 seconds.
Now we modify our code to run paginated scan n amount of times (according to amount of segments) in parallel. We first create the n amount of promises and then run them using Promise.all().
Example of parallel scan:
export class ExampleRepository {
...
async scan() {
const amountOfSegments = 10
const scanPromises = []
for (let segment = 0; segment < amountOfSegments; segment++) {
scanPromises.push(this.scanSegment(segment, amountOfSegments))
}
const segmentResults = await Promise.all(scanPromises)
let totalResult = []
segmentResults.forEach((segmentResult) => {
totalResult = totalResult.concat(segmentResult)
})
return totalResult
}
private async scanSegment(
segment: number,
amountOfSegments: number
) {
let result = []
let lastEvaluatedKey = undefined
do {
const scanResponse = await this.dynamoClient.send(
new ScanCommand({
TableName: this.tableName,
ExclusiveStartKey: lastEvaluatedKey,
Segment: segment,
TotalSegments: amountOfSegments,
})
)
lastEvaluatedKey = scanResponse.LastEvaluatedKey
result = result.concat(scanResponse.Items)
} while (lastEvaluatedKey)
return result
}
}
Approximate time to return result is 3 seconds, so around 70% percent faster now
Conclusion
Parallel scans in DynamoDB speed up large table processing by splitting work across segments and using resources efficiently. They reduce scan time, but it will cost more since more RCU units will be used.