scan_table module

This module implements a PostgreSQL sequential table scanner. The rows in the table will be read and processed in batches. Each batch will consist of rows located physically close to each other, thus saving precious IO bandwidth.

The scanned table must be defined as an SQLAlchemy Core table, or using an SQLAlchemy’s declarative base class.

class swpt_pythonlib.scan_table.TableScanner

A table-scanner super-class. Sub-classes may override class attributes.

Exapmle:

from swpt_pythonlib.scan_table import TableScanner
from mymodels import Customer

class CustomerScanner(TableScanner):
    table = Customer.__table__
    columns = [Customer.id, Customer.last_order_date]

    def process_rows(self, rows):
        for row in rows:
            print(row['id'], row['last_order_date'])
blocks_per_query: int = 40

The number of database pages (blocks) to be retrieved per query. It might be a good idea to increase this number when the size of the table row is big.

columns: Optional[List[ColumnElement]] = None

An optional list of sqlalchemy.sql.expression.ColumnElement instances to be be retrieved for each row. Most of the time it will be a list of Column instances. Defaults to all columns.

process_rows(rows: list) None

Process a list or rows.

Must be defined in the subclass.

Parameters

rows – A list of table rows.

run(engine: ConnectionEventsTarget, completion_goal: timedelta, quit_early: bool = False) None

Scan table continuously.

The table is scanned sequentially, starting from a random row. During the scan process_rows() will be continuously invoked with a list of rows. When the end of the table is reached, the scan continues from the beginning, ad infinitum.

Parameters
  • engine – SQLAlchemy engine

  • completion_goal – The time interval in which the whole table should be processed. This is merely an approximate goal. In reality, scans can take any amount of time.

  • quit_early – Exit after some time. This is mainly useful during testing.

table: Optional[Table] = None

The sqlalchemy.schema.Table that will be scanned. (Model.__table__ if declarative base is used.)

Must be defined in the subclass.

target_beat_duration: int = 25

The scanning of the table is done in a sequence of “beats”. This attribute determines the ideal duration in milliseconds of those beats. The value should be big enough so that, on average, all the operations performed on table’s rows could be completed within this interval. Setting this value too high may have the effect of too many rows being processed simultaneously in one beat.