scan_table
module¶
This module implements a PostgreSQL sequential table scanner. The rows in the table will be read and processed in batches. Each batch will consist of rows located physically close to each other, thus saving precious IO bandwidth.
The scanned table must be defined as an SQLAlchemy Core table, or using an SQLAlchemy’s declarative base class.
- class swpt_pythonlib.scan_table.TableScanner¶
A table-scanner super-class. Sub-classes may override class attributes.
Exapmle:
from swpt_pythonlib.scan_table import TableScanner from mymodels import Customer class CustomerScanner(TableScanner): table = Customer.__table__ columns = [Customer.id, Customer.last_order_date] def process_rows(self, rows): for row in rows: print(row['id'], row['last_order_date'])
- blocks_per_query: int = 40¶
The number of database pages (blocks) to be retrieved per query. It might be a good idea to increase this number when the size of the table row is big.
- columns: Optional[List[ColumnElement]] = None¶
An optional list of
sqlalchemy.sql.expression.ColumnElement
instances to be be retrieved for each row. Most of the time it will be a list ofColumn
instances. Defaults to all columns.
- process_rows(rows: list) None ¶
Process a list or rows.
Must be defined in the subclass.
- Parameters
rows – A list of table rows.
- run(engine: ConnectionEventsTarget, completion_goal: timedelta, quit_early: bool = False) None ¶
Scan table continuously.
The table is scanned sequentially, starting from a random row. During the scan
process_rows()
will be continuously invoked with a list of rows. When the end of the table is reached, the scan continues from the beginning, ad infinitum.- Parameters
engine – SQLAlchemy engine
completion_goal – The time interval in which the whole table should be processed. This is merely an approximate goal. In reality, scans can take any amount of time.
quit_early – Exit after some time. This is mainly useful during testing.
- table: Optional[Table] = None¶
The
sqlalchemy.schema.Table
that will be scanned. (Model.__table__
if declarative base is used.)Must be defined in the subclass.
- target_beat_duration: int = 25¶
The scanning of the table is done in a sequence of “beats”. This attribute determines the ideal duration in milliseconds of those beats. The value should be big enough so that, on average, all the operations performed on table’s rows could be completed within this interval. Setting this value too high may have the effect of too many rows being processed simultaneously in one beat.