data_access_api.services.cohort
Attributes
Functions
|
|
|
Validates that an inbound SQL query is structurally safe to run against OMOP. |
|
Executes a SQL query and returns results. |
|
Returns counts of non-null values for each column in the DataFrame. |
|
Returns counts of null values for each column in the DataFrame. |
|
Returns the distribution of sexes in the DataFrame. |
|
Returns the distribution of ages in the DataFrame. |
|
Verifies that the number of unique values in each column of the DataFrame is not smaller than the threshold. |
|
Groups entries in the results list with counts less than min_count into an "Other" category. |
|
Returns aggregated statistics from the query results. |
Module Contents
- data_access_api.services.cohort.COHORT_QUERY_THRESHOLD = 10
- data_access_api.services.cohort.ALLOWED_SCHEMA = 'omop'
- data_access_api.services.cohort.MAX_QUERY_LENGTH = 10240
- data_access_api.services.cohort._ALLOWED_QUERY_TYPES: tuple[type[sqlglot.exp.Expression], Ellipsis]
- data_access_api.services.cohort._invalid_query(detail: str) fastapi.HTTPException
- data_access_api.services.cohort.validate_query(query: str) bool
Validates that an inbound SQL query is structurally safe to run against OMOP.
Database-layer protections already in place
The data-access-api connects as
data_analyst_reader(seeflip-omop-db/files/create_readonly_users.sql), a Postgres role granted onlyCONNECT+USAGEon schemaomop+SELECTon its tables and sequences, withINSERT,UPDATE,DELETE,TRUNCATE, andCREATEexplicitly REVOKEd. Any DDL or DML is therefore rejected by Postgres itself, so this function does NOT keyword-filter forDROP/INSERT/UPDATE/ etc. — those are already covered at the DB layer.What this function enforces
Since the read-only role can still issue arbitrary
SELECTqueries:The query is shorter than
MAX_QUERY_LENGTH(DoS guard).The query parses as exactly one non-empty statement (defeats query stacking, stray semicolons that bypass the count check, and empty inputs).
The top-level statement is SELECT-shaped (rejects
COPY,EXPLAIN, and any DDL/DML the DB would also reject — fail fast at the API).Any schema-qualified table reference targets only the
omopschema (blocks enumeration ofinformation_schema,pg_catalog,pg_classetc., which Postgres makes readable to rolepublicby default).Every
LIMITandOFFSETis a literal integer (defeats the blind data-extraction technique that abusesLIMIT CASE WHEN <predicate> THEN n ELSE m ENDto make the row count a function of a single character value, then reads it back via the cohort-size error message).
- param query:
The SQL query string from the caller.
- returns:
Truewhen the query is structurally safe.- raises HTTPException(400):
When any of the rules above is violated.
- data_access_api.services.cohort.get_records(query: str | sqlalchemy.sql.elements.TextClause, params: collections.abc.Mapping[str, Any] | None = None) pandas.DataFrame
Executes a SQL query and returns results.
- Parameters:
query (str | TextClause) – The SQL query to execute. Pass a
TextClausewith bind parameters when the query is parameterized.params (Mapping[str, Any] | None) – Optional mapping of bind parameter names to values for parameterized queries.
- Returns:
The results of the query as a DataFrame.
- Return type:
pd.DataFrame
- Raises:
HTTPException – If the query is invalid or if there is an error during execution.
- data_access_api.services.cohort.get_counts(df: pandas.DataFrame) dict
Returns counts of non-null values for each column in the DataFrame.
- Parameters:
df (pd.DataFrame) – The cohort DataFrame.
- Returns:
{"name": "Counts", "results": [{"value": <column>, "count": <int>}, ...]}.- Return type:
dict
- data_access_api.services.cohort.get_null_counts(df: pandas.DataFrame) dict
Returns counts of null values for each column in the DataFrame.
- Parameters:
df (pd.DataFrame) – The cohort DataFrame.
- Returns:
{"name": "Nulls", "results": [{"value": <column>, "count": <int>}, ...]}.- Return type:
dict
- data_access_api.services.cohort.get_sex_distribution(df: pandas.DataFrame) dict
Returns the distribution of sexes in the DataFrame.
Assumes the DataFrame has accesion_id, query the table to get the sex distribution.
- Parameters:
df (pd.DataFrame) – The cohort DataFrame. Must include a
person_idcolumn; otherwise an empty result set is returned.- Returns:
{"name": "Sex Distribution", "results": [{"value": <sex>, "count": <int>}, ...]}.- Return type:
dict
- data_access_api.services.cohort.get_age_distribution(df: pandas.DataFrame) dict
Returns the distribution of ages in the DataFrame.
Assumes the DataFrame has accesion_id, query the table to get the age distribution.
- Parameters:
df (pd.DataFrame) – The cohort DataFrame. Must include a
person_idcolumn; otherwise an empty result set is returned.- Returns:
{"name": "Age Distribution", "results": [{"value": "<decade>", "count": <int>}, ...]}where each value is a ten-year age bucket.- Return type:
dict
- data_access_api.services.cohort.verify_cardinality(df: pandas.DataFrame, threshold: float = 0.05) bool
Verifies that the number of unique values in each column of the DataFrame is not smaller than the threshold.
This is to prevent leaking information about individuals in the cohort.
- Parameters:
df (pd.DataFrame) – The cohort DataFrame to check.
threshold (float) – Minimum acceptable proportion of unique values per column. Defaults to
0.05.
- Returns:
Trueif every column has enough unique values (either aboveCOHORT_QUERY_THRESHOLDin absolute terms, or abovethresholdin relative terms).Falseif any column falls below both thresholds.- Return type:
bool
- data_access_api.services.cohort.make_other_category(results: list[dict], min_count: int = COHORT_QUERY_THRESHOLD) list[dict]
Groups entries in the results list with counts less than min_count into an “Other” category.
- Parameters:
results (list of dict) – List of dictionaries with ‘value’ and ‘count’ keys.
min_count (int) – Minimum count threshold to avoid grouping into “Other”.
- Returns:
Updated list with low-count entries grouped into “Other”.
- Return type:
list of dict
- data_access_api.services.cohort.get_statistics(df: pandas.DataFrame, query_input: data_access_api.routers.schema.CohortQueryInput, threshold: int) data_access_api.routers.schema.StatisticsResponse
Returns aggregated statistics from the query results.
Counts the number of records.
Aggregates the number of occurrences of each unique value per column.
Below-threshold counts are privacy-suppressed by returning a
StatisticsResponsewithrecord_count=0, emptydataandsuppressed=True— the count itself is suppressed, not just the per-field breakdown. A genuine zero is suppressed identically to a small (1..threshold-1) count, so the two are indistinguishable on the wire and the response cannot be used to infer that >=1 patient matched (membership disclosure — issue #519, security review). Thesuppressedflag only tells the hub/UI to render a “below-threshold” chip instead of a bare 0; it does not reveal which 0s were genuine. Suppression is intentional rather than an HTTPException so the trust still has a normal response to forward to the hub; raising here previously caused trust-api to skip the hub callback and leave the per-trust UI status stuck.- Parameters:
df (pd.DataFrame) – Query results dataframe.
query_input (data_access_api.routers.schema.CohortQueryInput) – Input object containing the query and metadata.
threshold (int) – Minimum number of records required to return results.
- Returns:
Contains the aggregated statistics, or a 0-count empty response when below
COHORT_QUERY_THRESHOLD.- Return type: