Baseline to stem_table
The baseline table contains one row per person and a column for each field. There can be thousands of columns and each column name is structured as field_id-instance.array
(e.g. 20001-1.15
). The field_id
encodes the variable, the instance
indicates one of the four assessment centre visits (ranging from 0 to 3) and with the array
index multiple values can be given for the same field_id at the same visit.
To map all these columns to individual events, we map them first to a staging table: the stem table. Depending on the domain of the target concepts, the events are then mapped to their respective tables. See the section on stem table mapping
The input for this transformation are mapping tables of thefield_id
. Each field can have multiple values, for which a separate mapping is made. The mappings are made using a modified version of Usagi, allowing for mapping of values and having different target types. They are saved as default Usagi save files in this location: resources/baseline_field_mapping
. This folder also contains a readme with a detailed description of how the mappings were made.
Each field or field/value combination can have a mapping to an event, unit and/or value concept. Also, each field is associated to a date field_id (resources/baseline_field_mapping/date_field_lookup.csv
). Based the mappings given, both the semantic mapping and structural mapping is made. The value can be either a discrete (mapping to a value_as_concept_id), numeric (mapping to value_as_number) or a text (mapping to value_as_string).
- discrete if the value is coded
- numeric if the value can be converted to float (or if the value is missing)
- text in all other cases
The ETL is implemented as follows:
- Loop through all rows of the baseline table:
- Loop through all columns in the row (except for the
eid
):- If the value is empty, go to next column.
- From the column name, extract
field_id
andinstance
index (we ignore the array index) - Look up
target
concepts byfield_id
and value. If thefield_id
is ignored, we continue to the next field. - Look up date_field_id by
field_id
. Create the column name, using the extractedinstance
and array ‘0’. (e.g.53-1.0
, if the date_field_id is ‘53’ and the instance is ‘1’) - Map record to stem table columns according to the overview below.
- Loop through all columns in the row (except for the
Notes:
- The field is considered numeric if it has no value mappings
- For numeric fields, values of -1 and -3 are filtered out.
- A field can also be ‘ignored’ meaning that no OMOP record should be created for this field.
- Fields not given in the mapping tables are also included with a
concept_id
of 0. Thesource_concept_id
is looked up for the givenfield_id
. - Mappings that have not been approved will be mapped to a 0 (see the mapping status column in the mapping tables).
- If a field has no value mappings, but the value can’t be converted to a float, then it is treated as free-text and populates the value_as_string field.
- There are some cases where instances run higher than 3. These correspond to positions in the death or cancer registry and are handled separately.
- Waist and Hip circumference (fields 48 and 49) are mapped to observation concepts (4172830 4111665, respectively). There does not exist an equivalent target concept for both in the measurement domain (see also issue #176).
Destination Field | Source field | Logic | Comment |
---|---|---|---|
id | Not used | ||
domain_id | Not used, the concept_id | ||
person_id | eid | ||
start_date | date_column | Lookup date_field_id from field_id (step d) | |
start_datetime | date_column | Lookup date_field_id from field_id (step d) | |
end_date | One date provided | ||
end_datetime | One date provided | ||
visit_occurrence_id | TBD | ||
provider_id | Not provided | ||
concept_id | target.event_concept_id | Lookup from field_id and value (step c) | |
source_value | field_id value | If value numeric: field_id If discrete: field_id |value . Truncated to 50 characters. | |
source_concept_id | field_id value | Use field_id to find the UK Biobank vocabulary concept_id | UK Biobank has it is own vocabulary in Athena |
value_as_concept_id | target.value_as_concept_id | Lookup from field_id and value (step c) | If value has a discrete mapping |
value_as_number | target.value_as_number | Lookup from field_id and value (step c) | If value is numeric |
value_as_string | target.value_as_string | Lookup from field_id and value (step c). Truncated to 50 characters. | If value is text |
value_source_value | Not used, field and value are in source_value column | ||
unit_concept_id | target.unit_concept_id | Lookup from field_id and value (step c) | If value is numeric |
unit_source_value | Not used, unit is directly derived from the field_id | ||
type_concept_id | Mapping given in field_id_to_type_concept_id.csv | Depending on the category of the field_id one of: - 32862 Patient filled survey - 32851 Healthcare professional filled survey - 32879 Registry - 32856 Lab - 32817 EHR |
There are 26 more columns in the stem table, each corresponding to columns in the event tables. These are not used and omitted for brevity.
Example
Given te following baseline row:
eid | 31-0.0 | 53-0.0 | 53-1.0 | 46-0.0 | 2443-1.0 |
---|---|---|---|---|---|
123 | 0 | 2010-01-01 | 2020-06-06 | 12.5 | 1 |
Loop through columns:
- Columns with field_id 31 (gender) and 53 (visit date) are marked as ‘IGNORED’ in the mapping files. We skip these for the mapping to stem table. These fields are used for other mappings (gender in the mapping to person).
- Column
46-0.0
(“Hand grip strength, left”, initial visit)- Not empty
field_id
=46,instance
=0, (array
=0)- Lookup mapping for field 46 in mapping tables, we get:
event_concept_id=44805437 value_as_number=12.5 unit_concept_id=9529 value_as_concept_id=None source_value=46
- Lookup the date field associated with this
field_id
. We get: 53. Together with the instance (0), this means we have to retrieve the date from column53-0.0
(=2010-01-01). - Map (see result for ‘Record 1’ in table below)
- Column
2443-1.0
(“Diabetes diagnosed by doctor”, repeat visit)- Not empty
field_id
=2443,instance
=1, (array
=0)- Lookup mapping for field 2443 in mapping tables, we get:
event_concept_id=4188893 value_as_number=None unit_concept_id=None value_as_concept_id=201820 source_value=2443|1
- Lookup the date field associated with this
field_id
. We get: 53. Together with the instance (1), this means we have to retrieve the date from column53-1.0
(=2020-06-06). - Map (see result for ‘Record 2’ in table below)
Destination Field | Record 1 | Record 2 |
---|---|---|
person_id | 123 | 123 |
start_date | 2010-01-01 | 2020-06-06 |
start_datetime | 2010-01-01T00:00:00 | 2020-06-06T00:00:00 |
visit_occurrence_id | ||
concept_id | 44805437 | 4214956 |
source_value | “46” | “2443|1” |
source_concept_id | 35810112 | 35810297 |
value_as_concept_id | 201820 | |
value_as_number | 12.5 | |
value_as_string | ||
value_source_value | ||
unit_concept_id | 9529 | |
type_concept_id | 32879 | 32862 |