A union dataset is a logical combination of multiple warehouse tables exposed to users as a single logical dataset. A typical use case for union datasets is to create a single logical event dataset from a set of event tables in the warehouse, one for each event type. Analytics' union datasets have performance and cost advantages over physically combining these tables in the warehouse.
Create a union dataset
- On the navigation pane, go to Data > Datasets > + New Dataset > Union Dataset.
- In the Tables section of the dataset definition page, click + Add Tables, select the warehouse tables to be included for the dataset, and click Confirm.
- Save the dataset to see the columns, primary keys, and other properties. Whenever you change the tables on a union dataset, you must click Save to see the properties.
- The Columns section contains the columns of the union dataset, where each column maps to one or more columns of the tables chosen for the dataset. By default, Analytics attempts to combine and map columns based on matching column names and data types.
- For example, the union dataset is created from two warehouse tables, Table 1 and Table 2. Say Table 1 has a string column called Country, and Table 2 also has a string column called Country. In the resulting union dataset, Analytics creates a string column called Country mapped to the Country columns of Table 1 and Table 2.
- To see the mapping of these columns, click an entry, and the related properties such as column name, description, data type, measure or attribute, default aggregation, data index, status, source tables, mapping type, and value are displayed. You can inspect the properties of individual columns if required.
-
You can edit column properties directly with one click in the Columns tab without opening individual entries.
The following sections are the fields that you can edit within the individual column sections.
Name and description
You can edit the column name and description by clicking the respective fields. To save your changes, click Save to Dataset. You can hover over the description column in the columns tab to see the column description.
Column type
To change the column type, click the field under Measure or Attribute and select a value from the drop-down list.
Default aggregation
To set the default aggregation, select an aggregator from the drop-down list. Learn more about Aggregators.
Data index and status
Enable or disable Data Index and status (Enabled).
Mapping type
In the Column Mappings section, you can edit values in Mapping type (Column, Constant, and None) and Value.
Every column is mapped to zero or more columns of the underlying tables. This is called column mapping. The following are some related details:
- If the column should map to a specific table column, it has COLUMN mapping.
- If the column should just be a constant, it has CONSTANT mapping. This type of mapping is equivalent to selecting a constant, like
select 'val' as columnName from table. - If the column has NONE mapping, the column selections are not projected from the underlying table.
Overrides
Within this section, you can see the name of the source table, mapping type, and column value. You can also change the mapping type and column values here. To add an override row, click +.
Define primary keys
A primary key is a column or set of columns in a dataset that uniquely identifies each row within the dataset. While configuring a primary key is optional in Optimizely Analytics, it is a highly recommended best practice for ensuring the quality and reliability of your data.
The primary key's main job is to ensure that every row is unique. Without it, you could have duplicate rows that inflate your metrics. For example, if you have two identical rows for the same user, you might incorrectly count them as two separate users, leading to flawed analysis and business decisions. A primary key acts as a safeguard against this.
To define a primary key, ensure that a dataset within Analytics does not include fully duplicated rows, because you cannot establish a primary key for such a dataset.
- Click + Add to add primary keys.
- Click the blank box under Columns Group.
- Select a column to set it as a key in your dataset from the drop-down list.
Health check
The primary key health check is a tool designed to verify that the primary key is functioning correctly, letting you proactively detect errors and build confidence in your data foundation.
Use Check health in the Primary Keys tab to evaluate a given set of columns designated as a primary key to determine how many distinct key values have duplicate rows.
For datasets identified as event streams, the check automatically filters data to the last 60 days. To ensure predictable performance and cost efficiency, the scan is limited to a maximum of 10 million rows.
There are three different health statuses possible, and they are as follows:
- Healthy – Indicates that all primary key values are unique. For example, when all the values are unique, it displays the following message: All non-null PK values are unique!
- Warning – Indicates that a small number of duplicate primary key values were found. For example, when the number of duplicate values is between 1 and 99, it displays the following message: Some duplicate PK values found!
- Unhealthy – Indicates that more than 100 duplicate primary key values were found. For example, when the number of duplicate values is more than 100, it displays the following message: Many duplicate PK values found!
Cohorts and derived columns (optional)
In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.
Define related datasets
The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. You can combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets to join in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.
- Click + Add Related Dataset.
- From the Dataset drop-down list, select the dataset you want to link to the current source dataset.
- Select the relevant Cardinality for the relationship:
- many to one – Match many rows in the current dataset to one row in the related dataset.
- one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
- one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
- Click + and use the drop-down lists to select the columns that must match to establish a successful relationship.
- Click Save.
Health check
The Check health option in the Related Datasets tab evaluates a defined relationship between two datasets. It identifies how many distinct values on the many side of the relationship do not have a corresponding match on the one side. This helps uncover broken links or orphaned records that could impact data integrity.
For datasets identified as event streams, the check automatically filters data to the last 60 days. To ensure predictable performance and cost efficiency, the scan is limited to a maximum of 10 million rows.
There are three different health statuses possible, and they are as follows:
- Healthy – Indicates <10% of unmatched records, suggesting the relationship is mostly intact. For example, when the status is healthy, it displays the following message: Join null rate: 8% (80 nulls out of 1000 rows).
- Warning – Indicates 10-20% of unmatched records, which could impact data integrity. For example, when the status is a warning, it displays the following message: Join null rate: 15% (150 nulls out of 1000 rows).
- Unhealthy – Indicates >20% of unmatched records, signaling a critical data integrity issue. For example, when the status is unhealthy, it displays the following message: Join null rate: 25% (250 nulls out of 1000 rows).
Set sampling keys
A sampling key for a dataset is exclusive to an actor dataset and lets you use sampled data when creating an exploration using this derived dataset.
To configure sampling,
- Click + Add Sampling Key, select a dataset in the For actor field, and the key by which events are sampled.
- Configure sampling extracts by selecting the dataset, fast rate, fast suffix, precise rate, and precise suffix.
-
Add schema path entry details.
In warehouses, sample extracts can live in a separate schema from the original table. This is useful in organizing warehouse tables. For example, the source table could be
springsight.public.users, but extracts could bespringsight.extracts.users_fast.If this schema is not provided, Analytics tries to find the sample as
springsight.public.users_fastin which the extract schema is a list:[springsight, extracts] (DB name, schema name).
Configure semantics
The semantics tab lets you annotate this dataset as an event stream or an actor.
Actor
Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add properties.
Decision stream
Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:
- Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
- Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, leave it blank if the experiment name used by Optimizely is populated.
- Experiment name – The experiment name used by Optimizely.
- Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
- (Optional) Is holdback – The definition of a group that must be excluded from experiments you are performing to compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
- Timestamp – The time at which the decision was made.
- Custom partition time column – Select the column and set the time skew before and after the event time. The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data.
Event stream
Contains time-series actions performed by users or other actors. This is often behavioral data, such as clicks, page views, or transactions. When you select Event stream, click Add Event Stream and configure event and actor properties.
Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew.
Use the Index Time Range field to apply a time filter when indexing distinct column values and JSON fields, overriding the app-level setting if specified. You can reset this to the default value when required.
The Session Identifier field in the Event Stream Properties section determines how events are grouped for each actor. It is crucial for defining what constitutes a session for a user within your data. The chosen session identifier, such as a user_id or session_id, should be unique and consistent across your tables to ensure accurate event grouping and analysis.
You can see the events in the dataset when you click See all events.
A union dataset with an event stream regularly collects event statistics. If any of these collections are still pending, the dataset remains in a pending state. Missing event data can lead to incomplete or inaccurate results, as queries may not reflect the full set of information until all event statistics are collected.
The Refresh button manually triggers statistics collection and removes the pending state.
Slowly changing dimensions (SCD)
SCD lets you handle historical changes in your data more effectively, with minimal configuration. SCD2 datasets require only a many-to-one relationship with the actor dataset. When you configure this, Optimizely Analytics auto-joins any event stream that joins with the actor dataset (for example, Users) with the SCD2 table; you do not have to perform any additional setup.
Model this type with the following required fields:
- Actor Dataset – Commonly Users.
- Start Timestamp Column – Must be a timestamp.
- End Timestamp Column – Must be a timestamp.
Auto-monitoring in union datasets
Auto-monitoring lets you dynamically manage large sets of tables within a union dataset, automatically tracking changes and additions to source schemas. It reduces manual effort and ensures that your union datasets remain current as tables are added to monitored schemas. Auto-monitoring does not add all tables at once, but adds them in batches in the background.
Create an auto-monitored union dataset
You can configure auto-monitoring on one or more schemas that contribute to a union dataset.
- Go to Data > Datasets > + New Dataset > Union Dataset.
-
Click Auto Monitoring to select one or more schemas to monitor.
-
Choose a schema and toggle Auto-monitor. This will initially include all tables from that schema. If you want to exclude specific tables, add ignore patterns by clicking + in the Ignore Tables section.
-
Click Save to save the schema configuration. You must save after configuring each schema. This ensures the settings are applied correctly. Skipping this may result in an incomplete setup. After saving, auto-monitoring will batch-add these tables, typically processing 100 tables every 10 minutes in the background. The dataset may show a pending state while tables are being added. You can monitor progress from the dataset view.
Update regular union datasets to auto-monitored ones
Converting an existing, regular union dataset to an auto-monitored one requires careful operation.
- Open the union dataset where you want to enable auto-monitoring.
-
Click Auto Monitoring, select the schemas you want to monitor, and click Save. When saved, auto-monitoring replaces manually added tables with the selected auto-monitored ones. Ensure the number of tables matches the original to maintain consistency.
- If you have custom column mappings, ensure they are still valid post-migration.
- Auto-monitored tables replace manually added tables.
- This has no impact on dashboards, derived columns, or semantics.
- If this is an event dataset, after auto-monitoring is set up, the system automatically collects event statistics.
Edit your dataset
- Go to Data > Datasets.
- Click the dataset you want to edit
- Make any necessary changes
- Click Save.
Duplicate your dataset
- Go to Data > Datasets.
- Click the dataset you want to duplicate.
- Click More > Save As and complete the following:
- Name – Enter a name for your copy.
- (Optional) Description – Enter a description for your copy.
- Destination folder – Select a destination folder for your copy.
- Click Apply.
Delete your dataset
- Go to Data > Datasets.
- Click the dataset you want to delete.
- Click More > Delete.
- Click Delete on the confirmation page.
Please sign in to leave a comment.