You can create a derived dataset by combining existing datasets (source or derived) to create a compound dataset. A fundamental distinction between source and derived datasets is that you define source datasets by mapping tables in the data warehouse. In contrast, derived datasets are based on other datasets within Analytics.
Create a derived dataset
You can create a derived dataset from multiple datasets with custom SQL statements.
- Go to Data > Datasets > + New Dataset > Derived Dataset.
- Configure the dataset in the dataset definition page.
Dataset definition
The following are the different sections available for dataset definition:
Query
Insert your SQL query into the SQL box. You can insert dataset names or columns into the box with the auto-complete feature that displays a list of selectable options. Click Run Query to preview your query results.
Column properties
- Name – The display name of the column in the dataset.
- SQL definition – The SQL expression or underlying column name used to populate the field.
- Description – The optional text to explain the purpose or contents of the column.
- Data type – The data format of the column.
- Measure or attribute – The option to specify whether the column should be treated as a metric for analysis or as an attribute for grouping and filtering.
- Default aggregation – The default operation applied when the analysis uses the column.
- Data index – The toggle to control whether Analytics caches the column's distinct values in the warehouse. This reduces the latency with which column value dropdowns are populated. This is true by default for columns with string data types and disabled for others.
- Enabled – The toggle to control the visibility and usability of the column.
Primary keys
A primary key is a column or set of columns in a dataset that uniquely identifies each row within the dataset. While configuring a primary key is optional in Optimizely Analytics, it is a highly recommended best practice for ensuring the quality and reliability of your data.
The primary key's main job is to ensure that every row is unique. Without it, you could have duplicate rows that inflate your metrics. For example, if you have two identical rows for the same user, you might incorrectly count them as two separate users, leading to flawed analysis and business decisions. A primary key acts as a safeguard against this.
To define a primary key, ensure that a dataset within Analytics does not include fully duplicated rows, because you cannot establish a primary key for such a dataset.
- Click + Add to add primary keys.
- Click the blank box under Columns Group.
- Select a column to set it as a key in your dataset from the drop-down list.
Health check
The primary key health check is a tool designed to verify that the primary key is functioning correctly, letting you proactively detect errors and build confidence in your data foundation.
Use Check health in the Primary Keys tab to evaluate a given set of columns designated as a primary key to determine how many distinct key values have duplicate rows.
For datasets identified as event streams, the check automatically filters data to the last 60 days. To ensure predictable performance and cost efficiency, the scan is limited to a maximum of 10 million rows.
There are three different health statuses possible, and they are as follows:
- Healthy – Indicates that all primary key values are unique. For example, when all the values are unique, it displays the following message: All non-null PK values are unique!
- Warning – Indicates that a small number of duplicate primary key values were found. For example, when the number of duplicate values is between 1 and 99, it displays the following message: Some duplicate PK values found!
- Unhealthy – Indicates that more than 100 duplicate primary key values were found. For example, when the number of duplicate values is more than 100, it displays the following message: Many duplicate PK values found!
Cohorts and derived columns
In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.
Related datasets
The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. You can combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets to join in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.
- Click + Add Related Dataset.
- From the Dataset drop-down list, select the dataset you want to link to the current source dataset.
- Select the relevant Cardinality for the relationship:
- many to one – Match many rows in the current dataset to one row in the related dataset.
- one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
- one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
- Click + and use the drop-down lists to select the columns that must match to establish a successful relationship.
- Click Save.
Health check
The Check health option in the Related Datasets tab evaluates a defined relationship between two datasets. It identifies how many distinct values on the many side of the relationship do not have a corresponding match on the one side. This helps uncover broken links or orphaned records that could impact data integrity.
For datasets identified as event streams, the check automatically filters data to the last 60 days. To ensure predictable performance and cost efficiency, the scan is limited to a maximum of 10 million rows.
There are three different health statuses possible, and they are as follows:
- Healthy – Indicates <10% of unmatched records, suggesting the relationship is mostly intact. For example, when the status is healthy, it displays the following message: Join null rate: 8% (80 nulls out of 1000 rows).
- Warning – Indicates 10-20% of unmatched records, which could impact data integrity. For example, when the status is a warning, it displays the following message: Join null rate: 15% (150 nulls out of 1000 rows).
- Unhealthy – Indicates >20% of unmatched records, signaling a critical data integrity issue. For example, when the status is unhealthy, it displays the following message: Join null rate: 25% (250 nulls out of 1000 rows).
Semantics
Lets you annotate your dataset as an actor, decision stream, event stream, or SCD.
Actor
Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add properties.
Decision stream
Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:
- Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
- Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, leave it blank if the experiment name used by Optimizely is populated.
- Experiment name – The experiment name used by Optimizely.
- Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
- (Optional) Is holdback – The definition of a group that must be excluded from experiments you are performing to compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
- Timestamp – The time at which the decision was made.
- Custom partition time column – Select the column and set the time skew before and after the event time. The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data.
Event stream
Contains time-series actions performed by users or other actors. This is often behavioral data, such as clicks, page views, or transactions. When you select Event stream, click Add Event Stream and configure event and actor properties.
Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew.
Use the Index Time Range field to apply a time filter when indexing distinct column values and JSON fields, overriding the app-level setting if specified. You can reset this to the default value when required.
The Session Identifier field in the Event Stream Properties section determines how events are grouped for each actor. It is crucial for defining what constitutes a session for a user within your data. The chosen session identifier, such as a user_id or session_id, should be unique and consistent across your tables to ensure accurate event grouping and analysis.
You can see the events in the dataset when you click See all events.
Slowly changing dimensions (SCD)
SCD lets you handle historical changes in your data more effectively, with minimal configuration. SCD2 datasets require only a many-to-one relationship with the actor dataset. When you configure this, Optimizely Analytics auto-joins any event stream that joins with the actor dataset (for example, Users) with the SCD2 table; you do not have to perform any additional setup.
Model this type with the following required fields:
- Actor Dataset – Commonly Users.
- Start Timestamp Column – Must be a timestamp.
- End Timestamp Column – Must be a timestamp.
Once you save your dataset, you can view and access it on the Dataset page.
Edit your dataset
- Go to Data > Datasets.
- Click the dataset you want to edit
- Make any necessary changes
- Click Save.
Duplicate your dataset
- Go to Data > Datasets.
- Click the dataset you want to duplicate.
- Click More > Save As and complete the following:
- Name – Enter a name for your copy.
- (Optional) Description – Enter a description for your copy.
- Destination folder – Select a destination folder for your copy.
- Click Apply.
Delete your dataset
- Go to Data > Datasets.
- Click the dataset you want to delete.
- Click More > Delete.
- Click Delete on the confirmation page.
Please sign in to leave a comment.