Source dataset

A source dataset is a logical table from the data warehouse. When you connect to the data warehouse, you can map a database table, database view, or SQL query from the data warehouse to a source dataset within Analytics without physically transferring or moving data to Analytics. You can then use the source dataset to execute analysis or create derived datasets.

Create a source dataset

On the navigation panel, go to Data > Datasets > + New Dataset > Source Dataset.
Expand the schema using the tree view in the Pick a Source section and select the table or view you want to add to the query. To preview the selected data, click Preview in the Source Preview section. Click Confirm to move to the dataset definition.

Dataset definition

The following are the different sections available for dataset definition:

Query

Analytics displays a default SQL query on the definition page based on your data selections, including the connection source and contents. You can modify the suggested SQL command to shape the data as needed. Click Run Query and preview the data to ensure the SQL statement returns the expected information.

Column properties

This section lets you inspect the properties of individual columns. This section defines the following properties:

Name – The display name of the column in the dataset.
SQL definition – The SQL expression or underlying column name used to populate the field.
Description – The optional text to explain the purpose or contents of the column.
Data type – The data format of the column.
Measure or attribute – The option to specify whether the column should be treated as a metric for analysis or as an attribute for grouping and filtering.
Default aggregation – The default operation applied when the analysis uses the column.
Data index – The toggle to control whether Analytics caches the column's distinct values in the warehouse. This reduces the latency with which column value dropdowns are populated. This is true by default for columns with string data types and disabled for others.
Enabled – The toggle to control the visibility and usability of the column.

Primary keys (optional)

A primary key is a column or set of columns in a dataset that uniquely identifies each row within the dataset. While configuring a primary key is optional in Optimizely Analytics, it is a highly recommended best practice for ensuring the quality and reliability of your data.

The primary key's main job is to ensure that every row is unique. Without it, you could have duplicate rows that inflate your metrics. For example, if you have two identical rows for the same user, you might incorrectly count them as two separate users, leading to flawed analysis and business decisions. A primary key acts as a safeguard against this.

To define a primary key, ensure that a dataset within Analytics does not include fully duplicated rows, because you cannot establish a primary key for such a dataset.

Click + Add to add primary keys.
Click the blank box under Columns Group.
Select a column to set it as a key in your dataset from the drop-down list.

Health check

The primary key health check is a tool designed to verify that the primary key is functioning correctly, letting you proactively detect errors and build confidence in your data foundation.

Use Check health in the Primary Keys tab to evaluate a given set of columns designated as a primary key to determine how many distinct key values have duplicate rows.

For datasets identified as event streams, the check automatically filters data to the last 60 days. To ensure predictable performance and cost efficiency, the scan is limited to a maximum of 10 million rows.

There are three different health statuses possible, and they are as follows:

Healthy – Indicates that all primary key values are unique. For example, when all the values are unique, it displays the following message: All non-null PK values are unique!
Warning – Indicates that a small number of duplicate primary key values were found. For example, when the number of duplicate values is between 1 and 99, it displays the following message: Some duplicate PK values found!
Unhealthy – Indicates that more than 100 duplicate primary key values were found. For example, when the number of duplicate values is more than 100, it displays the following message: Many duplicate PK values found!

Cohorts and derived columns (optional)

In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.

Related Datasets

The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. You can combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets to join in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.

Click + Add Related Dataset.
From the Dataset drop-down list, select the dataset you want to link to the current source dataset.
Select the relevant Cardinality for the relationship:
- many to one – Match many rows in the current dataset to one row in the related dataset.
- one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
- one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
Click + and use the drop-down lists to select the columns that must match to establish a successful relationship.
Click Save.

Health check

The Check health option in the Related Datasets tab evaluates a defined relationship between two datasets. It identifies how many distinct values on the many side of the relationship do not have a corresponding match on the one side. This helps uncover broken links or orphaned records that could impact data integrity.

There are three different health statuses possible, and they are as follows:

Healthy – Indicates <10% of unmatched records, suggesting the relationship is mostly intact. For example, when the status is healthy, it displays the following message: Join null rate: 8% (80 nulls out of 1000 rows).
Warning – Indicates 10-20% of unmatched records, which could impact data integrity. For example, when the status is a warning, it displays the following message: Join null rate: 15% (150 nulls out of 1000 rows).
Unhealthy – Indicates >20% of unmatched records, signaling a critical data integrity issue. For example, when the status is unhealthy, it displays the following message: Join null rate: 25% (250 nulls out of 1000 rows).

The health check option is not available for one-to-one cardinality.

Sampling

Sampling keys are exclusive to an actor dataset and let you use sampled data when creating an exploration using this source dataset.

To configure sampling,

Click + Add Sampling Key, select a dataset in the For actor field, and the key by which events are sampled.
Configure sampling extracts by selecting the dataset, fast rate, fast suffix, precise rate, and precise suffix.
Add schema path entry details.

In warehouses, sample extracts can live in a separate schema from the original table. This is useful in organizing warehouse tables. For example, the source table could be springsight.public.users, but extracts could be springsight.extracts.users_fast.

If this schema is not provided, Analytics tries to find the sample as springsight.public.users_fast in which the extract schema is a list: [springsight, extracts] (DB name, schema name).

Semantics

Semantics let you annotate your dataset as an actor, decision stream, event stream, or SCD.

Actor

Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add properties.

Decision stream

Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:

Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, leave it blank if the experiment name used by Optimizely is populated.
Experiment name – The experiment name used by Optimizely.
Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
(Optional) Is holdback – The definition of a group that must be excluded from experiments you are performing to compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
Timestamp – The time at which the decision was made.
Custom partition time column – Select the column and set the time skew before and after the event time. The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data.

Event stream

Contains time-series actions performed by users or other actors. This is often behavioral data, such as clicks, page views, or transactions. When you select Event stream, click Add Event Stream and configure event and actor properties.

Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew.

Use the Index Time Range field to apply a time filter when indexing distinct column values and JSON fields, overriding the app-level setting if specified. You can reset this to the default value when required.

The Session Identifier field in the Event Stream Properties section determines how events are grouped for each actor. It is crucial for defining what constitutes a session for a user within your data. The chosen session identifier, such as a user_id or session_id, should be unique and consistent across your tables to ensure accurate event grouping and analysis.

You can see the events in the dataset when you click See all events.

Slowly changing dimensions (SCD)

SCD lets you handle historical changes in your data more effectively, with minimal configuration. SCD2 datasets require only a many-to-one relationship with the actor dataset. When you configure this, Optimizely Analytics auto-joins any event stream that joins with the actor dataset (for example, Users) with the SCD2 table; you do not have to perform any additional setup.

Model this type with the following required fields:

Actor Dataset – Commonly Users.
Start Timestamp Column – Must be a timestamp.
End Timestamp Column – Must be a timestamp.

You can view and access your dataset on the Datasets page when you save it.

Active event

Any event the user triggers becomes an active event. For example, a user click is an active event. Analytics lets you define active events by identifying a cohort on any event stream. To define an active event, complete the following:

On the navigation panel, click + > Cohort and choose a type.
Add a new block.
Set a value based on your requirements.
Click Save.

This cohort displays in the active events drop-down list in the Semantics tab of your dataset.

Why do you have to define active events?

You can use active events inside features like funnels when you define active events. While defining the stages in a funnel, Analytics lets you set stages in which only the active events display. In Funnels, you can only see active events if your event stream has an annotation for active events.

Edit your dataset

Go to Data > Datasets.
Click the dataset you want to edit
Make any necessary changes
Click Save.

Duplicate your dataset

Go to Data > Datasets.
Click the dataset you want to duplicate.
Click More > Save As and complete the following:
- Name – Enter a name for your copy.
- (Optional) Description – Enter a description for your copy.
- Destination folder – Select a destination folder for your copy.
Click Apply.

Delete your dataset

Go to Data > Datasets.
Click the dataset you want to delete.
Click More > Delete.
Click Delete on the confirmation page.

Source dataset

Create a source dataset

Dataset definition

Query

Column properties

Primary keys (optional)

Health check

Cohorts and derived columns (optional)

Related Datasets

Health check

Sampling

Semantics

Actor

Decision stream

Event stream

Slowly changing dimensions (SCD)

Active event

Why do you have to define active events?

Edit your dataset

Duplicate your dataset

Delete your dataset

<%= previousTitle %>

<%= nextTitle %>

In this article

<%= heading %>

<% if (!block.description) { %> <%= block.name %> <% } else { %> <%= block.name %> <% } %>

<%= heading %>

<% if (!block.description) { %> <%= parsed.title %> <% } else { %> <%= parsed.title %> <% } %>

User Research

Security Announcements

Still have questions?

Categories

Toggle navigation menu

<%= category.name %>