Derived dataset

  • Updated

You can create a derived dataset by combining existing datasets (source or derived) to create a compound dataset. A fundamental distinction between source and derived datasets is that you define source datasets by mapping tables in the data warehouse. In contrast, derived datasets are based on other datasets within Analytics.

Create a derived dataset

You can create a derived dataset from multiple datasets with custom SQL statements. 

  1. Go to Data > Datasets > + New DatasetDerived Dataset.
  2. Configure the dataset in the dataset definition page.

opti-derdataset-1.gif

Dataset definition

The following are the different sections available for dataset definition:

Query

Insert your SQL query into the SQL box. You can insert dataset names or columns into the box with the auto-complete feature that displays a list of selectable options. Click Run Query to preview your query results.

opti-derdataset-2.gif

Column properties

Inspect the properties of individual columns. The following properties are defined:
  • Name – The display name of the column in the dataset.
  • SQL definition – The SQL expression or underlying column name used to populate the field.
  • Description – The optional text to explain the purpose or contents of the column.
  • Data type – The data format of the column.
  • Measure or attribute – The option to specify whether the column should be treated as a metric for analysis or as an attribute for grouping and filtering.
  • Default aggregation – The default operation applied when the analysis uses the column.
  • Data index – The toggle to control whether Analytics caches the column's distinct values in the warehouse. This reduces the latency with which column value dropdowns are populated. This is true by default for columns with string data types and disabled for others.
  • Enabled – The toggle to control the visibility and usability of the column.
These properties have default values, which makes constructing queries for explorations easier.

opti-derdataset-3.png

 

Primary keys

A primary key consists of one or more columns in a dataset that uniquely identifies each row within the dataset. Configuring a primary key for datasets is optional. To define a primary key, ensure that a dataset within Analytics does not include any fully duplicated rows, as a primary key cannot be established for such a dataset. 

  1. Click + Add to add primary keys.
  2. Click the blank box under Columns Group.
  3. Select a column to set it as a key in your dataset from the drop-down list.

opti-derdataset-4.png

Cohorts and derived columns

In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.

opti-derdataset-5.png

Related datasets

The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. This lets you combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets can be joined in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.

  1. Click + Add Related Dataset.
  2. Click the Dataset drop-down list and choose the dataset you want to link to the current source dataset.
  3. Select the relevant Cardinality for the relationship. The following options are available:
    • many to one – Match many rows in the current dataset to one row in the related dataset.
    • one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
    • one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
  4. Click + and use the drop-down lists to select the columns that must match to establish a successful relationship, and click Save.

opti-derdataset-6.png

Semantics

Lets you annotate your dataset as an actor, decision stream, event stream, or SCD.

Actor

Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add as many properties as needed.

opti-derdataset-7.png

Decision stream

Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:

  • Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
  • Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, it can be left blank if the experiment name used by Optimizely is populated.
  • Experiment name – The experiment name used by Optimizely.
  • Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
  • Is holdback – The definition of a group that must be excluded from all experiments you are performing so that you can compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
  • Timestamp – The time at which the decision was made. This field is required.
  • Custom partition time column – The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data. Select the column and set the time skew before and after the event time.

opti-derdataset-8.png

Event stream

Contains time-series actions performed by users or other actors. This is often behavioral data, like clicks, page views, or transactions. When you choose Event stream, click Add Event Stream and configure event and actor properties.

Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew. 

You can see all the events in the dataset when you click See all events.

opti-derdataset-9.png

SCD

SCD lets you handle historical changes in your data more effectively, with minimal configuration. Analytics auto-joins SCD2 data with any event stream linked to the actor dataset, no extra relationships are needed. This type is modeled with three required fields. 

  • The actor dataset (commonly Users).
  • The start timestamp column that must be a timestamp.
  • The end timestamp column that must be a timestamp.

opti-derdataset-10.png

Once you save your dataset, you can view and access it on the Dataset page.

Edit your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to edit, make necessary changes, and click Save.
opti-srcdataset-13.gif

Duplicate your dataset

  1. Go to Data > Datasets on the navigation panel.
  2. Click the dataset you want to duplicate.
  3. Click More > Save As and complete the following:
    • Name – Enter a name for your copy.
    • Description – (Optional) Enter a description for your copy.
    • Destination folder – Choose a destination folder for your copy.
  4. Click Apply.
opti-srcdataset-14.gif

Delete your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to delete.
  3. Click More > Delete. On the confirmation page, click Delete
opti-srcdataset-15.gif