Union dataset

  • Updated

A union dataset is a logical combination of multiple warehouse tables exposed to users as a single logical dataset. A typical use case for union datasets is to create a single logical event dataset from a set of event tables in the warehouse, one for each event type. Analytics' union datasets have performance and cost advantages over physically combining these tables in the warehouse.

A source dataset in Analytics maps to only one warehouse table, while a union dataset maps to multiple warehouse tables.

Create a union dataset

  1. On the navigation panel, go to Data > Datasets > + New Dataset > Union Dataset.
  2. In the Tables section of the dataset definition page, click + Add Tables, select the warehouse tables that need to be included for the dataset, and click Confirm
  3. Save the dataset to see the columns, primary keys, and other such properties. Whenever you change the tables on a union dataset, you must click Save to see the properties.opti-unidataset-1.gif
  1. The Columns section contains the columns of the union dataset, where each column maps to one or more columns of the tables chosen for the dataset. By default, Analytics attempts to combine and map columns based on matching column names and data types.
    • For example, the union dataset is created from two warehouse tables, Table 1 and Table 2. Say Table 1 has a string column called "Country," and Table 2 also has a string column called "Country." In the resulting union dataset, Analytics creates a string column called "Country" mapped to the "Country" columns of Table 1 and Table 2.
    • To see the mapping of these columns, click an entry, and the related properties such as column name, description, data type, measure or attribute, default aggregation, data index, status, source tables, mapping type, and value are displayed. You can inspect the properties of individual columns if required.
    • You can edit column properties directly with one click in the Columns tab without opening individual entries.
opti-unidataset-2.gif

The following sections are the fields that you can edit within the individual column sections.

Name and description

You can edit the column name and description by clicking the respective fields. To save your changes, click Save to Dataset. You can hover over the description column in the columns tab to see the column description.

opti-unidataset-3.gif

Column Type

To change the column type, click the field under Measure or Attribute and select a value from the drop-down list.

opti-unidataset-4.png

Default Aggregation

To set the default aggregation, click the drop-down list and choose an aggregator. Learn more about Aggregators.

opti-unidataset-5.png

Data Index and Status

Data index and status can be enabled or disabled using their respective toggles.

opti-unidataset-6.png

Mapping Type

In the Column Mappings section, you can edit default values like mapping type and column value. The following are available mapping types: Column, Constant, and None.

Every column is mapped to zero or more columns of the underlying tables. This is called column mapping. The following are some related details:

  • If the column should map to a specific table column, it has COLUMN mapping.
  • If the column should just be a constant, it has CONSTANT mapping. This type of mapping is equivalent to selecting a constant, like select 'val' as columnName from table.
  • If the column has NONE mapping, the column selections are not projected from the underlying table.opti-unidataset-7.gif

These properties have default values assigned to them. This makes constructing queries for explorations and visualizations downstream in the application easier.

Overrides

Within this section, you can see the name of the source table, mapping type, and column value. You can also change the mapping type and column values here. To add a new override, click + , and a new row is added.

opti-unidataset-8.gif

Define primary keys

A primary key consists of one or more columns in a dataset that uniquely identifies each row within the dataset. Configuring a primary key for datasets is optional. To define a primary key, ensure that a dataset within Analytics does not include any fully duplicated rows, as a primary key cannot be established for such a dataset. 

  1. Click + Add to add primary keys.
  2. Click the blank box under Columns Group.
  3. Select a column to set it as a key in your dataset from the drop-down list.
opti-unidataset-9.gif
Cohorts and derived columns (optional)

In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.

opti-unidataset-10.png
Define related datasets

The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. This lets you combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets can be joined in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.

  1. Click + Add Related Dataset.
  2. Click the Dataset drop-down list and choose the dataset you want to link to the current source dataset.
  3. Select the relevant Cardinality for the relationship. The following options are available:
    • many to one – Match many rows in the current dataset to one row in the related dataset.
    • one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
    • one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
  4. Click + and use the drop-down lists to select the columns that must match to establish a successful relationship, and click Save.
opti-unidataset-11.gif

Set sampling keys

A sampling key for a dataset is exclusive to an actor dataset and lets you utilize sampled data when creating an exploration using this derived dataset.

To configure sampling,

  1. Click + Add Sampling Key, select a dataset in the For actor field, and the key by which events are sampled.
  2. Configure sampling extracts by selecting the dataset, fast rate, fast suffix, precise rate, and precise suffix.
  3. Add schema path entry details. In warehouses, sample extracts can live in a separate schema from the original table. This is useful in the organization of warehouse tables. For example, the source table could be springsight.public.users, but extracts could be springsight.extracts.users_fast. If this schema is not provided, then we would try to find the sample as springsight.public.users_fast. In this case, the extract schema is a list - [springsight, extracts] (DB name, schema name).
opti-unidataset-11.png

Configure semantics

The semantics tab lets you annotate this dataset as an event stream or an actor.

Actor

Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add as many properties as needed.

opti-unidataset-12.png

Decision stream

Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:

  • Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
  • Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, it can be left blank if the experiment name used by Optimizely is populated.
  • Experiment name – The experiment name used by Optimizely.
  • Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
  • Is holdback – The definition of a group that must be excluded from all experiments you are performing so that you can compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
  • Timestamp – The time at which the decision was made. This field is required.
  • Custom partition time column – The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data. Select the column and set the time skew before and after the event time.
opti-unidataset-13.png

Event stream

Contains time-series actions performed by users or other actors. This is often behavioral data, like clicks, page views, or transactions. When you choose Event stream, click Add Event Stream and configure event and actor properties.

Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew. 

You can see all the events in the dataset when you click See all events.

opti-unidataset-14.png

SCD

SCD lets you handle historical changes in your data more effectively, with minimal configuration. Analytics auto-joins SCD2 data with any event stream linked to the actor dataset, no extra relationships are needed. This type is modeled with three required fields. 

  • The actor dataset (commonly Users).
  • The start timestamp column that must be a timestamp.
  • The end timestamp column that must be a timestamp.

opti-unidataset-15.png

Edit your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to edit, make necessary changes, and click Save.
opti-srcdataset-13.gif

Duplicate your dataset

  1. Go to Data > Datasets on the navigation panel.
  2. Click the dataset you want to duplicate.
  3. Click More > Save As and complete the following:
    • Name – Enter a name for your copy.
    • Description – (Optional) Enter a description for your copy.
    • Destination folder – Choose a destination folder for your copy.
  4. Click Apply.
opti-srcdataset-14.gif

Delete your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to delete.
  3. Click More > Delete. On the confirmation page, click Delete
opti-srcdataset-15.gif