Source dataset

  • Updated

A source dataset is a logical table from the data warehouse. When you connect to the data warehouse, you can map a database table, database view, or SQL query from the data warehouse to a source dataset within Analytics without physically transferring or moving data to Analytics. You can then utilize the source dataset to execute analysis or create derived datasets.

Create a source dataset

  1. On the navigation panel, go to Data > Datasets  > + New Dataset > Source Dataset.opti-srcdataset-1.gif
  2. Expand the schema using the tree view in the Pick a Source section and select the table or view you want to add to the query. To preview the selected data, click Preview in the Source Preview section. Click Confirm to move to the dataset definition.opti-srcdataset-2.gif

Dataset definition

The following are the different sections available for dataset definition:

Query

Analytics displays a default SQL query on the definition page based on your data selections, including the connection source and contents. You can modify the suggested SQL command to shape the data as needed. Click Run Query and preview the data to ensure the SQL statement returns the expected information.

opti-srcdataset-3.gif

Column properties

This section lets you inspect the properties of individual columns. The following properties are defined:

  • Name – The display name of the column in the dataset.
  • SQL definition – The SQL expression or underlying column name used to populate the field.
  • Description – The optional text to explain the purpose or contents of the column.
  • Data type – The data format of the column.
  • Measure or attribute – The option to specify whether the column should be treated as a metric for analysis or as an attribute for grouping and filtering.
  • Default aggregation – The default operation applied when the analysis uses the column.
  • Data index – The toggle to control whether Analytics caches the column's distinct values in the warehouse. This reduces the latency with which column value dropdowns are populated. This is true by default for columns with string data types and disabled for others.
  • Enabled – The toggle to control the visibility and usability of the column.

opti-srcdataset-4.gif

Primary keys (Optional)

A primary key consists of one or more columns in a dataset that uniquely identifies each row within the dataset. Configuring a primary key for datasets is optional. To define a primary key, ensure that a dataset within Analytics does not include any fully duplicated rows, as a primary key cannot be established for such a dataset. 

  1. Click + Add to add primary keys.
  2. Click the blank box under Columns Group.
  3. Select a column to set it as a key in your dataset from the drop-down list.

opti-srcdataset-5.gif

Cohorts and derived columns (Optional)

In this section, you can define new derived columns and cohorts of rows that represent logical groupings within the respective tabs. This is an optional step. Learn more about cohorts and derived columns.

opti-srcdataset-7.png

Related Datasets

The Related datasets tab lets you define logical relationships between datasets, similar to how tables relate in a data warehouse. This lets you combine event and attribute data for richer analysis without understanding the underlying data structure. Analytics uses these relationships to determine which datasets can be joined in an exploration and only shows relevant options in the UI. For example, using common user ID columns, you can form a relationship between an events dataset and a users dataset.

  1. Click + Add Related Dataset.
  2. Click the Dataset drop-down list and choose the dataset you want to link to the current source dataset.
  3. Select the relevant Cardinality for the relationship. The following options are available:
    • many to one – Match many rows in the current dataset to one row in the related dataset.
    • one to many – Match one row in the current dataset to multiple rows in the related dataset based on the selected columns.
    • one to one – Establish a relationship by directly matching a selected column in a single row of datasets.
  4. Click + and use the drop-down lists to select the columns that must match to establish a successful relationship, and click Save.
opti-srcdataset-6.gif

Sampling

Sampling keys are exclusive to an actor dataset and let you utilize sampled data when creating an exploration using this source dataset.

To configure sampling,

  1. Click + Add Sampling Key, select a dataset in the For actor field, and the key by which events are sampled.
  2. Configure sampling extracts by selecting the dataset, fast rate, fast suffix, precise rate, and precise suffix.
  3. Add schema path entry details. In warehouses, sample extracts can live in a separate schema from the original table. This is useful in the organization of warehouse tables. For example, the source table could be springsight.public.users, but extracts could be springsight.extracts.users_fast. If this schema is not provided, then we would try to find the sample as springsight.public.users_fast. In this case, the extract schema is a list - [springsight, extracts] (DB name, schema name).

opti-srcdataset-8.png

Semantics

Semantics let you annotate your dataset as an actor, decision stream, event stream, or SCD.

Actor

Represents entities that perform actions. When Actor is chosen, click Add Preferred Property and add as many properties as needed.

opti-srcdataset-9.png

Decision stream

Captures decision events such as experiment bucketing. If Decision Stream is chosen, you must configure the following fields:

  • Actor dataset – The actor corresponding to whatever identifier is used for experiment variation decisions.
  • Experiment ID – The experiment ID used by Optimizely. ID is preferred, but if it is not available, it can be left blank if the experiment name used by Optimizely is populated.
  • Experiment name – The experiment name used by Optimizely.
  • Variation ID – The variation ID used by Optimizely. If the variation ID is unavailable, you can use the same picker to select the variation name, and it works automatically.
  • Is holdback – The definition of a group that must be excluded from all experiments you are performing so that you can compare how this group performs on specific metrics as compared to other visitors who took part in the experiment.
  • Timestamp – The time at which the decision was made. This field is required.
  • Custom partition time column – The option that optimizes time-based queries by mapping your event time to a warehouse partition column. When enabled, time filters in queries apply precisely to the event timestamp and loosely to the partition column (with a configurable skew), improving performance without missing relevant data. Select the column and set the time skew before and after the event time.

opti-srcdataset-10.png

Event stream

Contains time-series actions performed by users or other actors. This is often behavioral data, like clicks, page views, or transactions. When you choose Event stream, click Add Event Stream and configure event and actor properties.

Custom partition time column improves query performance by applying exact filters on event time and relaxed filters on a mapped partition column, based on a set time skew. 

You can see all the events in the dataset when you click See all events.

opti-srcdataset-11.png

 

SCD

SCD lets you handle historical changes in your data more effectively, with minimal configuration. Analytics auto-joins SCD2 data with any event stream linked to the actor dataset, no extra relationships are needed. This type is modeled with three required fields. 

  • The actor dataset (commonly Users).
  • The start timestamp column that must be a timestamp.
  • The end timestamp column that must be a timestamp.

opti-srcdataset-16.png 

You can view and access your dataset on the Datasets page when you save it.

opti-srcdataset-12.png

Active event

Any event triggered by the user becomes an active event. For example, a click by the user is an active event. Analytics lets you define active events by identifying a cohort on any event stream. To define an active event,

  1. On the navigation panel, click + > Cohort and choose a type.
  2. Add a new block, set a value based on your requirements, and click Save.
  3. This cohort is displayed in the active events drop-down list in the Semantics tab of your dataset.
opti-srcdataset-activevents.gif

Why do you have to define active events?

You can use active events inside features like funnels when you define active events. While defining the stages in a funnel, Analytics lets you set stages in which only the active events are displayed. In Funnels, you can only see active events if your event stream has an annotation for active events.

Edit your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to edit, make necessary changes, and click Save.
opti-srcdataset-13.gif

Duplicate your dataset

  1. Go to Data > Datasets on the navigation panel.
  2. Click the dataset you want to duplicate.
  3. Click More > Save As and complete the following:
    • Name – Enter a name for your copy.
    • Description – (Optional) Enter a description for your copy.
    • Destination folder – Choose a destination folder for your copy.
  4. Click Apply.
opti-srcdataset-14.gif

Delete your dataset

  1. Go to Data > Datasets.
  2. Click the dataset you want to delete.
  3. Click More > Delete. On the confirmation page, click Delete
opti-srcdataset-15.gif