NetScript language reference

  • Updated

Netscript is a powerful analytical language that lets you author and manipulate SQL queries. NetScript query expressions, henceforth called quads, represent SQL queries, and operations on these quads transform the SQL queries according to common analytical requirements.

This document introduces NetScript by presenting a few examples that highlight the key capabilities of the language.

Notes about syntax

Before using NetScript, note the following rules about the syntax:

  • Function names are case-insensitive. For example, countDistinct and countdistinct are synonymous.
  • You can omit underscores in function names. For example, extract_week and extractweek are synonymous.
  • Time zone-sensitive functions use Pacific Time (PT) by default, but you can configure the time zone for each Analytics app.
  • Functions and aggregations can use function syntax (round(x)) or pipe syntax (x | round). These are equivalent.

Examples

NetScript quads represent SQL queries that are written for analytical use cases. These use cases typically require aggregating quantities and slicing them by various dimensions. The following notation comes in handy while describing the semantics of quads. For a quad P:

  • P. Query refers to the SQL query represented by P
  • P. Grain refers to the set of dimensions P on which the quad P is aggregated.
  • P. Measures refers to the set of measures computed by a quad. Most quads only have a single measure, but NetScript provides limited support for multi-measure quads. In the SQL analogy, if P.Grain corresponds to group by expressions, then P.Measures corresponds to aggregated expressions.

For example, say there is a simple sample ecommerce-inspired schema that includes the following tables:

  • Orders – Each record in this table is an individual order with foreign keys to Customer and Product.
  • LineItems – Each record in this table is an individual line item within an order, with a foreign key to the Orders table.
  • Customers – Each record in this table is a customer with basic account information such as name, address, and so on.
  • Product – Each record in this table is a product with basic catalog information such as product name, category, etc.

Quad 2

2 is the simplest quad. It represents the constant 2 and corresponds to the following SQL query.

select 2

The measure here is the constant 2, and its grain is the empty set.

Orders dataset

Orders represents rows from the dataset called Orders. You must define the dataset in the system or the quad results in an error. The quad directly references the dataset (and not some column within the dataset), so the SQL query for this quad produces the primary key for each row in Orders.

No aggregation is happening in this query, and it does not fit within the measure or dimension paradigm described previously. To make it fit, pretend that this query is aggregated at the grain of each row in Orders. The pretense is technically valid because that is the finest possible aggregation grain.

select OrderKey as dim0, OrderKey as measure0 from Orders

Orders.Amount quad

This quad references a specific column from the Orders dataset. This maps to the following SQL query:

select OrderKey as dim0, Amount as measure0 from Orders

Amount must be a valid column within Orders. This query is also assumed to be implicitly aggregated at the grain of each row in the Orders dataset.

2 * Orders.Amount

Binary operators like * conceptually represent a join between the sub-queries for the quad 2 and Orders.Amount. You must join the two subqueries on their shared dimensions. In this case, the quad 2 has no dimensions; thus, the join is a cross-product join. The SQL for this query is shown in optimized and unoptimized forms.

with A as select 2 as measure0,
B as select OrderKey as dim0, Amount as measure0 from Orders
select B.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B

// Equivalent optimized form
select OrderKey as dim0, 2 * Amount as measure0 from Orders

Orders.Amount * Orders.Discount

As in the previous example, the * operator represents a join between the two subqueries produced by the two operands. Because both sub-queries are aggregated at the OrderKey grain, this join is trivial, and the sub-queries are merged into a single select statement.

with A as select OrderKey as dim0, Amount as measure0 from Orders,
B as select OrderKey as dim0, Discount as measure0 from Orders
select A.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B
where A.dim0 = B.dim0

// Equivalent optimized form
select OrderKey as dim0, Amount * Discount as measure0 from Orders

Orders.Amount | by (Orders.ShipMode)

Recall that quads are always aggregated at some grain. The by operator takes an aggregated quad and slices it by the supplied dimension. Notice that the input Orders.Amount is already grouped in the finest possible grain OrderKey. Even though further slicing is not possible, the by operation is still valid and its only effect is to add the ShipMode column to the set of dimensions. The by operator is also often referred to as the slice operator, and its behavior in the face of more complicated input and slicer quads is discussed later in the article.

This quad corresponds to the following SQL query:

select OrderKey as dim0, ShipMode as dim1, Amount as measure0 from Orders

Successive slices over the same dimension are a no-op. For any quad, the following identity holds:

P | by (Q)    is equivalent to     P | by (Q) | by (Q)

The slice operation is a no-op if the requested dimension is already present in the grain of the input quad.

Orders.Amount | sum by (Orders.ShipMode)

With this quad, row-level order data is taken and aggregated at the level of the different ShipModes. Note that ShipMode is not present in the input query. So, you must first inject that field and then aggregate over it. It is simpler to think of this as the following equivalent quad:

Orders.Amount | by (Orders.ShipMode) | sum by (Orders.ShipMode)`

This maps to the following SQL query:

with A as select OrderKey as dim0, ShipMode as dim1, Amount as measure0
from Orders
select A.dim1 as dim0, sum(A.measure0) as measure0 from A group by A.dim1

// Equivalent optimized form
select ShipMode as dim0, sum(Amount) as measure0
from Orders group by A.ShipMode

NetScript supports common aggregation functions such as count, min, max, sum, avg, and so on.

Orders.Amount | sum by (timeBucket(Orders.OrderDate))

<timebucket(...)></timebucket(...)> is a built-in function that takes a timestamp-valued quad and buckets it into discrete time buckets. The granularity of that bucketing is not specified in the quad and is bound at run-time. To support that capability, timeBucket has an optional second argument, which defaults to $TIME_GRAIN and controls the granularity at which the aggregation happens. The previous quad is equivalent to the following:

Orders.Amount | sum by (timeBucket(Orders.OrderDate, $TIME_GRAIN))

This flexibility of changing the time granularity is useful in monitoring use cases where visualizing time series data at various grains is a core workflow. If you want to view the data at fine granularity, you could set this parameter to 10 seconds, or you could zoom out to see the big picture by setting it to 1 day. Run-time parameters let quads represent a flexible stack of computations whose meaning can be post-facto adjusted by end users. These adjustments may include changing the grain of a quad, applying filters, and so on.

For the sake of completeness, the previous quad corresponds to the following SQL query:

select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN)

In some scenarios, it may make sense to permanently fix the granularity of certain calculations. You may provide any desired constant value for the optional second argument to the timeBucket(...) function for those use cases.

Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)​

This example demonstrates that you can stack operations on quads on top of each other. This corresponds to the following SQL query:

select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
ShipMode as dim1,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN), ShipMode

As expected, the second by operator further slices the first aggregation by an additional ShipMode dimension. This is achieved by adding that field to the group by clause.

Orders | count​

NetScript also supports dimensionless aggregations where the entire input is aggregated to a single row. In this case, the query computes the total number of orders in the system.

with A as select OrderKey as dim0, OrderKey as measure0 from Orders
select count(A.measure0) as measure0 from A

// Equivalent optimized form
select count(OrderKey) as measure0 from Orders

Variables

You can assign quads to variables, which you can then reference in other parts of a NetScript program. Consider the following example:

amountByTime = Orders.Amount | sum by (timeBucket(Orders.OrderDate));
amountByTime | by (Orders.ShipMode)

This quad is equivalent to the earlier quad.

Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)

NetScript treats such variables as untyped macros and expands them like the C preprocessor.

Quads over multiple datasets

NetScript knows the underlying database schema and can automatically traverse schema joins to generate queries. Consider the following quad:

LineItems.TotalPrice | sum by (Product.Category)

Given its knowledge of the underlying schema, NetScript can generate the following query for this quad:

select Category as dim0, sum(TotalPrice) as measure0
from LineItems, Product where LineItems.ProductId = Product.Id
group by Category

Joins on quads

The simplest way to join NetScript quads is to combine them using binary operators. The simplest case of such a join is where the associated grains of operands are identical. For example:

revenue  = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost

Revenue and cost are computed at the same grain (Orders.ShipMode). This lets you interpret the / operation as a join on this shared ShipMode dimension. For example:

with A as select ShipMode as dim0, sum(Amount) as measure0
from Orders group by ShipMode,
B as select ShipMode as dim0, sum(Cost) as measure0
from Orders group by ShipMode
select A.dim0 as dim0, A.measure0 / B.measure0 as measure0
from A, B where A.dim0 == B.dim0

The join is also well-defined when the set of dimensions on one side is a subset of the dimensions on the other. Consider the following example:

revenue = Orders.Amount | sum by (Orders.Channel, Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost

The / operator represents a join. The left-hand side (LHS) grain is {Channel, ShipMode} while the right-hand side (RHS) grain is {ShipMode}. This quad produces the following SQL, where the join condition is only joining on the shared dimensions:

with A as select Channel as dim0,
ShipMode as dim1,
sum(Orders.Amount) as measure0
from Orders group by Channel, ShipMode
B as select ShipMode as dim0, sum(Orders.Cost) as measure0
from Orders group by ShipMode
select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 == B.dim0

Now consider the following program:

revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue

The intent is to calculate the percentage of revenue each product contributes to its category. The LHS grain is {Product} and the RHS grain is {Product.Category}, and neither side is a subset of the other. One simple solution is to modify the program as follows:

revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue

Because productRevenue is already sliced at the {Product} grain, further slicing it by Product.Category does not change any of the aggregated revenue numbers. You can use the subset rule to generate the appropriate SQL query for this quad. This example presents an important idea of how NetScript joins quads. There is a functional dependency from Product to Product.Category in which a given product has only one category. In database speak, there is a many-to-one relationship from Product to Product.Category. Adding the extra dimension to productRevenue does not change the aggregation granularity of the quad and makes the join possible. NetScript does not require users to manually add these extra dimensions to their quads; it can do it automatically. This means the original quad (without the extra Product.Category dimension) is perfectly well-defined and generates the following SQL query:

with A as select Product.ProductKey as dim0,
Product.Category as dim1,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.ProductKey, Product.Category,
B as select Product.Category as dim0,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.Category select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 = B.dim0

Quads with multiple measures​

NetScript has limited support for multi-measure quads with a feature called dataframes.

revenue = Orders.Amount | sum
cost = Orders.Cost | sum
[revenue, cost]

This quad results in the following SQL query:

with A as select sum(Amount) as measure0 from Orders,
B as select sum(Cost) as measure0 from Orders
select A.measure0 as measure0, B.measure0 as measure1 from A, B

Current NetScript places several limitations on dataframes. For example, dataframes cannot be dimensions of other quads, cannot be aggregated, and so on. But they come in handy for the simple use cases of presenting two columns side by side.

Dataframes also offer the full join sophistication of NetScript joins. Consider the following quad:

revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
[productRevenue, categoryRevenue]

This is very similar to the last example in the previous section, but it presents product and category revenues side by side instead of taking a ratio. The generated SQL query would also have a similar structure, but the top-level select clause would be to select A.measure0 as measure0 and B.measure0 as measure1 instead of selecting A.measure0 / B.measure0 as measure0.

Multiple levels of aggregation

The following quad performs a two-level aggregation to calculate the average revenue across ship modes. It is impossible to express this calculation as a single level of aggregation. You must calculate the total revenue for each ship mode (using sum) and then calculate the average across ship modes.

Orders.Amount | sum by (Orders.ShipMode) | avg

The quad produces the following SQL:

  with A as select ShipMode as dim0, sum(Amount) as measure0 from Orders
group by ShipMode
select average(A.measure0) as measure0 from A

Slice quads with joins​

Consider a quad that is produced as a result of a join that is sliced.

revenue  = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
profit = revenue - cost
profit | by (Orders.ShipMode)

A reasonable interpretation in this case is to slice revenue and cost by Orders.ShipMode and then join the result. That is, you produce the SQL for the following NetScript:

(revenue | by (Orders.ShipMode)) - (cost | by (Orders.ShipMode))

This generalizes to the following identity: P, Q, and R are arbitrary quads, and f is a join operator.

f(P, Q) | by (R) = f(P | by R, Q | by R)

While this is mathematically elegant, it is not true in all cases. Remember that you cannot slide one of the sides by the provided quad R. This can happen if P and Q come from different fact tables, and R only makes sense for one side. Slices over joins will attempt to slice all branches and are well-defined as long as at least one of those slice attempts succeeds.

Orders | count | where (Orders.ShipMode == 'priority')​

You can also filter quads with the where keyword. Quads represent queries and not the data returned by those queries. Consequently, the filter operation does not filter the data that a quad produces. Instead, it mutates the query represented by the quad in a way that makes sense concerning the semantics of the filter. This query produces the following SQL:

select count(Orders) from Orders where ShipMode = 'priority'

This example is simplistic, but NetScript filters can be arbitrary quads (as long as they have a single boolean-typed measure). The query compiler can interpret the quad and produce the right SQL query if the underlying joins are possible.