NetScript Language Reference

  • Updated

Summary

Netscript is a powerful analytical language that allows users to author and manipulate SQL queries in an easy and natural fashion. NetScript query expressions, henceforth called quads, represent SQL queries and operations on these quads transform the SQL queries according to common analytical requirements.

This document will introduce NetScript by presenting a few examples that highlight the key capabilities of the language.

Notes About Syntax

Before using NetScript, please note the following rules about NetScript syntax.

  • Function names are case-insensitive. For example, countDistinct and countdistinct are synonymous.
  • Underscores in function names can be omitted. For example, extract_week and extractweek are synonymous.
  • Timezone-sensitive functions use PT by default, but you can configure the timezone for each Analytics app.
  • Functions and aggregations can use function syntax (round(x)) or pipe syntax (x | round). These are equivalent.

Motivating Examples

NetScript quads represent SQL queries that are written for analytical use cases. These use cases typically require aggregating quantities and slicing them by various dimensions. The following notation will come in handy while describing the semantics of quads. For a quad P:

  • P.Query refers to the SQL query represented by P
  • P.Grain refers to the set of dimensions  P on which the quad P is aggregated.
  • P.Measures refers to the set of measures computed by a quad. Most quads only have a single measure, but NetScript provides some limited support for multi-measure quads. In the SQL analogy, if P.Grain corresponds to group by expressions, then P.Measures corresponds to aggregated expressions

For these examples, we will use a simple sample e-commerce inspired schema that includes the following tables:

  • Orders - Each record in this table is an individual order. Has foreign keys to Customer and Product.
  • LineItems - Each record in this table is an individual line item within an order with a foreign key to the Orders table.
  • Customers - Each record in this table is a customer with basic account information such as name, address etc.
  • Product - Each record in this table is a product with basic catalog information such as product name, category etc.

2

2 is the simplest quad. It represents the constant 2 and corresponds to the following SQL query.

select 2
The measure here is the constant 2 and its grain is the empty set.
Orders​

Orders represents rows from the dataset called Orders. The dataset must be defined in the system or the quad will result in an error. Since the quad directly references the dataset (and not some column within the dataset), the SQL query for this quad will produce the primary key for each row in Orders.

There is no aggregation happening in this query and it does not quite fit within the measure / dimension paradigm described above. In order to make it fit, we pretend that this query is aggregated at the grain of each row in Orders. Since that is the finest possible aggregation grain, our pretense is technically valid.

select OrderKey as dim0, OrderKey as measure0 from Orders

Orders.Amount

This quad references a specific column from the Orders dataset. As is hopefully intuitive, this will map to the following SQL query.

select OrderKey as dim0, Amount as measure0 from Orders
Clearly, Amount must be a valid column within Orders. This query is also assumed to be implicitly aggregated at the grain of each row in the Orders dataset.

2 * Orders.Amount

Binary operators like * conceptually represent a join between the subqueries for the quad 2 and Orders.Amount. In later examples, we will discuss the semantics of such joins in detail. Loosely speaking, the two subqueries must be joined on their shared dimensions. In this case, the quad 2 has no dimensions and thus the join is simply a cross-product join. We show the SQL for this quad in both optimized and unoptimized form.

with A as select 2 as measure0,
B as select OrderKey as dim0, Amount as measure0 from Orders
select B.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B

// Equivalent optimized form
select OrderKey as dim0, 2 * Amount as measure0 from Orders

Orders.Amount * Orders.Discount

As in the previous example, the * operator represents a join between the two subqueries produced by the two operands. Since both sub-queries are aggregated at the. OrderKey grain, this join is trivial and we simply merge them into a single select statement.

with A as select OrderKey as dim0, Amount as measure0 from Orders,
B as select OrderKey as dim0, Discount as measure0 from Orders
select A.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B
where A.dim0 = B.dim0

// Equivalent optimized form
select OrderKey as dim0, Amount * Discount as measure0 from Orders

Orders.Amount | by (Orders.ShipMode)

Recall that quads are always aggregated at some grain. The by operator will take an aggregated quad and slice it by the supplied dimension. Notice that the input Orders.Amount is already grouped at the finest possible grain OrderKey. Even though further slicing is not possible, the by operation is still valid and its only effect is to add an additional ShipMode column to the set of dimensions. The by` operator is also often referred to as the slice operator and its behavior in the face of more complicated input and slicer quads is discussed later in the document.

This quad corresponds to the following SQL query.

select OrderKey as dim0, ShipMode as dim1, Amount as measure0 from Orders
As might be obvious, successive slices over the same dimension is a no-op. For any quad, the following identity holds true:
P | by (Q)    is equivalent to     P | by (Q) | by (Q)
The slice operation is a no-op if the requested dimension is already present in the grain of the input quad.

Orders.Amount | sum by (Orders.ShipMode)​

With this quad, we are taking row level order data and aggregating them at the level of the different ShipModes. Note that ShipMode is not present in the input query. So we must first inject that field and then aggregate over it. It is simpler to think of this as the following equivalent quad:

Orders.Amount | by (Orders.ShipMode) | sum by (Orders.ShipMode)`
This will map to the following SQL query.
with A as select OrderKey as dim0, ShipMode as dim1, Amount as measure0
from Orders
select A.dim1 as dim0, sum(A.measure0) as measure0 from A group by A.dim1

// Equivalent optimized form
select ShipMode as dim0, sum(Amount) as measure0
from Orders group by A.ShipMode
NetScript supports several common aggregation functions such as count, min, max, sum, avg etc.

Orders.Amount | sum by (timeBucket(Orders.OrderDate))​

timeBucket(...) is a built-in function that takes a timestamp valued quad and bucketizes that into discrete time buckets. The granularity of that bucketing is not specified in the quad and is bound at "run-time". To support that capability, timeBucket has an optional second argument which defaults to $TIME_GRAIN and controls the granularity at which the aggregation happens. The above quad is therefore equivalent to the following:
Orders.Amount | sum by (timeBucket(Orders.OrderDate, $TIME_GRAIN))
This flexibility of changing the time granularity is useful in monitoring use cases where visualizing time series data at various grains is a core workflow. If one wants to view the data at fine granularity, this parameter could be set to 10 seconds, or one could zoom out to see the big picture by setting it to 1 day. We will later see that $TIME_GRAIN is an instance of a special NetScript concept called a parameter, whose value is externally bound at the time the NetScript quad is evaluated. Run-time parameters allow quads to represent a flexible stack of computations whose meaning can be post-facto adjusted by end users. These adjustments may including changing the grain of a quad, applying filters etc.

For the sake of completeness, the above quad corresponds to the following SQL query.

select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN)
In some scenarios, it may make sense to permanently fix the granularity of certain calculations. Users may provide any desired constant value for the optional second argument to the timeBucket(...) function for those use cases.

Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)​

This example demonstrates that operations on quads can be stacked on top of each other. This corresponds to the following SQL query.
select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
ShipMode as dim1,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN), ShipMode
As expected, the second by operator further slices the first aggregation by an additional ShipMode dimension. This is achieved by adding that field to the group by clause.

Orders | count​

NetScript also support dimension-less aggregations where the entire input is aggregated down to a single row. In this case, the query computes the total number of orders in the system.
with A as select OrderKey as dim0, OrderKey as measure0 from Orders
select count(A.measure0) as measure0 from A

// Equivalent optimized form
select count(OrderKey) as measure0 from Orders

​Variables

Quads can be assigned to variables and these variables can then be referenced in other parts of a NetScript program. Consider the following example:
amountByTime = Orders.Amount | sum by (timeBucket(Orders.OrderDate));
amountByTime | by (Orders.ShipMode)
This quad is equivalent to our earlier quad
Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)
NetScript treats such variables as untyped macros and expands them much like the C preprocessor.

Quads over multiple datasets

NetScript is aware of the underlying database schema and can automatically traverse schema joins to generate queries. Consider the following quad:

LineItems.TotalPrice | sum by (Product.Category)
Given its knowledge of the underlying schema, NetScript can generate the following query for this quad.
select Category as dim0, sum(TotalPrice) as measure0
from LineItems, Product where LineItems.ProductId = Product.Id
group by Category
The precise rules around how the relationships are traversed and the constraints on what quads are valid vs not are described in the formal spec sections below.

Joins on quads

The simplest way to join NetScript quads is to combine them using binary operators. The simplest case of such a join is where the associated grains of both operands are identical. Here is an example:

revenue  = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost
Both revenue and cost are computed at the same grain (Orders.ShipMode). This allows us to interpret the / operation as a join on this shared ShipMode dimension. Here is the SQL corresponding to this quad.
with A as select ShipMode as dim0, sum(Amount) as measure0
from Orders group by ShipMode,
B as select ShipMode as dim0, sum(Cost) as measure0
from Orders grou pby ShipMode
select A.dim0 as dim0, A.measure0 / B.measure0 as measure0
from A, B where A.dim0 == B.dim0
The join is also well-defined when the set of dimensions on one side are a subset of the dimensions on the other side. Consider the following example:
revenue = Orders.Amount | sum by (Orders.Channel, Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost
The / operator represents a join. The LHS grain is {Channel, ShipMode} while the RHS grain is {ShipMode}. This quad will produce the following SQL. Notice how the join condition is only joining on the shared dimensions.
with A as select Channel as dim0,
ShipMode as dim1,
sum(Orders.Amount) as measure0
from Orders group by Channel, ShipMode
B as select ShipMode as dim0, sum(Orders.Cost) as measure0
from Orders group by ShipMode
select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 == B.dim0
Now consider the following program..
revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue
The intent is to calculate the percentage of revenue that each product contributes to its category. The LHS grain is {Product} and the RHS grain is {Product.Category} and neither side is a subset of the other. One simple solution is to modify the program as follows:
revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue
Since productRevenue is already sliced at {Product} grain, further slicing it by Product.Category will not change any of the aggregated revenue numbers. Now we can use our subset rule to generate the appropriate SQL query for this quad. This example presents an important idea of how NetScript joins quads. There is a functional dependency from Product to Product.Category in that a given product has only a given category. In database speak, there is a many-to-one relationship from Product to Product.Category. Therefore adding the extra dimension to productRevenue does not change the aggregation granularity of the quad and makes the join possible. NetScript does not require users to manually add these extra dimensions to their quads and can do it automatically. This means the original quad (without the extra Product.Category dimension) is perfectly well-defined and generates the following SQL query.
with A as select Product.ProductKey as dim0,
Product.Category as dim1,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.ProductKey, Product.Category,
B as select Product.Category as dim0,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.Category
select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 = B.dim0

Quads with Multiple Measures​

So far we have only looked at quads with a single measure named measure0. NetScript has limited support for multi-measure quads with a feature called dataframes.

revenue = Orders.Amount | sum
cost = Orders.Cost | sum
[revenue, cost]
This quad will result in the following SQL query.
with A as select sum(Amount) as measure0 from Orders,
B as select sum(Cost) as measure0 from Orders
select A.measure0 as measure0, B.measure0 as measure1 from A, B
Current NetScript places several limitations on dataframes. For example, dataframes cannot be dimensions of other quads, cannot be aggregated etc. But they are quite handy for the simple use cases of presenting two columns side by side.

Dataframes also offer the full join sophisitication of NetScript joins. Consider the following quad:

revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
[productRevenue, categoryRevenue]
This is very similar to the last example in the previous section but we present product and category revenues side by side instead of taking a ratio. The generated SQL query would also have pretty similar structure but the top level select clause would be select A.measure0 as measure0, B.measure0 as measure1 instead of select A.measure0 / B.measure0 as measure0.

Multiple Levels of Aggregation

The following quad will perform two level aggregation to calculate the average revenue across all ship modes. Note that it is not possible to express this calculation as a single level of aggregation. We need to first the total revenue for each shipmode (using sum) and then calculate the average across all ship modes.

Orders.Amount | sum by (Orders.ShipMode) | avg
The quad will produce the following SQL.
  with A as select ShipMode as dim0, sum(Amount) as measure0 from Orders
group by ShipMode
select average(A.measure0) as measure0 from A

Slicing Quads With Joins​

Consider the following quad where we are slicing a quad that is produced as a result of a join (the operator).

revenue  = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
profit = revenue - cost
profit | by (Orders.ShipMode)
A reasonable interpretation in this case is to slice both revenue and cost by Orders.ShipMode and then join the result. That is, we produce the SQL for the following netscript:
(revenue | by (Orders.ShipMode)) - (cost | by (Orders.ShipMode))
This generalizes to the following identity where P, Q, R are arbitrary quads and f is a join operator.
f(P, Q) | by (R) = f(P | by R, Q | by R)
While this is mathematically elegant, it is not true in all cases! The case to keep in mind is when one of the sides cannot be sliced by the provided quad R. This can happen if P and Q come from different fact tables and R only makes sense for one side. Slices over joins will attempt to slice all branches and are well-defined as long as at least one of those slice attempts succeed.

Orders | count | where (Orders.ShipMode == 'priority')​

Quads can also be filtered with the where keyword. Recall that quads represent queries and not the data returned by those queries. Consequently, the filter operation does not literally filter the data produced by a quad. Rather it mutates the query represented by the quad in a way that makes sense with respect to the semantics of the filter. This query would produce the following SQL:

select count(Orders) from Orders where ShipMode = 'priority'
This example is of course quite simplistic but NetScript filters can be arbitrary quads (as long as they have a single boolean typed measure). The query compiler can interpret the quad and produce the right SQL query as long as the underlying joins are possible.