Summary
Netscript is a powerful analytical language that allows users to author and manipulate SQL queries in an easy and natural fashion. NetScript query expressions, henceforth called quads, represent SQL queries and operations on these quads transform the SQL queries according to common analytical requirements.
This document will introduce NetScript by presenting a few examples that highlight the key capabilities of the language.
Notes About Syntax
Before using NetScript, please note the following rules about NetScript syntax.
- Function names are case-insensitive. For example, countDistinct and countdistinct are synonymous.
- Underscores in function names can be omitted. For example, extract_week and extractweek are synonymous.
- Timezone-sensitive functions use PT by default, but you can configure the timezone for each Analytics app.
- Functions and aggregations can use function syntax (round(x)) or pipe syntax (x | round). These are equivalent.
Motivating Examples
NetScript quads represent SQL queries that are written for analytical use cases. These use cases typically require aggregating quantities and slicing them by various dimensions. The following notation will come in handy while describing the semantics of quads. For a quad P:
- P.Query refers to the SQL query represented by P.
- P.Grain refers to the set of dimensions P on which the quad P is aggregated.
- P.Measures refers to the set of measures computed by a quad. Most quads only have a single measure, but NetScript provides some limited support for multi-measure quads. In the SQL analogy, if P.Grain corresponds to group by expressions, then P.Measures corresponds to aggregated expressions
For these examples, we will use a simple sample e-commerce inspired schema that includes the following tables:
- Orders - Each record in this table is an individual order. Has foreign keys to Customer and Product.
- LineItems - Each record in this table is an individual line item within an order with a foreign key to the Orders table.
- Customers - Each record in this table is a customer with basic account information such as name, address etc.
- Product - Each record in this table is a product with basic catalog information such as product name, category etc.
2
2 is the simplest quad. It represents the constant 2 and corresponds to the following SQL query.
select 2
Orders represents rows from the dataset called Orders. The dataset must be defined in the system or the quad will result in an error. Since the quad directly references the dataset (and not some column within the dataset), the SQL query for this quad will produce the primary key for each row in Orders.
There is no aggregation happening in this query and it does not quite fit within the measure / dimension paradigm described above. In order to make it fit, we pretend that this query is aggregated at the grain of each row in Orders. Since that is the finest possible aggregation grain, our pretense is technically valid.
select OrderKey as dim0, OrderKey as measure0 from Orders
Orders.Amount
This quad references a specific column from the Orders dataset. As is hopefully intuitive, this will map to the following SQL query.
select OrderKey as dim0, Amount as measure0 from Orders
2 * Orders.Amount
Binary operators like * conceptually represent a join between the subqueries for the quad 2 and Orders.Amount. In later examples, we will discuss the semantics of such joins in detail. Loosely speaking, the two subqueries must be joined on their shared dimensions. In this case, the quad 2 has no dimensions and thus the join is simply a cross-product join. We show the SQL for this quad in both optimized and unoptimized form.
with A as select 2 as measure0,
B as select OrderKey as dim0, Amount as measure0 from Orders
select B.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B
// Equivalent optimized form
select OrderKey as dim0, 2 * Amount as measure0 from Orders
As in the previous example, the * operator represents a join between the two subqueries produced by the two operands. Since both sub-queries are aggregated at the. OrderKey grain, this join is trivial and we simply merge them into a single select statement.
with A as select OrderKey as dim0, Amount as measure0 from Orders,
B as select OrderKey as dim0, Discount as measure0 from Orders
select A.dim0 as dim0, A.measure0 * B.measure0 as measure0 from A, B
where A.dim0 = B.dim0
// Equivalent optimized form
select OrderKey as dim0, Amount * Discount as measure0 from Orders
Orders.Amount | by (Orders.ShipMode)
Recall that quads are always aggregated at some grain. The by operator will take an aggregated quad and slice it by the supplied dimension. Notice that the input Orders.Amount is already grouped at the finest possible grain OrderKey. Even though further slicing is not possible, the by operation is still valid and its only effect is to add an additional ShipMode column to the set of dimensions. The by` operator is also often referred to as the slice operator and its behavior in the face of more complicated input and slicer quads is discussed later in the document.
This quad corresponds to the following SQL query.
select OrderKey as dim0, ShipMode as dim1, Amount as measure0 from Orders
P | by (Q) is equivalent to P | by (Q) | by (Q)
Orders.Amount | sum by (Orders.ShipMode)
With this quad, we are taking row level order data and aggregating them at the level of the different ShipModes. Note that ShipMode is not present in the input query. So we must first inject that field and then aggregate over it. It is simpler to think of this as the following equivalent quad:
Orders.Amount | by (Orders.ShipMode) | sum by (Orders.ShipMode)`
with A as select OrderKey as dim0, ShipMode as dim1, Amount as measure0
from Orders
select A.dim1 as dim0, sum(A.measure0) as measure0 from A group by A.dim1
// Equivalent optimized form
select ShipMode as dim0, sum(Amount) as measure0
from Orders group by A.ShipMode
Orders.Amount | sum by (timeBucket(Orders.OrderDate))
Orders.Amount | sum by (timeBucket(Orders.OrderDate, $TIME_GRAIN))
For the sake of completeness, the above quad corresponds to the following SQL query.
select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN)
Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)
select timeBucket(OrderDate, $TIME_GRAIN) as dim0,
ShipMode as dim1,
sum(Amount) as measure0 from Orders
group by timeBucket(OrderDate, $TIME_GRAIN), ShipMode
Orders | count
with A as select OrderKey as dim0, OrderKey as measure0 from Orders
select count(A.measure0) as measure0 from A
// Equivalent optimized form
select count(OrderKey) as measure0 from Orders
Variables
amountByTime = Orders.Amount | sum by (timeBucket(Orders.OrderDate));
amountByTime | by (Orders.ShipMode)
Orders.Amount | sum by (timeBucket(Orders.OrderDate)) | by (Orders.ShipMode)
Quads over multiple datasets
NetScript is aware of the underlying database schema and can automatically traverse schema joins to generate queries. Consider the following quad:
LineItems.TotalPrice | sum by (Product.Category)
select Category as dim0, sum(TotalPrice) as measure0
from LineItems, Product where LineItems.ProductId = Product.Id
group by Category
Joins on quads
The simplest way to join NetScript quads is to combine them using binary operators. The simplest case of such a join is where the associated grains of both operands are identical. Here is an example:
revenue = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost
with A as select ShipMode as dim0, sum(Amount) as measure0
from Orders group by ShipMode,
B as select ShipMode as dim0, sum(Cost) as measure0
from Orders grou pby ShipMode
select A.dim0 as dim0, A.measure0 / B.measure0 as measure0
from A, B where A.dim0 == B.dim0
revenue = Orders.Amount | sum by (Orders.Channel, Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
revenue / cost
with A as select Channel as dim0,
ShipMode as dim1,
sum(Orders.Amount) as measure0
from Orders group by Channel, ShipMode
B as select ShipMode as dim0, sum(Orders.Cost) as measure0
from Orders group by ShipMode
select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 == B.dim0
revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue
revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
productRevenue / categoryRevenue
with A as select Product.ProductKey as dim0,
Product.Category as dim1,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.ProductKey, Product.Category,
B as select Product.Category as dim0,
sum(Amount) as measure0
from Product, LineItems
where LineItems.ProductKey = Product.ProductKey
group by Product.Category
select A.dim0 as dim0,
A.dim1 as dim1,
A.measure0 / B.measure0 as measure0
from A, B where A.dim1 = B.dim0
Quads with Multiple Measures
So far we have only looked at quads with a single measure named measure0. NetScript has limited support for multi-measure quads with a feature called dataframes.
revenue = Orders.Amount | sum
cost = Orders.Cost | sum
[revenue, cost]
with A as select sum(Amount) as measure0 from Orders,
B as select sum(Cost) as measure0 from Orders
select A.measure0 as measure0, B.measure0 as measure1 from A, B
Dataframes also offer the full join sophisitication of NetScript joins. Consider the following quad:
revenue = LineItems.Price | sum;
productRevenue = revenue | by (Product, Product.Category);
categoryRevenue = revenue | by (Product.Category);
[productRevenue, categoryRevenue]
Multiple Levels of Aggregation
The following quad will perform two level aggregation to calculate the average revenue across all ship modes. Note that it is not possible to express this calculation as a single level of aggregation. We need to first the total revenue for each shipmode (using sum) and then calculate the average across all ship modes.
Orders.Amount | sum by (Orders.ShipMode) | avg
with A as select ShipMode as dim0, sum(Amount) as measure0 from Orders
group by ShipMode
select average(A.measure0) as measure0 from A
Slicing Quads With Joins
Consider the following quad where we are slicing a quad that is produced as a result of a join (the - operator).
revenue = Orders.Amount | sum by (Orders.ShipMode);
cost = Orders.Cost | sum by (Orders.ShipMode);
profit = revenue - cost
profit | by (Orders.ShipMode)
(revenue | by (Orders.ShipMode)) - (cost | by (Orders.ShipMode))
f(P, Q) | by (R) = f(P | by R, Q | by R)
Orders | count | where (Orders.ShipMode == 'priority')
Quads can also be filtered with the where keyword. Recall that quads represent queries and not the data returned by those queries. Consequently, the filter operation does not literally filter the data produced by a quad. Rather it mutates the query represented by the quad in a way that makes sense with respect to the semantics of the filter. This query would produce the following SQL:
select count(Orders) from Orders where ShipMode = 'priority'
Please sign in to leave a comment.