Steps definition
This document describes the steps that can be used in our pipelines. A step is the unit of transformation generated by a user interaction with the visual query builder.
Unless explicitly specified, all steps are supported by all backends.
absolutevalue step
This step is meant to compute the absolute value of a given input column.
{
name: 'absolutevalue',
column: 'my-column'
new_column: 'my-new-column'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Value |
|---|---|
| Company 1 | -33 |
| Company 2 | 0 |
| Company 3 | 10 |
Step configuration:
{
name: 'absolutevalue',
column: "Value",
new_column_name: "My-absolute-value",
}
Output dataset:
| Company | Value | My-absolute-value |
|---|---|---|
| Company 1 | -33 | 33 |
| Company 2 | 0 | 0 |
| Company 3 | 10 | 10 |
addmissingdates step
Add missing dates as new rows in a dates column. Exhaustive dates will range between the minimum and maximum date found in the dataset (or in each group if a group by logic is applied - see thereafter).
Added rows will be set to null in columns not referenced in the step configuration.
You should make sure to use a group by logic if you want to add missing dates in
independent groups of rows (e.g. you may need to add missing rows for every
country found in a “COUNTRY” column). And you should ensure that every date is
unique in every group of rows at the specified granularity, else you may get
inconsistent results. You can specify “group by” columns in the groups
parameter.
An addmissingdates step has the following structure:
{
name: 'addmissingdates',
datesColumn: 'DATE', // the dates columns with missing rows
datesGranularity: 'day', // the granularity of the date column (day, month or year)
groups: ['COUNTRY'] // (optional) columns used for group by logic if necessary
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: day granularity without groups
Input dataset:
| DATE | VALUE |
|---|---|
| “2018-01-01T00:00:00.000Z” | 75 |
| “2018-01-02T00:00:00.000Z” | 80 |
| “2018-01-03T00:00:00.000Z” | 82 |
| “2018-01-04T00:00:00.000Z” | 83 |
| “2018-01-05T00:00:00.000Z” | 80 |
| “2018-01-07T00:00:00.000Z” | 86 |
| “2018-01-08T00:00:00.000Z” | 79 |
| “2018-01-09T00:00:00.000Z” | 76 |
| “2018-01-10T00:00:00.000Z” | 79 |
| “2018-01-11T00:00:00.000Z” | 75 |
Here the day “2018-01-06” is missing.
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'day'
}
Output dataset:
| DATE | VALUE |
|---|---|
| “2018-01-01T00:00:00.000Z” | 75 |
| “2018-01-02T00:00:00.000Z” | 80 |
| “2018-01-03T00:00:00.000Z” | 82 |
| “2018-01-04T00:00:00.000Z” | 83 |
| “2018-01-05T00:00:00.000Z” | 80 |
| “2018-01-06T00:00:00.000Z” | |
| “2018-01-07T00:00:00.000Z” | 86 |
| “2018-01-08T00:00:00.000Z” | 79 |
| “2018-01-09T00:00:00.000Z” | 76 |
| “2018-01-10T00:00:00.000Z” | 79 |
| “2018-01-11T00:00:00.000Z” | 75 |
Example 2: day granularity with groups
Input dataset:
| COUNTRY | DATE | VALUE |
|---|---|---|
| France | “2018-01-01T00:00:00.000Z” | 75 |
| France | “2018-01-02T00:00:00.000Z” | 80 |
| France | “2018-01-03T00:00:00.000Z” | 82 |
| France | “2018-01-04T00:00:00.000Z” | 83 |
| France | “2018-01-05T00:00:00.000Z” | 80 |
| France | “2018-01-07T00:00:00.000Z” | 86 |
| France | “2018-01-08T00:00:00.000Z” | 79 |
| France | “2018-01-09T00:00:00.000Z” | 76 |
| France | “2018-01-10T00:00:00.000Z” | 79 |
| France | “2018-01-11T00:00:00.000Z” | 85 |
| USA | “2018-01-01T00:00:00.000Z” | 69 |
| USA | “2018-01-02T00:00:00.000Z” | 73 |
| USA | “2018-01-03T00:00:00.000Z” | 73 |
| USA | “2018-01-05T00:00:00.000Z” | 75 |
| USA | “2018-01-06T00:00:00.000Z” | 70 |
| USA | “2018-01-07T00:00:00.000Z” | 76 |
| USA | “2018-01-08T00:00:00.000Z” | 73 |
| USA | “2018-01-09T00:00:00.000Z” | 70 |
| USA | “2018-01-10T00:00:00.000Z” | 72 |
| USA | “2018-01-12T00:00:00.000Z” | 78 |
Here the day “2018-01-06” is missing for “France” rows, and “2018-01-11” and “2018-01-11” are missing for “USA” rows.
Note that “2018-01-12” will not be considered as a missing row for “France” rows, because the latest date found for this group of rows is “2018-01-11” (even though “2018-01-12” is the latest date found for “USA” rows).
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'day',
groups: 'COUNTRY'
}
Output dataset:
| COUNTRY | DATE | VALUE |
|---|---|---|
| France | “2018-01-01T00:00:00.000Z” | 75 |
| France | “2018-01-02T00:00:00.000Z” | 80 |
| France | “2018-01-03T00:00:00.000Z” | 82 |
| France | “2018-01-04T00:00:00.000Z” | 83 |
| France | “2018-01-05T00:00:00.000Z” | 80 |
| France | “2018-01-06T00:00:00.000Z” | |
| France | “2018-01-07T00:00:00.000Z” | 86 |
| France | “2018-01-08T00:00:00.000Z” | 79 |
| France | “2018-01-09T00:00:00.000Z” | 76 |
| France | “2018-01-10T00:00:00.000Z” | 79 |
| France | “2018-01-11T00:00:00.000Z” | 85 |
| USA | “2018-01-01T00:00:00.000Z” | 69 |
| USA | “2018-01-02T00:00:00.000Z” | 73 |
| USA | “2018-01-03T00:00:00.000Z” | 73 |
| USA | “2018-01-04T00:00:00.000Z” | |
| USA | “2018-01-05T00:00:00.000Z” | 75 |
| USA | “2018-01-06T00:00:00.000Z” | 70 |
| USA | “2018-01-07T00:00:00.000Z” | 76 |
| USA | “2018-01-08T00:00:00.000Z” | 73 |
| USA | “2018-01-09T00:00:00.000Z” | 70 |
| USA | “2018-01-10T00:00:00.000Z” | 72 |
| USA | “2018-01-11T00:00:00.000Z” | |
| USA | “2018-01-12T00:00:00.000Z” | 78 |
Example 3: month granularity
Input dataset:
| DATE | VALUE |
|---|---|
| “2019-01-01T00:00:00.000Z” | 74 |
| “2019-02-01T00:00:00.000Z” | 73 |
| “2019-03-01T00:00:00.000Z” | 68 |
| “2019-04-01T00:00:00.000Z” | 71 |
| “2019-06-01T00:00:00.000Z” | 74 |
| “2019-07-01T00:00:00.000Z” | 74 |
| “2019-08-01T00:00:00.000Z” | 73 |
| “2019-09-01T00:00:00.000Z” | 72 |
| “2019-10-01T00:00:00.000Z” | 75 |
| “2019-12-01T00:00:00.000Z” | 76 |
Here “2019-05” and “2019-11” are missing.
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'month',
}
Output dataset:
| DATE | VALUE |
|---|---|
| “2019-01-01T00:00:00.000Z” | 74 |
| “2019-02-01T00:00:00.000Z” | 73 |
| “2019-03-01T00:00:00.000Z” | 68 |
| “2019-04-01T00:00:00.000Z” | 71 |
| “2019-05-01T00:00:00.000Z” | |
| “2019-06-01T00:00:00.000Z” | 74 |
| “2019-07-01T00:00:00.000Z” | 74 |
| “2019-08-01T00:00:00.000Z” | 73 |
| “2019-09-01T00:00:00.000Z” | 72 |
| “2019-10-01T00:00:00.000Z” | 75 |
| “2019-11-01T00:00:00.000Z” | |
| “2019-12-01T00:00:00.000Z” | 76 |
aggregate step
Perform aggregations on one or several columns. Available aggregation functions are sum, average, count, count distinct, min, max, first, last.
An aggreation step has the following structure:
{
name: 'aggregate',
on: ['column1', 'column2'],
aggregations: [
{
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
columns: ['value1', 'value2']
}
{
newcolumns: ['avg_value1'],
aggfunction: 'avg',
columns: ['value1']
}
// ...
]
keepOriginalGranularity: false, // whether to keep the original granularity, in that
// case computed aggregations will be added in new columns
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column and newcolumn (in the singular) properties of the aggregation
parameter are deprecated and are supported only for retrocompatibility purposes.
When this step was first created, only 1 column at a time could be aggregated for
a given aggregation.
Now columns and newcolumns are lists of columns, allowing to apply the same
aggregation function to several columns at once.
An old-fashioned step looked like this:
{
name: 'aggregate',
on: ['column1', 'column2'],
aggregations: [
{
newcolumn: 'sum_value1'
aggfunction: 'sum',
column: 'value1'
}
{
newcolumn: 'sum_value2',
aggfunction: 'sum',
column: 'value2'
}
{
newcolumn: 'avg_value1',
aggfunction: 'avg',
column: 'value1'
}
// ...
]
}
Example 1: keepOriginalGranularity set to false
Input dataset:
| Label | Group | Value1 | Value2 |
|---|---|---|---|
| Label 1 | Group 1 | 13 | 10 |
| Label 2 | Group 1 | 7 | 21 |
| Label 3 | Group 1 | 20 | 4 |
| Label 4 | Group 2 | 1 | 17 |
| Label 5 | Group 2 | 9 | 12 |
| Label 6 | Group 2 | 5 | 2 |
Step configuration:
{
name: 'aggregate',
on: ['Group'],
aggregations: [
{
newcolumns: ['Sum-Value1', 'Sum-Value2'],
aggfunction: 'sum',
columns: ['Value1', 'Value2']
}
{
newcolumns: ['Avg-Value1'],
aggfunction: 'avg',
columns: ['Value1']
}
],
keepOriginalGranularity: false,
}
Output dataset:
| Group | Sum-Value1 | Sum-Value2 | Avg-Value1 |
|---|---|---|---|
| Group 1 | 40 | 35 | 13.333333 |
| Group 2 | 16 | 31 | 5.333333 |
Example 2: keepOriginalGranularity set to true
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'aggregate',
on: ['Group'],
aggregations: [
{
newcolumns: ['Total'],
aggfunction: 'sum',
columns: ['Value']
}
],
keepOriginalGranularity: true,
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 40 |
| Label 2 | Group 1 | 40 |
| Label 3 | Group 1 | 40 |
| Label 4 | Group 2 | 16 |
| Label 5 | Group 2 | 16 |
| Label 6 | Group 2 | 16 |
append step
Appends to the current dataset, one or several datasets resulting from other
pipelines.
WeaverBird allows you to save pipelines referenced by name in the Vuex store
of the application. You can then call them by their unique names in this step.
{
name: 'append',
pipelines: ['pipeline1', 'pipeline2'] // the name of the pipelines to append
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
dataset1 (saved in the application Vuex store):
| Label | Group | Value |
|---|---|---|
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
dataset2 (saved in the application Vuex store):
| Label | Group | Value |
|---|---|---|
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'append',
pipelines: ['dataset1', 'dataset2']
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
argmax step
Get row(s) matching the maximum value in a given column, by group if groups
is specified.
{
name: 'argmax',
column: 'value', // column in which to search for max value
groups: ['group1', 'group2']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: without groups
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmax',
column: 'Value'
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 3 | Group 1 | 20 |
Example 2: with groups
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmax',
column: 'Value',
groups: ['Group']
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 3 | Group 1 | 20 |
| Label 5 | Group 2 | 10 |
argmin step
Get row(s) matching the minimum value in a given column, by group if groups
is specified.
{
name: 'argmin',
column: 'value', // column in which to search for max value
groups: ['group1', 'group2'] // optional
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: without groups
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmin',
column: 'Value'
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 4 | Group 2 | 1 |
Example 2: with groups
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmin',
column: 'Value',
groups: ['Groups']
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 2 | Group 1 | 7 |
| Label 4 | Group 2 | 1 |
comparetext step
Compares 2 string columns and returns true if the string values are equal, and false oteherwise. The comparison is case-sensitive (see examples below).
{
name: 'comparetext',
newColumnName: 'NEW', // new column to be created, with the comparison result
strCol1: 'TEXT_1', // first string column to compare
strCol2: 'TEXT_2', // second string column to compare
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| TEXT_1 | TEXT_2 |
|---|---|
| France | Fr |
| France | France |
| France | france |
| France | England |
| France | USA |
Step configuration:
{
name: 'split',
newColumnName: 'RESULT',
ctrCol1: 'TEXT_1',
ctrCol2: 'TEXT_2',
}
Output dataset:
| TEXT_1 | TEXT_2 | RESULT |
|---|---|---|
| France | Fr | false |
| France | France | true |
| France | france | false |
| France | England | false |
| France | USA | false |
concatenate step
This step allows to concatenate several columns using a separator.
{
name: 'concatenate',
columns: ['Company', 'Group']
separator: ' - ' // can be a string of any length
new_column_name: 'Label' // The new column in which to write the concatenation
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Group | Value |
|---|---|---|
| Company 1 | Group 1 | 13 |
| Company 2 | Group 1 | 7 |
| Company 3 | Group 1 | 20 |
| Company 4 | Group 2 | 1 |
| Company 5 | Group 2 | 10 |
| Company 6 | Group 2 | 5 |
Step configuration:
{
name: 'concatenate',
columns: ['Company', 'Group']
separator: ' - '
new_column_name: 'Label'
}
Output dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| Company 1 | Group 1 | 13 | Company 1 - Group 1 |
| Company 2 | Group 1 | 7 | Company 2 - Group 1 |
| Company 3 | Group 1 | 20 | Company 3 - Group 1 |
| Company 4 | Group 2 | 1 | Company 4 - Group 2 |
| Company 5 | Group 2 | 10 | Company 5 - Group 2 |
| Company 6 | Group 2 | 5 | Company 6 - Group 2 |
convert step
This step allows to convert columns data types.
{
name: 'convert',
columns: ['col1', 'col2']
data_type: 'integer' // The data type to convert the column into. Can be either
// 'integer', 'float', 'text', 'date' or 'boolean'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Pandas (python)
- Snowflake
In a effort to harmonize as much as possible the conversion behaviors, for some cases, the Sql translator implements casting otherwise than the CAST AS method.
Precisely, when casting float to integer, the default behaviour rounds the result, other languages truncate it.
That’s why the use of TRUNCATE was implemented when converting float to int.
The same implementation was done when converting strings to int (for date represented as string).
As for the conversion of date to int, we handled it by assuming the dataset’s timestamp is in TIMESTAMP_NTZ format.
Example
Input dataset:
| Company | Value |
|---|---|
| Company 1 | ‘13’ |
| Company 2 | ‘7’ |
| Company 3 | ‘20’ |
| Company 4 | ‘1’ |
| Company 5 | ‘10’ |
| Company 6 | ‘5’ |
Step configuration:
{
name: 'convert',
columns: ['Value']
data_type: 'integer'
}
Output dataset:
| Company | Value |
|---|---|
| Company 1 | 13 |
| Company 2 | 7 |
| Company 3 | 20 |
| Company 4 | 1 |
| Company 5 | 10 |
| Company 6 | 5 |
cumsum step
This step allows to compute the cumulated sum of value columns based on a reference column (usually dates) to be sorted by ascending order for the needs of the computation. The computation can be scoped by group if needed.
The toCumSum parameter takes as input a list of 2-elements lists in the form
[‘valueColumn’, ‘newColumn’].
{
name: 'cumsum',
toCumSum: [['myValues', 'myCumsum']],
referenceColumn: 'myDates',
groupby: ['foo', 'bar'],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
| DATE | VALUE |
|---|---|
| 2019-01 | 2 |
| 2019-02 | 5 |
| 2019-03 | 3 |
| 2019-04 | 8 |
| 2019-05 | 9 |
| 2019-06 | 6 |
Step configuration:
{
name: 'cumsum',
toCumSum: [['VALUE', '']]
referenceColumn: 'DATE',
}
Output dataset:
| DATE | VALUE | VALUE_CUMSUM |
|---|---|---|
| 2019-01 | 2 | 2 |
| 2019-02 | 5 | 7 |
| 2019-03 | 3 | 10 |
| 2019-04 | 8 | 18 |
| 2019-05 | 9 | 27 |
| 2019-06 6 | 6 | 33 |
Example 2: With more advanced options
Input dataset:
| COUNTRY | DATE | VALUE |
|---|---|---|
| France | 2019-01 | 2 |
| France | 2019-02 | 5 |
| France | 2019-03 | 3 |
| France | 2019-04 | 8 |
| France | 2019-05 | 9 |
| France | 2019-06 6 | 6 |
| USA | 2019-01 | 10 |
| USA | 2019-02 | 6 |
| USA | 2019-03 | 6 |
| USA | 2019-04 | 4 |
| USA | 2019-05 | 8 |
| USA | 2019-06 6 | 7 |
Step configuration:
{
name: 'cumsum',
toCumSum: [['VALUE', 'MY_CUMSUM']],
referenceColumn: 'DATE',
groupby: ['COUNTRY'],
}
Output dataset:
| COUNTRY | DATE | VALUE | MY_CUMSUM |
|---|---|---|---|
| France | 2019-01 | 2 | 2 |
| France | 2019-02 | 5 | 7 |
| France | 2019-03 | 3 | 10 |
| France | 2019-04 | 8 | 18 |
| France | 2019-05 | 9 | 27 |
| France | 2019-06 6 | 6 | 33 |
| USA | 2019-01 | 10 | 10 |
| USA | 2019-02 | 6 | 16 |
| USA | 2019-03 | 6 | 22 |
| USA | 2019-04 | 4 | 26 |
| USA | 2019-05 | 8 | 34 |
| USA | 2019-06 6 | 7 | 41 |
custom step
This step allows to define a custom query that can’t be expressed using the other existing steps.
{
name: 'custom',
query: '$group: {_id: ...}'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
Example: using Mongo query language
Input dataset:
| Company | Group | Value |
|---|---|---|
| Company 1 | Group 1 | 13 |
| Company 2 | Group 1 | 7 |
| Company 3 | Group 1 | 20 |
| Company 4 | Group 2 | 1 |
| Company 5 | Group 2 | 10 |
| Company 6 | Group 2 | 5 |
Step configuration:
{
name: 'custom',
query: '$addFields: { Label: { $concat: [ "$Label", " - ", "$Group" ] ] } }'
}
Output dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| Company 1 | Group 1 | 13 | Company 1 - Group 1 |
| Company 2 | Group 1 | 7 | Company 2 - Group 1 |
| Company 3 | Group 1 | 20 | Company 3 - Group 1 |
| Company 4 | Group 2 | 1 | Company 4 - Group 2 |
| Company 5 | Group 2 | 10 | Company 5 - Group 2 |
| Company 6 | Group 2 | 5 | Company 6 - Group 2 |
dateextract step
Extract date information (eg. day, week, year etc.). The following information can be extracted:
year: extract ‘year’ from date,month: extract ‘month’ from date,day: extract ‘day of month’ from date,week': extract ‘week number’ (ranging from 0 to 53) from date,quarter: extract ‘quarter number’ from date (1 for Jan-Feb-Mar)dayOfWeek: extract ‘day of week’ (ranging from 1 for Sunday to 7 for Staurday) from date,dayOfYear: extract ‘day of year’ from date,isoYear: extract ‘year number’ in ISO 8601 format (ranging from 1 to 53) from date.isoWeek: extract ‘week number’ in ISO 8601 format (ranging from 1 to 53) from date.isoDayOfWeek: extract ‘day of week’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date,firstDayOfYear: calendar date corresponding to the first day (1st of January) of the year ,firstDayOfMonth: calendar date corresponding to the first day of the month,firstDayOfWeek: calendar date corresponding to the first day of the week,firstDayOfQuarter: calendar date corresponding to the first day of the quarter,firstDayOfIsoWeek: calendar date corresponding to the first day of the week in ISO 8601 format,currentDay: calendar date of the target date,previousDay: calendar date one day before the target date,firstDayOfPreviousYear: calendar date corresponding to the first day (1st of January) of the previous year,firstDayOfPreviousMonth: calendar date corresponding to the first day of the previous month,firstDayOfPreviousWeek: calendar date corresponding to the first day of the previous week,firstDayOfPreviousQuarter: calendar date corresponding to the first day of the previous quarter,firstDayOfPreviousISOWeek: calendar date corresponding to the first day of the previous ISO week,previousYear: extract previous ‘year number’ from datepreviousMonth: extract previous ‘month number’ from datepreviousWeek: extract previous ‘week number’ from datepreviousQuarter: extract previous ‘quarter number’ from datepreviousISOWeek: extract previous ‘week number’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from datehour: extract ‘hour’ from date,minutes: extract ‘minutes’ from date,seconds: extract ‘seconds’ from date,milliseconds: extract ‘milliseconds’ from date,
Here’s an example of such a step:
{
name: 'dateextract',
column: 'date',
dateInfo: ['year', 'month', 'day'],
newColumns: ['date_year', 'date_month', 'date_day'],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The operation and new_column_name parameters are deprecated and are
supported for retrocompatibility purposes only.
An old-fashioned step looks like this:
{
name: 'dateextract',
column: 'date',
operation: 'day',
new_column_name: 'date_day',
}
Example
Input dataset:
| Date |
|---|
| 2019-10-30T00:00:00.000Z |
| 2019-10-15T00:00:00.000Z |
| 2019-10-01T00:00:00.000Z |
| 2019-09-30T00:00:00.000Z |
| 2019-09-15T00:00:00.000Z |
| 2019-09-01T00:00:00.000Z |
Step configuration:
{
name: 'dateextract',
column: 'Date',
dateInfo: ['year', 'month', 'day'],
newColumns: ['Date_year', 'Date_month', 'Date_day'],
}
Output dataset:
| Date | Date_year | Date_month | Date_day |
|---|---|---|---|
| 2019-10-30T00:00:00.000Z | 2019 | 10 | 30 |
| 2019-10-15T00:00:00.000Z | 2019 | 10 | 15 |
| 2019-10-01T00:00:00.000Z | 2019 | 10 | 1 |
| 2019-09-30T00:00:00.000Z | 2020 | 10 | 30 |
| 2019-09-15T00:00:00.000Z | 2020 | 10 | 15 |
| 2019-09-01T00:00:00.000Z | 2020 | 10 | 1 |
dategranularity step
Extract date information (eg. day, week, year etc.) in a column intended for aggregation. The following granularities are supported:
year: calendar date corresponding to the first day (1st of January) of the yearquarter: calendar date corresponding to the first day of the quartermonth: calendar date corresponding to the first day of the monthweek: calendar date corresponding to the first day of the week (sunday)isoWeek: calendar date corresponding to the first day of the week (monday)day: calendar date corresponding to the first hour of the day
Here’s an example of such a step:
{
name: 'dategranularity',
column: 'date',
granularity: 'year',
newColumn: 'do_the_aggregate_on_this'
}
*This step is supported by the following backends:
- Mongo
- Pandas (python)
Example
Input dataset:
| Date |
|---|
| 2019-10-30T00:00:00.000Z |
| 2019-10-15T00:00:00.000Z |
| 2019-10-01T00:00:00.000Z |
| 2019-09-30T05:11:31.000Z |
| 2019-09-15T00:00:00.000Z |
| 2019-09-01T00:00:00.000Z |
Step configuration:
{
name: 'dategranularity',
column: 'Date',
granularity: 'month'
}
Output dataset:
| Date |
|---|
| 2019-10-01T00:00:00.000Z |
| 2019-10-01T00:00:00.000Z |
| 2019-10-01T00:00:00.000Z |
| 2019-09-01T00:00:00.000Z |
| 2019-09-01T00:00:00.000Z |
| 2019-09-01T00:00:00.000Z |
delete step
Delete a column.
{
name: 'delete',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| Company 1 | Group 1 | 13 | Company 1 - Group 1 |
| Company 2 | Group 1 | 7 | Company 2 - Group 1 |
| Company 3 | Group 1 | 20 | Company 3 - Group 1 |
| Company 4 | Group 2 | 1 | Company 4 - Group 2 |
| Company 5 | Group 2 | 10 | Company 5 - Group 2 |
| Company 6 | Group 2 | 5 | Company 6 - Group 2 |
Step configuration:
{
name: 'delete',
columns: ['Company', 'Group']
}
Output dataset:
| Value | Label |
|---|---|
| 13 | Company 1 - Group 1 |
| 7 | Company 2 - Group 1 |
| 20 | Company 3 - Group 1 |
| 1 | Company 4 - Group 2 |
| 10 | Company 5 - Group 2 |
| 5 | Company 6 - Group 2 |
dissolve step
Geographically dissolve data
{
name: 'dissolve',
groups: ['my-column', 'some-other-column'],
include_nulls: true,
}
This step is supported by the following backends:
- Pandas (python)
Example without aggregations
Input dataset:
| Country | City | geometry |
|---|---|---|
| Country 1 | City 1 | Polygon |
| Country 2 | City 2 | Polygon |
| Country 2 | City 3 | Polygon |
| Country 1 | City 4 | Polygon |
| Country 2 | City 5 | Polygon |
| Country 1 | City 6 | Polygon |
Step configuration:
{
name: 'dissolve',
groups: ['Country'],
include_nulls: true,
}
Output dataset:
| Country | geometry |
|---|---|
| Country 1 | MultiPolygon |
| Country 2 | MultiPolygon |
Example with aggregations
Input dataset:
| Country | City | geometry | Population |
|---|---|---|---|
| Country 1 | City 1 | Polygon | 100_000 |
| Country 2 | City 2 | Polygon | 50_000 |
| Country 2 | City 3 | Polygon | 200_000 |
| Country 1 | City 4 | Polygon | 30_000 |
| Country 2 | City 5 | Polygon | 25_000 |
| Country 1 | City 6 | Polygon | 10_000 |
Step configuration:
{
name: 'dissolve',
groups: ['Country'],
include_nulls: true,
aggregations: [
{
aggfunction: 'sum',
columns: ['Population'],
newcolumns: ['Total population'],
}
]
}
Output dataset:
| Country | geometry | Total population |
|---|---|---|
| Country 1 | MultiPolygon | 140_000 |
| Country 2 | MultiPolygon | 275_000 |
domain step
This step is meant to select a specific domain (using MongoDB terminology).
{
name: 'domain',
domain: 'my-domain'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
duplicate step
This step is meant to duplicate a column.
{
name: 'duplicate',
column: 'my-column'
new_column_name: 'my-duplicate'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Value |
|---|---|
| Company 1 | 13 |
| Company 2 | 0 |
| Company 3 | 20 |
Step configuration:
{
name: 'duplicate',
column: "Company",
new_column_name: "Company-copy",
}
Output dataset:
| Company | Value | Company-copy |
|---|---|---|
| Company 1 | 13 | Company 1 |
| Company 2 | 0 | Company 2 |
| Company 3 | 20 | Company 3 |
duration step
Compute the duration (in days, hours, minutes or seconds) between 2 dates in a new column.
{
name: 'duration',
newColumnName: 'DURATION', // new column name
startDateColumn: 'START_DATE', // start date
endDateColumn: 'END_DATE', // end date
durationIn: 'days', // duration units (days, hours, minutes or seconds)
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: duration in days
Input dataset:
| START_DATE | END_DATE |
|---|---|
| “2020-01-01T00:00:00.000Z” | “2020-01-31T00:00:00.000Z” |
| “2020-01-01T00:00:00.000Z” | “2020-12-31T00:00:00.000Z” |
Step configuration:
{
name: 'duration',
newColumnName: 'DURATION',
startDateColumn: 'START_DATE',
endDateColumn: 'END_DATE',
durationIn: 'days',
}
Output dataset:
| START_HOUR | END_HOUR | DURATION |
|---|---|---|
| “2020-01-01T00:00:00.000Z” | “2020-01-31T00:00:00.000Z” | 30 |
| “2020-01-01T00:00:00.000Z” | “2020-12-31T00:00:00.000Z” | 365 |
Example 2: duration in minutes
Input dataset:
| START_HOUR | END_HOUR |
|---|---|
| “2020-01-01T14:00:00.000Z” | “2020-01-31T15:00:00.000Z” |
| “2020-01-01T15:00:00.000Z” | “2020-12-31T20:00:00.000Z” |
Step configuration:
{
name: 'duration',
newColumnName: 'DURATION',
startDateColumn: 'START_HOUR',
endDateColumn: 'END_HOUR',
durationIn: 'minutes',
}
Output dataset:
| START_HOUR | END_HOUR | DURATION |
|---|---|---|
| “2020-01-01T14:00:00.000Z” | “2020-01-31T15:00:00.000Z” | 60 |
| “2020-01-01T15:00:00.000Z” | “2020-12-31T20:00:00.000Z” | 300 |
evolution step
Use this step if you need to compute the row-by-row evolution of a value column, based on a date column. It will output 2 columns: one for the evolution in absolute value, the other for the evolution in percentage.
You must be careful that the computation is scoped so that there are no dates
duplicates (so that any date finds no more than one previous date). That means
that you may need to specify “group by” columns to make any date unique inside
each group. You should specify those columns in the indexColumns parameter.
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear', // or vsLastMonth, vsLastWeek, vsLastDay
evolutionFormat: 'abs', // or pct
// optional, if computation has to be performed by group so that any date finds no more than one previous date
indexColumns: ['COUNTRY'],
newColumn: 'MY_EVOL', // optional, <originalColumnName>_EVOL_ABS or <originalColumnName>_EVOL_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic configuration - evolution in absolute value
Input dataset:
| DATE | VALUE |
|---|---|
| 2019-06 | 79 |
| 2019-07 | 81 |
| 2019-08 | 77 |
| 2019-09 | 75 |
| 2019-11 | 78 |
| 2019-12 | 88 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastMonth',
evolutionFormat: 'abs',
indexColumns: [],
}
Output dataset:
| DATE | VALUE | VALUE_EVOL_ABS |
|---|---|---|
| 2019-06 | 79 | |
| 2019-07 | 81 | 2 |
| 2019-08 | 77 | -4 |
| 2019-09 | 75 | -2 |
| 2019-11 | 78 | |
| 2019-12 | 88 | 10 |
Example 2: Basic configuration - evolution in percentage
Input dataset:
| DATE | VALUE |
|---|---|
| 2019-06 | 79 |
| 2019-07 | 81 |
| 2019-08 | 77 |
| 2019-09 | 75 |
| 2019-11 | 78 |
| 2019-12 | 88 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastMonth',
evolutionFormat: 'pct',
indexColumns: [],
}
Output dataset:
| DATE | VALUE | VALUE_EVOL_PCT |
|---|---|---|
| 2019-06 | 79 | |
| 2019-07 | 81 | 0.02531645569620253 |
| 2019-08 | 77 | -0.04938271604938271 |
| 2019-09 | 75 | -0.025974025974025976 |
| 2019-11 | 78 | |
| 2019-12 | 88 | 0.1282051282051282 |
Example 3: Error on duplicate dates
If ‘COUNTRY’ is not specified as indexColumn, the computation will not be scoped by country. Then there are duplicate dates in the “DATE” columns which is prohibited and will lead to an error.
Input dataset:
| DATE | COUNTRY | VALUE |
|---|---|---|
| 2014-12 | France | 79 |
| 2015-12 | France | 81 |
| 2016-12 | France | 77 |
| 2017-12 | France | 75 |
| 2014-12 | USA | 74 |
| 2015-12 | USA | 74 |
| 2016-12 | USA | 73 |
| 2017-12 | USA | 72 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear',
evolutionFormat: 'abs',
indexColumns: [],
}
Output dataset:
With the mongo translator, you will get an error at row-level as shown below:
| DATE | COUNTRY | VALUE | MY_EVOL |
|---|---|---|---|
| 2014-12 | France | 79 | |
| 2015-12 | France | 81 | Error … |
| 2016-12 | France | 77 | Error … |
| 2017-12 | France | 75 | Error … |
| 2014-12 | USA | 74 | |
| 2015-12 | USA | 74 | Error … |
| 2016-12 | USA | 73 | Error … |
| 2017-12 | USA | 72 | Error … |
The pandas translator will just return an error, and you will not get any data.
Example 4: Complete configuration with index columns
Input dataset:
| DATE | COUNTRY | VALUE |
|---|---|---|
| 2014-12 | France | 79 |
| 2015-12 | France | 81 |
| 2016-12 | France | 77 |
| 2017-12 | France | 75 |
| 2019-12 | France | 78 |
| 2020-12 | France | 88 |
| 2014-12 | USA | 74 |
| 2015-12 | USA | 74 |
| 2016-12 | USA | 73 |
| 2017-12 | USA | 72 |
| 2018-11 | USA | 75 |
| 2020-12 | USA | 76 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear',
evolutionFormat: 'abs',
indexColumns: ['COUNTRY'],
newColumn: 'MY_EVOL',
}
Output dataset:
| DATE | COUNTRY | VALUE | MY_EVOL |
|---|---|---|---|
| 2014-12 | France | 79 | |
| 2015-12 | France | 81 | 2 |
| 2016-12 | France | 77 | -4 |
| 2017-12 | France | 75 | -2 |
| 2019-12 | France | 78 | |
| 2020-12 | France | 88 | 10 |
| 2014-12 | USA | 74 | |
| 2015-12 | USA | 74 | 0 |
| 2016-12 | USA | 73 | -1 |
| 2017-12 | USA | 72 | -1 |
| 2018-11 | USA | 75 | 3 |
| 2020-12 | USA | 76 |
fillna step
Replace null values by a given value in specified columns.
{
name: 'fillna',
columns: ["foo", "bar"],
value: 0
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column (in the singular) parameter is deprecated and is supported only for
retrocompatibility purposes.
An old-fashioned step looks like this:
{
name: 'fillna',
column: 'foo',
value: 0
}
Example
Input dataset:
| Company | Group | Value | KPI |
|---|---|---|---|
| Company 1 | Group 1 | 13 | |
| Company 2 | Group 1 | 12 | |
| Company 3 | Group 1 | 20 | 40 |
| Company 4 | Group 2 | 1 | |
| Company 5 | Group 2 | 38 | |
| Company 6 | Group 2 | 5 | 4 |
Step configuration:
{
name: 'fillna',
columns: ["Value", "KPI"],
value: 0
}
Output dataset:
| Company | Group | Value | KPI |
|---|---|---|---|
| Company 1 | Group 1 | 13 | 0 |
| Company 2 | Group 1 | 0 | 12 |
| Company 3 | Group 1 | 20 | 40 |
| Company 4 | Group 2 | 1 | 0 |
| Company 5 | Group 2 | 0 | 38 |
| Company 6 | Group 2 | 5 | 4 |
filter step
Filter out lines that don’t match a filter definition.
{
name: 'filter',
condition: {
column: 'my-column',
value: 42,
operator: 'ne'
}
}
operator is optional, and defaults to eq. Allowed operators are eq, ne,
gt, ge, lt, le, in, nin, matches, notmatches isnull or notnull.
value can be an arbitrary value depending on the selected operator (e.g a list
when used with the in operator, or null when used with the isnull
operator).
matches and notmatches operators are used to test value against a regular expression.
Conditions can be grouped and nested with logical operators and and or.
{
name: 'filter',
condition: {
and: [
{
column: 'my-column',
value: 42,
operator: 'gte'
},
{
column: 'my-column',
value: 118,
operator: 'lte'
},
{
or: [
{
column: 'my-other-column',
value: 'blue',
operator: 'eq'
},
{
column: 'my-other-column',
value: 'red',
operator: 'eq'
}
]
}
]
}
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Relative dates
Date values can be relative to the moment to the moment when the query is executed.
This is expressed by using a RelativeDate object instead of the value, of the form:
{
quantity: Number // can be positive (future) or negative (past)
duration: 'year' | 'quarter' | 'month' | 'week' | 'day'
}
*This feature is supported by the following backends:
- Mongo 5.0
formula step
Add a computation based on a formula. Usually column names do not need to be
escaped, unless they include whitespaces, in which case you’ll need to use
brackets ‘[]’ (e.g. [myColumn]). Any string escaped with quotes (', ", ''', """)
will be considered a string literal.
{
{
name: 'formula',
new_column: 'result', // if the column already exists, overwrites it
formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}
}
Supported operators
The following operators are supported by the formula step (note that a value can be a column name or a literal, such as 42 or foo).
+: Does an addition of two numeric values. See theconcatenatestep to append strings-: Does an substraction of two numeric values. See thereplacestep to remove a part of a string*: Multiplies two numeric values./: Divides a numeric value by another. Divisions by zero will returnnull.%: Returns the rest of an integer division. Divisions by zero will returnnull.
Example 1: Basic usage
Input dataset:
| Label | Value1 | Value2 | Value3 | Value4 |
|---|---|---|---|---|
| Label 1 | 10 | 2 | 3 | 1 |
| Label 2 | 1 | 13 | 7 | 3 |
| Label 3 | 5 | 20 | 5 | 2 |
Step configuration:
{
name: 'formula',
new_column: 'Result',
formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}
Output dataset:
| Label | Value1 | Value2 | Value3 | Value4 | Result |
|---|---|---|---|---|---|
| Label 1 | 10 | 2 | 3 | 1 | 2 |
| Label 2 | 1 | 13 | 7 | 3 | -4 |
| Label 3 | 5 | 20 | 5 | 2 | 1 |
Example 2: Column name with whitespaces
Input dataset:
| Label | Value1 | Value2 | Value3 | Value 4 |
|---|---|---|---|---|
| Label 1 | 10 | 2 | 3 | 1 |
| Label 2 | 1 | 13 | 7 | 3 |
| Label 3 | 5 | 20 | 5 | 2 |
Step configuration:
{
name: 'formula',
new_column: 'Result',
formula: '(Value1 + Value2) / Value3 - [Value 4] * 2'
}
Output dataset:
| Label | Value1 | Value2 | Value3 | Value 4 | Result |
|---|---|---|---|---|---|
| Label 1 | 10 | 2 | 3 | 1 | 2 |
| Label 2 | 1 | 13 | 7 | 3 | -4 |
| Label 3 | 5 | 20 | 5 | 2 | 1 |
hierarchy step
Hierarchy for geographical data.
This step dissolves data for every hierarchy level, and adds a hierarchy level column containing a level (with 0 being the lowest granularity, i.e. the highest level).
{
name: 'hierarchy',
hierarchy: ['Country', 'City'],
include_nulls: false,
}
*This step is supported by the following backends:
- Pandas (python)
Example
Input dataset:
| Country | City | geometry | Population |
|---|---|---|---|
| Country 1 | City 1 | Polygon | 100_000 |
| Country 2 | City 2 | Polygon | 50_000 |
| Country 2 | City 3 | Polygon | 200_000 |
| Country 1 | City 4 | Polygon | 30_000 |
| Country 2 | City 5 | Polygon | 25_000 |
| Country 1 | City 6 | Polygon | 10_000 |
Step configuration:
name: 'hierarchy',
hierarchy: ['Country', 'City'],
include_nulls: false,
Output dataset:
| Country | City | geometry | Population | hierarchy_level |
|---|---|---|---|---|
| Country 1 | City 1 | Polygon | 100_000 | 2 |
| Country 2 | City 2 | Polygon | 50_000 | 2 |
| Country 2 | City 3 | Polygon | 200_000 | 2 |
| Country 1 | City 4 | Polygon | 30_000 | 2 |
| Country 2 | City 5 | Polygon | 25_000 | 2 |
| Country 1 | City 6 | Polygon | 10_000 | 2 |
| Country 1 | null | MultiPolygon | null | 1 |
| Country 2 | null | MultiPolygon | null | 1 |
| null | null | MultiPolygon | null | 0 |
ifthenelse step
Creates a new column, which values will depend on a condition expressed on existing columns.
The condition is expressed in the if parameter with a condition object, which
is the same object expected by the condition parameter of the
filter step). Conditions can be grouped and nested with logical
operators and and or.
The then parameter only supports a string, that will be interpreted as a
formula (cf. formula step). If you want it to be interpreted
striclty as a string and not a formula, you must escape the string with quotes
(e.g. ‘“this is a text”’).
if...then...else blocks can be nested as the else parameter
supports either a string that will be interpreted as a formula (cf.
formula step), or a nested if if...then...else object.
{
name: 'ifthenelse',
newColumn: '', // the name of the new column to be created
if: { column: '', value: '', operator: 'eq' }, // a condition, same object as in the filter step
then: '', // a string that will be interpreted as a formula
else: '', // a string interpreted as a formula or a nested if...then...else object
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Label | number |
|---|---|
| Label 1 | -2 |
| Label 2 | 2 |
| Label 3 | 0 |
Step configuration:
{
name: 'ifthenelse',
newColumn: 'result',
if: { column: 'number', value: 0, operator: 'eq' },
then: '"zero"'
else: {
if: { column: 'rel', value: 0, operator: 'lt' },
then: 'number * -1',
else: 'number'
}
}
Output dataset:
| Label | number | result |
|---|---|---|
| Label 1 | -2 | 2 |
| Label 2 | 5 | 5 |
| Label 3 | 0 | zero |
join step
Joins a dataset to the current dataset, i.e. brings columns from the former into
the latter, and matches rows based on columns correspondance. It is similar to
a JOIN clause in SQL, or to a VLOOKUP in excel. The joined dataset is the
result from the query of the right_pipeline.
The join type can be:
-
‘left’: will keep every row of the current dataset and fill unmatched rows with
nullvalues, -
‘inner’: will only keep rows that match rows of the joined dataset.
In the on parameter, you must specify 1 or more column couple(s) that will be
compared to determine rows correspondance between the 2 datasets. The first
element of a couple is for the current dataset column, and the second for the
corresponding column in the right dataset to be joined. If you specify more than
1 couple, the matching rows will be those that find a correspondance between the
2 datasets for every column couple specified (logical ‘AND’).
Weaverbird allows you to save pipelines referenced by name in the Vuex store
of the application. You can then call them by their unique names in this step.
{
name: 'join',
right_pipeline: 'somePipelineReference',
type: 'left', // or 'inner'
on: [
['currentDatasetColumn1', 'rightDatasetColumn1'],
['currentDatasetColumn2', 'rightDatasetColumn2'],
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Left join with one column couple as on parameter
Input dataset:
| Label | Value |
|---|---|
| Label 1 | 13 |
| Label 2 | 7 |
| Label 3 | 20 |
| Label 4 | 1 |
| Label 5 | 1 |
| Label 6 | 1 |
rightDataset (saved in the application Vuex store):
| Label | Group |
|---|---|
| Label 1 | Group 1 |
| Label 2 | Group 1 |
| Label 3 | Group 2 |
| Label 4 | Group 2 |
Step configuration:
{
name: 'join',
right_pipeline: 'rightDataset',
type: 'left',
on: [['Label', 'Label']];
}
Output dataset:
| Label | Value | Group |
|---|---|---|
| Label 1 | 13 | Group 1 |
| Label 2 | 7 | Group 1 |
| Label 3 | 20 | Group 2 |
| Label 4 | 1 | Group 2 |
| Label 5 | 1 | |
| Label 6 | 1 |
Example 2: inner join with different column names in the on parameter
Input dataset:
| Label | Value |
|---|---|
| Label 1 | 13 |
| Label 2 | 7 |
| Label 3 | 20 |
| Label 4 | 1 |
| Label 5 | 1 |
| Label 6 | 1 |
rightDataset (saved in the application Vuex store):
| LabelRight | Group |
|---|---|
| Label 1 | Group 1 |
| Label 2 | Group 1 |
| Label 3 | Group 2 |
| Label 4 | Group 2 |
Step configuration:
{
name: 'join',
right_pipeline: 'rightDataset',
type: 'inner',
on: [['Label', 'LabelRight']];
}
Output dataset:
| Label | Value | LabelRight | Group |
|---|---|---|---|
| Label 1 | 13 | Label 1 | Group 1 |
| Label 2 | 7 | Label 2 | Group 1 |
| Label 3 | 20 | Label 3 | Group 2 |
| Label 4 | 1 | Label 4 | Group 2 |
fromdate step
Converts a date column into a string column based on a specified format.
{
name: 'fromdate',
column: 'myDateColumn'
format: '%Y-%m-%d' // For further details on supported format specifiers,
// see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Date | Value |
|---|---|---|
| Company 1 | 2019-10-06T00:00:00.000Z | 13 |
| Company 1 | 2019-10-07T00:00:00.000Z | 7 |
| Company 1 | 2019-10-08T00:00:00.000Z | 20 |
| Company 2 | 2019-10-06T00:00:00.000Z | 1 |
| Company 2 | 2019-10-07T00:00:00.000Z | 10 |
| Company 2 | 2019-10-08T00:00:00.000Z | 5 |
Step configuration:
{
name: 'fromdate',
column: 'Date',
format: '%d/%m/%Y'
}
Output dataset:
| Company | Date | Value |
|---|---|---|
| Company 1 | 06/10/2019 | 13 |
| Company 1 | 07/10/2019 | 7 |
| Company 1 | 08/10/2019 | 20 |
| Company 2 | 06/10/2019 | 1 |
| Company 2 | 07/10/2019 | 10 |
| Company 2 | 08/10/2019 | 5 |
lowercase step
$toLower operator does not support accents.
If you have accents you need to lowercase with Mongo, use a replacetext
step after lowercase.
Converts a string column to lowercase.
{
name: 'lowercase',
column: 'foo',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| LABEL 1 | Group 1 | 13 |
| LABEL 2 | Group 1 | 7 |
| LABEL 3 | Group 1 | 20 |
Step configuration:
{
name: 'lowercase',
column: 'Label',
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| label 1 | Group 1 | 13 |
| label 2 | Group 1 | 7 |
| label 3 | Group 1 | 20 |
movingaverage step
Compute the moving average based on a value column, a reference column to sort (usually a date column) and a moving window (in number of rows i.e. data points). If needed, the computation can be performed by group of rows. The computation result is added in a new column.
{
name: 'movingaverage',
valueColumn: 'value',
columnToSort: 'dates'
movingWindow: 12,
groups: ['foo', 'bar'] // optional
newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
| DATE | VALUE |
|---|---|
| 2018-01-01 | 75 |
| 2018-01-02 | 80 |
| 2018-01-03 | 82 |
| 2018-01-04 | 83 |
| 2018-01-05 | 80 |
| 2018-01-06 | 86 |
| 2018-01-07 | 79 |
| 2018-01-08 | 76 |
Step configuration:
{
name: 'movingaverage',
valueColumn: 'VALUE',
columnToSort: 'DATE'
movingWindow: 2,
}
Output dataset:
| DATE | VALUE | VALUE_MOVING_AVG |
|---|---|---|
| 2018-01-01 | 75 | null |
| 2018-01-02 | 80 | 77.5 |
| 2018-01-03 | 82 | 81 |
| 2018-01-04 | 83 | 82.5 |
| 2018-01-05 | 80 | 81.5 |
| 2018-01-06 | 86 | 83 |
| 2018-01-07 | 79 | 82.5 |
| 2018-01-08 | 76 | 77.5 |
Example 2: with groups and custom newColumnName
Input dataset:
| COUNTRY | DATE | VALUE |
|---|---|---|
| France | 2018-01-01 | 75 |
| France | 2018-01-02 | 80 |
| France | 2018-01-03 | 82 |
| France | 2018-01-04 | 83 |
| France | 2018-01-05 | 80 |
| France | 2018-01-06 | 86 |
| USA | 2018-01-01 | 69 |
| USA | 2018-01-02 | 73 |
| USA | 2018-01-03 | 73 |
| USA | 2018-01-04 | 75 |
| USA | 2018-01-05 | 70 |
| USA | 2018-01-06 | 76 |
Step configuration:
{
name: 'movingaverage',
valueColumn: 'VALUE',
columnToSort: 'DATE'
movingWindow: 2,
groups: ['COUNTRY']
newColumName: 'ROLLING_AVERAGE'
}
Output dataset:
| COUNTRY | DATE | VALUE | ROLLING_AVERAGE |
|---|---|---|---|
| France | 2018-01-01 | 75 | null |
| France | 2018-01-02 | 80 | null |
| France | 2018-01-03 | 82 | 79 |
| France | 2018-01-04 | 83 | 81.7 |
| France | 2018-01-05 | 80 | 81.7 |
| France | 2018-01-06 | 86 | 83 |
| USA | 2018-01-01 | 69 | null |
| USA | 2018-01-02 | 73 | null |
| USA | 2018-01-03 | 73 | 71.7 |
| USA | 2018-01-04 | 75 | 73.7 |
| USA | 2018-01-05 | 70 | 72.7 |
| USA | 2018-01-06 | 76 | 73.7 |
percentage step
Compute the percentage of total, i.e. for every row the value in column divided
by the total as the sum of every values in column. The computation can be performed
by group if specified. The result is written in a new column.
{
name: 'percentage',
column: 'bar',
group: ['foo'] // optional
newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 5 |
| Label 2 | Group 1 | 10 |
| Label 3 | Group 1 | 15 |
| Label 4 | Group 2 | 2 |
| Label 5 | Group 2 | 7 |
| Label 6 | Group 2 | 5 |
Step configuration:
name: 'percentage',
new_column: 'Percentage_of_total',
column: 'Value',
group: ['Group']
newColumn: 'Percentage'
}
Output dataset:
| Label | Group | Value | Percentage |
|---|---|---|---|
| Label 1 | Group 1 | 5 | 0.167 |
| Label 2 | Group 1 | 10 | 0.333 |
| Label 3 | Group 1 | 15 | 0.5 |
| Label 4 | Group 2 | 2 | 0.143 |
| Label 5 | Group 2 | 7 | 0.5 |
| Label 6 | Group 2 | 5 | 0.357 |
pivot step
Pivot rows into columns around a given index (expressed as a combination of column(s)).
Values to be used as new column names are found in the column column_to_pivot.
Values to populate new columns are found in the column value_column.
The function used to aggregate data (when several rows are found by index group) must be
among sum, avg, count, min or max.
{
name: 'pivot',
index: ['column_1', 'column_2'],
column_to_pivot: 'column_3',
value_column: 'column_4',
agg_function: 'sum',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Country | Value |
|---|---|---|
| Label 1 | Country1 | 13 |
| Label 2 | Country1 | 7 |
| Label 3 | Country1 | 20 |
| Label 1 | Country2 | 1 |
| Label 2 | Country2 | 10 |
| Label 3 | Country2 | 5 |
| label 3 | Country2 | 1 |
Step configuration:
{
name: 'pivot',
index: ['Label'],
column_to_pivot: 'Country',
value_column: 'Value',
agg_function: 'sum',
}
Output dataset:
| Label | Country1 | Country2 |
|---|---|---|
| Label 1 | 13 | 1 |
| Label 2 | 7 | 10 |
| Label 3 | 20 | 6 |
statistics step
Compute statistics of a column.,
{
name: 'statistics',
column: 'Value',
groupby: [],
statistics: ['average', 'count'],
quantiles: [{label: 'median', nth: 1, order: 2}],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'statistics',
column: 'Value',
groupby: [],
statistics: ['average', 'count'],
quantiles: [{label: 'median', nth: 1, order: 2}],
}
Output dataset:
| average | count | median |
|---|---|---|
| 9.33333 | 6 | 8.5 |
rank step
This step allows to compute a rank column based on a value column that can be sorted in ascending or descending order. The ranking can be computed by group.
There are 2 ranking methods available, that you will understand easily through those examples:
standard: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 5, 5, 7]dense: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 3, 3, 4]
(The dense method is basically the same as the standard method, but rank
always increases by 1 at most).
{
name: 'rank',
valueCol: 'VALUE', // The value column that will be ordered to determine rank
order: 'desc', // How to order the value column to determine ranking.
// either asc(ending) or desc(ending), 'desc' by default
method: 'standard', // either 'standard' or 'dense', as explained above
groupby: ['foo', 'bar'], // specify columns if you want to group the ranking computation
newColumnName: 'columnA', // specify the new column name to be created (by default '<yourValueColumn>_RANK')
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
| COUNTRY | VALUE |
|---|---|
| FRANCE | 15 |
| FRANCE | 5 |
| FRANCE | 10 |
| FRANCE | 20 |
| FRANCE | 10 |
| FRANCE | 15 |
| USA | 20 |
| USA | 30 |
| USA | 20 |
| USA | 25 |
| USA | 15 |
| USA | 20 |
Step configuration:
{
name: 'rank',
valueCol: 'VALUE',
order: 'desc',
method: 'standard',
}
Output dataset:
| COUNTRY | VALUE | VALUE_RANK |
|---|---|---|
| USA | 30 | 1 |
| USA | 25 | 2 |
| FRANCE | 20 | 3 |
| USA | 20 | 3 |
| USA | 20 | 3 |
| USA | 20 | 3 |
| FRANCE | 15 | 7 |
| FRANCE | 15 | 7 |
| USA | 15 | 7 |
| FRANCE | 10 | 10 |
| FRANCE | 10 | 10 |
| FRANCE | 5 | 12 |
Example 2: With more options
Input dataset:
| COUNTRY | VALUE |
|---|---|
| FRANCE | 15 |
| FRANCE | 5 |
| FRANCE | 10 |
| FRANCE | 20 |
| FRANCE | 10 |
| FRANCE | 15 |
| USA | 20 |
| USA | 30 |
| USA | 20 |
| USA | 25 |
| USA | 15 |
| USA | 20 |
Step configuration:
{
name: 'rank',
valueCol: 'VALUE',
order: 'asc',
method: 'dense',
groupby: ['COUNTRY'],
newColumnName: 'MY_RANK',
}
Output dataset:
| COUNTRY | VALUE | MY_RANK |
|---|---|---|
| FRANCE | 5 | 1 |
| FRANCE | 10 | 2 |
| FRANCE | 10 | 2 |
| FRANCE | 15 | 3 |
| FRANCE | 15 | 3 |
| FRANCE | 20 | 4 |
| USA | 15 | 1 |
| USA | 20 | 2 |
| USA | 20 | 2 |
| USA | 20 | 2 |
| USA | 25 | 3 |
| USA | 30 | 4 |
rename step
Rename one or several columns.
The toRename parameter takes as input a list of 2-elements lists in the form
[‘oldColumnName’, ‘newColumnName’].
{
name: 'rename',
toRename: [
['oldCol1', 'newCol1'] // The first value is the column to be renamed, the second is the new column name
['oldCol2', 'newCol2']
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The oldname and newnname parameters are deprecated and are supported only for
retrocompatibility purposes.
When this step was first created, only 1 column at a time could be renamed, and
the step was configured via 2 parameters, oldname and newnname, that are now
deprecated and supported only for retrocompatibility purposes.
An old-fashioned step with oldname set to ‘old’ and newname set to ‘new’
is equivalent to a new-fashioned step with toRename parameter set to [‘old’, ‘new’].
So an old-fashioned step looks like this:
{
name: 'rename',
oldname: 'old-column-name',
newname: 'new-column-name'
}
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'rename',
oldname: 'Label',
newname: 'Company'
}
Output dataset:
| Company | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
replace step
Replace one or several values in a column.
A replace step has the following strucure:
{
name: 'replace',
search_column: "column_1",
to_replace: [
['foo', 'bar'], // The first value is the one to be replace, the second is the new value
[42, 0]
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| COMPANY | COUNTRY |
|---|---|
| Company 1 | Fr |
| Company 2 | UK |
Step configuration:
{
name: 'replace',
search_column: "COUNTRY",
to_replace: [
['Fr', 'France']
['UK', 'United Kingdom']
]
}
Output dataset:
| COMPANY | COUNTRY |
|---|---|
| Company 1 | France |
| Company 2 | United Kingdom |
replacetext step
Replace a substring in a column.
A replace-text step has the following structure:
{
name: 'replacetext',
search_column: "column_1",
old_str: 'foo',
new_str: 'bar'
}
*This step is supported by the following backends:
- Mongo 5.0 (python)
- Pandas (python)
- AWS Athena
- Google Big Query
- MySQL
- PostgreSQL
- Redshift
- Snowflake
Example
Input dataset:
| COMPANY | COUNTRY |
|---|---|
| Company 1 | Fr is boring |
| Company 2 | UK |
Step configuration:
{
name: 'replacetext',
search_column: "COUNTRY",
old_str: 'Fr',
new_str: 'France',
}
Output dataset:
| COMPANY | COUNTRY |
|---|---|
| Company 1 | France is boring |
| Company 2 | UK |
rollup step
Use this step if you need to compute aggregated data at every level of a
hierarchy, specified as a series of columns from top to bottom level.
The output data structure stacks the data of every level of the hierarchy,
specifying for every row the label, level and parent in dedicated columns.
Aggregated rows can be computed with using either sum, average, count, count distinct, min, max, first or last.
{
name: 'rollup',
hierarchy: ['continent', 'country', 'city'], // Hierarchical columns, from top to bottom level
aggregations: [ // The columns to be aggregated, and how
{
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
columns: ['value1', 'value2']
}
{
newcolumns: ['avg_value1'],
aggfunction: 'avg',
columns: ['value1']
}
],
groupby: ['date'], // optional, if needing to segment the rollup computation by group
labelCol: 'label', // optional, name of the output label column, 'label' by default
levelCol: 'level', // optional, name of the output level column, 'level' by default
childLevelCol: 'child_level', // optional, name of the next output level column, 'child_level' by default
parentLabelCol: 'parent', // optional, name of the output parent column, 'parent' by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column and newcolumn (in the singular) properties of the aggregation
parameter are deprecated and are supported only for retrocompatibility purposes.
When this step was first created, only 1 column at a time could be aggregated for
a given aggregation.
Now columns and newcolumns are lists of columns, allowing to apply the same
aggregation function to several columns at once.
An old-fashioned step looked like this:
{
name: 'rollup',
hierarchy: ['continent', 'country', 'city'],
aggregations: [
{
newcolumn: 'sum_value1'
aggfunction: 'sum',
column: 'value1'
}
{
newcolumn: 'sum_value2',
aggfunction: 'sum',
column: 'value2'
}
{
newcolumn: 'avg_value1',
aggfunction: 'avg',
column: 'value1'
}
// ...
]
}
Example 1 : Basic configuration
Input dataset:
| CITY | COUNTRY | CONTINENT | YEAR | VALUE |
|---|---|---|---|---|
| Paris | France | Europe | 2018 | 10 |
| Bordeaux | France | Europe | 2018 | 5 |
| Barcelona | Spain | Europe | 2018 | 8 |
| Madrid | Spain | Europe | 2018 | 3 |
| Boston | USA | North America | 2018 | 12 |
| New-York | USA | North America | 2018 | 21 |
| Montreal | Canada | North America | 2018 | 10 |
| Ottawa | Canada | North America | 2018 | 7 |
| Paris | France | Europe | 2019 | 13 |
| Bordeaux | France | Europe | 2019 | 8 |
| Barcelona | Spain | Europe | 2019 | 11 |
| Madrid | Spain | Europe | 2019 | 6 |
| Boston | USA | North America | 2019 | 15 |
| New-York | USA | North America | 2019 | 24 |
| Montreal | Canada | North America | 2019 | 10 |
| Ottawa | Canada | North America | 2019 | 13 |
Step configuration:
{
name: 'rollup',
hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
aggregations: [
{
newcolumn: ['VALUE'],
aggfunction: 'sum',
column: ['VALUE']
},
],
}
Output dataset:
| CITY | COUNTRY | CONTINENT | label | level | child_level | parent | VALUE |
|---|---|---|---|---|---|---|---|
| Europe | Europe | CONTINENT | COUNTRY | 64 | |||
| North America | North America | CONTINENT | COUNTRY | 112 | |||
| France | Europe | France | COUNTRY | CITY | Europe | 36 | |
| Spain | Europe | Spain | COUNTRY | CITY | Europe | 28 | |
| USA | North America | USA | COUNTRY | CITY | North America | 72 | |
| Canada | North America | Canada | COUNTRY | CITY | North America | 40 | |
| Paris | France | Europe | Paris | CITY | France | 23 | |
| Bordeaux | France | Europe | Bordeaux | CITY | France | 13 | |
| Barcelona | Spain | Europe | Barcelona | CITY | Spain | 19 | |
| Madrid | Spain | Europe | Madrid | CITY | Spain | 9 | |
| Boston | USA | North America | Boston | CITY | USA | 27 | |
| New-York | USA | North America | New-York | CITY | USA | 45 | |
| Montreal | Canada | North America | Montreal | CITY | Canada | 20 | |
| Ottawa | Canada | North America | Ottawa | CITY | Canada | 20 |
Example 2 : Configuration with optional parameters
Input dataset:
| CITY | COUNTRY | CONTINENT | YEAR | VALUE | COUNT |
|---|---|---|---|---|---|
| Paris | France | Europe | 2018 | 10 | 1 |
| Bordeaux | France | Europe | 2018 | 5 | 1 |
| Barcelona | Spain | Europe | 2018 | 8 | 1 |
| Madrid | Spain | Europe | 2018 | 3 | 1 |
| Boston | USA | North America | 2018 | 12 | 1 |
| New-York | USA | North America | 2018 | 21 | 1 |
| Montreal | Canada | North America | 2018 | 10 | 1 |
| Ottawa | Canada | North America | 2018 | 7 | 1 |
| Paris | France | Europe | 2019 | 13 | 1 |
| Bordeaux | France | Europe | 2019 | 8 | 1 |
| Barcelona | Spain | Europe | 2019 | 11 | 1 |
| Madrid | Spain | Europe | 2019 | 6 | 1 |
| Boston | USA | North America | 2019 | 15 | 1 |
| New-York | USA | North America | 2019 | 24 | 1 |
| Montreal | Canada | North America | 2019 | 10 | 1 |
| Ottawa | Canada | North America | 2019 | 13 | 1 |
Step configuration:
{
name: 'rollup',
hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
aggregations: [
{
newcolumn: ['VALUE-sum', 'COUNT']
aggfunction: 'sum',
column: ['VALUE', 'COUNT']
},
{
newcolumn: 'VALUE-avg',
aggfunction: 'avg',
column: 'VALUE'
}
],
groupby: ['YEAR'],
labelCol: 'MY_LABEL',
levelCol: 'MY_LEVEL',
childLevelCol: 'MY_CHILD_LEVEL',
parentLabelCol: 'MY_PARENT',
}
Output dataset:
| CITY | COUNTRY | CONTINENT | YEAR | MY_LABEL | MY_LEVEL | MY_CHILD_LEVEL | MY_PARENT | VALUE-sum | VALUE-avg | COUNT |
|---|---|---|---|---|---|---|---|---|---|---|
| North America | 2018 | Europe | CONTINENT | COUNTRY | 26 | 6.5 | 4 | |||
| North America | 2018 | North America | CONTINENT | COUNTRY | 50 | 12.5 | 4 | |||
| France | Europe | 2018 | France | COUNTRY | CITY | Europe | 15 | 7.5 | 2 | |
| Spain | Europe | 2018 | Spain | COUNTRY | CITY | Europe | 11 | 5.5 | 2 | |
| USA | North America | 2018 | USA | COUNTRY | CITY | North America | 33 | 16.5 | 2 | |
| Canada | North America | 2018 | Canada | COUNTRY | CITY | North America | 17 | 8.5 | 2 | |
| Paris | France | Europe | 2018 | Paris | CITY | France | 10 | 10 | 1 | |
| Bordeaux | France | Europe | 2018 | Bordeaux | CITY | France | 5 | 5 | 1 | |
| Barcelona | Spain | Europe | 2018 | Barcelona | CITY | Spain | 8 | 8 | 1 | |
| Madrid | Spain | Europe | 2018 | Madrid | CITY | Spain | 3 | 3 | 1 | |
| Boston | USA | North America | 2018 | Boston | CITY | USA | 12 | 12 | 1 | |
| New-York | USA | North America | 2018 | New-York | CITY | USA | 21 | 21 | 1 | |
| Montreal | Canada | North America | 2018 | Montreal | CITY | Canada | 10 | 10 | 1 | |
| Ottawa | Canada | North America | 2018 | Ottawa | CITY | Canada | 7 | 7 | 1 | |
| North America | 2019 | Europe | CONTINENT | COUNTRY | 38 | 9.5 | 4 | |||
| North America | 2019 | North America | CONTINENT | COUNTRY | 62 | 15.5 | 4 | |||
| France | Europe | 2019 | France | COUNTRY | CITY | Europe | 21 | 10.5 | 2 | |
| Spain | Europe | 2019 | Spain | COUNTRY | CITY | Europe | 17 | 8.5 | 2 | |
| USA | North America | 2019 | USA | COUNTRY | CITY | North America | 39 | 19.5 | 2 | |
| Canada | North America | 2019 | Canada | COUNTRY | CITY | North America | 23 | 11.5 | 2 | |
| Paris | France | Europe | 2019 | Paris | CITY | France | 13 | 13 | 1 | |
| Bordeaux | France | Europe | 2019 | Bordeaux | CITY | France | 8 | 8 | 1 | |
| Barcelona | Spain | Europe | 2019 | Barcelona | CITY | Spain | 11 | 11 | 1 | |
| Madrid | Spain | Europe | 2019 | Madrid | CITY | Spain | 6 | 6 | 1 | |
| Boston | USA | North America | 2019 | Boston | CITY | USA | 15 | 15 | 1 | |
| New-York | USA | North America | 2019 | New-York | CITY | USA | 24 | 24 | 1 | |
| Montreal | Canada | North America | 2019 | Montreal | CITY | Canada | 10 | 10 | 1 | |
| Ottawa | Canada | North America | 2019 | Ottawa | CITY | Canada | 13 | 13 | 1 |
select step
Select a column. The default is to keep every columns of the input domain. If
the select is used, it will only keep selected columns in the output.
{
name: 'select',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| Company 1 | Group 1 | 13 | Company 1 - Group 1 |
| Company 2 | Group 1 | 7 | Company 2 - Group 1 |
| Company 3 | Group 1 | 20 | Company 3 - Group 1 |
| Company 4 | Group 2 | 1 | Company 4 - Group 2 |
| Company 5 | Group 2 | 10 | Company 5 - Group 2 |
| Company 6 | Group 2 | 5 | Company 6 - Group 2 |
Step configuration:
{
{
name: 'select',
columns: ['Value', 'Label']
}
}
Output dataset:
| Value | Label |
|---|---|
| 13 | Company 1 - Group 1 |
| 7 | Company 2 - Group 1 |
| 20 | Company 3 - Group 1 |
| 1 | Company 4 - Group 2 |
| 10 | Company 5 - Group 2 |
| 5 | Company 6 - Group 2 |
sort step
Sort values in one or several columns. Order can be either ‘asc’ or ‘desc’.
When sorting on several columns, order of columns specified in columns matters.
{
name: 'sort',
columns: [{column: 'foo', order: 'asc'}, {column: 'bar', order: 'desc'}],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'sort',
columns: [{ column: 'Group', order: 'asc'}, {column: 'Value', order: 'desc' }]
}
Output dataset:
| Company | Group | Value |
|---|---|---|
| Label 3 | Group 1 | 20 |
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
| Label 4 | Group 2 | 1 |
split step
Split a string column into several columns based on a delimiter.
{
name: 'split',
column: 'foo', // the columnn to split
delimiter: ' - ', // the delimiter can e a strinng of any length
number_cols_to_keep: 3, // the numer of columns to keep resulting from the
// split (starting from first chunk)
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1
Input dataset:
| Label | Value |
|---|---|
| Label 1 - Group 1 - France | 13 |
| Label 2 - Group 1 - Spain | 7 |
| Label 3 - Group 1 - USA | 20 |
| Label 4 - Group 2 - France | 1 |
| Label 5 - Group 2 - Spain | 10 |
| Label 6 - Group 2 - USA | 5 |
Step configuration:
{
name: 'split',
column: 'Label',
delimiter: ' - ',
number_cols_to_keep: 3,
}
Output dataset:
| Label_1 | Label_2 | Label_3 | Value |
|---|---|---|---|
| Label 1 | Group 1 | Spain | 13 |
| Label 2 | Group 1 | USA | 7 |
| Label 3 | Group 1 | France | 20 |
| Label 4 | Group 2 | USA | 1 |
| Label 5 | Group 2 | France | 10 |
| Label 6 | Group 2 | Spain | 5 |
Example 2: keeping less columns
Input dataset:
| Label | Value |
|---|---|
| Label 1 - Group 1 - France | 13 |
| Label 2 - Group 1 - Spain | 7 |
| Label 3 - Group 1 - USA | 20 |
| Label 4 - Group 2 - France | 1 |
| Label 5 - Group 2 - Spain | 10 |
| Label 6 - Group 2 - USA | 5 |
Step configuration:
{
name: 'split',
column: 'Label',
delimiter: ' - ',
number_cols_to_keep: 2,
}
Output dataset:
| Label_1 | Label_2 | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
simplify step
Simplifies geographical data.
When simplifying your data, every point that is closer than a specific distance to the previous one is suppressed. This step can be useful if you have a very precise shape for a country (such as one-meter precision), but want to quickly draw a map chart. In that case, you may want to simplify your data.
After simplification, no points will be closer than tolerance. The unit depends on data’s projection and on its unit,
but in general, it’s expressed in meters for CRS projections. For more details, see the
GeoPandas documentation.
Step configuration:
{
name: 'simplify',
tolerance: 1.0,
}
substring step
Extract a substring in a string column. The substring begins at index
start_index (beginning at 1) and stops at end_index.
You can specify negative indexes, in such a case the index search will start
from the end of the string (with -1 being the last index of the string).
Please refer to the examples below for illustration.
Neither start_index nor end_index can be equal to 0.
{
name: 'substring',
column: 'foo',
start_index: 1, // 1 = fisrt character
end_index: -1, // -1 = last character
newColumName: 'myNewColumn' // optional, <originalColumame>_SUBSTR by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: positive start_index and end_index
Input dataset:
| Group | Value |
|---|---|
| foo | 13 |
| overflow | 7 |
| some_text | 20 |
| a_word | 1 |
| toucan | 10 |
| toco | 5 |
Step configuration:
{
column: 'Label',
name: 'substring',
start_index: 1,
end_index: 4,
}
| Label | Value | Label_PCT |
|---|---|---|
| foo | 13 | foo |
| overflow | 7 | over |
| some_text | 20 | some |
| a_word | 1 | a_wo |
| toucan | 10 | touc |
| toco | 5 | toco |
Example 2: start_index is positive and end_index is negative
Input dataset:
| Label | Value |
|---|---|
| foo | 13 |
| overflow | 7 |
| some_text | 20 |
| a_word | 1 |
| toucan | 10 |
| toco | 5 |
Step configuration:
{
name: 'substring',
column: 'Label',
start_index: 2,
end_index: -2,
newColumName: 'short_label',
}
Output dataset:
| Label | Value | short_label |
|---|---|---|
| foo | 13 | o |
| overflow | 7 | verflo |
| some_text | 20 | ome_tex |
| a_word | 1 | _wor |
| toucan | 10 | ouca |
| toco | 5 | oc |
Example 3: start_index and end_index are negative
Input dataset:
| Label | Value |
|---|---|
| foo | 13 |
| overflow | 7 |
| some_text | 20 |
| a_word | 1 |
| toucan | 10 |
| toco | 5 |
Step configuration:
{
name: 'substring',
column: 'Label',
start_index: -3,
end_index: -1,
}
Output dataset:
| Label | Value | Label_PCT |
|---|---|---|
| foo | 13 | foo |
| overflow | 7 | low |
| some_text | 20 | ext |
| a_word | 1 | ord |
| toucan | 10 | can |
| toco | 5 | oco |
text step
Use this step to add a text column where every value will be equal to the
specified text.
{
{
name: 'text',
new_column: 'new', // if the column already exists, overwrites it
text: 'some text'
}
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Label | Value1 |
|---|---|
| Label 1 | 10 |
| Label 2 | 1 |
| Label 3 | 5 |
Step configuration:
{
name: 'text',
new_column: 'KPI',
text: 'Sales'
}
Output dataset:
| Label | Value1 | KPI |
|---|---|---|
| Label 1 | 10 | Sales |
| Label 2 | 1 | Sales |
| Label 3 | 5 | Sales |
todate step
Converts a string column into a date column based on a specified format.
{
name: 'todate',
column: 'myTextColumn'
format: '%Y-%m-%d' // Optional. If undefined, the backend tries to infer the date format.
// For further details on supported format specifiers,
// see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
// Note: custom format is not supported in Mongo versions older than 4.0
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
| Company | Date | Value |
|---|---|---|
| Company 1 | 06/10/2019 | 13 |
| Company 1 | 07/10/2019 | 7 |
| Company 1 | 08/10/2019 | 20 |
| Company 2 | 06/10/2019 | 1 |
| Company 2 | 07/10/2019 | 10 |
| Company 2 | 08/10/2019 | 5 |
Step configuration:
{
name: 'todate',
column: 'Date',
format: '%d/%m/%Y'
}
Output dataset:
| Company | Date | Value |
|---|---|---|
| Company 1 | 2019-10-06T00:00:00.000Z | 13 |
| Company 1 | 2019-10-07T00:00:00.000Z | 7 |
| Company 1 | 2019-10-08T00:00:00.000Z | 20 |
| Company 2 | 2019-10-06T00:00:00.000Z | 1 |
| Company 2 | 2019-10-07T00:00:00.000Z | 10 |
| Company 2 | 2019-10-08T00:00:00.000Z | 5 |
top step
Return top N rows by group if groups is specified, else over full dataset.
{
name: 'top',
groups: ['foo'],
rank_on: 'bar',
sort: 'desc', // or 'asc'
limit: 10
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: top without groups, ascending order
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'top',
rank_on: 'Value',
sort: 'asc',
limit: 3
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| Label 4 | Group 2 | 1 |
| Label 6 | Group 2 | 5 |
| Label 2 | Group 1 | 7 |
Example 2: top with groups, descending order
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 4 | Group 2 | 1 |
| Label 5 | Group 2 | 10 |
| Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'top',
groups: ['Group'],
rank_on: 'Value',
sort: 'desc',
limit: 1
}
Output dataset:
| Company | Group | Value |
|---|---|---|
| Label 3 | Group 1 | 20 |
| Label 5 | Group 2 | 10 |
totals step
Append “total” rows to the dataset for specified dimensions. Computed rows result from an aggregation (either sum, average, count, count distinct, min, max, first or last)
{
name: 'totals',
// an array of objects: { totalColumn: <column to add total rows in>, totalRowsLabel: <the label of the added total rows>}
totalDimensions: [
{ totalColumn: 'foo', totalRowsLabel: 'Total foos' },
{ totalColumn: 'bar', totalRowsLabel: 'Total bars' }
],
aggregations: [ // The columns to be aggregated, and how
{
columns: ['value1', 'value2']
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
},
{
columns: ['value1']
aggfunction: 'avg',
newcolumns: ['avg_value1'],
}
],
groups: ['someDimension']// specify columns if the totals computation needs to be scoped by group of rows
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: basic usage
Input dataset:
| COUNTRY | PRODUCT | YEAR | VALUE |
|---|---|---|---|
| France | product A | 2019 | 5 |
| USA | product A | 2019 | 10 |
| France | product B | 2019 | 10 |
| USA | product B | 2019 | 15 |
| France | product A | 2020 | 20 |
| USA | product A | 2020 | 20 |
| France | product B | 2020 | 30 |
| USA | product B | 2020 | 25 |
Step configuration:
{
name: 'totals',
totalDimensions: [{ totalColumn: 'COUNTRY', totalRowsLabel: 'All countries' }],
aggregations: [
{
columns: ['VALUE']
aggfunction: 'sum',
newcolumns: ['VALUE'],
}
]
}
Output dataset:
| COUNTRY | PRODUCT | YEAR | VALUE |
|---|---|---|---|
| France | product A | 2019 | 5 |
| USA | product A | 2019 | 10 |
| France | product B | 2019 | 10 |
| USA | product B | 2019 | 15 |
| France | product A | 2020 | 20 |
| USA | product A | 2020 | 20 |
| France | product B | 2020 | 30 |
| USA | product B | 2020 | 25 |
| All countries | null | null | 135 |
Example 2: With several totals and groups
Input dataset:
| COUNTRY | PRODUCT | YEAR | VALUE_1 | VALUE_2 |
|---|---|---|---|---|
| France | product A | 2019 | 5 | 50 |
| USA | product A | 2019 | 10 | 100 |
| France | product B | 2019 | 10 | 100 |
| USA | product B | 2019 | 15 | 150 |
| France | product A | 2020 | 20 | 200 |
| USA | product A | 2020 | 20 | 200 |
| France | product B | 2020 | 30 | 300 |
| USA | product B | 2020 | 25 | 250 |
Step configuration:
{
name: 'totals',
totalDimensions: [
{totalColumn: 'COUNTRY', totalRowsLabel: 'All countries'},
{totalColumn: 'PRODUCT', totalRowsLabel: 'All products'}
],
aggregations: [
{
columns: ['VALUE_1-sum', 'VALUE_2']
aggfunction: 'sum',
newcolumns: ['VALUE_1', 'VALUE_2'],
},
{
columns: ['VALUE_1-avg']
aggfunction: 'avg',
newcolumns: ['VALUE_1'],
}
],
groups: ['YEAR']
}
Output dataset:
| COUNTRY | PRODUCT | YEAR | VALUE_2 | VALUE_1-sum | VALUE_1-avg |
|---|---|---|---|---|---|
| France | product A | 2019 | 50 | 5 | 5 |
| USA | product A | 2019 | 100 | 10 | 10 |
| France | product B | 2019 | 100 | 10 | 10 |
| USA | product B | 2019 | 150 | 15 | 15 |
| France | product A | 2020 | 200 | 20 | 20 |
| USA | product A | 2020 | 200 | 20 | 20 |
| France | product B | 2020 | 300 | 30 | 30 |
| USA | product B | 2020 | 250 | 25 | 25 |
| USA | All products | 2020 | 450 | 45 | 22.5 |
| France | All products | 2020 | 500 | 50 | 25 |
| USA | All products | 2019 | 250 | 25 | 12.5 |
| France | All products | 2019 | 150 | 15 | 7.5 |
| All countries | product B | 2020 | 550 | 55 | 27.5 |
| All countries | product A | 2020 | 400 | 40 | 20 |
| All countries | product B | 2019 | 250 | 25 | 12.5 |
| All countries | product A | 2019 | 150 | 15 | 7.5 |
| All countries | All products | 2020 | 950 | 95 | 23.75 |
| All countries | All products | 2019 | 400 | 40 | 10 |
trim step
Trim spaces in a column.
{
name: 'trim',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Pandas (python)
This step is unsupported by the following backends:
- Mongo 3.6
Example
Input dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| ’ Company 1 ‘ | Group 1 | 13 | Company 1 - Group 1 |
| ’ Company 2 ‘ | Group 1 | 7 | Company 2 - Group 1 |
Step configuration:
{
name: 'trim',
columns: ['Company']
}
Output dataset:
| Company | Group | Value | Label |
|---|---|---|---|
| ‘Company 1’ | Group 1 | 13 | Company 1 - Group 1 |
| ‘Company 2’ | Group 1 | 7 | Company 2 - Group 1 |
unpivot step
Unpivot a list of columns to rows.
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'], // columns to keep fixed around which to unpivot columns
unpivot: ['NB_CLIENTS', 'REVENUES'], // columns to unpivot
unpivot_column_name: 'KPI', // name of the new dimension column created after unpivot
value_column_name: 'VALUE', // name of the new value column created after unpivot
dropna: true // whether null values have to be kept or the corresponding rows discarded
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: with dropnaparameter to true
Input dataset:
| COMPANY | COUNTRY | NB_CLIENTS | REVENUES |
|---|---|---|---|
| Company 1 | France | 7 | 10 |
| Company 2 | France | 2 | |
| Company 1 | USA | 12 | 6 |
| Company 2 | USA | 1 | 3 |
Step configuration:
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'],
unpivot: ['NB_CLIENTS', 'REVENUES'],
unpivot_column_name: 'KPI',
value_column_name: 'VALUE',
dropna: true
}
Output dataset:
| COMPANY | COUNTRY | KPI | VALUE |
|---|---|---|---|
| Company 1 | France | NB_CLIENTS | 7 |
| Company 1 | France | REVENUES | 10 |
| Company 2 | France | NB_CLIENTS | 2 |
| Company 1 | USA | NB_CLIENTS | 12 |
| Company 1 | USA | REVENUES | 6 |
| Company 2 | USA | NB_CLIENTS | 1 |
| Company 2 | USA | REVENUES | 3 |
Example 1: with dropnaparameter to false
Input dataset:
| COMPANY | COUNTRY | NB_CLIENTS | REVENUES |
|---|---|---|---|
| Company 1 | France | 7 | 10 |
| Company 2 | France | 2 | |
| Company 1 | USA | 12 | 6 |
| Company 2 | USA | 1 | 3 |
Step configuration:
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'],
unpivot: ['NB_CLIENTS', 'REVENUES'],
unpivot_column_name: 'KPI',
value_column_name: 'VALUE',
dropna: false
}
Output dataset:
| COMPANY | COUNTRY | KPI | VALUE |
|---|---|---|---|
| Company 1 | France | NB_CLIENTS | 7 |
| Company 1 | France | REVENUES | 10 |
| Company 2 | France | NB_CLIENTS | 2 |
| Company 2 | France | REVENUES | |
| Company 1 | USA | NB_CLIENTS | 12 |
| Company 1 | USA | REVENUES | 6 |
| Company 2 | USA | NB_CLIENTS | 1 |
| Company 2 | USA | REVENUES | 3 |
uppercase step
$toUpper operator does not support accents.
If you have accents you need to uppercase with Mongo, use a replacetext
step after uppercase.
Converts a string column to uppercase.
{
name: 'uppercase',
column: 'foo',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
Step configuration:
{
name: 'uppercase',
column: 'Label',
}
Output dataset:
| Label | Group | Value |
|---|---|---|
| LABEL 1 | Group 1 | 13 |
| LABEL 2 | Group 1 | 7 |
| LABEL 3 | Group 1 | 20 |
uniquegroups step
Allow to get unique groups of values from one or several columns.
{
name: 'uniquegroups',
on: ['foo', 'bar']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
| Label | Group | Value |
|---|---|---|
| Label 1 | Group 1 | 13 |
| Label 2 | Group 1 | 7 |
| Label 3 | Group 1 | 20 |
| Label 1 | Group 2 | 1 |
| Label 2 | Group 1 | 2 |
| Label 3 | Group 1 | 3 |
Step configuration:
{
name: 'uniquegroups',
column: ['Label', 'Group'],
}
Output dataset:
| Label | Group |
|---|---|
| Label 1 | Group 1 |
| Label 1 | Group 2 |
| Label 2 | Group 1 |
| Label 3 | Group 1 |
waterfall step
This step allows to generate a data structure useful to build waterfall charts.
It breaks down the variation between two values (usually between two dates)
accross entities. Entities are found in the labelsColumn, and can optionally
be regrouped under common parents found in the parentsColumn for drill-down
purposes.
{
name: 'waterfall',
valueColumn: 'VALUE', // the column of values
milestonesColumn: 'DATE', // the column where starting and ending blocks labels
// are to be found (usually dates)
start: '2019', // the starting block label to be found in `milestonesColumn`
end: '2020', // the starting block label to be found in `milestonesColumn`
labelsColumn: 'PRODUCT', // the column of labels for the variations breakdown
groupby: ['COUNTRY'], // specify columns if you want to group the waterfall computation
sortBy: 'value', // either 'value' (default) or 'label' depending on wether to
// sort waterfall blocks by their label or value
order: 'desc', // either 'asc' or 'desc' (default) depending on wether to sort
// by ascending or descending order
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
| city | year | revenue |
|---|---|---|
| Bordeaux | 2019 | 135 |
| Boston | 2019 | 275 |
| New-York | 2019 | 115 |
| Paris | 2019 | 450 |
| Bordeaux | 2018 | 98 |
| Boston | 2018 | 245 |
| New-York | 2018 | 103 |
| Paris | 2018 | 385 |
Step configuration:
{
name: 'waterfall',
valueColumn: 'revenue',
milestonesColumn: 'year',
start: '2018',
end: '2019',
labelsColumn: 'city',
sortBy: 'value',
order: 'desc',
}
Output dataset:
| LABEL_waterfall | TYPE_waterfall | revenue |
|---|---|---|
| 2018 | null | 831 |
| Paris | parent | 65 |
| Bordeaux | parent | 37 |
| Boston | parent | 30 |
| New-York | parent | 12 |
| 2019 | null | 975 |
Example 2: With more options
Input dataset:
| city | country | product | year | revenue |
|---|---|---|---|---|
| Bordeaux | France | product1 | 2019 | 65 |
| Bordeaux | France | product2 | 2019 | 70 |
| Paris | France | product1 | 2019 | 210 |
| Paris | France | product2 | 2019 | 240 |
| Boston | USA | product1 | 2019 | 130 |
| Boston | USA | product2 | 2019 | 145 |
| New-York | USA | product1 | 2019 | 55 |
| New-York | USA | product2 | 2019 | 60 |
| Bordeaux | France | product1 | 2018 | 38 |
| Bordeaux | France | product2 | 2018 | 60 |
| Paris | France | product1 | 2018 | 175 |
| Paris | France | product2 | 2018 | 210 |
| Boston | USA | product1 | 2018 | 95 |
| Boston | USA | product2 | 2018 | 150 |
| New-York | USA | product1 | 2018 | 50 |
| New-York | USA | product2 | 2018 | 53 |
Step configuration:
{
name: 'waterfall',
valueColumn: 'revenue',
milestonesColumn: 'year',
start: '2018',
end: '2019',
labelsColumn: 'city',
parentsColumn: 'country',
groupby: ['product'],
sortBy: 'label',
order: 'asc',
}
Output dataset:
| LABEL_waterfall | GROUP_waterfall | TYPE_waterfall | product | revenue |
|---|---|---|---|---|
| 2018 | 2018 | null | product1 | 358 |
| 2018 | 2018 | null | product2 | 473 |
| Bordeaux | France | child | product1 | 27 |
| Bordeaux | France | child | product2 | 10 |
| Boston | USA | child | product1 | 35 |
| Boston | USA | child | product2 | -5 |
| France | France | parent | product2 | 40 |
| France | France | parent | product1 | 62 |
| New-York | USA | child | product1 | 5 |
| New-York | USA | child | product2 | 7 |
| Paris | France | child | product1 | 35 |
| Paris | France | child | product2 | 30 |
| USA | USA | parent | product2 | 2 |
| USA | USA | parent | product1 | 40 |
| 2019 | 2019 | null | product2 | 515 |
| 2019 | 2019 | null | product1 | 460 |