Steps definition
This document describes the steps that can be used in our pipelines. A step is the unit of transformation generated by a user interaction with the visual query builder.
Unless explicitly specified, all steps are supported by all backends.
absolutevalue
step
This step is meant to compute the absolute value of a given input column.
{
name: 'absolutevalue',
column: 'my-column'
new_column: 'my-new-column'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Value |
---|---|
Company 1 | -33 |
Company 2 | 0 |
Company 3 | 10 |
Step configuration:
{
name: 'absolutevalue',
column: "Value",
new_column_name: "My-absolute-value",
}
Output dataset:
Company | Value | My-absolute-value |
---|---|---|
Company 1 | -33 | 33 |
Company 2 | 0 | 0 |
Company 3 | 10 | 10 |
addmissingdates
step
Add missing dates as new rows in a dates column. Exhaustive dates will range between the minimum and maximum date found in the dataset (or in each group if a group by logic is applied - see thereafter).
Added rows will be set to null in columns not referenced in the step configuration.
You should make sure to use a group by logic if you want to add missing dates in
independent groups of rows (e.g. you may need to add missing rows for every
country found in a “COUNTRY” column). And you should ensure that every date is
unique in every group of rows at the specified granularity, else you may get
inconsistent results. You can specify “group by” columns in the groups
parameter.
An addmissingdates step has the following structure:
{
name: 'addmissingdates',
datesColumn: 'DATE', // the dates columns with missing rows
datesGranularity: 'day', // the granularity of the date column (day, month or year)
groups: ['COUNTRY'] // (optional) columns used for group by logic if necessary
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: day granularity without groups
Input dataset:
DATE | VALUE |
---|---|
“2018-01-01T00:00:00.000Z” | 75 |
“2018-01-02T00:00:00.000Z” | 80 |
“2018-01-03T00:00:00.000Z” | 82 |
“2018-01-04T00:00:00.000Z” | 83 |
“2018-01-05T00:00:00.000Z” | 80 |
“2018-01-07T00:00:00.000Z” | 86 |
“2018-01-08T00:00:00.000Z” | 79 |
“2018-01-09T00:00:00.000Z” | 76 |
“2018-01-10T00:00:00.000Z” | 79 |
“2018-01-11T00:00:00.000Z” | 75 |
Here the day “2018-01-06” is missing.
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'day'
}
Output dataset:
DATE | VALUE |
---|---|
“2018-01-01T00:00:00.000Z” | 75 |
“2018-01-02T00:00:00.000Z” | 80 |
“2018-01-03T00:00:00.000Z” | 82 |
“2018-01-04T00:00:00.000Z” | 83 |
“2018-01-05T00:00:00.000Z” | 80 |
“2018-01-06T00:00:00.000Z” | |
“2018-01-07T00:00:00.000Z” | 86 |
“2018-01-08T00:00:00.000Z” | 79 |
“2018-01-09T00:00:00.000Z” | 76 |
“2018-01-10T00:00:00.000Z” | 79 |
“2018-01-11T00:00:00.000Z” | 75 |
Example 2: day granularity with groups
Input dataset:
COUNTRY | DATE | VALUE |
---|---|---|
France | “2018-01-01T00:00:00.000Z” | 75 |
France | “2018-01-02T00:00:00.000Z” | 80 |
France | “2018-01-03T00:00:00.000Z” | 82 |
France | “2018-01-04T00:00:00.000Z” | 83 |
France | “2018-01-05T00:00:00.000Z” | 80 |
France | “2018-01-07T00:00:00.000Z” | 86 |
France | “2018-01-08T00:00:00.000Z” | 79 |
France | “2018-01-09T00:00:00.000Z” | 76 |
France | “2018-01-10T00:00:00.000Z” | 79 |
France | “2018-01-11T00:00:00.000Z” | 85 |
USA | “2018-01-01T00:00:00.000Z” | 69 |
USA | “2018-01-02T00:00:00.000Z” | 73 |
USA | “2018-01-03T00:00:00.000Z” | 73 |
USA | “2018-01-05T00:00:00.000Z” | 75 |
USA | “2018-01-06T00:00:00.000Z” | 70 |
USA | “2018-01-07T00:00:00.000Z” | 76 |
USA | “2018-01-08T00:00:00.000Z” | 73 |
USA | “2018-01-09T00:00:00.000Z” | 70 |
USA | “2018-01-10T00:00:00.000Z” | 72 |
USA | “2018-01-12T00:00:00.000Z” | 78 |
Here the day “2018-01-06” is missing for “France” rows, and “2018-01-11” and “2018-01-11” are missing for “USA” rows.
Note that “2018-01-12” will not be considered as a missing row for “France” rows, because the latest date found for this group of rows is “2018-01-11” (even though “2018-01-12” is the latest date found for “USA” rows).
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'day',
groups: 'COUNTRY'
}
Output dataset:
COUNTRY | DATE | VALUE |
---|---|---|
France | “2018-01-01T00:00:00.000Z” | 75 |
France | “2018-01-02T00:00:00.000Z” | 80 |
France | “2018-01-03T00:00:00.000Z” | 82 |
France | “2018-01-04T00:00:00.000Z” | 83 |
France | “2018-01-05T00:00:00.000Z” | 80 |
France | “2018-01-06T00:00:00.000Z” | |
France | “2018-01-07T00:00:00.000Z” | 86 |
France | “2018-01-08T00:00:00.000Z” | 79 |
France | “2018-01-09T00:00:00.000Z” | 76 |
France | “2018-01-10T00:00:00.000Z” | 79 |
France | “2018-01-11T00:00:00.000Z” | 85 |
USA | “2018-01-01T00:00:00.000Z” | 69 |
USA | “2018-01-02T00:00:00.000Z” | 73 |
USA | “2018-01-03T00:00:00.000Z” | 73 |
USA | “2018-01-04T00:00:00.000Z” | |
USA | “2018-01-05T00:00:00.000Z” | 75 |
USA | “2018-01-06T00:00:00.000Z” | 70 |
USA | “2018-01-07T00:00:00.000Z” | 76 |
USA | “2018-01-08T00:00:00.000Z” | 73 |
USA | “2018-01-09T00:00:00.000Z” | 70 |
USA | “2018-01-10T00:00:00.000Z” | 72 |
USA | “2018-01-11T00:00:00.000Z” | |
USA | “2018-01-12T00:00:00.000Z” | 78 |
Example 3: month granularity
Input dataset:
DATE | VALUE |
---|---|
“2019-01-01T00:00:00.000Z” | 74 |
“2019-02-01T00:00:00.000Z” | 73 |
“2019-03-01T00:00:00.000Z” | 68 |
“2019-04-01T00:00:00.000Z” | 71 |
“2019-06-01T00:00:00.000Z” | 74 |
“2019-07-01T00:00:00.000Z” | 74 |
“2019-08-01T00:00:00.000Z” | 73 |
“2019-09-01T00:00:00.000Z” | 72 |
“2019-10-01T00:00:00.000Z” | 75 |
“2019-12-01T00:00:00.000Z” | 76 |
Here “2019-05” and “2019-11” are missing.
Step configuration:
{
name: 'addmissingdates',
datesColumn: 'DATE',
datesGranularity: 'month',
}
Output dataset:
DATE | VALUE |
---|---|
“2019-01-01T00:00:00.000Z” | 74 |
“2019-02-01T00:00:00.000Z” | 73 |
“2019-03-01T00:00:00.000Z” | 68 |
“2019-04-01T00:00:00.000Z” | 71 |
“2019-05-01T00:00:00.000Z” | |
“2019-06-01T00:00:00.000Z” | 74 |
“2019-07-01T00:00:00.000Z” | 74 |
“2019-08-01T00:00:00.000Z” | 73 |
“2019-09-01T00:00:00.000Z” | 72 |
“2019-10-01T00:00:00.000Z” | 75 |
“2019-11-01T00:00:00.000Z” | |
“2019-12-01T00:00:00.000Z” | 76 |
aggregate
step
Perform aggregations on one or several columns. Available aggregation functions are sum, average, count, count distinct, min, max, first, last.
An aggreation step has the following structure:
{
name: 'aggregate',
on: ['column1', 'column2'],
aggregations: [
{
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
columns: ['value1', 'value2']
}
{
newcolumns: ['avg_value1'],
aggfunction: 'avg',
columns: ['value1']
}
// ...
]
keepOriginalGranularity: false, // whether to keep the original granularity, in that
// case computed aggregations will be added in new columns
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column
and newcolumn
(in the singular) properties of the aggregation
parameter are deprecated and are supported only for retrocompatibility purposes.
When this step was first created, only 1 column at a time could be aggregated for
a given aggregation.
Now columns
and newcolumns
are lists of columns, allowing to apply the same
aggregation function to several columns at once.
An old-fashioned step looked like this:
{
name: 'aggregate',
on: ['column1', 'column2'],
aggregations: [
{
newcolumn: 'sum_value1'
aggfunction: 'sum',
column: 'value1'
}
{
newcolumn: 'sum_value2',
aggfunction: 'sum',
column: 'value2'
}
{
newcolumn: 'avg_value1',
aggfunction: 'avg',
column: 'value1'
}
// ...
]
}
Example 1: keepOriginalGranularity set to false
Input dataset:
Label | Group | Value1 | Value2 |
---|---|---|---|
Label 1 | Group 1 | 13 | 10 |
Label 2 | Group 1 | 7 | 21 |
Label 3 | Group 1 | 20 | 4 |
Label 4 | Group 2 | 1 | 17 |
Label 5 | Group 2 | 9 | 12 |
Label 6 | Group 2 | 5 | 2 |
Step configuration:
{
name: 'aggregate',
on: ['Group'],
aggregations: [
{
newcolumns: ['Sum-Value1', 'Sum-Value2'],
aggfunction: 'sum',
columns: ['Value1', 'Value2']
}
{
newcolumns: ['Avg-Value1'],
aggfunction: 'avg',
columns: ['Value1']
}
],
keepOriginalGranularity: false,
}
Output dataset:
Group | Sum-Value1 | Sum-Value2 | Avg-Value1 |
---|---|---|---|
Group 1 | 40 | 35 | 13.333333 |
Group 2 | 16 | 31 | 5.333333 |
Example 2: keepOriginalGranularity set to true
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'aggregate',
on: ['Group'],
aggregations: [
{
newcolumns: ['Total'],
aggfunction: 'sum',
columns: ['Value']
}
],
keepOriginalGranularity: true,
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 40 |
Label 2 | Group 1 | 40 |
Label 3 | Group 1 | 40 |
Label 4 | Group 2 | 16 |
Label 5 | Group 2 | 16 |
Label 6 | Group 2 | 16 |
append
step
Appends to the current dataset, one or several datasets resulting from other
pipelines
.
WeaverBird allows you to save pipelines
referenced by name in the Vuex store
of the application. You can then call them by their unique names in this step.
{
name: 'append',
pipelines: ['pipeline1', 'pipeline2'] // the name of the pipelines to append
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
dataset1 (saved in the application Vuex store):
Label | Group | Value |
---|---|---|
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
dataset2 (saved in the application Vuex store):
Label | Group | Value |
---|---|---|
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'append',
pipelines: ['dataset1', 'dataset2']
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
argmax
step
Get row(s) matching the maximum value in a given column
, by group if groups
is specified.
{
name: 'argmax',
column: 'value', // column in which to search for max value
groups: ['group1', 'group2']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: without groups
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmax',
column: 'Value'
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 3 | Group 1 | 20 |
Example 2: with groups
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmax',
column: 'Value',
groups: ['Group']
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 3 | Group 1 | 20 |
Label 5 | Group 2 | 10 |
argmin
step
Get row(s) matching the minimum value in a given column
, by group if groups
is specified.
{
name: 'argmin',
column: 'value', // column in which to search for max value
groups: ['group1', 'group2'] // optional
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: without groups
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmin',
column: 'Value'
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 4 | Group 2 | 1 |
Example 2: with groups
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'argmin',
column: 'Value',
groups: ['Groups']
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 2 | Group 1 | 7 |
Label 4 | Group 2 | 1 |
comparetext
step
Compares 2 string columns and returns true if the string values are equal, and false oteherwise. The comparison is case-sensitive (see examples below).
{
name: 'comparetext',
newColumnName: 'NEW', // new column to be created, with the comparison result
strCol1: 'TEXT_1', // first string column to compare
strCol2: 'TEXT_2', // second string column to compare
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
TEXT_1 | TEXT_2 |
---|---|
France | Fr |
France | France |
France | france |
France | England |
France | USA |
Step configuration:
{
name: 'split',
newColumnName: 'RESULT',
ctrCol1: 'TEXT_1',
ctrCol2: 'TEXT_2',
}
Output dataset:
TEXT_1 | TEXT_2 | RESULT |
---|---|---|
France | Fr | false |
France | France | true |
France | france | false |
France | England | false |
France | USA | false |
concatenate
step
This step allows to concatenate several columns
using a separator
.
{
name: 'concatenate',
columns: ['Company', 'Group']
separator: ' - ' // can be a string of any length
new_column_name: 'Label' // The new column in which to write the concatenation
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Group | Value |
---|---|---|
Company 1 | Group 1 | 13 |
Company 2 | Group 1 | 7 |
Company 3 | Group 1 | 20 |
Company 4 | Group 2 | 1 |
Company 5 | Group 2 | 10 |
Company 6 | Group 2 | 5 |
Step configuration:
{
name: 'concatenate',
columns: ['Company', 'Group']
separator: ' - '
new_column_name: 'Label'
}
Output dataset:
Company | Group | Value | Label |
---|---|---|---|
Company 1 | Group 1 | 13 | Company 1 - Group 1 |
Company 2 | Group 1 | 7 | Company 2 - Group 1 |
Company 3 | Group 1 | 20 | Company 3 - Group 1 |
Company 4 | Group 2 | 1 | Company 4 - Group 2 |
Company 5 | Group 2 | 10 | Company 5 - Group 2 |
Company 6 | Group 2 | 5 | Company 6 - Group 2 |
convert
step
This step allows to convert columns
data types.
{
name: 'convert',
columns: ['col1', 'col2']
data_type: 'integer' // The data type to convert the column into. Can be either
// 'integer', 'float', 'text', 'date' or 'boolean'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Pandas (python)
- Snowflake
In a effort to harmonize as much as possible the conversion behaviors, for some cases, the Sql translator implements casting otherwise than the CAST AS method.
Precisely, when casting float to integer, the default behaviour rounds the result, other languages truncate it.
That’s why the use of TRUNCATE
was implemented when converting float to int.
The same implementation was done when converting strings to int (for date represented as string).
As for the conversion of date to int, we handled it by assuming the dataset’s timestamp is in TIMESTAMP_NTZ
format.
Example
Input dataset:
Company | Value |
---|---|
Company 1 | ‘13’ |
Company 2 | ‘7’ |
Company 3 | ‘20’ |
Company 4 | ‘1’ |
Company 5 | ‘10’ |
Company 6 | ‘5’ |
Step configuration:
{
name: 'convert',
columns: ['Value']
data_type: 'integer'
}
Output dataset:
Company | Value |
---|---|
Company 1 | 13 |
Company 2 | 7 |
Company 3 | 20 |
Company 4 | 1 |
Company 5 | 10 |
Company 6 | 5 |
cumsum
step
This step allows to compute the cumulated sum of value columns based on a reference column (usually dates) to be sorted by ascending order for the needs of the computation. The computation can be scoped by group if needed.
The toCumSum
parameter takes as input a list of 2-elements lists in the form
[‘valueColumn’, ‘newColumn’].
{
name: 'cumsum',
toCumSum: [['myValues', 'myCumsum']],
referenceColumn: 'myDates',
groupby: ['foo', 'bar'],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
DATE | VALUE |
---|---|
2019-01 | 2 |
2019-02 | 5 |
2019-03 | 3 |
2019-04 | 8 |
2019-05 | 9 |
2019-06 | 6 |
Step configuration:
{
name: 'cumsum',
toCumSum: [['VALUE', '']]
referenceColumn: 'DATE',
}
Output dataset:
DATE | VALUE | VALUE_CUMSUM |
---|---|---|
2019-01 | 2 | 2 |
2019-02 | 5 | 7 |
2019-03 | 3 | 10 |
2019-04 | 8 | 18 |
2019-05 | 9 | 27 |
2019-06 6 | 6 | 33 |
Example 2: With more advanced options
Input dataset:
COUNTRY | DATE | VALUE |
---|---|---|
France | 2019-01 | 2 |
France | 2019-02 | 5 |
France | 2019-03 | 3 |
France | 2019-04 | 8 |
France | 2019-05 | 9 |
France | 2019-06 6 | 6 |
USA | 2019-01 | 10 |
USA | 2019-02 | 6 |
USA | 2019-03 | 6 |
USA | 2019-04 | 4 |
USA | 2019-05 | 8 |
USA | 2019-06 6 | 7 |
Step configuration:
{
name: 'cumsum',
toCumSum: [['VALUE', 'MY_CUMSUM']],
referenceColumn: 'DATE',
groupby: ['COUNTRY'],
}
Output dataset:
COUNTRY | DATE | VALUE | MY_CUMSUM |
---|---|---|---|
France | 2019-01 | 2 | 2 |
France | 2019-02 | 5 | 7 |
France | 2019-03 | 3 | 10 |
France | 2019-04 | 8 | 18 |
France | 2019-05 | 9 | 27 |
France | 2019-06 6 | 6 | 33 |
USA | 2019-01 | 10 | 10 |
USA | 2019-02 | 6 | 16 |
USA | 2019-03 | 6 | 22 |
USA | 2019-04 | 4 | 26 |
USA | 2019-05 | 8 | 34 |
USA | 2019-06 6 | 7 | 41 |
custom
step
This step allows to define a custom query that can’t be expressed using the other existing steps.
{
name: 'custom',
query: '$group: {_id: ...}'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
Example: using Mongo query language
Input dataset:
Company | Group | Value |
---|---|---|
Company 1 | Group 1 | 13 |
Company 2 | Group 1 | 7 |
Company 3 | Group 1 | 20 |
Company 4 | Group 2 | 1 |
Company 5 | Group 2 | 10 |
Company 6 | Group 2 | 5 |
Step configuration:
{
name: 'custom',
query: '$addFields: { Label: { $concat: [ "$Label", " - ", "$Group" ] ] } }'
}
Output dataset:
Company | Group | Value | Label |
---|---|---|---|
Company 1 | Group 1 | 13 | Company 1 - Group 1 |
Company 2 | Group 1 | 7 | Company 2 - Group 1 |
Company 3 | Group 1 | 20 | Company 3 - Group 1 |
Company 4 | Group 2 | 1 | Company 4 - Group 2 |
Company 5 | Group 2 | 10 | Company 5 - Group 2 |
Company 6 | Group 2 | 5 | Company 6 - Group 2 |
dateextract
step
Extract date information (eg. day, week, year etc.). The following information can be extracted:
year
: extract ‘year’ from date,month
: extract ‘month’ from date,day
: extract ‘day of month’ from date,week'
: extract ‘week number’ (ranging from 0 to 53) from date,quarter
: extract ‘quarter number’ from date (1 for Jan-Feb-Mar)dayOfWeek
: extract ‘day of week’ (ranging from 1 for Sunday to 7 for Staurday) from date,dayOfYear
: extract ‘day of year’ from date,isoYear
: extract ‘year number’ in ISO 8601 format (ranging from 1 to 53) from date.isoWeek
: extract ‘week number’ in ISO 8601 format (ranging from 1 to 53) from date.isoDayOfWeek
: extract ‘day of week’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date,firstDayOfYear
: calendar dat corresponding to the first day (1st of January) of the year ,firstDayOfMonth
: calendar dat corresponding to the first day of the month,firstDayOfWeek
: calendar dat corresponding to the first day of the week,firstDayOfQuarter
: calendar dat corresponding to the first day of the quarter,firstDayOfIsoWeek
: calendar dat corresponding to the first day of the week in ISO 8601 format,previousDay
: calendar date one day before the target date,firstDayOfPreviousYear
: calendar dat corresponding to the first day (1st of January) of the previous year,firstDayOfPreviousMonth
: calendar dat corresponding to the first day of the previous month,firstDayOfPreviousWeek
: calendar dat corresponding to the first day of the previous week,firstDayOfPreviousQuarter
: calendar dat corresponding to the first day of the previous quarter,firstDayOfPreviousISOWeek
: calendar dat corresponding to the first day of the previous ISO week,previousYear
: extract previous ‘year number’ from datepreviousMonth
: extract previous ‘month number’ from datepreviousWeek
: extract previous ‘week number’ from datepreviousQuarter
: extract previous ‘quarter number’ from datepreviousISOWeek
: extract previous ‘week number’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from datehour
: extract ‘hour’ from date,minutes
: extract ‘minutes’ from date,seconds
: extract ‘seconds’ from date,milliseconds
: extract ‘milliseconds’ from date,
Here’s an example of such a step:
{
name: 'dateextract',
column: 'date',
dateInfo: ['year', 'month', 'day'],
newColumns: ['date_year', 'date_month', 'date_day'],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The operation
and new_column_name
parameters are deprecated and are
supported for retrocompatibility purposes only.
An old-fashioned step looks like this:
{
name: 'dateextract',
column: 'date',
operation: 'day',
new_column_name: 'date_day',
}
Example
Input dataset:
Date |
---|
2019-10-30T00:00:00.000Z |
2019-10-15T00:00:00.000Z |
2019-10-01T00:00:00.000Z |
2019-09-30T00:00:00.000Z |
2019-09-15T00:00:00.000Z |
2019-09-01T00:00:00.000Z |
Step configuration:
{
name: 'dateextract',
column: 'Date',
dateInfo: ['year', 'month', 'day'],
newColumns: ['Date_year', 'Date_month', 'Date_day'],
}
Output dataset:
Date | Date_year | Date_month | Date_day |
---|---|---|---|
2019-10-30T00:00:00.000Z | 2019 | 10 | 30 |
2019-10-15T00:00:00.000Z | 2019 | 10 | 15 |
2019-10-01T00:00:00.000Z | 2019 | 10 | 1 |
2019-09-30T00:00:00.000Z | 2020 | 10 | 30 |
2019-09-15T00:00:00.000Z | 2020 | 10 | 15 |
2019-09-01T00:00:00.000Z | 2020 | 10 | 1 |
delete
step
Delete a column.
{
name: 'delete',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Group | Value | Label |
---|---|---|---|
Company 1 | Group 1 | 13 | Company 1 - Group 1 |
Company 2 | Group 1 | 7 | Company 2 - Group 1 |
Company 3 | Group 1 | 20 | Company 3 - Group 1 |
Company 4 | Group 2 | 1 | Company 4 - Group 2 |
Company 5 | Group 2 | 10 | Company 5 - Group 2 |
Company 6 | Group 2 | 5 | Company 6 - Group 2 |
Step configuration:
{
name: 'delete',
columns: ['Company', 'Group']
}
Output dataset:
Value | Label |
---|---|
13 | Company 1 - Group 1 |
7 | Company 2 - Group 1 |
20 | Company 3 - Group 1 |
1 | Company 4 - Group 2 |
10 | Company 5 - Group 2 |
5 | Company 6 - Group 2 |
dissolve
step
Geographically dissolve data
{
name: 'dissolve',
groups: ['my-column', 'some-other-column'],
include_nulls: true,
}
This step is supported by the following backends:
- Pandas (python)
Example without aggregations
Input dataset:
Country | City | geometry |
---|---|---|
Country 1 | City 1 | Polygon |
Country 2 | City 2 | Polygon |
Country 2 | City 3 | Polygon |
Country 1 | City 4 | Polygon |
Country 2 | City 5 | Polygon |
Country 1 | City 6 | Polygon |
Step configuration:
{
name: 'dissolve',
groups: ['Country'],
include_nulls: true,
}
Output dataset:
Country | geometry |
---|---|
Country 1 | MultiPolygon |
Country 2 | MultiPolygon |
Example with aggregations
Input dataset:
Country | City | geometry | Population |
---|---|---|---|
Country 1 | City 1 | Polygon | 100_000 |
Country 2 | City 2 | Polygon | 50_000 |
Country 2 | City 3 | Polygon | 200_000 |
Country 1 | City 4 | Polygon | 30_000 |
Country 2 | City 5 | Polygon | 25_000 |
Country 1 | City 6 | Polygon | 10_000 |
Step configuration:
{
name: 'dissolve',
groups: ['Country'],
include_nulls: true,
aggregations: [
{
aggfunction: 'sum',
columns: ['Population'],
newcolumns: ['Total population'],
}
]
}
Output dataset:
Country | geometry | Total population |
---|---|---|
Country 1 | MultiPolygon | 140_000 |
Country 2 | MultiPolygon | 275_000 |
domain
step
This step is meant to select a specific domain (using MongoDB terminology).
{
name: 'domain',
domain: 'my-domain'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
duplicate
step
This step is meant to duplicate a column.
{
name: 'duplicate',
column: 'my-column'
new_column_name: 'my-duplicate'
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Value |
---|---|
Company 1 | 13 |
Company 2 | 0 |
Company 3 | 20 |
Step configuration:
{
name: 'duplicate',
column: "Company",
new_column_name: "Company-copy",
}
Output dataset:
Company | Value | Company-copy |
---|---|---|
Company 1 | 13 | Company 1 |
Company 2 | 0 | Company 2 |
Company 3 | 20 | Company 3 |
duration
step
Compute the duration (in days, hours, minutes or seconds) between 2 dates in a new column.
{
name: 'duration',
newColumnName: 'DURATION', // new column name
startDateColumn: 'START_DATE', // start date
endDateColumn: 'END_DATE', // end date
durationIn: 'days', // duration units (days, hours, minutes or seconds)
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: duration in days
Input dataset:
START_DATE | END_DATE |
---|---|
“2020-01-01T00:00:00.000Z” | “2020-01-31T00:00:00.000Z” |
“2020-01-01T00:00:00.000Z” | “2020-12-31T00:00:00.000Z” |
Step configuration:
{
name: 'duration',
newColumnName: 'DURATION',
startDateColumn: 'START_DATE',
endDateColumn: 'END_DATE',
durationIn: 'days',
}
Output dataset:
START_HOUR | END_HOUR | DURATION |
---|---|---|
“2020-01-01T00:00:00.000Z” | “2020-01-31T00:00:00.000Z” | 30 |
“2020-01-01T00:00:00.000Z” | “2020-12-31T00:00:00.000Z” | 365 |
Example 2: duration in minutes
Input dataset:
START_HOUR | END_HOUR |
---|---|
“2020-01-01T14:00:00.000Z” | “2020-01-31T15:00:00.000Z” |
“2020-01-01T15:00:00.000Z” | “2020-12-31T20:00:00.000Z” |
Step configuration:
{
name: 'duration',
newColumnName: 'DURATION',
startDateColumn: 'START_HOUR',
endDateColumn: 'END_HOUR',
durationIn: 'minutes',
}
Output dataset:
START_HOUR | END_HOUR | DURATION |
---|---|---|
“2020-01-01T14:00:00.000Z” | “2020-01-31T15:00:00.000Z” | 60 |
“2020-01-01T15:00:00.000Z” | “2020-12-31T20:00:00.000Z” | 300 |
evolution
step
Use this step if you need to compute the row-by-row evolution of a value column, based on a date column. It will output 2 columns: one for the evolution in absolute value, the other for the evolution in percentage.
You must be careful that the computation is scoped so that there are no dates
duplicates (so that any date finds no more than one previous date). That means
that you may need to specify “group by” columns to make any date unique inside
each group. You should specify those columns in the indexColumns
parameter.
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear', // or vsLastMonth, vsLastWeek, vsLastDay
evolutionFormat: 'abs', // or pct
// optional, if computation has to be performed by group so that any date finds no more than one previous date
indexColumns: ['COUNTRY'],
newColumn: 'MY_EVOL', // optional, <originalColumnName>_EVOL_ABS or <originalColumnName>_EVOL_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic configuration - evolution in absolute value
Input dataset:
DATE | VALUE |
---|---|
2019-06 | 79 |
2019-07 | 81 |
2019-08 | 77 |
2019-09 | 75 |
2019-11 | 78 |
2019-12 | 88 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastMonth',
evolutionFormat: 'abs',
indexColumns: [],
}
Output dataset:
DATE | VALUE | VALUE_EVOL_ABS |
---|---|---|
2019-06 | 79 | |
2019-07 | 81 | 2 |
2019-08 | 77 | -4 |
2019-09 | 75 | -2 |
2019-11 | 78 | |
2019-12 | 88 | 10 |
Example 2: Basic configuration - evolution in percentage
Input dataset:
DATE | VALUE |
---|---|
2019-06 | 79 |
2019-07 | 81 |
2019-08 | 77 |
2019-09 | 75 |
2019-11 | 78 |
2019-12 | 88 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastMonth',
evolutionFormat: 'pct',
indexColumns: [],
}
Output dataset:
DATE | VALUE | VALUE_EVOL_PCT |
---|---|---|
2019-06 | 79 | |
2019-07 | 81 | 0.02531645569620253 |
2019-08 | 77 | -0.04938271604938271 |
2019-09 | 75 | -0.025974025974025976 |
2019-11 | 78 | |
2019-12 | 88 | 0.1282051282051282 |
Example 3: Error on duplicate dates
If ‘COUNTRY’ is not specified as indexColumn, the computation will not be scoped by country. Then there are duplicate dates in the “DATE” columns which is prohibited and will lead to an error.
Input dataset:
DATE | COUNTRY | VALUE |
---|---|---|
2014-12 | France | 79 |
2015-12 | France | 81 |
2016-12 | France | 77 |
2017-12 | France | 75 |
2014-12 | USA | 74 |
2015-12 | USA | 74 |
2016-12 | USA | 73 |
2017-12 | USA | 72 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear',
evolutionFormat: 'abs',
indexColumns: [],
}
Output dataset:
With the mongo translator, you will get an error at row-level as shown below:
DATE | COUNTRY | VALUE | MY_EVOL |
---|---|---|---|
2014-12 | France | 79 | |
2015-12 | France | 81 | Error … |
2016-12 | France | 77 | Error … |
2017-12 | France | 75 | Error … |
2014-12 | USA | 74 | |
2015-12 | USA | 74 | Error … |
2016-12 | USA | 73 | Error … |
2017-12 | USA | 72 | Error … |
The pandas translator will just return an error, and you will not get any data.
Example 4: Complete configuration with index columns
Input dataset:
DATE | COUNTRY | VALUE |
---|---|---|
2014-12 | France | 79 |
2015-12 | France | 81 |
2016-12 | France | 77 |
2017-12 | France | 75 |
2019-12 | France | 78 |
2020-12 | France | 88 |
2014-12 | USA | 74 |
2015-12 | USA | 74 |
2016-12 | USA | 73 |
2017-12 | USA | 72 |
2018-11 | USA | 75 |
2020-12 | USA | 76 |
Step configuration:
{
name: 'evolution',
dateCol: 'DATE',
valueCol: 'VALUE',
evolutionType: 'vsLastYear',
evolutionFormat: 'abs',
indexColumns: ['COUNTRY'],
newColumn: 'MY_EVOL',
}
Output dataset:
DATE | COUNTRY | VALUE | MY_EVOL |
---|---|---|---|
2014-12 | France | 79 | |
2015-12 | France | 81 | 2 |
2016-12 | France | 77 | -4 |
2017-12 | France | 75 | -2 |
2019-12 | France | 78 | |
2020-12 | France | 88 | 10 |
2014-12 | USA | 74 | |
2015-12 | USA | 74 | 0 |
2016-12 | USA | 73 | -1 |
2017-12 | USA | 72 | -1 |
2018-11 | USA | 75 | 3 |
2020-12 | USA | 76 |
fillna
step
Replace null values by a given value in specified columns.
{
name: 'fillna',
columns: ["foo", "bar"],
value: 0
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column
(in the singular) parameter is deprecated and is supported only for
retrocompatibility purposes.
An old-fashioned step looks like this:
{
name: 'fillna',
column: 'foo',
value: 0
}
Example
Input dataset:
Company | Group | Value | KPI |
---|---|---|---|
Company 1 | Group 1 | 13 | |
Company 2 | Group 1 | 12 | |
Company 3 | Group 1 | 20 | 40 |
Company 4 | Group 2 | 1 | |
Company 5 | Group 2 | 38 | |
Company 6 | Group 2 | 5 | 4 |
Step configuration:
{
name: 'fillna',
columns: ["Value", "KPI"],
value: 0
}
Output dataset:
Company | Group | Value | KPI |
---|---|---|---|
Company 1 | Group 1 | 13 | 0 |
Company 2 | Group 1 | 0 | 12 |
Company 3 | Group 1 | 20 | 40 |
Company 4 | Group 2 | 1 | 0 |
Company 5 | Group 2 | 0 | 38 |
Company 6 | Group 2 | 5 | 4 |
filter
step
Filter out lines that don’t match a filter definition.
{
name: 'filter',
condition: {
column: 'my-column',
value: 42,
operator: 'ne'
}
}
operator
is optional, and defaults to eq
. Allowed operators are eq
, ne
,
gt
, ge
, lt
, le
, in
, nin
, matches
, notmatches
isnull
or notnull
.
value
can be an arbitrary value depending on the selected operator (e.g a list
when used with the in
operator, or null
when used with the isnull
operator).
matches
and notmatches
operators are used to test value against a regular expression.
Conditions can be grouped and nested with logical operators and
and or
.
{
name: 'filter',
condition: {
and: [
{
column: 'my-column',
value: 42,
operator: 'gte'
},
{
column: 'my-column',
value: 118,
operator: 'lte'
},
{
or: [
{
column: 'my-other-column',
value: 'blue',
operator: 'eq'
},
{
column: 'my-other-column',
value: 'red',
operator: 'eq'
}
]
}
]
}
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Relative dates
Date values can be relative to the moment to the moment when the query is executed.
This is expressed by using a RelativeDate
object instead of the value, of the form:
{
quantity: Number // can be positive (future) or negative (past)
duration: 'year' | 'quarter' | 'month' | 'week' | 'day'
}
*This feature is supported by the following backends:
- Mongo 5.0
formula
step
Add a computation based on a formula. Usually column names do not need to be
escaped, unless they include whitespaces, in which case you’ll need to use
brackets ‘[]’ (e.g. [myColumn]). Any string escaped with quotes ('
, "
, '''
, """
)
will be considered a string literal.
{
{
name: 'formula',
new_column: 'result', // if the column already exists, overwrites it
formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}
}
Supported operators
The following operators are supported by the formula step (note that a value can be a column name or a literal, such as 42
or foo
).
+
: Does an addition of two numeric values. See theconcatenate
step to append strings-
: Does an substraction of two numeric values. See thereplace
step to remove a part of a string*
: Multiplies two numeric values./
: Divides a numeric value by another. Divisions by zero will returnnull
.%
: Returns the rest of an integer division. Divisions by zero will returnnull
.
Example 1: Basic usage
Input dataset:
Label | Value1 | Value2 | Value3 | Value4 |
---|---|---|---|---|
Label 1 | 10 | 2 | 3 | 1 |
Label 2 | 1 | 13 | 7 | 3 |
Label 3 | 5 | 20 | 5 | 2 |
Step configuration:
{
name: 'formula',
new_column: 'Result',
formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}
Output dataset:
Label | Value1 | Value2 | Value3 | Value4 | Result |
---|---|---|---|---|---|
Label 1 | 10 | 2 | 3 | 1 | 2 |
Label 2 | 1 | 13 | 7 | 3 | -4 |
Label 3 | 5 | 20 | 5 | 2 | 1 |
Example 2: Column name with whitespaces
Input dataset:
Label | Value1 | Value2 | Value3 | Value 4 |
---|---|---|---|---|
Label 1 | 10 | 2 | 3 | 1 |
Label 2 | 1 | 13 | 7 | 3 |
Label 3 | 5 | 20 | 5 | 2 |
Step configuration:
{
name: 'formula',
new_column: 'Result',
formula: '(Value1 + Value2) / Value3 - [Value 4] * 2'
}
Output dataset:
Label | Value1 | Value2 | Value3 | Value 4 | Result |
---|---|---|---|---|---|
Label 1 | 10 | 2 | 3 | 1 | 2 |
Label 2 | 1 | 13 | 7 | 3 | -4 |
Label 3 | 5 | 20 | 5 | 2 | 1 |
hierarchy
step
Hierarchy for geographical data.
This step dissolves data for every hierarchy level, and adds a hierarchy level column containing a level (with 0 being the lowest granularity, i.e. the highest level).
{
name: 'hierarchy',
hierarchy: ['Country', 'City'],
include_nulls: false,
}
*This step is supported by the following backends:
- Pandas (python)
Example
Input dataset:
Country | City | geometry | Population |
---|---|---|---|
Country 1 | City 1 | Polygon | 100_000 |
Country 2 | City 2 | Polygon | 50_000 |
Country 2 | City 3 | Polygon | 200_000 |
Country 1 | City 4 | Polygon | 30_000 |
Country 2 | City 5 | Polygon | 25_000 |
Country 1 | City 6 | Polygon | 10_000 |
Step configuration:
name: 'hierarchy',
hierarchy: ['Country', 'City'],
include_nulls: false,
Output dataset:
Country | City | geometry | Population | hierarchy_level |
---|---|---|---|---|
Country 1 | City 1 | Polygon | 100_000 | 2 |
Country 2 | City 2 | Polygon | 50_000 | 2 |
Country 2 | City 3 | Polygon | 200_000 | 2 |
Country 1 | City 4 | Polygon | 30_000 | 2 |
Country 2 | City 5 | Polygon | 25_000 | 2 |
Country 1 | City 6 | Polygon | 10_000 | 2 |
Country 1 | null | MultiPolygon | null | 1 |
Country 2 | null | MultiPolygon | null | 1 |
null | null | MultiPolygon | null | 0 |
ifthenelse
step
Creates a new column, which values will depend on a condition expressed on existing columns.
The condition is expressed in the if
parameter with a condition object, which
is the same object expected by the condition
parameter of the
filter step). Conditions can be grouped and nested with logical
operators and
and or
.
The then
parameter only supports a string, that will be interpreted as a
formula (cf. formula step). If you want it to be interpreted
striclty as a string and not a formula, you must escape the string with quotes
(e.g. ‘“this is a text”’).
if...then...else
blocks can be nested as the else
parameter
supports either a string that will be interpreted as a formula (cf.
formula step), or a nested if if...then...else
object.
{
name: 'ifthenelse',
newColumn: '', // the name of the new column to be created
if: { column: '', value: '', operator: 'eq' }, // a condition, same object as in the filter step
then: '', // a string that will be interpreted as a formula
else: '', // a string interpreted as a formula or a nested if...then...else object
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Label | number |
---|---|
Label 1 | -2 |
Label 2 | 2 |
Label 3 | 0 |
Step configuration:
{
name: 'ifthenelse',
newColumn: 'result',
if: { column: 'number', value: 0, operator: 'eq' },
then: '"zero"'
else: {
if: { column: 'rel', value: 0, operator: 'lt' },
then: 'number * -1',
else: 'number'
}
}
Output dataset:
Label | number | result |
---|---|---|
Label 1 | -2 | 2 |
Label 2 | 5 | 5 |
Label 3 | 0 | zero |
join
step
Joins a dataset to the current dataset, i.e. brings columns from the former into
the latter, and matches rows based on
columns correspondance. It is similar to
a JOIN
clause in SQL, or to a VLOOKUP
in excel. The joined dataset is the
result from the query of the right_pipeline
.
The join type can be:
-
‘left’: will keep every row of the current dataset and fill unmatched rows with
null
values, -
‘inner’: will only keep rows that match rows of the joined dataset.
In the on
parameter, you must specify 1 or more column couple(s) that will be
compared to determine rows correspondance between the 2 datasets. The first
element of a couple is for the current dataset column, and the second for the
corresponding column in the right dataset to be joined. If you specify more than
1 couple, the matching rows will be those that find a correspondance between the
2 datasets for every column couple specified (logical ‘AND’).
Weaverbird allows you to save pipelines
referenced by name in the Vuex store
of the application. You can then call them by their unique names in this step.
{
name: 'join',
right_pipeline: 'somePipelineReference',
type: 'left', // or 'inner'
on: [
['currentDatasetColumn1', 'rightDatasetColumn1'],
['currentDatasetColumn2', 'rightDatasetColumn2'],
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Left join with one column couple as on
parameter
Input dataset:
Label | Value |
---|---|
Label 1 | 13 |
Label 2 | 7 |
Label 3 | 20 |
Label 4 | 1 |
Label 5 | 1 |
Label 6 | 1 |
rightDataset (saved in the application Vuex store):
Label | Group |
---|---|
Label 1 | Group 1 |
Label 2 | Group 1 |
Label 3 | Group 2 |
Label 4 | Group 2 |
Step configuration:
{
name: 'join',
right_pipeline: 'rightDataset',
type: 'left',
on: [['Label', 'Label']];
}
Output dataset:
Label | Value | Group |
---|---|---|
Label 1 | 13 | Group 1 |
Label 2 | 7 | Group 1 |
Label 3 | 20 | Group 2 |
Label 4 | 1 | Group 2 |
Label 5 | 1 | |
Label 6 | 1 |
Example 2: inner join with different column names in the on
parameter
Input dataset:
Label | Value |
---|---|
Label 1 | 13 |
Label 2 | 7 |
Label 3 | 20 |
Label 4 | 1 |
Label 5 | 1 |
Label 6 | 1 |
rightDataset (saved in the application Vuex store):
LabelRight | Group |
---|---|
Label 1 | Group 1 |
Label 2 | Group 1 |
Label 3 | Group 2 |
Label 4 | Group 2 |
Step configuration:
{
name: 'join',
right_pipeline: 'rightDataset',
type: 'inner',
on: [['Label', 'LabelRight']];
}
Output dataset:
Label | Value | LabelRight | Group |
---|---|---|---|
Label 1 | 13 | Label 1 | Group 1 |
Label 2 | 7 | Label 2 | Group 1 |
Label 3 | 20 | Label 3 | Group 2 |
Label 4 | 1 | Label 4 | Group 2 |
fromdate
step
Converts a date column
into a string column based on a specified format
.
{
name: 'fromdate',
column: 'myDateColumn'
format: '%Y-%m-%d' // For further details on supported format specifiers,
// see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Date | Value |
---|---|---|
Company 1 | 2019-10-06T00:00:00.000Z | 13 |
Company 1 | 2019-10-07T00:00:00.000Z | 7 |
Company 1 | 2019-10-08T00:00:00.000Z | 20 |
Company 2 | 2019-10-06T00:00:00.000Z | 1 |
Company 2 | 2019-10-07T00:00:00.000Z | 10 |
Company 2 | 2019-10-08T00:00:00.000Z | 5 |
Step configuration:
{
name: 'fromdate',
column: 'Date',
format: '%d/%m/%Y'
}
Output dataset:
Company | Date | Value |
---|---|---|
Company 1 | 06/10/2019 | 13 |
Company 1 | 07/10/2019 | 7 |
Company 1 | 08/10/2019 | 20 |
Company 2 | 06/10/2019 | 1 |
Company 2 | 07/10/2019 | 10 |
Company 2 | 08/10/2019 | 5 |
lowercase
step
$toLower
operator does not support accents.
If you have accents you need to lowercase with Mongo, use a replacetext
step after lowercase
.
Converts a string column
to lowercase.
{
name: 'lowercase',
column: 'foo',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Group | Value |
---|---|---|
LABEL 1 | Group 1 | 13 |
LABEL 2 | Group 1 | 7 |
LABEL 3 | Group 1 | 20 |
Step configuration:
{
name: 'lowercase',
column: 'Label',
}
Output dataset:
Label | Group | Value |
---|---|---|
label 1 | Group 1 | 13 |
label 2 | Group 1 | 7 |
label 3 | Group 1 | 20 |
movingaverage
step
Compute the moving average based on a value column, a reference column to sort (usually a date column) and a moving window (in number of rows i.e. data points). If needed, the computation can be performed by group of rows. The computation result is added in a new column.
{
name: 'movingaverage',
valueColumn: 'value',
columnToSort: 'dates'
movingWindow: 12,
groups: ['foo', 'bar'] // optional
newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
DATE | VALUE |
---|---|
2018-01-01 | 75 |
2018-01-02 | 80 |
2018-01-03 | 82 |
2018-01-04 | 83 |
2018-01-05 | 80 |
2018-01-06 | 86 |
2018-01-07 | 79 |
2018-01-08 | 76 |
Step configuration:
{
name: 'movingaverage',
valueColumn: 'VALUE',
columnToSort: 'DATE'
movingWindow: 2,
}
Output dataset:
DATE | VALUE | VALUE_MOVING_AVG |
---|---|---|
2018-01-01 | 75 | null |
2018-01-02 | 80 | 77.5 |
2018-01-03 | 82 | 81 |
2018-01-04 | 83 | 82.5 |
2018-01-05 | 80 | 81.5 |
2018-01-06 | 86 | 83 |
2018-01-07 | 79 | 82.5 |
2018-01-08 | 76 | 77.5 |
Example 2: with groups and custom newColumnName
Input dataset:
COUNTRY | DATE | VALUE |
---|---|---|
France | 2018-01-01 | 75 |
France | 2018-01-02 | 80 |
France | 2018-01-03 | 82 |
France | 2018-01-04 | 83 |
France | 2018-01-05 | 80 |
France | 2018-01-06 | 86 |
USA | 2018-01-01 | 69 |
USA | 2018-01-02 | 73 |
USA | 2018-01-03 | 73 |
USA | 2018-01-04 | 75 |
USA | 2018-01-05 | 70 |
USA | 2018-01-06 | 76 |
Step configuration:
{
name: 'movingaverage',
valueColumn: 'VALUE',
columnToSort: 'DATE'
movingWindow: 2,
groups: ['COUNTRY']
newColumName: 'ROLLING_AVERAGE'
}
Output dataset:
COUNTRY | DATE | VALUE | ROLLING_AVERAGE |
---|---|---|---|
France | 2018-01-01 | 75 | null |
France | 2018-01-02 | 80 | null |
France | 2018-01-03 | 82 | 79 |
France | 2018-01-04 | 83 | 81.7 |
France | 2018-01-05 | 80 | 81.7 |
France | 2018-01-06 | 86 | 83 |
USA | 2018-01-01 | 69 | null |
USA | 2018-01-02 | 73 | null |
USA | 2018-01-03 | 73 | 71.7 |
USA | 2018-01-04 | 75 | 73.7 |
USA | 2018-01-05 | 70 | 72.7 |
USA | 2018-01-06 | 76 | 73.7 |
percentage
step
Compute the percentage of total, i.e. for every row the value in column
divided
by the total as the sum of every values in column
. The computation can be performed
by group
if specified. The result is written in a new column.
{
name: 'percentage',
column: 'bar',
group: ['foo'] // optional
newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 5 |
Label 2 | Group 1 | 10 |
Label 3 | Group 1 | 15 |
Label 4 | Group 2 | 2 |
Label 5 | Group 2 | 7 |
Label 6 | Group 2 | 5 |
Step configuration:
name: 'percentage',
new_column: 'Percentage_of_total',
column: 'Value',
group: ['Group']
newColumn: 'Percentage'
}
Output dataset:
Label | Group | Value | Percentage |
---|---|---|---|
Label 1 | Group 1 | 5 | 0.167 |
Label 2 | Group 1 | 10 | 0.333 |
Label 3 | Group 1 | 15 | 0.5 |
Label 4 | Group 2 | 2 | 0.143 |
Label 5 | Group 2 | 7 | 0.5 |
Label 6 | Group 2 | 5 | 0.357 |
pivot
step
Pivot rows into columns around a given index
(expressed as a combination of column(s)).
Values to be used as new column names are found in the column column_to_pivot
.
Values to populate new columns are found in the column value_column
.
The function used to aggregate data (when several rows are found by index group) must be
among sum
, avg
, count
, min
or max
.
{
name: 'pivot',
index: ['column_1', 'column_2'],
column_to_pivot: 'column_3',
value_column: 'column_4',
agg_function: 'sum',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Country | Value |
---|---|---|
Label 1 | Country1 | 13 |
Label 2 | Country1 | 7 |
Label 3 | Country1 | 20 |
Label 1 | Country2 | 1 |
Label 2 | Country2 | 10 |
Label 3 | Country2 | 5 |
label 3 | Country2 | 1 |
Step configuration:
{
name: 'pivot',
index: ['Label'],
column_to_pivot: 'Country',
value_column: 'Value',
agg_function: 'sum',
}
Output dataset:
Label | Country1 | Country2 |
---|---|---|
Label 1 | 13 | 1 |
Label 2 | 7 | 10 |
Label 3 | 20 | 6 |
statistics
step
Compute statistics of a column.,
{
name: 'statistics',
column: 'Value',
groupby: [],
statistics: ['average', 'count'],
quantiles: [{label: 'median', nth: 1, order: 2}],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'statistics',
column: 'Value',
groupby: [],
statistics: ['average', 'count'],
quantiles: [{label: 'median', nth: 1, order: 2}],
}
Output dataset:
average | count | median |
---|---|---|
9.33333 | 6 | 8.5 |
rank
step
This step allows to compute a rank column based on a value column that can be sorted in ascending or descending order. The ranking can be computed by group.
There are 2 ranking methods available, that you will understand easily through those examples:
standard
: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 5, 5, 7]dense
: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 3, 3, 4]
(The dense
method is basically the same as the standard
method, but rank
always increases by 1 at most).
{
name: 'rank',
valueCol: 'VALUE', // The value column that will be ordered to determine rank
order: 'desc', // How to order the value column to determine ranking.
// either asc(ending) or desc(ending), 'desc' by default
method: 'standard', // either 'standard' or 'dense', as explained above
groupby: ['foo', 'bar'], // specify columns if you want to group the ranking computation
newColumnName: 'columnA', // specify the new column name to be created (by default '<yourValueColumn>_RANK')
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
COUNTRY | VALUE |
---|---|
FRANCE | 15 |
FRANCE | 5 |
FRANCE | 10 |
FRANCE | 20 |
FRANCE | 10 |
FRANCE | 15 |
USA | 20 |
USA | 30 |
USA | 20 |
USA | 25 |
USA | 15 |
USA | 20 |
Step configuration:
{
name: 'rank',
valueCol: 'VALUE',
order: 'desc',
method: 'standard',
}
Output dataset:
COUNTRY | VALUE | VALUE_RANK |
---|---|---|
USA | 30 | 1 |
USA | 25 | 2 |
FRANCE | 20 | 3 |
USA | 20 | 3 |
USA | 20 | 3 |
USA | 20 | 3 |
FRANCE | 15 | 7 |
FRANCE | 15 | 7 |
USA | 15 | 7 |
FRANCE | 10 | 10 |
FRANCE | 10 | 10 |
FRANCE | 5 | 12 |
Example 2: With more options
Input dataset:
COUNTRY | VALUE |
---|---|
FRANCE | 15 |
FRANCE | 5 |
FRANCE | 10 |
FRANCE | 20 |
FRANCE | 10 |
FRANCE | 15 |
USA | 20 |
USA | 30 |
USA | 20 |
USA | 25 |
USA | 15 |
USA | 20 |
Step configuration:
{
name: 'rank',
valueCol: 'VALUE',
order: 'asc',
method: 'dense',
groupby: ['COUNTRY'],
newColumnName: 'MY_RANK',
}
Output dataset:
COUNTRY | VALUE | MY_RANK |
---|---|---|
FRANCE | 5 | 1 |
FRANCE | 10 | 2 |
FRANCE | 10 | 2 |
FRANCE | 15 | 3 |
FRANCE | 15 | 3 |
FRANCE | 20 | 4 |
USA | 15 | 1 |
USA | 20 | 2 |
USA | 20 | 2 |
USA | 20 | 2 |
USA | 25 | 3 |
USA | 30 | 4 |
rename
step
Rename one or several columns.
The toRename
parameter takes as input a list of 2-elements lists in the form
[‘oldColumnName’, ‘newColumnName’].
{
name: 'rename',
toRename: [
['oldCol1', 'newCol1'] // The first value is the column to be renamed, the second is the new column name
['oldCol2', 'newCol2']
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The oldname
and newnname
parameters are deprecated and are supported only for
retrocompatibility purposes.
When this step was first created, only 1 column at a time could be renamed, and
the step was configured via 2 parameters, oldname
and newnname
, that are now
deprecated and supported only for retrocompatibility purposes.
An old-fashioned step with oldname
set to ‘old’ and newname
set to ‘new’
is equivalent to a new-fashioned step with toRename
parameter set to [‘old’, ‘new’].
So an old-fashioned step looks like this:
{
name: 'rename',
oldname: 'old-column-name',
newname: 'new-column-name'
}
Example:
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'rename',
oldname: 'Label',
newname: 'Company'
}
Output dataset:
Company | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
replace
step
Replace one or several values in a column.
A replace step has the following strucure:
{
name: 'replace',
search_column: "column_1",
to_replace: [
['foo', 'bar'], // The first value is the one to be replace, the second is the new value
[42, 0]
]
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
COMPANY | COUNTRY |
---|---|
Company 1 | Fr |
Company 2 | UK |
Step configuration:
{
name: 'replace',
search_column: "COUNTRY",
to_replace: [
['Fr', 'France']
['UK', 'United Kingdom']
]
}
Output dataset:
COMPANY | COUNTRY |
---|---|
Company 1 | France |
Company 2 | United Kingdom |
replacetext
step
Replace a substring in a column.
A replace-text step has the following structure:
{
name: 'replacetext',
search_column: "column_1",
old_str: 'foo',
new_str: 'bar'
}
*This step is supported by the following backends:
- Mongo 5.0 (python)
- Pandas (python)
- AWS Athena
- Google Big Query
- MySQL
- PostgreSQL
- Redshift
- Snowflake
Example
Input dataset:
COMPANY | COUNTRY |
---|---|
Company 1 | Fr is boring |
Company 2 | UK |
Step configuration:
{
name: 'replacetext',
search_column: "COUNTRY",
old_str: 'Fr',
new_str: 'France',
}
Output dataset:
COMPANY | COUNTRY |
---|---|
Company 1 | France is boring |
Company 2 | UK |
rollup
step
Use this step if you need to compute aggregated data at every level of a
hierarchy
, specified as a series of columns from top to bottom level.
The output data structure stacks the data of every level of the hierarchy,
specifying for every row the label, level and parent in dedicated columns.
Aggregated rows can be computed with using either sum, average, count, count distinct, min, max, first or last.
{
name: 'rollup',
hierarchy: ['continent', 'country', 'city'], // Hierarchical columns, from top to bottom level
aggregations: [ // The columns to be aggregated, and how
{
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
columns: ['value1', 'value2']
}
{
newcolumns: ['avg_value1'],
aggfunction: 'avg',
columns: ['value1']
}
],
groupby: ['date'], // optional, if needing to segment the rollup computation by group
labelCol: 'label', // optional, name of the output label column, 'label' by default
levelCol: 'level', // optional, name of the output level column, 'level' by default
parentLabelCol: 'parent', // optional, name of the output parent column, 'parent' by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Deprecation note:
The column
and newcolumn
(in the singular) properties of the aggregation
parameter are deprecated and are supported only for retrocompatibility purposes.
When this step was first created, only 1 column at a time could be aggregated for
a given aggregation.
Now columns
and newcolumns
are lists of columns, allowing to apply the same
aggregation function to several columns at once.
An old-fashioned step looked like this:
{
name: 'rollup',
hierarchy: ['continent', 'country', 'city'],
aggregations: [
{
newcolumn: 'sum_value1'
aggfunction: 'sum',
column: 'value1'
}
{
newcolumn: 'sum_value2',
aggfunction: 'sum',
column: 'value2'
}
{
newcolumn: 'avg_value1',
aggfunction: 'avg',
column: 'value1'
}
// ...
]
}
Example 1 : Basic configuration
Input dataset:
CITY | COUNTRY | CONTINENT | YEAR | VALUE |
---|---|---|---|---|
Paris | France | Europe | 2018 | 10 |
Bordeaux | France | Europe | 2018 | 5 |
Barcelona | Spain | Europe | 2018 | 8 |
Madrid | Spain | Europe | 2018 | 3 |
Boston | USA | North America | 2018 | 12 |
New-York | USA | North America | 2018 | 21 |
Montreal | Canada | North America | 2018 | 10 |
Ottawa | Canada | North America | 2018 | 7 |
Paris | France | Europe | 2019 | 13 |
Bordeaux | France | Europe | 2019 | 8 |
Barcelona | Spain | Europe | 2019 | 11 |
Madrid | Spain | Europe | 2019 | 6 |
Boston | USA | North America | 2019 | 15 |
New-York | USA | North America | 2019 | 24 |
Montreal | Canada | North America | 2019 | 10 |
Ottawa | Canada | North America | 2019 | 13 |
Step configuration:
{
name: 'rollup',
hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
aggregations: [
{
newcolumn: ['VALUE'],
aggfunction: 'sum',
column: ['VALUE']
},
],
}
Output dataset:
CITY | COUNTRY | CONTINENT | label | level | parent | VALUE |
---|---|---|---|---|---|---|
Europe | Europe | CONTINENT | 64 | |||
North America | North America | CONTINENT | 112 | |||
France | Europe | France | COUNTRY | Europe | 36 | |
Spain | Europe | Spain | COUNTRY | Europe | 28 | |
USA | North America | USA | COUNTRY | North America | 72 | |
Canada | North America | Canada | COUNTRY | North America | 40 | |
Paris | France | Europe | Paris | CITY | France | 23 |
Bordeaux | France | Europe | Bordeaux | CITY | France | 13 |
Barcelona | Spain | Europe | Barcelona | CITY | Spain | 19 |
Madrid | Spain | Europe | Madrid | CITY | Spain | 9 |
Boston | USA | North America | Boston | CITY | USA | 27 |
New-York | USA | North America | New-York | CITY | USA | 45 |
Montreal | Canada | North America | Montreal | CITY | Canada | 20 |
Ottawa | Canada | North America | Ottawa | CITY | Canada | 20 |
Example 2 : Configuration with optional parameters
Input dataset:
CITY | COUNTRY | CONTINENT | YEAR | VALUE | COUNT |
---|---|---|---|---|---|
Paris | France | Europe | 2018 | 10 | 1 |
Bordeaux | France | Europe | 2018 | 5 | 1 |
Barcelona | Spain | Europe | 2018 | 8 | 1 |
Madrid | Spain | Europe | 2018 | 3 | 1 |
Boston | USA | North America | 2018 | 12 | 1 |
New-York | USA | North America | 2018 | 21 | 1 |
Montreal | Canada | North America | 2018 | 10 | 1 |
Ottawa | Canada | North America | 2018 | 7 | 1 |
Paris | France | Europe | 2019 | 13 | 1 |
Bordeaux | France | Europe | 2019 | 8 | 1 |
Barcelona | Spain | Europe | 2019 | 11 | 1 |
Madrid | Spain | Europe | 2019 | 6 | 1 |
Boston | USA | North America | 2019 | 15 | 1 |
New-York | USA | North America | 2019 | 24 | 1 |
Montreal | Canada | North America | 2019 | 10 | 1 |
Ottawa | Canada | North America | 2019 | 13 | 1 |
Step configuration:
{
name: 'rollup',
hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
aggregations: [
{
newcolumn: ['VALUE-sum', 'COUNT']
aggfunction: 'sum',
column: ['VALUE', 'COUNT']
},
{
newcolumn: 'VALUE-avg',
aggfunction: 'avg',
column: 'VALUE'
}
],
groupby: ['YEAR'],
labelCol: 'MY_LABEL',
levelCol: 'MY_LEVEL',
parentLabelCol: 'MY_PARENT',
}
Output dataset:
CITY | COUNTRY | CONTINENT | YEAR | MY_LABEL | MY_LEVEL | MY_PARENT | VALUE-sum | VALUE-avg | COUNT |
---|---|---|---|---|---|---|---|---|---|
North America | 2018 | Europe | CONTINENT | 26 | 6.5 | 4 | |||
North America | 2018 | North America | CONTINENT | 50 | 12.5 | 4 | |||
France | Europe | 2018 | France | COUNTRY | Europe | 15 | 7.5 | 2 | |
Spain | Europe | 2018 | Spain | COUNTRY | Europe | 11 | 5.5 | 2 | |
USA | North America | 2018 | USA | COUNTRY | North America | 33 | 16.5 | 2 | |
Canada | North America | 2018 | Canada | COUNTRY | North America | 17 | 8.5 | 2 | |
Paris | France | Europe | 2018 | Paris | CITY | France | 10 | 10 | 1 |
Bordeaux | France | Europe | 2018 | Bordeaux | CITY | France | 5 | 5 | 1 |
Barcelona | Spain | Europe | 2018 | Barcelona | CITY | Spain | 8 | 8 | 1 |
Madrid | Spain | Europe | 2018 | Madrid | CITY | Spain | 3 | 3 | 1 |
Boston | USA | North America | 2018 | Boston | CITY | USA | 12 | 12 | 1 |
New-York | USA | North America | 2018 | New-York | CITY | USA | 21 | 21 | 1 |
Montreal | Canada | North America | 2018 | Montreal | CITY | Canada | 10 | 10 | 1 |
Ottawa | Canada | North America | 2018 | Ottawa | CITY | Canada | 7 | 7 | 1 |
North America | 2019 | Europe | CONTINENT | 38 | 9.5 | 4 | |||
North America | 2019 | North America | CONTINENT | 62 | 15.5 | 4 | |||
France | Europe | 2019 | France | COUNTRY | Europe | 21 | 10.5 | 2 | |
Spain | Europe | 2019 | Spain | COUNTRY | Europe | 17 | 8.5 | 2 | |
USA | North America | 2019 | USA | COUNTRY | North America | 39 | 19.5 | 2 | |
Canada | North America | 2019 | Canada | COUNTRY | North America | 23 | 11.5 | 2 | |
Paris | France | Europe | 2019 | Paris | CITY | France | 13 | 13 | 1 |
Bordeaux | France | Europe | 2019 | Bordeaux | CITY | France | 8 | 8 | 1 |
Barcelona | Spain | Europe | 2019 | Barcelona | CITY | Spain | 11 | 11 | 1 |
Madrid | Spain | Europe | 2019 | Madrid | CITY | Spain | 6 | 6 | 1 |
Boston | USA | North America | 2019 | Boston | CITY | USA | 15 | 15 | 1 |
New-York | USA | North America | 2019 | New-York | CITY | USA | 24 | 24 | 1 |
Montreal | Canada | North America | 2019 | Montreal | CITY | Canada | 10 | 10 | 1 |
Ottawa | Canada | North America | 2019 | Ottawa | CITY | Canada | 13 | 13 | 1 |
select
step
Select a column. The default is to keep every columns of the input domain. If
the select
is used, it will only keep selected columns in the output.
{
name: 'select',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Group | Value | Label |
---|---|---|---|
Company 1 | Group 1 | 13 | Company 1 - Group 1 |
Company 2 | Group 1 | 7 | Company 2 - Group 1 |
Company 3 | Group 1 | 20 | Company 3 - Group 1 |
Company 4 | Group 2 | 1 | Company 4 - Group 2 |
Company 5 | Group 2 | 10 | Company 5 - Group 2 |
Company 6 | Group 2 | 5 | Company 6 - Group 2 |
Step configuration:
{
{
name: 'select',
columns: ['Value', 'Label']
}
}
Output dataset:
Value | Label |
---|---|
13 | Company 1 - Group 1 |
7 | Company 2 - Group 1 |
20 | Company 3 - Group 1 |
1 | Company 4 - Group 2 |
10 | Company 5 - Group 2 |
5 | Company 6 - Group 2 |
sort
step
Sort values in one or several columns. Order can be either ‘asc’ or ‘desc’.
When sorting on several columns, order of columns specified in columns
matters.
{
name: 'sort',
columns: [{column: 'foo', order: 'asc'}, {column: 'bar', order: 'desc'}],
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'sort',
columns: [{ column: 'Group', order: 'asc'}, {column: 'Value', order: 'desc' }]
}
Output dataset:
Company | Group | Value |
---|---|---|
Label 3 | Group 1 | 20 |
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Label 4 | Group 2 | 1 |
split
step
Split a string column
into several columns based on a delimiter
.
{
name: 'split',
column: 'foo', // the columnn to split
delimiter: ' - ', // the delimiter can e a strinng of any length
number_cols_to_keep: 3, // the numer of columns to keep resulting from the
// split (starting from first chunk)
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1
Input dataset:
Label | Value |
---|---|
Label 1 - Group 1 - France | 13 |
Label 2 - Group 1 - Spain | 7 |
Label 3 - Group 1 - USA | 20 |
Label 4 - Group 2 - France | 1 |
Label 5 - Group 2 - Spain | 10 |
Label 6 - Group 2 - USA | 5 |
Step configuration:
{
name: 'split',
column: 'Label',
delimiter: ' - ',
number_cols_to_keep: 3,
}
Output dataset:
Label_1 | Label_2 | Label_3 | Value |
---|---|---|---|
Label 1 | Group 1 | Spain | 13 |
Label 2 | Group 1 | USA | 7 |
Label 3 | Group 1 | France | 20 |
Label 4 | Group 2 | USA | 1 |
Label 5 | Group 2 | France | 10 |
Label 6 | Group 2 | Spain | 5 |
Example 2: keeping less columns
Input dataset:
Label | Value |
---|---|
Label 1 - Group 1 - France | 13 |
Label 2 - Group 1 - Spain | 7 |
Label 3 - Group 1 - USA | 20 |
Label 4 - Group 2 - France | 1 |
Label 5 - Group 2 - Spain | 10 |
Label 6 - Group 2 - USA | 5 |
Step configuration:
{
name: 'split',
column: 'Label',
delimiter: ' - ',
number_cols_to_keep: 2,
}
Output dataset:
Label_1 | Label_2 | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
simplify
step
Simplifies geographical data.
When simplifying your data, every point that is closer than a specific distance to the previous one is suppressed. This step can be useful if you have a very precise shape for a country (such as one-meter precision), but want to quickly draw a map chart. In that case, you may want to simplify your data.
After simplification, no points will be closer than tolerance
. The unit depends on data’s projection and on its unit,
but in general, it’s expressed in meters for CRS projections. For more details, see the
GeoPandas documentation.
Step configuration:
{
name: 'simplify',
tolerance: 1.0,
}
substring
step
Extract a substring in a string column
. The substring begins at index
start_index
(beginning at 1
) and stops at end_index
.
You can specify negative indexes, in such a case the index search will start
from the end of the string (with -1
being the last index of the string).
Please refer to the examples below for illustration.
Neither start_index
nor end_index
can be equal to 0.
{
name: 'substring',
column: 'foo',
start_index: 1, // 1 = fisrt character
end_index: -1, // -1 = last character
newColumName: 'myNewColumn' // optional, <originalColumame>_SUBSTR by default
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: positive start_index
and end_index
Input dataset:
Group | Value |
---|---|
foo | 13 |
overflow | 7 |
some_text | 20 |
a_word | 1 |
toucan | 10 |
toco | 5 |
Step configuration:
{
column: 'Label',
name: 'substring',
start_index: 1,
end_index: 4,
}
Label | Value | Label_PCT |
---|---|---|
foo | 13 | foo |
overflow | 7 | over |
some_text | 20 | some |
a_word | 1 | a_wo |
toucan | 10 | touc |
toco | 5 | toco |
Example 2: start_index
is positive and end_index
is negative
Input dataset:
Label | Value |
---|---|
foo | 13 |
overflow | 7 |
some_text | 20 |
a_word | 1 |
toucan | 10 |
toco | 5 |
Step configuration:
{
name: 'substring',
column: 'Label',
start_index: 2,
end_index: -2,
newColumName: 'short_label',
}
Output dataset:
Label | Value | short_label |
---|---|---|
foo | 13 | o |
overflow | 7 | verflo |
some_text | 20 | ome_tex |
a_word | 1 | _wor |
toucan | 10 | ouca |
toco | 5 | oc |
Example 3: start_index
and end_index
are negative
Input dataset:
Label | Value |
---|---|
foo | 13 |
overflow | 7 |
some_text | 20 |
a_word | 1 |
toucan | 10 |
toco | 5 |
Step configuration:
{
name: 'substring',
column: 'Label',
start_index: -3,
end_index: -1,
}
Output dataset:
Label | Value | Label_PCT |
---|---|---|
foo | 13 | foo |
overflow | 7 | low |
some_text | 20 | ext |
a_word | 1 | ord |
toucan | 10 | can |
toco | 5 | oco |
text
step
Use this step to add a text column where every value will be equal to the
specified text
.
{
{
name: 'text',
new_column: 'new', // if the column already exists, overwrites it
text: 'some text'
}
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Label | Value1 |
---|---|
Label 1 | 10 |
Label 2 | 1 |
Label 3 | 5 |
Step configuration:
{
name: 'text',
new_column: 'KPI',
text: 'Sales'
}
Output dataset:
Label | Value1 | KPI |
---|---|---|
Label 1 | 10 | Sales |
Label 2 | 1 | Sales |
Label 3 | 5 | Sales |
todate
step
Converts a string column
into a date column based on a specified format
.
{
name: 'todate',
column: 'myTextColumn'
format: '%Y-%m-%d' // Optional. If undefined, the backend tries to infer the date format.
// For further details on supported format specifiers,
// see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
// Note: custom format is not supported in Mongo versions older than 4.0
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example
Input dataset:
Company | Date | Value |
---|---|---|
Company 1 | 06/10/2019 | 13 |
Company 1 | 07/10/2019 | 7 |
Company 1 | 08/10/2019 | 20 |
Company 2 | 06/10/2019 | 1 |
Company 2 | 07/10/2019 | 10 |
Company 2 | 08/10/2019 | 5 |
Step configuration:
{
name: 'todate',
column: 'Date',
format: '%d/%m/%Y'
}
Output dataset:
Company | Date | Value |
---|---|---|
Company 1 | 2019-10-06T00:00:00.000Z | 13 |
Company 1 | 2019-10-07T00:00:00.000Z | 7 |
Company 1 | 2019-10-08T00:00:00.000Z | 20 |
Company 2 | 2019-10-06T00:00:00.000Z | 1 |
Company 2 | 2019-10-07T00:00:00.000Z | 10 |
Company 2 | 2019-10-08T00:00:00.000Z | 5 |
top
step
Return top N rows by group if groups
is specified, else over full dataset.
{
name: 'top',
groups: ['foo'],
rank_on: 'bar',
sort: 'desc', // or 'asc'
limit: 10
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: top without groups
, ascending order
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'top',
rank_on: 'Value',
sort: 'asc',
limit: 3
}
Output dataset:
Label | Group | Value |
---|---|---|
Label 4 | Group 2 | 1 |
Label 6 | Group 2 | 5 |
Label 2 | Group 1 | 7 |
Example 2: top with groups
, descending order
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 4 | Group 2 | 1 |
Label 5 | Group 2 | 10 |
Label 6 | Group 2 | 5 |
Step configuration:
{
name: 'top',
groups: ['Group'],
rank_on: 'Value',
sort: 'desc',
limit: 1
}
Output dataset:
Company | Group | Value |
---|---|---|
Label 3 | Group 1 | 20 |
Label 5 | Group 2 | 10 |
totals
step
Append “total” rows to the dataset for specified dimensions. Computed rows result from an aggregation (either sum, average, count, count distinct, min, max, first or last)
{
name: 'totals',
// an array of objects: { totalColumn: <column to add total rows in>, totalRowsLabel: <the label of the added total rows>}
totalDimensions: [
{ totalColumn: 'foo', totalRowsLabel: 'Total foos' },
{ totalColumn: 'bar', totalRowsLabel: 'Total bars' }
],
aggregations: [ // The columns to be aggregated, and how
{
columns: ['value1', 'value2']
aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
},
{
columns: ['value1']
aggfunction: 'avg',
newcolumns: ['avg_value1'],
}
],
groups: ['someDimension']// specify columns if the totals computation needs to be scoped by group of rows
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: basic usage
Input dataset:
COUNTRY | PRODUCT | YEAR | VALUE |
---|---|---|---|
France | product A | 2019 | 5 |
USA | product A | 2019 | 10 |
France | product B | 2019 | 10 |
USA | product B | 2019 | 15 |
France | product A | 2020 | 20 |
USA | product A | 2020 | 20 |
France | product B | 2020 | 30 |
USA | product B | 2020 | 25 |
Step configuration:
{
name: 'totals',
totalDimensions: [{ totalColumn: 'COUNTRY', totalRowsLabel: 'All countries' }],
aggregations: [
{
columns: ['VALUE']
aggfunction: 'sum',
newcolumns: ['VALUE'],
}
]
}
Output dataset:
COUNTRY | PRODUCT | YEAR | VALUE |
---|---|---|---|
France | product A | 2019 | 5 |
USA | product A | 2019 | 10 |
France | product B | 2019 | 10 |
USA | product B | 2019 | 15 |
France | product A | 2020 | 20 |
USA | product A | 2020 | 20 |
France | product B | 2020 | 30 |
USA | product B | 2020 | 25 |
All countries | null | null | 135 |
Example 2: With several totals and groups
Input dataset:
COUNTRY | PRODUCT | YEAR | VALUE_1 | VALUE_2 |
---|---|---|---|---|
France | product A | 2019 | 5 | 50 |
USA | product A | 2019 | 10 | 100 |
France | product B | 2019 | 10 | 100 |
USA | product B | 2019 | 15 | 150 |
France | product A | 2020 | 20 | 200 |
USA | product A | 2020 | 20 | 200 |
France | product B | 2020 | 30 | 300 |
USA | product B | 2020 | 25 | 250 |
Step configuration:
{
name: 'totals',
totalDimensions: [
{totalColumn: 'COUNTRY', totalRowsLabel: 'All countries'},
{totalColumn: 'PRODUCT', totalRowsLabel: 'All products'}
],
aggregations: [
{
columns: ['VALUE_1-sum', 'VALUE_2']
aggfunction: 'sum',
newcolumns: ['VALUE_1', 'VALUE_2'],
},
{
columns: ['VALUE_1-avg']
aggfunction: 'avg',
newcolumns: ['VALUE_1'],
}
],
groups: ['YEAR']
}
Output dataset:
COUNTRY | PRODUCT | YEAR | VALUE_2 | VALUE_1-sum | VALUE_1-avg |
---|---|---|---|---|---|
France | product A | 2019 | 50 | 5 | 5 |
USA | product A | 2019 | 100 | 10 | 10 |
France | product B | 2019 | 100 | 10 | 10 |
USA | product B | 2019 | 150 | 15 | 15 |
France | product A | 2020 | 200 | 20 | 20 |
USA | product A | 2020 | 200 | 20 | 20 |
France | product B | 2020 | 300 | 30 | 30 |
USA | product B | 2020 | 250 | 25 | 25 |
USA | All products | 2020 | 450 | 45 | 22.5 |
France | All products | 2020 | 500 | 50 | 25 |
USA | All products | 2019 | 250 | 25 | 12.5 |
France | All products | 2019 | 150 | 15 | 7.5 |
All countries | product B | 2020 | 550 | 55 | 27.5 |
All countries | product A | 2020 | 400 | 40 | 20 |
All countries | product B | 2019 | 250 | 25 | 12.5 |
All countries | product A | 2019 | 150 | 15 | 7.5 |
All countries | All products | 2020 | 950 | 95 | 23.75 |
All countries | All products | 2019 | 400 | 40 | 10 |
trim
step
Trim spaces in a column.
{
name: 'trim',
columns: ['my-column', 'some-other-column']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Pandas (python)
This step is unsupported by the following backends:
- Mongo 3.6
Example
Input dataset:
Company | Group | Value | Label |
---|---|---|---|
’ Company 1 ‘ | Group 1 | 13 | Company 1 - Group 1 |
’ Company 2 ‘ | Group 1 | 7 | Company 2 - Group 1 |
Step configuration:
{
name: 'trim',
columns: ['Company']
}
Output dataset:
Company | Group | Value | Label |
---|---|---|---|
‘Company 1’ | Group 1 | 13 | Company 1 - Group 1 |
‘Company 2’ | Group 1 | 7 | Company 2 - Group 1 |
unpivot
step
Unpivot a list of columns to rows.
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'], // columns to keep fixed around which to unpivot columns
unpivot: ['NB_CLIENTS', 'REVENUES'], // columns to unpivot
unpivot_column_name: 'KPI', // name of the new dimension column created after unpivot
value_column_name: 'VALUE', // name of the new value column created after unpivot
dropna: true // whether null values have to be kept or the corresponding rows discarded
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: with dropna
parameter to true
Input dataset:
COMPANY | COUNTRY | NB_CLIENTS | REVENUES |
---|---|---|---|
Company 1 | France | 7 | 10 |
Company 2 | France | 2 | |
Company 1 | USA | 12 | 6 |
Company 2 | USA | 1 | 3 |
Step configuration:
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'],
unpivot: ['NB_CLIENTS', 'REVENUES'],
unpivot_column_name: 'KPI',
value_column_name: 'VALUE',
dropna: true
}
Output dataset:
COMPANY | COUNTRY | KPI | VALUE |
---|---|---|---|
Company 1 | France | NB_CLIENTS | 7 |
Company 1 | France | REVENUES | 10 |
Company 2 | France | NB_CLIENTS | 2 |
Company 1 | USA | NB_CLIENTS | 12 |
Company 1 | USA | REVENUES | 6 |
Company 2 | USA | NB_CLIENTS | 1 |
Company 2 | USA | REVENUES | 3 |
Example 1: with dropna
parameter to false
Input dataset:
COMPANY | COUNTRY | NB_CLIENTS | REVENUES |
---|---|---|---|
Company 1 | France | 7 | 10 |
Company 2 | France | 2 | |
Company 1 | USA | 12 | 6 |
Company 2 | USA | 1 | 3 |
Step configuration:
{
name: 'unpivot',
keep: ['COMPANY', 'COUNTRY'],
unpivot: ['NB_CLIENTS', 'REVENUES'],
unpivot_column_name: 'KPI',
value_column_name: 'VALUE',
dropna: false
}
Output dataset:
COMPANY | COUNTRY | KPI | VALUE |
---|---|---|---|
Company 1 | France | NB_CLIENTS | 7 |
Company 1 | France | REVENUES | 10 |
Company 2 | France | NB_CLIENTS | 2 |
Company 2 | France | REVENUES | |
Company 1 | USA | NB_CLIENTS | 12 |
Company 1 | USA | REVENUES | 6 |
Company 2 | USA | NB_CLIENTS | 1 |
Company 2 | USA | REVENUES | 3 |
uppercase
step
$toUpper
operator does not support accents.
If you have accents you need to uppercase with Mongo, use a replacetext
step after uppercase
.
Converts a string column
to uppercase.
{
name: 'uppercase',
column: 'foo',
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Step configuration:
{
name: 'uppercase',
column: 'Label',
}
Output dataset:
Label | Group | Value |
---|---|---|
LABEL 1 | Group 1 | 13 |
LABEL 2 | Group 1 | 7 |
LABEL 3 | Group 1 | 20 |
uniquegroups
step
Allow to get unique groups of values from one or several columns.
{
name: 'uniquegroups',
on: ['foo', 'bar']
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example:
Input dataset:
Label | Group | Value |
---|---|---|
Label 1 | Group 1 | 13 |
Label 2 | Group 1 | 7 |
Label 3 | Group 1 | 20 |
Label 1 | Group 2 | 1 |
Label 2 | Group 1 | 2 |
Label 3 | Group 1 | 3 |
Step configuration:
{
name: 'uniquegroups',
column: ['Label', 'Group'],
}
Output dataset:
Label | Group |
---|---|
Label 1 | Group 1 |
Label 1 | Group 2 |
Label 2 | Group 1 |
Label 3 | Group 1 |
waterfall
step
This step allows to generate a data structure useful to build waterfall charts.
It breaks down the variation between two values (usually between two dates)
accross entities. Entities are found in the labelsColumn
, and can optionally
be regrouped under common parents found in the parentsColumn
for drill-down
purposes.
{
name: 'waterfall',
valueColumn: 'VALUE', // the column of values
milestonesColumn: 'DATE', // the column where starting and ending blocks labels
// are to be found (usually dates)
start: '2019', // the starting block label to be found in `milestonesColumn`
end: '2020', // the starting block label to be found in `milestonesColumn`
labelsColumn: 'PRODUCT', // the column of labels for the variations breakdown
groupby: ['COUNTRY'], // specify columns if you want to group the waterfall computation
sortBy: 'value', // either 'value' (default) or 'label' depending on wether to
// sort waterfall blocks by their label or value
order: 'desc', // either 'asc' or 'desc' (default) depending on wether to sort
// by ascending or descending order
}
*This step is supported by the following backends:
- Mongo 5.0
- Mongo 4.2
- Mongo 4.0
- Mongo 3.6
- Pandas (python)
Example 1: Basic usage
Input dataset:
city | year | revenue |
---|---|---|
Bordeaux | 2019 | 135 |
Boston | 2019 | 275 |
New-York | 2019 | 115 |
Paris | 2019 | 450 |
Bordeaux | 2018 | 98 |
Boston | 2018 | 245 |
New-York | 2018 | 103 |
Paris | 2018 | 385 |
Step configuration:
{
name: 'waterfall',
valueColumn: 'revenue',
milestonesColumn: 'year',
start: '2018',
end: '2019',
labelsColumn: 'city',
sortBy: 'value',
order: 'desc',
}
Output dataset:
LABEL_waterfall | TYPE_waterfall | revenue |
---|---|---|
2018 | null | 831 |
Paris | parent | 65 |
Bordeaux | parent | 37 |
Boston | parent | 30 |
New-York | parent | 12 |
2019 | null | 975 |
Example 2: With more options
Input dataset:
city | country | product | year | revenue |
---|---|---|---|---|
Bordeaux | France | product1 | 2019 | 65 |
Bordeaux | France | product2 | 2019 | 70 |
Paris | France | product1 | 2019 | 210 |
Paris | France | product2 | 2019 | 240 |
Boston | USA | product1 | 2019 | 130 |
Boston | USA | product2 | 2019 | 145 |
New-York | USA | product1 | 2019 | 55 |
New-York | USA | product2 | 2019 | 60 |
Bordeaux | France | product1 | 2018 | 38 |
Bordeaux | France | product2 | 2018 | 60 |
Paris | France | product1 | 2018 | 175 |
Paris | France | product2 | 2018 | 210 |
Boston | USA | product1 | 2018 | 95 |
Boston | USA | product2 | 2018 | 150 |
New-York | USA | product1 | 2018 | 50 |
New-York | USA | product2 | 2018 | 53 |
Step configuration:
{
name: 'waterfall',
valueColumn: 'revenue',
milestonesColumn: 'year',
start: '2018',
end: '2019',
labelsColumn: 'city',
parentsColumn: 'country',
groupby: ['product'],
sortBy: 'label',
order: 'asc',
}
Output dataset:
LABEL_waterfall | GROUP_waterfall | TYPE_waterfall | product | revenue |
---|---|---|---|---|
2018 | 2018 | null | product1 | 358 |
2018 | 2018 | null | product2 | 473 |
Bordeaux | France | child | product1 | 27 |
Bordeaux | France | child | product2 | 10 |
Boston | USA | child | product1 | 35 |
Boston | USA | child | product2 | -5 |
France | France | parent | product2 | 40 |
France | France | parent | product1 | 62 |
New-York | USA | child | product1 | 5 |
New-York | USA | child | product2 | 7 |
Paris | France | child | product1 | 35 |
Paris | France | child | product2 | 30 |
USA | USA | parent | product2 | 2 |
USA | USA | parent | product1 | 40 |
2019 | 2019 | null | product2 | 515 |
2019 | 2019 | null | product1 | 460 |
- Previous
- Next