Steps

User interface

General Principles Add Missing Dates Add text column Add Total Rows Aggregate Append Compare Text Columns Argmax Argmin Concatenate columns Convert Convert date to text Convert text to date Cumulated sum Custom step Delete column(s) Geographically dissolve data Duplicate Duration Evolution Extract date information Extract substring Fill null Filter Formula Geographical hierarchy Get unique groups/values Hierarchical rollup ifthenelse Join Keep column(s) Moving Average Percentage Pivot Rank Rename Replace Geographical simplification Sort Split column Column's Statistics To lowercase To uppercase Top N rows Unpivot Waterfall

Technical documentation

Packages Steps Designing a new step Known limitations Variables Custom code editor Python package

Steps definition

This document describes the steps that can be used in our pipelines. A step is the unit of transformation generated by a user interaction with the visual query builder.

Unless explicitly specified, all steps are supported by all backends.

`absolutevalue` step

This step is meant to compute the absolute value of a given input column.

{
    name: 'absolutevalue',
    column: 'my-column'
    new_column: 'my-new-column'
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Value
Company 1	-33
Company 2	0
Company 3	10

Step configuration:

{
  name: 'absolutevalue',
  column: "Value",
  new_column_name: "My-absolute-value",
}

Output dataset:

Company	Value	My-absolute-value
Company 1	-33	33
Company 2	0	0
Company 3	10	10

`addmissingdates` step

Add missing dates as new rows in a dates column. Exhaustive dates will range between the minimum and maximum date found in the dataset (or in each group if a group by logic is applied - see thereafter).

Added rows will be set to null in columns not referenced in the step configuration.

You should make sure to use a group by logic if you want to add missing dates in independent groups of rows (e.g. you may need to add missing rows for every country found in a “COUNTRY” column). And you should ensure that every date is unique in every group of rows at the specified granularity, else you may get inconsistent results. You can specify “group by” columns in the groups parameter.

An addmissingdates step has the following structure:

{
  name: 'addmissingdates',
  datesColumn: 'DATE', // the dates columns with missing rows
  datesGranularity: 'day', // the granularity of the date column (day, month or year)
  groups: ['COUNTRY'] // (optional) columns used for group by logic if necessary
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: day granularity without groups

Input dataset:

DATE	VALUE
“2018-01-01T00:00:00.000Z”	75
“2018-01-02T00:00:00.000Z”	80
“2018-01-03T00:00:00.000Z”	82
“2018-01-04T00:00:00.000Z”	83
“2018-01-05T00:00:00.000Z”	80
“2018-01-07T00:00:00.000Z”	86
“2018-01-08T00:00:00.000Z”	79
“2018-01-09T00:00:00.000Z”	76
“2018-01-10T00:00:00.000Z”	79
“2018-01-11T00:00:00.000Z”	75

Here the day “2018-01-06” is missing.

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'day'
}

Output dataset:

DATE	VALUE
“2018-01-01T00:00:00.000Z”	75
“2018-01-02T00:00:00.000Z”	80
“2018-01-03T00:00:00.000Z”	82
“2018-01-04T00:00:00.000Z”	83
“2018-01-05T00:00:00.000Z”	80
“2018-01-06T00:00:00.000Z”
“2018-01-07T00:00:00.000Z”	86
“2018-01-08T00:00:00.000Z”	79
“2018-01-09T00:00:00.000Z”	76
“2018-01-10T00:00:00.000Z”	79
“2018-01-11T00:00:00.000Z”	75

Example 2: day granularity with groups

Input dataset:

COUNTRY	DATE	VALUE
France	“2018-01-01T00:00:00.000Z”	75
France	“2018-01-02T00:00:00.000Z”	80
France	“2018-01-03T00:00:00.000Z”	82
France	“2018-01-04T00:00:00.000Z”	83
France	“2018-01-05T00:00:00.000Z”	80
France	“2018-01-07T00:00:00.000Z”	86
France	“2018-01-08T00:00:00.000Z”	79
France	“2018-01-09T00:00:00.000Z”	76
France	“2018-01-10T00:00:00.000Z”	79
France	“2018-01-11T00:00:00.000Z”	85
USA	“2018-01-01T00:00:00.000Z”	69
USA	“2018-01-02T00:00:00.000Z”	73
USA	“2018-01-03T00:00:00.000Z”	73
USA	“2018-01-05T00:00:00.000Z”	75
USA	“2018-01-06T00:00:00.000Z”	70
USA	“2018-01-07T00:00:00.000Z”	76
USA	“2018-01-08T00:00:00.000Z”	73
USA	“2018-01-09T00:00:00.000Z”	70
USA	“2018-01-10T00:00:00.000Z”	72
USA	“2018-01-12T00:00:00.000Z”	78

Here the day “2018-01-06” is missing for “France” rows, and “2018-01-11” and “2018-01-11” are missing for “USA” rows.

Note that “2018-01-12” will not be considered as a missing row for “France” rows, because the latest date found for this group of rows is “2018-01-11” (even though “2018-01-12” is the latest date found for “USA” rows).

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'day',
  groups: 'COUNTRY'
}

Output dataset:

COUNTRY	DATE	VALUE
France	“2018-01-01T00:00:00.000Z”	75
France	“2018-01-02T00:00:00.000Z”	80
France	“2018-01-03T00:00:00.000Z”	82
France	“2018-01-04T00:00:00.000Z”	83
France	“2018-01-05T00:00:00.000Z”	80
France	“2018-01-06T00:00:00.000Z”
France	“2018-01-07T00:00:00.000Z”	86
France	“2018-01-08T00:00:00.000Z”	79
France	“2018-01-09T00:00:00.000Z”	76
France	“2018-01-10T00:00:00.000Z”	79
France	“2018-01-11T00:00:00.000Z”	85
USA	“2018-01-01T00:00:00.000Z”	69
USA	“2018-01-02T00:00:00.000Z”	73
USA	“2018-01-03T00:00:00.000Z”	73
USA	“2018-01-04T00:00:00.000Z”
USA	“2018-01-05T00:00:00.000Z”	75
USA	“2018-01-06T00:00:00.000Z”	70
USA	“2018-01-07T00:00:00.000Z”	76
USA	“2018-01-08T00:00:00.000Z”	73
USA	“2018-01-09T00:00:00.000Z”	70
USA	“2018-01-10T00:00:00.000Z”	72
USA	“2018-01-11T00:00:00.000Z”
USA	“2018-01-12T00:00:00.000Z”	78

Example 3: month granularity

Input dataset:

DATE	VALUE
“2019-01-01T00:00:00.000Z”	74
“2019-02-01T00:00:00.000Z”	73
“2019-03-01T00:00:00.000Z”	68
“2019-04-01T00:00:00.000Z”	71
“2019-06-01T00:00:00.000Z”	74
“2019-07-01T00:00:00.000Z”	74
“2019-08-01T00:00:00.000Z”	73
“2019-09-01T00:00:00.000Z”	72
“2019-10-01T00:00:00.000Z”	75
“2019-12-01T00:00:00.000Z”	76

Here “2019-05” and “2019-11” are missing.

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'month',
}

Output dataset:

DATE	VALUE
“2019-01-01T00:00:00.000Z”	74
“2019-02-01T00:00:00.000Z”	73
“2019-03-01T00:00:00.000Z”	68
“2019-04-01T00:00:00.000Z”	71
“2019-05-01T00:00:00.000Z”
“2019-06-01T00:00:00.000Z”	74
“2019-07-01T00:00:00.000Z”	74
“2019-08-01T00:00:00.000Z”	73
“2019-09-01T00:00:00.000Z”	72
“2019-10-01T00:00:00.000Z”	75
“2019-11-01T00:00:00.000Z”
“2019-12-01T00:00:00.000Z”	76

`aggregate` step

Perform aggregations on one or several columns. Available aggregation functions are sum, average, count, count distinct, min, max, first, last.

An aggreation step has the following structure:

{
   name: 'aggregate',
   on: ['column1', 'column2'],
   aggregations:  [
    {
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      columns: ['value1', 'value2']
    }
    {
      newcolumns: ['avg_value1'],
      aggfunction: 'avg',
      columns: ['value1']
    }
    // ...
  ]
  keepOriginalGranularity: false, // whether to keep the original granularity, in that
                                  // case computed aggregations will be added in new columns
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Deprecation note:

The column and newcolumn (in the singular) properties of the aggregation parameter are deprecated and are supported only for retrocompatibility purposes. When this step was first created, only 1 column at a time could be aggregated for a given aggregation. Now columns and newcolumns are lists of columns, allowing to apply the same aggregation function to several columns at once.

An old-fashioned step looked like this:

{
   name: 'aggregate',
   on: ['column1', 'column2'],
   aggregations:  [
    {
      newcolumn: 'sum_value1'
      aggfunction: 'sum',
      column: 'value1'
    }
    {
      newcolumn: 'sum_value2',
      aggfunction: 'sum',
      column: 'value2'
    }
    {
      newcolumn: 'avg_value1',
      aggfunction: 'avg',
      column: 'value1'
    }
    // ...
  ]
}

Example 1: keepOriginalGranularity set to false

Input dataset:

Label	Group	Value1	Value2
Label 1	Group 1	13	10
Label 2	Group 1	7	21
Label 3	Group 1	20	4
Label 4	Group 2	1	17
Label 5	Group 2	9	12
Label 6	Group 2	5	2

Step configuration:

{
  name: 'aggregate',
   on: ['Group'],
   aggregations:  [
    {
      newcolumns: ['Sum-Value1', 'Sum-Value2'],
      aggfunction: 'sum',
      columns: ['Value1', 'Value2']
    }
    {
      newcolumns: ['Avg-Value1'],
      aggfunction: 'avg',
      columns: ['Value1']
    }
  ],
  keepOriginalGranularity: false,
}

Output dataset:

Group	Sum-Value1	Sum-Value2	Avg-Value1
Group 1	40	35	13.333333
Group 2	16	31	5.333333

Example 2: keepOriginalGranularity set to true

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'aggregate',
   on: ['Group'],
   aggregations:  [
    {
      newcolumns: ['Total'],
      aggfunction: 'sum',
      columns: ['Value']
    }
  ],
  keepOriginalGranularity: true,
}

Output dataset:

Label	Group	Value
Label 1	Group 1	40
Label 2	Group 1	40
Label 3	Group 1	40
Label 4	Group 2	16
Label 5	Group 2	16
Label 6	Group 2	16

`append` step

Appends to the current dataset, one or several datasets resulting from other pipelines. WeaverBird allows you to save pipelines referenced by name in the Vuex store of the application. You can then call them by their unique names in this step.

{
  name: 'append',
  pipelines: ['pipeline1', 'pipeline2'] // the name of the pipelines to append
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7

dataset1 (saved in the application Vuex store):

Label	Group	Value
Label 3	Group 1	20
Label 4	Group 2	1

dataset2 (saved in the application Vuex store):

Label	Group	Value
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'append',
  pipelines: ['dataset1', 'dataset2']
}

Output dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

`argmax` step

Get row(s) matching the maximum value in a given column, by group if groups is specified.

{
  name: 'argmax',
  column: 'value', // column in which to search for max value
  groups: ['group1', 'group2']
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: without `groups`

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'argmax',
  column: 'Value'
}

Output dataset:

Label	Group	Value
Label 3	Group 1	20

Example 2: with `groups`

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'argmax',
  column: 'Value',
  groups: ['Group']
}

Output dataset:

Label	Group	Value
Label 3	Group 1	20
Label 5	Group 2	10

`argmin` step

Get row(s) matching the minimum value in a given column, by group if groups is specified.

{
  name: 'argmin',
  column: 'value', // column in which to search for max value
  groups: ['group1', 'group2'] // optional
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: without `groups`

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'argmin',
  column: 'Value'
}

Output dataset:

Label	Group	Value
Label 4	Group 2	1

Example 2: with `groups`

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'argmin',
  column: 'Value',
  groups: ['Groups']
}

Output dataset:

Label	Group	Value
Label 2	Group 1	7
Label 4	Group 2	1

`comparetext` step

Compares 2 string columns and returns true if the string values are equal, and false oteherwise. The comparison is case-sensitive (see examples below).

{
  name: 'comparetext',
  newColumnName: 'NEW', // new column to be created, with the comparison result
  strCol1: 'TEXT_1', // first string column to compare
  strCol2: 'TEXT_2', // second string column to compare
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

TEXT_1	TEXT_2
France	Fr
France	France
France	france
France	England
France	USA

Step configuration:

{
  name: 'split',
  newColumnName: 'RESULT',
  ctrCol1: 'TEXT_1',
  ctrCol2: 'TEXT_2',
}

Output dataset:

TEXT_1	TEXT_2	RESULT
France	Fr	false
France	France	true
France	france	false
France	England	false
France	USA	false

`concatenate` step

This step allows to concatenate several columns using a separator.

{
  name: 'concatenate',
  columns: ['Company', 'Group']
  separator: ' - ' // can be a string of any length
  new_column_name: 'Label' // The new column in which to write the concatenation
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Group	Value
Company 1	Group 1	13
Company 2	Group 1	7
Company 3	Group 1	20
Company 4	Group 2	1
Company 5	Group 2	10
Company 6	Group 2	5

Step configuration:

{
  name: 'concatenate',
  columns: ['Company', 'Group']
  separator: ' - '
  new_column_name: 'Label'
}

Output dataset:

Company	Group	Value	Label
Company 1	Group 1	13	Company 1 - Group 1
Company 2	Group 1	7	Company 2 - Group 1
Company 3	Group 1	20	Company 3 - Group 1
Company 4	Group 2	1	Company 4 - Group 2
Company 5	Group 2	10	Company 5 - Group 2
Company 6	Group 2	5	Company 6 - Group 2

`convert` step

This step allows to convert columns data types.

{
  name: 'convert',
  columns: ['col1', 'col2']
  data_type: 'integer' // The data type to convert the column into. Can be either
                       // 'integer', 'float', 'text', 'date' or 'boolean'
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Pandas (python)
Snowflake

In a effort to harmonize as much as possible the conversion behaviors, for some cases, the Sql translator implements casting otherwise than the CAST AS method.

Precisely, when casting float to integer, the default behaviour rounds the result, other languages truncate it. That’s why the use of TRUNCATE was implemented when converting float to int. The same implementation was done when converting strings to int (for date represented as string). As for the conversion of date to int, we handled it by assuming the dataset’s timestamp is in TIMESTAMP_NTZ format.

Example

Input dataset:

Company	Value
Company 1	‘13’
Company 2	‘7’
Company 3	‘20’
Company 4	‘1’
Company 5	‘10’
Company 6	‘5’

Step configuration:

{
  name: 'convert',
  columns: ['Value']
  data_type: 'integer'
}

Output dataset:

Company	Value
Company 1	13
Company 2	7
Company 3	20
Company 4	1
Company 5	10
Company 6	5

`cumsum` step

This step allows to compute the cumulated sum of value columns based on a reference column (usually dates) to be sorted by ascending order for the needs of the computation. The computation can be scoped by group if needed.

The toCumSum parameter takes as input a list of 2-elements lists in the form [‘valueColumn’, ‘newColumn’].

{
  name: 'cumsum',
  toCumSum: [['myValues', 'myCumsum']],
  referenceColumn: 'myDates',
  groupby: ['foo', 'bar'],
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Basic usage

Input dataset:

DATE	VALUE
2019-01	2
2019-02	5
2019-03	3
2019-04	8
2019-05	9
2019-06	6

Step configuration:

{
  name: 'cumsum',
  toCumSum: [['VALUE', '']]
  referenceColumn: 'DATE',
}

Output dataset:

DATE	VALUE	VALUE_CUMSUM
2019-01	2	2
2019-02	5	7
2019-03	3	10
2019-04	8	18
2019-05	9	27
2019-06 6	6	33

Example 2: With more advanced options

Input dataset:

COUNTRY	DATE	VALUE
France	2019-01	2
France	2019-02	5
France	2019-03	3
France	2019-04	8
France	2019-05	9
France	2019-06 6	6
USA	2019-01	10
USA	2019-02	6
USA	2019-03	6
USA	2019-04	4
USA	2019-05	8
USA	2019-06 6	7

Step configuration:

{
  name: 'cumsum',
  toCumSum: [['VALUE', 'MY_CUMSUM']],
  referenceColumn: 'DATE',
  groupby: ['COUNTRY'],
}

Output dataset:

COUNTRY	DATE	VALUE	MY_CUMSUM
France	2019-01	2	2
France	2019-02	5	7
France	2019-03	3	10
France	2019-04	8	18
France	2019-05	9	27
France	2019-06 6	6	33
USA	2019-01	10	10
USA	2019-02	6	16
USA	2019-03	6	22
USA	2019-04	4	26
USA	2019-05	8	34
USA	2019-06 6	7	41

`custom` step

This step allows to define a custom query that can’t be expressed using the other existing steps.

{
    name: 'custom',
    query: '$group: {_id: ...}'
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6

Example: using Mongo query language

Input dataset:

Company	Group	Value
Company 1	Group 1	13
Company 2	Group 1	7
Company 3	Group 1	20
Company 4	Group 2	1
Company 5	Group 2	10
Company 6	Group 2	5

Step configuration:

{
  name: 'custom',
  query: '$addFields: { Label: { $concat: [ "$Label", " - ", "$Group" ] ] } }'
}

Output dataset:

Company	Group	Value	Label
Company 1	Group 1	13	Company 1 - Group 1
Company 2	Group 1	7	Company 2 - Group 1
Company 3	Group 1	20	Company 3 - Group 1
Company 4	Group 2	1	Company 4 - Group 2
Company 5	Group 2	10	Company 5 - Group 2
Company 6	Group 2	5	Company 6 - Group 2

`dateextract` step

Extract date information (eg. day, week, year etc.). The following information can be extracted:

year: extract ‘year’ from date,
month: extract ‘month’ from date,
day: extract ‘day of month’ from date,
week': extract ‘week number’ (ranging from 0 to 53) from date,
quarter: extract ‘quarter number’ from date (1 for Jan-Feb-Mar)
dayOfWeek: extract ‘day of week’ (ranging from 1 for Sunday to 7 for Staurday) from date,
dayOfYear: extract ‘day of year’ from date,
isoYear: extract ‘year number’ in ISO 8601 format (ranging from 1 to 53) from date.
isoWeek: extract ‘week number’ in ISO 8601 format (ranging from 1 to 53) from date.
isoDayOfWeek: extract ‘day of week’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date,
firstDayOfYear: calendar date corresponding to the first day (1st of January) of the year ,
firstDayOfMonth: calendar date corresponding to the first day of the month,
firstDayOfWeek: calendar date corresponding to the first day of the week,
firstDayOfQuarter: calendar date corresponding to the first day of the quarter,
firstDayOfIsoWeek: calendar date corresponding to the first day of the week in ISO 8601 format,
currentDay: calendar date of the target date,
previousDay: calendar date one day before the target date,
firstDayOfPreviousYear: calendar date corresponding to the first day (1st of January) of the previous year,
firstDayOfPreviousMonth: calendar date corresponding to the first day of the previous month,
firstDayOfPreviousWeek: calendar date corresponding to the first day of the previous week,
firstDayOfPreviousQuarter: calendar date corresponding to the first day of the previous quarter,
firstDayOfPreviousISOWeek: calendar date corresponding to the first day of the previous ISO week,
previousYear: extract previous ‘year number’ from date
previousMonth: extract previous ‘month number’ from date
previousWeek: extract previous ‘week number’ from date
previousQuarter: extract previous ‘quarter number’ from date
previousISOWeek: extract previous ‘week number’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date
hour: extract ‘hour’ from date,
minutes: extract ‘minutes’ from date,
seconds: extract ‘seconds’ from date,
milliseconds: extract ‘milliseconds’ from date,

Here’s an example of such a step:

{
  name: 'dateextract',
  column: 'date',
  dateInfo: ['year', 'month', 'day'],
  newColumns: ['date_year', 'date_month', 'date_day'],
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Deprecation note:

The operation and new_column_name parameters are deprecated and are supported for retrocompatibility purposes only.

An old-fashioned step looks like this:

{
  name: 'dateextract',
  column: 'date',
  operation: 'day',
  new_column_name: 'date_day',
}

Example

Input dataset:

Date
2019-10-30T00:00:00.000Z
2019-10-15T00:00:00.000Z
2019-10-01T00:00:00.000Z
2019-09-30T00:00:00.000Z
2019-09-15T00:00:00.000Z
2019-09-01T00:00:00.000Z

Step configuration:

{
  name: 'dateextract',
  column: 'Date',
  dateInfo: ['year', 'month', 'day'],
  newColumns: ['Date_year', 'Date_month', 'Date_day'],
}

Output dataset:

Date	Date_year	Date_month	Date_day
2019-10-30T00:00:00.000Z	2019	10	30
2019-10-15T00:00:00.000Z	2019	10	15
2019-10-01T00:00:00.000Z	2019	10	1
2019-09-30T00:00:00.000Z	2020	10	30
2019-09-15T00:00:00.000Z	2020	10	15
2019-09-01T00:00:00.000Z	2020	10	1

`dategranularity` step

Extract date information (eg. day, week, year etc.) in a column intended for aggregation. The following granularities are supported:

year: calendar date corresponding to the first day (1st of January) of the year
quarter: calendar date corresponding to the first day of the quarter
month: calendar date corresponding to the first day of the month
week: calendar date corresponding to the first day of the week (sunday)
isoWeek: calendar date corresponding to the first day of the week (monday)
day: calendar date corresponding to the first hour of the day

Here’s an example of such a step:

{
  name: 'dategranularity',
  column: 'date',
  granularity: 'year',
  newColumn: 'do_the_aggregate_on_this'
}

*This step is supported by the following backends:

Mongo
Pandas (python)

Example

Input dataset:

Date
2019-10-30T00:00:00.000Z
2019-10-15T00:00:00.000Z
2019-10-01T00:00:00.000Z
2019-09-30T05:11:31.000Z
2019-09-15T00:00:00.000Z
2019-09-01T00:00:00.000Z

Step configuration:

{
  name: 'dategranularity',
  column: 'Date',
  granularity: 'month'
}

Output dataset:

Date
2019-10-01T00:00:00.000Z
2019-10-01T00:00:00.000Z
2019-10-01T00:00:00.000Z
2019-09-01T00:00:00.000Z
2019-09-01T00:00:00.000Z
2019-09-01T00:00:00.000Z

`delete` step

Delete a column.

{
    name: 'delete',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Group	Value	Label
Company 1	Group 1	13	Company 1 - Group 1
Company 2	Group 1	7	Company 2 - Group 1
Company 3	Group 1	20	Company 3 - Group 1
Company 4	Group 2	1	Company 4 - Group 2
Company 5	Group 2	10	Company 5 - Group 2
Company 6	Group 2	5	Company 6 - Group 2

Step configuration:

{
  name: 'delete',
  columns: ['Company', 'Group']
}

Output dataset:

Value	Label
13	Company 1 - Group 1
7	Company 2 - Group 1
20	Company 3 - Group 1
1	Company 4 - Group 2
10	Company 5 - Group 2
5	Company 6 - Group 2

`dissolve` step

Geographically dissolve data

{
    name: 'dissolve',
    groups: ['my-column', 'some-other-column'],
    include_nulls: true,
}

This step is supported by the following backends:

Pandas (python)

Example without aggregations

Input dataset:

Country	City	geometry
Country 1	City 1	Polygon
Country 2	City 2	Polygon
Country 2	City 3	Polygon
Country 1	City 4	Polygon
Country 2	City 5	Polygon
Country 1	City 6	Polygon

Step configuration:

{
    name: 'dissolve',
    groups: ['Country'],
    include_nulls: true,
}

Output dataset:

Country	geometry
Country 1	MultiPolygon
Country 2	MultiPolygon

Example with aggregations

Input dataset:

Country	City	geometry	Population
Country 1	City 1	Polygon	100_000
Country 2	City 2	Polygon	50_000
Country 2	City 3	Polygon	200_000
Country 1	City 4	Polygon	30_000
Country 2	City 5	Polygon	25_000
Country 1	City 6	Polygon	10_000

Step configuration:

{
    name: 'dissolve',
    groups: ['Country'],
    include_nulls: true,
    aggregations: [
        {
            aggfunction: 'sum',
            columns: ['Population'],
            newcolumns: ['Total population'],
        }
    ]
}

Output dataset:

Country	geometry	Total population
Country 1	MultiPolygon	140_000
Country 2	MultiPolygon	275_000

`domain` step

This step is meant to select a specific domain (using MongoDB terminology).

{
    name: 'domain',
    domain: 'my-domain'
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

`duplicate` step

This step is meant to duplicate a column.

{
    name: 'duplicate',
    column: 'my-column'
    new_column_name: 'my-duplicate'
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Value
Company 1	13
Company 2	0
Company 3	20

Step configuration:

{
  name: 'duplicate',
  column: "Company",
  new_column_name: "Company-copy",
}

Output dataset:

Company	Value	Company-copy
Company 1	13	Company 1
Company 2	0	Company 2
Company 3	20	Company 3

`duration` step

Compute the duration (in days, hours, minutes or seconds) between 2 dates in a new column.

{
  name: 'duration',
  newColumnName: 'DURATION', // new column name
  startDateColumn: 'START_DATE', // start date
  endDateColumn: 'END_DATE', // end date
  durationIn: 'days', // duration units (days, hours, minutes or seconds)
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: duration in days

Input dataset:

START_DATE	END_DATE
“2020-01-01T00:00:00.000Z”	“2020-01-31T00:00:00.000Z”
“2020-01-01T00:00:00.000Z”	“2020-12-31T00:00:00.000Z”

Step configuration:

{
  name: 'duration',
  newColumnName: 'DURATION',
  startDateColumn: 'START_DATE',
  endDateColumn: 'END_DATE',
  durationIn: 'days',
}

Output dataset:

START_HOUR	END_HOUR	DURATION
“2020-01-01T00:00:00.000Z”	“2020-01-31T00:00:00.000Z”	30
“2020-01-01T00:00:00.000Z”	“2020-12-31T00:00:00.000Z”	365

Example 2: duration in minutes

Input dataset:

START_HOUR	END_HOUR
“2020-01-01T14:00:00.000Z”	“2020-01-31T15:00:00.000Z”
“2020-01-01T15:00:00.000Z”	“2020-12-31T20:00:00.000Z”

Step configuration:

{
  name: 'duration',
  newColumnName: 'DURATION',
  startDateColumn: 'START_HOUR',
  endDateColumn: 'END_HOUR',
  durationIn: 'minutes',
}

Output dataset:

START_HOUR	END_HOUR	DURATION
“2020-01-01T14:00:00.000Z”	“2020-01-31T15:00:00.000Z”	60
“2020-01-01T15:00:00.000Z”	“2020-12-31T20:00:00.000Z”	300

`evolution` step

Use this step if you need to compute the row-by-row evolution of a value column, based on a date column. It will output 2 columns: one for the evolution in absolute value, the other for the evolution in percentage.

You must be careful that the computation is scoped so that there are no dates duplicates (so that any date finds no more than one previous date). That means that you may need to specify “group by” columns to make any date unique inside each group. You should specify those columns in the indexColumns parameter.

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear', // or vsLastMonth, vsLastWeek, vsLastDay
  evolutionFormat: 'abs', // or pct
  // optional, if computation has to be performed by group so that any date finds no more than one previous date
  indexColumns: ['COUNTRY'],
  newColumn: 'MY_EVOL', // optional, <originalColumnName>_EVOL_ABS or <originalColumnName>_EVOL_PCT by default
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Basic configuration - evolution in absolute value

Input dataset:

DATE	VALUE
2019-06	79
2019-07	81
2019-08	77
2019-09	75
2019-11	78
2019-12	88

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastMonth',
  evolutionFormat: 'abs',
  indexColumns: [],
}

Output dataset:

DATE	VALUE	VALUE_EVOL_ABS
2019-06	79
2019-07	81	2
2019-08	77	-4
2019-09	75	-2
2019-11	78
2019-12	88	10

Example 2: Basic configuration - evolution in percentage

Input dataset:

DATE	VALUE
2019-06	79
2019-07	81
2019-08	77
2019-09	75
2019-11	78
2019-12	88

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastMonth',
  evolutionFormat: 'pct',
  indexColumns: [],
}

Output dataset:

DATE	VALUE	VALUE_EVOL_PCT
2019-06	79
2019-07	81	0.02531645569620253
2019-08	77	-0.04938271604938271
2019-09	75	-0.025974025974025976
2019-11	78
2019-12	88	0.1282051282051282

Example 3: Error on duplicate dates

If ‘COUNTRY’ is not specified as indexColumn, the computation will not be scoped by country. Then there are duplicate dates in the “DATE” columns which is prohibited and will lead to an error.

Input dataset:

DATE	COUNTRY	VALUE
2014-12	France	79
2015-12	France	81
2016-12	France	77
2017-12	France	75
2014-12	USA	74
2015-12	USA	74
2016-12	USA	73
2017-12	USA	72

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear',
  evolutionFormat: 'abs',
  indexColumns: [],
}

Output dataset:

With the mongo translator, you will get an error at row-level as shown below:

DATE	COUNTRY	VALUE	MY_EVOL
2014-12	France	79
2015-12	France	81	Error …
2016-12	France	77	Error …
2017-12	France	75	Error …
2014-12	USA	74
2015-12	USA	74	Error …
2016-12	USA	73	Error …
2017-12	USA	72	Error …

The pandas translator will just return an error, and you will not get any data.

Example 4: Complete configuration with index columns

Input dataset:

DATE	COUNTRY	VALUE
2014-12	France	79
2015-12	France	81
2016-12	France	77
2017-12	France	75
2019-12	France	78
2020-12	France	88
2014-12	USA	74
2015-12	USA	74
2016-12	USA	73
2017-12	USA	72
2018-11	USA	75
2020-12	USA	76

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear',
  evolutionFormat: 'abs',
  indexColumns: ['COUNTRY'],
  newColumn: 'MY_EVOL',
}

Output dataset:

DATE	COUNTRY	VALUE	MY_EVOL
2014-12	France	79
2015-12	France	81	2
2016-12	France	77	-4
2017-12	France	75	-2
2019-12	France	78
2020-12	France	88	10
2014-12	USA	74
2015-12	USA	74	0
2016-12	USA	73	-1
2017-12	USA	72	-1
2018-11	USA	75	3
2020-12	USA	76

`fillna` step

Replace null values by a given value in specified columns.

{
    name: 'fillna',
    columns: ["foo", "bar"],
    value: 0
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Deprecation note:

The column (in the singular) parameter is deprecated and is supported only for retrocompatibility purposes.

An old-fashioned step looks like this:

{
    name: 'fillna',
    column: 'foo',
    value: 0
}

Example

Input dataset:

Company	Group	Value	KPI
Company 1	Group 1	13
Company 2	Group 1		12
Company 3	Group 1	20	40
Company 4	Group 2	1
Company 5	Group 2		38
Company 6	Group 2	5	4

Step configuration:

{
  name: 'fillna',
  columns: ["Value", "KPI"],
  value: 0
}

Output dataset:

Company	Group	Value	KPI
Company 1	Group 1	13	0
Company 2	Group 1	0	12
Company 3	Group 1	20	40
Company 4	Group 2	1	0
Company 5	Group 2	0	38
Company 6	Group 2	5	4

`filter` step

Filter out lines that don’t match a filter definition.

{
    name: 'filter',
    condition: {
      column: 'my-column',
      value: 42,
      operator: 'ne'
    }
}

operator is optional, and defaults to eq. Allowed operators are eq, ne, gt, ge, lt, le, in, nin, matches, notmatches isnull or notnull.

value can be an arbitrary value depending on the selected operator (e.g a list when used with the in operator, or null when used with the isnull operator).

matches and notmatches operators are used to test value against a regular expression.

Conditions can be grouped and nested with logical operators and and or.

{
    name: 'filter',
    condition: {
      and: [
        {
          column: 'my-column',
          value: 42,
          operator: 'gte'
        },
        {
          column: 'my-column',
          value: 118,
          operator: 'lte'
        },
        {
          or: [
            {
              column: 'my-other-column',
              value: 'blue',
              operator: 'eq'
            },
            {
              column: 'my-other-column',
              value: 'red',
              operator: 'eq'
            }
          ]
        }
      ]
    }
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Relative dates

Date values can be relative to the moment to the moment when the query is executed. This is expressed by using a RelativeDate object instead of the value, of the form:

{
  quantity: Number // can be positive (future) or negative (past)
  duration: 'year' | 'quarter' | 'month' | 'week' | 'day'
}

*This feature is supported by the following backends:

Mongo 5.0

`formula` step

Add a computation based on a formula. Usually column names do not need to be escaped, unless they include whitespaces, in which case you’ll need to use brackets ‘[]’ (e.g. [myColumn]). Any string escaped with quotes (', ", ''', """) will be considered a string literal.

{
  {
    name: 'formula',
    new_column: 'result', // if the column already exists, overwrites it
    formula: '(Value1 + Value2) / Value3 - Value4 * 2'
  }
}

Supported operators

The following operators are supported by the formula step (note that a value can be a column name or a literal, such as 42 or foo).

+: Does an addition of two numeric values. See the concatenate step to append strings
-: Does an substraction of two numeric values. See the replace step to remove a part of a string
*: Multiplies two numeric values.
/: Divides a numeric value by another. Divisions by zero will return null.
%: Returns the rest of an integer division. Divisions by zero will return null.

Example 1: Basic usage

Input dataset:

Label	Value1	Value2	Value3	Value4
Label 1	10	2	3	1
Label 2	1	13	7	3
Label 3	5	20	5	2

Step configuration:

{
  name: 'formula',
  new_column: 'Result',
  formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}

Output dataset:

Label	Value1	Value2	Value3	Value4	Result
Label 1	10	2	3	1	2
Label 2	1	13	7	3	-4
Label 3	5	20	5	2	1

Example 2: Column name with whitespaces

Input dataset:

Label	Value1	Value2	Value3	Value 4
Label 1	10	2	3	1
Label 2	1	13	7	3
Label 3	5	20	5	2

Step configuration:

{
  name: 'formula',
  new_column: 'Result',
  formula: '(Value1 + Value2) / Value3 - [Value 4] * 2'
}

Output dataset:

Label	Value1	Value2	Value3	Value 4	Result
Label 1	10	2	3	1	2
Label 2	1	13	7	3	-4
Label 3	5	20	5	2	1

`hierarchy` step

Hierarchy for geographical data.

This step dissolves data for every hierarchy level, and adds a hierarchy level column containing a level (with 0 being the lowest granularity, i.e. the highest level).

{
    name: 'hierarchy',
    hierarchy: ['Country', 'City'],
    include_nulls: false,
}

*This step is supported by the following backends:

Pandas (python)

Example

Input dataset:

Country	City	geometry	Population
Country 1	City 1	Polygon	100_000
Country 2	City 2	Polygon	50_000
Country 2	City 3	Polygon	200_000
Country 1	City 4	Polygon	30_000
Country 2	City 5	Polygon	25_000
Country 1	City 6	Polygon	10_000

Step configuration:

    name: 'hierarchy',
    hierarchy: ['Country', 'City'],
    include_nulls: false,

Output dataset:

Country	City	geometry	Population	hierarchy_level
Country 1	City 1	Polygon	100_000	2
Country 2	City 2	Polygon	50_000	2
Country 2	City 3	Polygon	200_000	2
Country 1	City 4	Polygon	30_000	2
Country 2	City 5	Polygon	25_000	2
Country 1	City 6	Polygon	10_000	2
Country 1	null	MultiPolygon	null	1
Country 2	null	MultiPolygon	null	1
null	null	MultiPolygon	null	0

`ifthenelse` step

Creates a new column, which values will depend on a condition expressed on existing columns.

The condition is expressed in the if parameter with a condition object, which is the same object expected by the condition parameter of the filter step). Conditions can be grouped and nested with logical operators and and or.

The then parameter only supports a string, that will be interpreted as a formula (cf. formula step). If you want it to be interpreted striclty as a string and not a formula, you must escape the string with quotes (e.g. ‘“this is a text”’).

if...then...else blocks can be nested as the else parameter supports either a string that will be interpreted as a formula (cf. formula step), or a nested if if...then...else object.

{
  name: 'ifthenelse',
  newColumn: '', // the name of the new column to be created
  if: { column: '', value: '', operator: 'eq' }, // a condition, same object as in the filter step
  then: '', // a string that will be interpreted as a formula
  else: '', // a string interpreted as a formula or a nested if...then...else object
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Label	number
Label 1	-2
Label 2	2
Label 3	0

Step configuration:

{
    name: 'ifthenelse',
    newColumn: 'result',
    if: { column: 'number', value: 0, operator: 'eq' },
    then: '"zero"'
    else: {
      if: { column: 'rel', value: 0, operator: 'lt' },
      then: 'number * -1',
      else: 'number'
    }
}

Output dataset:

Label	number	result
Label 1	-2	2
Label 2	5	5
Label 3	0	zero

`join` step

Joins a dataset to the current dataset, i.e. brings columns from the former into the latter, and matches rows based on columns correspondance. It is similar to a JOIN clause in SQL, or to a VLOOKUP in excel. The joined dataset is the result from the query of the right_pipeline.

The join type can be:

‘left’: will keep every row of the current dataset and fill unmatched rows with null values,
‘inner’: will only keep rows that match rows of the joined dataset.

In the on parameter, you must specify 1 or more column couple(s) that will be compared to determine rows correspondance between the 2 datasets. The first element of a couple is for the current dataset column, and the second for the corresponding column in the right dataset to be joined. If you specify more than 1 couple, the matching rows will be those that find a correspondance between the 2 datasets for every column couple specified (logical ‘AND’).

Weaverbird allows you to save pipelines referenced by name in the Vuex store of the application. You can then call them by their unique names in this step.

{
  name: 'join',
  right_pipeline: 'somePipelineReference',
  type: 'left', // or 'inner'
  on: [
    ['currentDatasetColumn1', 'rightDatasetColumn1'],
    ['currentDatasetColumn2', 'rightDatasetColumn2'],
  ]
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Left join with one column couple as `on` parameter

Input dataset:

Label	Value
Label 1	13
Label 2	7
Label 3	20
Label 4	1
Label 5	1
Label 6	1

rightDataset (saved in the application Vuex store):

Label	Group
Label 1	Group 1
Label 2	Group 1
Label 3	Group 2
Label 4	Group 2

Step configuration:

{
  name: 'join',
  right_pipeline: 'rightDataset',
  type: 'left',
  on: [['Label', 'Label']];
}

Output dataset:

Label	Value	Group
Label 1	13	Group 1
Label 2	7	Group 1
Label 3	20	Group 2
Label 4	1	Group 2
Label 5	1
Label 6	1

Example 2: inner join with different column names in the `on` parameter

Input dataset:

Label	Value
Label 1	13
Label 2	7
Label 3	20
Label 4	1
Label 5	1
Label 6	1

rightDataset (saved in the application Vuex store):

LabelRight	Group
Label 1	Group 1
Label 2	Group 1
Label 3	Group 2
Label 4	Group 2

Step configuration:

{
  name: 'join',
  right_pipeline: 'rightDataset',
  type: 'inner',
  on: [['Label', 'LabelRight']];
}

Output dataset:

Label	Value	LabelRight	Group
Label 1	13	Label 1	Group 1
Label 2	7	Label 2	Group 1
Label 3	20	Label 3	Group 2
Label 4	1	Label 4	Group 2

`fromdate` step

Converts a date column into a string column based on a specified format.

{
    name: 'fromdate',
    column: 'myDateColumn'
    format: '%Y-%m-%d' // For further details on supported format specifiers,
                       // see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Date	Value
Company 1	2019-10-06T00:00:00.000Z	13
Company 1	2019-10-07T00:00:00.000Z	7
Company 1	2019-10-08T00:00:00.000Z	20
Company 2	2019-10-06T00:00:00.000Z	1
Company 2	2019-10-07T00:00:00.000Z	10
Company 2	2019-10-08T00:00:00.000Z	5

Step configuration:

{
  name: 'fromdate',
  column: 'Date',
  format: '%d/%m/%Y'
}

Output dataset:

Company	Date	Value
Company 1	06/10/2019	13
Company 1	07/10/2019	7
Company 1	08/10/2019	20
Company 2	06/10/2019	1
Company 2	07/10/2019	10
Company 2	08/10/2019	5

`lowercase` step

⚠️ Mongo's $toLower operator does not support accents. If you have accents you need to lowercase with Mongo, use a replacetext step after lowercase.

Converts a string column to lowercase.

{
  name: 'lowercase',
  column: 'foo',
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Group	Value
LABEL 1	Group 1	13
LABEL 2	Group 1	7
LABEL 3	Group 1	20

Step configuration:

{
  name: 'lowercase',
  column: 'Label',
}

Output dataset:

Label	Group	Value
label 1	Group 1	13
label 2	Group 1	7
label 3	Group 1	20

`movingaverage` step

Compute the moving average based on a value column, a reference column to sort (usually a date column) and a moving window (in number of rows i.e. data points). If needed, the computation can be performed by group of rows. The computation result is added in a new column.

{
  name: 'movingaverage',
  valueColumn: 'value',
  columnToSort: 'dates'
  movingWindow: 12,
  groups: ['foo', 'bar'] // optional
  newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Basic usage

Input dataset:

DATE	VALUE
2018-01-01	75
2018-01-02	80
2018-01-03	82
2018-01-04	83
2018-01-05	80
2018-01-06	86
2018-01-07	79
2018-01-08	76

Step configuration:

{
  name: 'movingaverage',
  valueColumn: 'VALUE',
  columnToSort: 'DATE'
  movingWindow: 2,
}

Output dataset:

DATE	VALUE	VALUE_MOVING_AVG
2018-01-01	75	null
2018-01-02	80	77.5
2018-01-03	82	81
2018-01-04	83	82.5
2018-01-05	80	81.5
2018-01-06	86	83
2018-01-07	79	82.5
2018-01-08	76	77.5

Example 2: with groups and custom newColumnName

Input dataset:

COUNTRY	DATE	VALUE
France	2018-01-01	75
France	2018-01-02	80
France	2018-01-03	82
France	2018-01-04	83
France	2018-01-05	80
France	2018-01-06	86
USA	2018-01-01	69
USA	2018-01-02	73
USA	2018-01-03	73
USA	2018-01-04	75
USA	2018-01-05	70
USA	2018-01-06	76

Step configuration:

{
  name: 'movingaverage',
  valueColumn: 'VALUE',
  columnToSort: 'DATE'
  movingWindow: 2,
  groups: ['COUNTRY']
  newColumName: 'ROLLING_AVERAGE'
}

Output dataset:

COUNTRY	DATE	VALUE	ROLLING_AVERAGE
France	2018-01-01	75	null
France	2018-01-02	80	null
France	2018-01-03	82	79
France	2018-01-04	83	81.7
France	2018-01-05	80	81.7
France	2018-01-06	86	83
USA	2018-01-01	69	null
USA	2018-01-02	73	null
USA	2018-01-03	73	71.7
USA	2018-01-04	75	73.7
USA	2018-01-05	70	72.7
USA	2018-01-06	76	73.7

`percentage` step

Compute the percentage of total, i.e. for every row the value in column divided by the total as the sum of every values in column. The computation can be performed by group if specified. The result is written in a new column.

{
  name: 'percentage',
  column: 'bar',
  group: ['foo'] // optional
  newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Group	Value
Label 1	Group 1	5
Label 2	Group 1	10
Label 3	Group 1	15
Label 4	Group 2	2
Label 5	Group 2	7
Label 6	Group 2	5

Step configuration:

  name: 'percentage',
  new_column: 'Percentage_of_total',
  column: 'Value',
  group: ['Group']
  newColumn: 'Percentage'
}

Output dataset:

Label	Group	Value	Percentage
Label 1	Group 1	5	0.167
Label 2	Group 1	10	0.333
Label 3	Group 1	15	0.5
Label 4	Group 2	2	0.143
Label 5	Group 2	7	0.5
Label 6	Group 2	5	0.357

`pivot` step

Pivot rows into columns around a given index (expressed as a combination of column(s)). Values to be used as new column names are found in the column column_to_pivot. Values to populate new columns are found in the column value_column. The function used to aggregate data (when several rows are found by index group) must be among sum, avg, count, min or max.

{
 name: 'pivot',
 index: ['column_1', 'column_2'],
 column_to_pivot: 'column_3',
 value_column: 'column_4',
 agg_function: 'sum',
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Country	Value
Label 1	Country1	13
Label 2	Country1	7
Label 3	Country1	20
Label 1	Country2	1
Label 2	Country2	10
Label 3	Country2	5
label 3	Country2	1

Step configuration:

{
 name: 'pivot',
 index: ['Label'],
 column_to_pivot: 'Country',
 value_column: 'Value',
 agg_function: 'sum',
}

Output dataset:

Label	Country1	Country2
Label 1	13	1
Label 2	7	10
Label 3	20	6

`statistics` step

Compute statistics of a column.,

{
    name: 'statistics',
    column: 'Value',
    groupby: [],
    statistics: ['average', 'count'],
    quantiles: [{label: 'median', nth: 1, order: 2}],
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
    name: 'statistics',
    column: 'Value',
    groupby: [],
    statistics: ['average', 'count'],
    quantiles: [{label: 'median', nth: 1, order: 2}],
}

Output dataset:

average	count	median
9.33333	6	8.5

`rank` step

This step allows to compute a rank column based on a value column that can be sorted in ascending or descending order. The ranking can be computed by group.

There are 2 ranking methods available, that you will understand easily through those examples:

standard: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 5, 5, 7]
dense: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 3, 3, 4]

(The dense method is basically the same as the standard method, but rank always increases by 1 at most).

{
  name: 'rank',
  valueCol: 'VALUE', // The value column that will be ordered to determine rank
  order: 'desc', // How to order the value column to determine ranking.
                 // either asc(ending) or desc(ending), 'desc' by default
  method: 'standard', // either 'standard' or 'dense', as explained above
  groupby: ['foo', 'bar'], // specify columns if you want to group the ranking computation
  newColumnName: 'columnA', // specify the new column name to be created (by default '<yourValueColumn>_RANK')
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Basic usage

Input dataset:

COUNTRY	VALUE
FRANCE	15
FRANCE	5
FRANCE	10
FRANCE	20
FRANCE	10
FRANCE	15
USA	20
USA	30
USA	20
USA	25
USA	15
USA	20

Step configuration:

{
  name: 'rank',
  valueCol: 'VALUE',
  order: 'desc',
  method: 'standard',
}

Output dataset:

COUNTRY	VALUE	VALUE_RANK
USA	30	1
USA	25	2
FRANCE	20	3
USA	20	3
USA	20	3
USA	20	3
FRANCE	15	7
FRANCE	15	7
USA	15	7
FRANCE	10	10
FRANCE	10	10
FRANCE	5	12

Example 2: With more options

Input dataset:

COUNTRY	VALUE
FRANCE	15
FRANCE	5
FRANCE	10
FRANCE	20
FRANCE	10
FRANCE	15
USA	20
USA	30
USA	20
USA	25
USA	15
USA	20

Step configuration:

{
  name: 'rank',
  valueCol: 'VALUE',
  order: 'asc',
  method: 'dense',
  groupby: ['COUNTRY'],
  newColumnName: 'MY_RANK',
}

Output dataset:

COUNTRY	VALUE	MY_RANK
FRANCE	5	1
FRANCE	10	2
FRANCE	10	2
FRANCE	15	3
FRANCE	15	3
FRANCE	20	4
USA	15	1
USA	20	2
USA	20	2
USA	20	2
USA	25	3
USA	30	4

`rename` step

Rename one or several columns. The toRename parameter takes as input a list of 2-elements lists in the form [‘oldColumnName’, ‘newColumnName’].

{
    name: 'rename',
    toRename: [
      ['oldCol1', 'newCol1'] // The first value is the column to be renamed, the second is the new column name
      ['oldCol2', 'newCol2']
    ]
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Deprecation note:

The oldname and newnname parameters are deprecated and are supported only for retrocompatibility purposes. When this step was first created, only 1 column at a time could be renamed, and the step was configured via 2 parameters, oldname and newnname, that are now deprecated and supported only for retrocompatibility purposes. An old-fashioned step with oldname set to ‘old’ and newname set to ‘new’ is equivalent to a new-fashioned step with toRename parameter set to [‘old’, ‘new’].

So an old-fashioned step looks like this:

{
    name: 'rename',
    oldname: 'old-column-name',
    newname: 'new-column-name'
}

Example:

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'rename',
  oldname: 'Label',
  newname: 'Company'
}

Output dataset:

Company	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

`replace` step

Replace one or several values in a column.

A replace step has the following strucure:

{
   name: 'replace',
   search_column: "column_1",
   to_replace: [
     ['foo', 'bar'], // The first value is the one to be replace, the second is the new value
     [42, 0]
   ]
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

COMPANY	COUNTRY
Company 1	Fr
Company 2	UK

Step configuration:

{
   name: 'replace',
   search_column: "COUNTRY",
   to_replace: [
     ['Fr', 'France']
     ['UK', 'United Kingdom']
   ]
}

Output dataset:

COMPANY	COUNTRY
Company 1	France
Company 2	United Kingdom

`replacetext` step

Replace a substring in a column.

A replace-text step has the following structure:

{
   name: 'replacetext',
   search_column: "column_1",
   old_str: 'foo',
   new_str: 'bar'
}

*This step is supported by the following backends:

Mongo 5.0 (python)
Pandas (python)
AWS Athena
Google Big Query
MySQL
PostgreSQL
Redshift
Snowflake

Example

Input dataset:

COMPANY	COUNTRY
Company 1	Fr is boring
Company 2	UK

Step configuration:

{
   name: 'replacetext',
   search_column: "COUNTRY",
   old_str: 'Fr',
   new_str: 'France',
}

Output dataset:

COMPANY	COUNTRY
Company 1	France is boring
Company 2	UK

`rollup` step

Use this step if you need to compute aggregated data at every level of a hierarchy, specified as a series of columns from top to bottom level. The output data structure stacks the data of every level of the hierarchy, specifying for every row the label, level and parent in dedicated columns.

Aggregated rows can be computed with using either sum, average, count, count distinct, min, max, first or last.

{
   name: 'rollup',
   hierarchy: ['continent', 'country', 'city'], // Hierarchical columns, from top to bottom level
   aggregations: [ // The columns to be aggregated, and how
   {
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      columns: ['value1', 'value2']
    }
    {
      newcolumns: ['avg_value1'],
      aggfunction: 'avg',
      columns: ['value1']
    }
   ],
   groupby: ['date'], // optional, if needing to segment the rollup computation by group
   labelCol: 'label', // optional, name of the output label column, 'label' by default
   levelCol: 'level', // optional, name of the output level column, 'level' by default
   childLevelCol: 'child_level', // optional, name of the next output level column, 'child_level' by default
   parentLabelCol: 'parent', // optional, name of the output parent column, 'parent' by default
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Deprecation note:

An old-fashioned step looked like this:

{
   name: 'rollup',
   hierarchy: ['continent', 'country', 'city'],
   aggregations:  [
    {
      newcolumn: 'sum_value1'
      aggfunction: 'sum',
      column: 'value1'
    }
    {
      newcolumn: 'sum_value2',
      aggfunction: 'sum',
      column: 'value2'
    }
    {
      newcolumn: 'avg_value1',
      aggfunction: 'avg',
      column: 'value1'
    }
    // ...
  ]
}

Example 1 : Basic configuration

Input dataset:

CITY	COUNTRY	CONTINENT	YEAR	VALUE
Paris	France	Europe	2018	10
Bordeaux	France	Europe	2018	5
Barcelona	Spain	Europe	2018	8
Madrid	Spain	Europe	2018	3
Boston	USA	North America	2018	12
New-York	USA	North America	2018	21
Montreal	Canada	North America	2018	10
Ottawa	Canada	North America	2018	7
Paris	France	Europe	2019	13
Bordeaux	France	Europe	2019	8
Barcelona	Spain	Europe	2019	11
Madrid	Spain	Europe	2019	6
Boston	USA	North America	2019	15
New-York	USA	North America	2019	24
Montreal	Canada	North America	2019	10
Ottawa	Canada	North America	2019	13

Step configuration:

{
   name: 'rollup',
   hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
   aggregations: [
    {
      newcolumn: ['VALUE'],
      aggfunction: 'sum',
      column: ['VALUE']
    },
   ],
}

Output dataset:

CITY	COUNTRY	CONTINENT	label	level	child_level	parent	VALUE
		Europe	Europe	CONTINENT	COUNTRY		64
		North America	North America	CONTINENT	COUNTRY		112
	France	Europe	France	COUNTRY	CITY	Europe	36
	Spain	Europe	Spain	COUNTRY	CITY	Europe	28
	USA	North America	USA	COUNTRY	CITY	North America	72
	Canada	North America	Canada	COUNTRY	CITY	North America	40
Paris	France	Europe	Paris	CITY		France	23
Bordeaux	France	Europe	Bordeaux	CITY		France	13
Barcelona	Spain	Europe	Barcelona	CITY		Spain	19
Madrid	Spain	Europe	Madrid	CITY		Spain	9
Boston	USA	North America	Boston	CITY		USA	27
New-York	USA	North America	New-York	CITY		USA	45
Montreal	Canada	North America	Montreal	CITY		Canada	20
Ottawa	Canada	North America	Ottawa	CITY		Canada	20

Example 2 : Configuration with optional parameters

Input dataset:

CITY	COUNTRY	CONTINENT	YEAR	VALUE	COUNT
Paris	France	Europe	2018	10	1
Bordeaux	France	Europe	2018	5	1
Barcelona	Spain	Europe	2018	8	1
Madrid	Spain	Europe	2018	3	1
Boston	USA	North America	2018	12	1
New-York	USA	North America	2018	21	1
Montreal	Canada	North America	2018	10	1
Ottawa	Canada	North America	2018	7	1
Paris	France	Europe	2019	13	1
Bordeaux	France	Europe	2019	8	1
Barcelona	Spain	Europe	2019	11	1
Madrid	Spain	Europe	2019	6	1
Boston	USA	North America	2019	15	1
New-York	USA	North America	2019	24	1
Montreal	Canada	North America	2019	10	1
Ottawa	Canada	North America	2019	13	1

Step configuration:

{
   name: 'rollup',
   hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
   aggregations: [
    {
      newcolumn: ['VALUE-sum', 'COUNT']
      aggfunction: 'sum',
      column: ['VALUE', 'COUNT']
    },
    {
      newcolumn: 'VALUE-avg',
      aggfunction: 'avg',
      column: 'VALUE'
    }
   ],
   groupby: ['YEAR'],
   labelCol: 'MY_LABEL',
   levelCol: 'MY_LEVEL',
   childLevelCol: 'MY_CHILD_LEVEL',
   parentLabelCol: 'MY_PARENT',
}

Output dataset:

CITY	COUNTRY	CONTINENT	YEAR	MY_LABEL	MY_LEVEL	MY_CHILD_LEVEL	MY_PARENT	VALUE-sum	VALUE-avg	COUNT
		North America	2018	Europe	CONTINENT	COUNTRY		26	6.5	4
		North America	2018	North America	CONTINENT	COUNTRY		50	12.5	4
	France	Europe	2018	France	COUNTRY	CITY	Europe	15	7.5	2
	Spain	Europe	2018	Spain	COUNTRY	CITY	Europe	11	5.5	2
	USA	North America	2018	USA	COUNTRY	CITY	North America	33	16.5	2
	Canada	North America	2018	Canada	COUNTRY	CITY	North America	17	8.5	2
Paris	France	Europe	2018	Paris	CITY		France	10	10	1
Bordeaux	France	Europe	2018	Bordeaux	CITY		France	5	5	1
Barcelona	Spain	Europe	2018	Barcelona	CITY		Spain	8	8	1
Madrid	Spain	Europe	2018	Madrid	CITY		Spain	3	3	1
Boston	USA	North America	2018	Boston	CITY		USA	12	12	1
New-York	USA	North America	2018	New-York	CITY		USA	21	21	1
Montreal	Canada	North America	2018	Montreal	CITY		Canada	10	10	1
Ottawa	Canada	North America	2018	Ottawa	CITY		Canada	7	7	1
		North America	2019	Europe	CONTINENT	COUNTRY		38	9.5	4
		North America	2019	North America	CONTINENT	COUNTRY		62	15.5	4
	France	Europe	2019	France	COUNTRY	CITY	Europe	21	10.5	2
	Spain	Europe	2019	Spain	COUNTRY	CITY	Europe	17	8.5	2
	USA	North America	2019	USA	COUNTRY	CITY	North America	39	19.5	2
	Canada	North America	2019	Canada	COUNTRY	CITY	North America	23	11.5	2
Paris	France	Europe	2019	Paris	CITY		France	13	13	1
Bordeaux	France	Europe	2019	Bordeaux	CITY		France	8	8	1
Barcelona	Spain	Europe	2019	Barcelona	CITY		Spain	11	11	1
Madrid	Spain	Europe	2019	Madrid	CITY		Spain	6	6	1
Boston	USA	North America	2019	Boston	CITY		USA	15	15	1
New-York	USA	North America	2019	New-York	CITY		USA	24	24	1
Montreal	Canada	North America	2019	Montreal	CITY		Canada	10	10	1
Ottawa	Canada	North America	2019	Ottawa	CITY		Canada	13	13	1

`select` step

Select a column. The default is to keep every columns of the input domain. If the select is used, it will only keep selected columns in the output.

{
    name: 'select',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Group	Value	Label
Company 1	Group 1	13	Company 1 - Group 1
Company 2	Group 1	7	Company 2 - Group 1
Company 3	Group 1	20	Company 3 - Group 1
Company 4	Group 2	1	Company 4 - Group 2
Company 5	Group 2	10	Company 5 - Group 2
Company 6	Group 2	5	Company 6 - Group 2

Step configuration:

{
  {
    name: 'select',
    columns: ['Value', 'Label']
}
}

Output dataset:

Value	Label
13	Company 1 - Group 1
7	Company 2 - Group 1
20	Company 3 - Group 1
1	Company 4 - Group 2
10	Company 5 - Group 2
5	Company 6 - Group 2

`sort` step

Sort values in one or several columns. Order can be either ‘asc’ or ‘desc’. When sorting on several columns, order of columns specified in columns matters.

{
    name: 'sort',
    columns: [{column: 'foo', order: 'asc'}, {column: 'bar', order: 'desc'}],
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
    name: 'sort',
    columns: [{ column: 'Group', order: 'asc'}, {column: 'Value', order: 'desc' }]
}

Output dataset:

Company	Group	Value
Label 3	Group 1	20
Label 1	Group 1	13
Label 2	Group 1	7
Label 5	Group 2	10
Label 6	Group 2	5
Label 4	Group 2	1

`split` step

Split a string column into several columns based on a delimiter.

{
  name: 'split',
  column: 'foo', // the columnn to split
  delimiter: ' - ', // the delimiter can e a strinng of any length
  number_cols_to_keep: 3, // the numer of columns to keep resulting from the
                          // split (starting from first chunk)
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1

Input dataset:

Label	Value
Label 1 - Group 1 - France	13
Label 2 - Group 1 - Spain	7
Label 3 - Group 1 - USA	20
Label 4 - Group 2 - France	1
Label 5 - Group 2 - Spain	10
Label 6 - Group 2 - USA	5

Step configuration:

{
  name: 'split',
  column: 'Label',
  delimiter: ' - ',
  number_cols_to_keep: 3,
}

Output dataset:

Label_1	Label_2	Label_3	Value
Label 1	Group 1	Spain	13
Label 2	Group 1	USA	7
Label 3	Group 1	France	20
Label 4	Group 2	USA	1
Label 5	Group 2	France	10
Label 6	Group 2	Spain	5

Example 2: keeping less columns

Input dataset:

Label	Value
Label 1 - Group 1 - France	13
Label 2 - Group 1 - Spain	7
Label 3 - Group 1 - USA	20
Label 4 - Group 2 - France	1
Label 5 - Group 2 - Spain	10
Label 6 - Group 2 - USA	5

Step configuration:

{
  name: 'split',
  column: 'Label',
  delimiter: ' - ',
  number_cols_to_keep: 2,
}

Output dataset:

Label_1	Label_2	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

`simplify` step

Simplifies geographical data.

When simplifying your data, every point that is closer than a specific distance to the previous one is suppressed. This step can be useful if you have a very precise shape for a country (such as one-meter precision), but want to quickly draw a map chart. In that case, you may want to simplify your data.

After simplification, no points will be closer than tolerance. The unit depends on data’s projection and on its unit, but in general, it’s expressed in meters for CRS projections. For more details, see the GeoPandas documentation.

Step configuration:

{
  name: 'simplify',
  tolerance: 1.0,
}

`substring` step

Extract a substring in a string column. The substring begins at index start_index (beginning at 1) and stops at end_index. You can specify negative indexes, in such a case the index search will start from the end of the string (with -1 being the last index of the string). Please refer to the examples below for illustration. Neither start_index nor end_index can be equal to 0.

{
  name: 'substring',
  column: 'foo',
  start_index: 1, // 1 = fisrt character
  end_index: -1, // -1 = last character
  newColumName: 'myNewColumn' // optional, <originalColumame>_SUBSTR by default
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: positive `start_index` and `end_index`

Input dataset:

Group	Value
foo	13
overflow	7
some_text	20
a_word	1
toucan	10
toco	5

Step configuration:

{
  column: 'Label',
  name: 'substring',
  start_index: 1,
  end_index: 4,
}

Label	Value	Label_PCT
foo	13	foo
overflow	7	over
some_text	20	some
a_word	1	a_wo
toucan	10	touc
toco	5	toco

Example 2: `start_index` is positive and `end_index` is negative

Input dataset:

Label	Value
foo	13
overflow	7
some_text	20
a_word	1
toucan	10
toco	5

Step configuration:

{
  name: 'substring',
  column: 'Label',
  start_index: 2,
  end_index: -2,
  newColumName: 'short_label',
}

Output dataset:

Label	Value	short_label
foo	13	o
overflow	7	verflo
some_text	20	ome_tex
a_word	1	_wor
toucan	10	ouca
toco	5	oc

Example 3: `start_index` and `end_index` are negative

Input dataset:

Label	Value
foo	13
overflow	7
some_text	20
a_word	1
toucan	10
toco	5

Step configuration:

{
  name: 'substring',
  column: 'Label',
  start_index: -3,
  end_index: -1,
}

Output dataset:

Label	Value	Label_PCT
foo	13	foo
overflow	7	low
some_text	20	ext
a_word	1	ord
toucan	10	can
toco	5	oco

`text` step

Use this step to add a text column where every value will be equal to the specified text.

{
  {
    name: 'text',
    new_column: 'new', // if the column already exists, overwrites it
    text: 'some text'
  }
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Label	Value1
Label 1	10
Label 2	1
Label 3	5

Step configuration:

{
  name: 'text',
  new_column: 'KPI',
  text: 'Sales'
}

Output dataset:

Label	Value1	KPI
Label 1	10	Sales
Label 2	1	Sales
Label 3	5	Sales

`todate` step

Converts a string column into a date column based on a specified format.

{
    name: 'todate',
    column: 'myTextColumn'
    format: '%Y-%m-%d' // Optional. If undefined, the backend tries to infer the date format.
                       // For further details on supported format specifiers,
                       // see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
                       // Note: custom format is not supported in Mongo versions older than 4.0
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example

Input dataset:

Company	Date	Value
Company 1	06/10/2019	13
Company 1	07/10/2019	7
Company 1	08/10/2019	20
Company 2	06/10/2019	1
Company 2	07/10/2019	10
Company 2	08/10/2019	5

Step configuration:

{
  name: 'todate',
  column: 'Date',
  format: '%d/%m/%Y'
}

Output dataset:

Company	Date	Value
Company 1	2019-10-06T00:00:00.000Z	13
Company 1	2019-10-07T00:00:00.000Z	7
Company 1	2019-10-08T00:00:00.000Z	20
Company 2	2019-10-06T00:00:00.000Z	1
Company 2	2019-10-07T00:00:00.000Z	10
Company 2	2019-10-08T00:00:00.000Z	5

`top` step

Return top N rows by group if groups is specified, else over full dataset.

{
  name: 'top',
  groups: ['foo'],
  rank_on: 'bar',
  sort: 'desc', // or 'asc'
  limit: 10
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: top without `groups`, ascending order

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'top',
  rank_on: 'Value',
  sort: 'asc',
  limit: 3
}

Output dataset:

Label	Group	Value
Label 4	Group 2	1
Label 6	Group 2	5
Label 2	Group 1	7

Example 2: top with `groups`, descending order

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 4	Group 2	1
Label 5	Group 2	10
Label 6	Group 2	5

Step configuration:

{
  name: 'top',
  groups: ['Group'],
  rank_on: 'Value',
  sort: 'desc',
  limit: 1
}

Output dataset:

Company	Group	Value
Label 3	Group 1	20
Label 5	Group 2	10

`totals` step

Append “total” rows to the dataset for specified dimensions. Computed rows result from an aggregation (either sum, average, count, count distinct, min, max, first or last)

{
  name: 'totals',
  // an array of objects: { totalColumn: <column to add total rows in>, totalRowsLabel: <the label of the added total rows>}
  totalDimensions: [
    { totalColumn: 'foo', totalRowsLabel: 'Total foos' },
    { totalColumn: 'bar', totalRowsLabel: 'Total bars' }
  ],
  aggregations: [ // The columns to be aggregated, and how
    {
      columns: ['value1', 'value2']
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
    },
    {
      columns: ['value1']
      aggfunction: 'avg',
      newcolumns: ['avg_value1'],
    }
   ],
  groups: ['someDimension']// specify columns if the totals computation needs to be scoped by group of rows
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: basic usage

Input dataset:

COUNTRY	PRODUCT	YEAR	VALUE
France	product A	2019	5
USA	product A	2019	10
France	product B	2019	10
USA	product B	2019	15
France	product A	2020	20
USA	product A	2020	20
France	product B	2020	30
USA	product B	2020	25

Step configuration:

{
  name: 'totals',
  totalDimensions: [{ totalColumn: 'COUNTRY', totalRowsLabel: 'All countries' }],
  aggregations: [
    {
      columns: ['VALUE']
      aggfunction: 'sum',
      newcolumns: ['VALUE'],
    }
   ]
}

Output dataset:

COUNTRY	PRODUCT	YEAR	VALUE
France	product A	2019	5
USA	product A	2019	10
France	product B	2019	10
USA	product B	2019	15
France	product A	2020	20
USA	product A	2020	20
France	product B	2020	30
USA	product B	2020	25
All countries	null	null	135

Example 2: With several totals and groups

Input dataset:

COUNTRY	PRODUCT	YEAR	VALUE_1	VALUE_2
France	product A	2019	5	50
USA	product A	2019	10	100
France	product B	2019	10	100
USA	product B	2019	15	150
France	product A	2020	20	200
USA	product A	2020	20	200
France	product B	2020	30	300
USA	product B	2020	25	250

Step configuration:

{
  name: 'totals',
  totalDimensions: [
    {totalColumn: 'COUNTRY', totalRowsLabel: 'All countries'},
    {totalColumn: 'PRODUCT', totalRowsLabel: 'All products'}
  ],
  aggregations: [
    {
      columns: ['VALUE_1-sum', 'VALUE_2']
      aggfunction: 'sum',
      newcolumns: ['VALUE_1', 'VALUE_2'],
    },
    {
      columns: ['VALUE_1-avg']
      aggfunction: 'avg',
      newcolumns: ['VALUE_1'],
    }
   ],
   groups: ['YEAR']
}

Output dataset:

COUNTRY	PRODUCT	YEAR	VALUE_2	VALUE_1-sum	VALUE_1-avg
France	product A	2019	50	5	5
USA	product A	2019	100	10	10
France	product B	2019	100	10	10
USA	product B	2019	150	15	15
France	product A	2020	200	20	20
USA	product A	2020	200	20	20
France	product B	2020	300	30	30
USA	product B	2020	250	25	25
USA	All products	2020	450	45	22.5
France	All products	2020	500	50	25
USA	All products	2019	250	25	12.5
France	All products	2019	150	15	7.5
All countries	product B	2020	550	55	27.5
All countries	product A	2020	400	40	20
All countries	product B	2019	250	25	12.5
All countries	product A	2019	150	15	7.5
All countries	All products	2020	950	95	23.75
All countries	All products	2019	400	40	10

`trim` step

Trim spaces in a column.

{
    name: 'trim',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Pandas (python)

This step is unsupported by the following backends:

Mongo 3.6

Example

Input dataset:

Company	Group	Value	Label
’ Company 1 ‘	Group 1	13	Company 1 - Group 1
’ Company 2 ‘	Group 1	7	Company 2 - Group 1

Step configuration:

{
  name: 'trim',
  columns: ['Company']
}

Output dataset:

Company	Group	Value	Label
‘Company 1’	Group 1	13	Company 1 - Group 1
‘Company 2’	Group 1	7	Company 2 - Group 1

`unpivot` step

Unpivot a list of columns to rows.

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'], // columns to keep fixed around which to unpivot columns
  unpivot: ['NB_CLIENTS', 'REVENUES'], // columns to unpivot
  unpivot_column_name: 'KPI', // name of the new dimension column created after unpivot
  value_column_name: 'VALUE', // name of the new value column created after unpivot
  dropna: true // whether null values have to be kept or the corresponding rows discarded
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: with `dropna`parameter to true

Input dataset:

COMPANY	COUNTRY	NB_CLIENTS	REVENUES
Company 1	France	7	10
Company 2	France	2
Company 1	USA	12	6
Company 2	USA	1	3

Step configuration:

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'],
  unpivot: ['NB_CLIENTS', 'REVENUES'],
  unpivot_column_name: 'KPI',
  value_column_name: 'VALUE',
  dropna: true
}

Output dataset:

COMPANY	COUNTRY	KPI	VALUE
Company 1	France	NB_CLIENTS	7
Company 1	France	REVENUES	10
Company 2	France	NB_CLIENTS	2
Company 1	USA	NB_CLIENTS	12
Company 1	USA	REVENUES	6
Company 2	USA	NB_CLIENTS	1
Company 2	USA	REVENUES	3

Example 1: with `dropna`parameter to false

Input dataset:

COMPANY	COUNTRY	NB_CLIENTS	REVENUES
Company 1	France	7	10
Company 2	France	2
Company 1	USA	12	6
Company 2	USA	1	3

Step configuration:

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'],
  unpivot: ['NB_CLIENTS', 'REVENUES'],
  unpivot_column_name: 'KPI',
  value_column_name: 'VALUE',
  dropna: false
}

Output dataset:

COMPANY	COUNTRY	KPI	VALUE
Company 1	France	NB_CLIENTS	7
Company 1	France	REVENUES	10
Company 2	France	NB_CLIENTS	2
Company 2	France	REVENUES
Company 1	USA	NB_CLIENTS	12
Company 1	USA	REVENUES	6
Company 2	USA	NB_CLIENTS	1
Company 2	USA	REVENUES	3

`uppercase` step

⚠️ Mongo's $toUpper operator does not support accents. If you have accents you need to uppercase with Mongo, use a replacetext step after uppercase.

Converts a string column to uppercase.

{
  name: 'uppercase',
  column: 'foo',
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20

Step configuration:

{
  name: 'uppercase',
  column: 'Label',
}

Output dataset:

Label	Group	Value
LABEL 1	Group 1	13
LABEL 2	Group 1	7
LABEL 3	Group 1	20

`uniquegroups` step

Allow to get unique groups of values from one or several columns.

{
  name: 'uniquegroups',
  on: ['foo', 'bar']
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example:

Input dataset:

Label	Group	Value
Label 1	Group 1	13
Label 2	Group 1	7
Label 3	Group 1	20
Label 1	Group 2	1
Label 2	Group 1	2
Label 3	Group 1	3

Step configuration:

{
  name: 'uniquegroups',
  column: ['Label', 'Group'],
}

Output dataset:

Label	Group
Label 1	Group 1
Label 1	Group 2
Label 2	Group 1
Label 3	Group 1

`waterfall` step

This step allows to generate a data structure useful to build waterfall charts. It breaks down the variation between two values (usually between two dates) accross entities. Entities are found in the labelsColumn, and can optionally be regrouped under common parents found in the parentsColumn for drill-down purposes.

{
  name: 'waterfall',
  valueColumn: 'VALUE', // the column of values
  milestonesColumn: 'DATE', // the column where starting and ending blocks labels
                            // are to be found (usually dates)
  start: '2019', // the starting block label to be found in `milestonesColumn`
  end: '2020', // the starting block label to be found in `milestonesColumn`
  labelsColumn: 'PRODUCT', // the column of labels for the variations breakdown
  groupby: ['COUNTRY'], // specify columns if you want to group the waterfall computation
  sortBy: 'value', // either 'value' (default) or 'label' depending on wether to
                   // sort waterfall blocks by their label or value
  order: 'desc', // either 'asc' or 'desc' (default) depending on wether to sort
                 // by ascending or descending order
}

*This step is supported by the following backends:

Mongo 5.0
Mongo 4.2
Mongo 4.0
Mongo 3.6
Pandas (python)

Example 1: Basic usage

Input dataset:

city	year	revenue
Bordeaux	2019	135
Boston	2019	275
New-York	2019	115
Paris	2019	450
Bordeaux	2018	98
Boston	2018	245
New-York	2018	103
Paris	2018	385

Step configuration:

{
  name: 'waterfall',
  valueColumn: 'revenue',
  milestonesColumn: 'year',
  start: '2018',
  end: '2019',
  labelsColumn: 'city',
  sortBy: 'value',
  order: 'desc',
}

Output dataset:

LABEL_waterfall	TYPE_waterfall	revenue
2018	null	831
Paris	parent	65
Bordeaux	parent	37
Boston	parent	30
New-York	parent	12
2019	null	975

Example 2: With more options

Input dataset:

city	country	product	year	revenue
Bordeaux	France	product1	2019	65
Bordeaux	France	product2	2019	70
Paris	France	product1	2019	210
Paris	France	product2	2019	240
Boston	USA	product1	2019	130
Boston	USA	product2	2019	145
New-York	USA	product1	2019	55
New-York	USA	product2	2019	60
Bordeaux	France	product1	2018	38
Bordeaux	France	product2	2018	60
Paris	France	product1	2018	175
Paris	France	product2	2018	210
Boston	USA	product1	2018	95
Boston	USA	product2	2018	150
New-York	USA	product1	2018	50
New-York	USA	product2	2018	53

Step configuration:

{
  name: 'waterfall',
  valueColumn: 'revenue',
  milestonesColumn: 'year',
  start: '2018',
  end: '2019',
  labelsColumn: 'city',
  parentsColumn: 'country',
  groupby: ['product'],
  sortBy: 'label',
  order: 'asc',
}

Output dataset:

LABEL_waterfall	GROUP_waterfall	TYPE_waterfall	product	revenue
2018	2018	null	product1	358
2018	2018	null	product2	473
Bordeaux	France	child	product1	27
Bordeaux	France	child	product2	10
Boston	USA	child	product1	35
Boston	USA	child	product2	-5
France	France	parent	product2	40
France	France	parent	product1	62
New-York	USA	child	product1	5
New-York	USA	child	product2	7
Paris	France	child	product1	35
Paris	France	child	product2	30
USA	USA	parent	product2	2
USA	USA	parent	product1	40
2019	2019	null	product2	515
2019	2019	null	product1	460

User interface

Technical documentation

Steps definition

absolutevalue step

Example

addmissingdates step

Example 1: day granularity without groups

Example 2: day granularity with groups

Example 3: month granularity

aggregate step

Example 1: keepOriginalGranularity set to false

Example 2: keepOriginalGranularity set to true

append step

Example

argmax step

Example 1: without groups

Example 2: with groups

argmin step

Example 1: without groups

Example 2: with groups

comparetext step

Example

concatenate step

Example

convert step

Example

cumsum step

Example 1: Basic usage

Example 2: With more advanced options

custom step

Example: using Mongo query language

dateextract step

Example

dategranularity step

Example

delete step

Example

dissolve step

Example without aggregations

Example with aggregations

domain step

duplicate step

Example

duration step

Example 1: duration in days

Example 2: duration in minutes

evolution step

Example 1: Basic configuration - evolution in absolute value

Example 2: Basic configuration - evolution in percentage

Example 3: Error on duplicate dates

Example 4: Complete configuration with index columns

fillna step

Example

filter step

Relative dates

formula step

Supported operators

Example 1: Basic usage

Example 2: Column name with whitespaces

hierarchy step

Example

ifthenelse step

Example

join step

Example 1: Left join with one column couple as on parameter

Example 2: inner join with different column names in the on parameter

fromdate step

Example

lowercase step

Example:

movingaverage step

Example 1: Basic usage

Example 2: with groups and custom newColumnName

percentage step

Example:

pivot step

Example:

statistics step

Example:

rank step

`absolutevalue` step

`addmissingdates` step

`aggregate` step

`append` step

`argmax` step

Example 1: without `groups`

Example 2: with `groups`

`argmin` step

Example 1: without `groups`

Example 2: with `groups`

`comparetext` step

`concatenate` step

`convert` step

`cumsum` step

`custom` step

`dateextract` step

`dategranularity` step

`delete` step

`dissolve` step

`domain` step

`duplicate` step

`duration` step

`evolution` step

`fillna` step

`filter` step

`formula` step

`hierarchy` step

`ifthenelse` step

`join` step

Example 1: Left join with one column couple as `on` parameter

Example 2: inner join with different column names in the `on` parameter

`fromdate` step

`lowercase` step

`movingaverage` step

`percentage` step

`pivot` step

`statistics` step

`rank` step

`rename` step

`replace` step

`replacetext` step

`rollup` step

`select` step

`sort` step

`split` step

`simplify` step

`substring` step

Example 1: positive `start_index` and `end_index`

Example 2: `start_index` is positive and `end_index` is negative

Example 3: `start_index` and `end_index` are negative

`text` step

`todate` step

`top` step

Example 1: top without `groups`, ascending order

Example 2: top with `groups`, descending order

`totals` step

`trim` step

`unpivot` step

Example 1: with `dropna`parameter to true

Example 1: with `dropna`parameter to false

`uppercase` step

`uniquegroups` step

`waterfall` step