Steps definition

This document describes the steps that can be used in our pipelines. A step is the unit of transformation generated by a user interaction with the visual query builder.

Unless explicitly specified, all steps are supported by all backends.

absolutevalue step

This step is meant to compute the absolute value of a given input column.

{
    name: 'absolutevalue',
    column: 'my-column'
    new_column: 'my-new-column'
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Value
Company 1 -33
Company 2 0
Company 3 10

Step configuration:

{
  name: 'absolutevalue',
  column: "Value",
  new_column_name: "My-absolute-value",
}

Output dataset:

Company Value My-absolute-value
Company 1 -33 33
Company 2 0 0
Company 3 10 10

addmissingdates step

Add missing dates as new rows in a dates column. Exhaustive dates will range between the minimum and maximum date found in the dataset (or in each group if a group by logic is applied - see thereafter).

Added rows will be set to null in columns not referenced in the step configuration.

You should make sure to use a group by logic if you want to add missing dates in independent groups of rows (e.g. you may need to add missing rows for every country found in a “COUNTRY” column). And you should ensure that every date is unique in every group of rows at the specified granularity, else you may get inconsistent results. You can specify “group by” columns in the groups parameter.

An addmissingdates step has the following structure:

{
  name: 'addmissingdates',
  datesColumn: 'DATE', // the dates columns with missing rows
  datesGranularity: 'day', // the granularity of the date column (day, month or year)
  groups: ['COUNTRY'] // (optional) columns used for group by logic if necessary
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: day granularity without groups

Input dataset:

DATE VALUE
“2018-01-01T00:00:00.000Z” 75
“2018-01-02T00:00:00.000Z” 80
“2018-01-03T00:00:00.000Z” 82
“2018-01-04T00:00:00.000Z” 83
“2018-01-05T00:00:00.000Z” 80
“2018-01-07T00:00:00.000Z” 86
“2018-01-08T00:00:00.000Z” 79
“2018-01-09T00:00:00.000Z” 76
“2018-01-10T00:00:00.000Z” 79
“2018-01-11T00:00:00.000Z” 75

Here the day “2018-01-06” is missing.

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'day'
}

Output dataset:

DATE VALUE
“2018-01-01T00:00:00.000Z” 75
“2018-01-02T00:00:00.000Z” 80
“2018-01-03T00:00:00.000Z” 82
“2018-01-04T00:00:00.000Z” 83
“2018-01-05T00:00:00.000Z” 80
“2018-01-06T00:00:00.000Z”  
“2018-01-07T00:00:00.000Z” 86
“2018-01-08T00:00:00.000Z” 79
“2018-01-09T00:00:00.000Z” 76
“2018-01-10T00:00:00.000Z” 79
“2018-01-11T00:00:00.000Z” 75

Example 2: day granularity with groups

Input dataset:

COUNTRY DATE VALUE
France “2018-01-01T00:00:00.000Z” 75
France “2018-01-02T00:00:00.000Z” 80
France “2018-01-03T00:00:00.000Z” 82
France “2018-01-04T00:00:00.000Z” 83
France “2018-01-05T00:00:00.000Z” 80
France “2018-01-07T00:00:00.000Z” 86
France “2018-01-08T00:00:00.000Z” 79
France “2018-01-09T00:00:00.000Z” 76
France “2018-01-10T00:00:00.000Z” 79
France “2018-01-11T00:00:00.000Z” 85
USA “2018-01-01T00:00:00.000Z” 69
USA “2018-01-02T00:00:00.000Z” 73
USA “2018-01-03T00:00:00.000Z” 73
USA “2018-01-05T00:00:00.000Z” 75
USA “2018-01-06T00:00:00.000Z” 70
USA “2018-01-07T00:00:00.000Z” 76
USA “2018-01-08T00:00:00.000Z” 73
USA “2018-01-09T00:00:00.000Z” 70
USA “2018-01-10T00:00:00.000Z” 72
USA “2018-01-12T00:00:00.000Z” 78

Here the day “2018-01-06” is missing for “France” rows, and “2018-01-11” and “2018-01-11” are missing for “USA” rows.

Note that “2018-01-12” will not be considered as a missing row for “France” rows, because the latest date found for this group of rows is “2018-01-11” (even though “2018-01-12” is the latest date found for “USA” rows).

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'day',
  groups: 'COUNTRY'
}

Output dataset:

COUNTRY DATE VALUE
France “2018-01-01T00:00:00.000Z” 75
France “2018-01-02T00:00:00.000Z” 80
France “2018-01-03T00:00:00.000Z” 82
France “2018-01-04T00:00:00.000Z” 83
France “2018-01-05T00:00:00.000Z” 80
France “2018-01-06T00:00:00.000Z”  
France “2018-01-07T00:00:00.000Z” 86
France “2018-01-08T00:00:00.000Z” 79
France “2018-01-09T00:00:00.000Z” 76
France “2018-01-10T00:00:00.000Z” 79
France “2018-01-11T00:00:00.000Z” 85
USA “2018-01-01T00:00:00.000Z” 69
USA “2018-01-02T00:00:00.000Z” 73
USA “2018-01-03T00:00:00.000Z” 73
USA “2018-01-04T00:00:00.000Z”  
USA “2018-01-05T00:00:00.000Z” 75
USA “2018-01-06T00:00:00.000Z” 70
USA “2018-01-07T00:00:00.000Z” 76
USA “2018-01-08T00:00:00.000Z” 73
USA “2018-01-09T00:00:00.000Z” 70
USA “2018-01-10T00:00:00.000Z” 72
USA “2018-01-11T00:00:00.000Z”  
USA “2018-01-12T00:00:00.000Z” 78

Example 3: month granularity

Input dataset:

DATE VALUE
“2019-01-01T00:00:00.000Z” 74
“2019-02-01T00:00:00.000Z” 73
“2019-03-01T00:00:00.000Z” 68
“2019-04-01T00:00:00.000Z” 71
“2019-06-01T00:00:00.000Z” 74
“2019-07-01T00:00:00.000Z” 74
“2019-08-01T00:00:00.000Z” 73
“2019-09-01T00:00:00.000Z” 72
“2019-10-01T00:00:00.000Z” 75
“2019-12-01T00:00:00.000Z” 76

Here “2019-05” and “2019-11” are missing.

Step configuration:

{
  name: 'addmissingdates',
  datesColumn: 'DATE',
  datesGranularity: 'month',
}

Output dataset:

DATE VALUE
“2019-01-01T00:00:00.000Z” 74
“2019-02-01T00:00:00.000Z” 73
“2019-03-01T00:00:00.000Z” 68
“2019-04-01T00:00:00.000Z” 71
“2019-05-01T00:00:00.000Z”  
“2019-06-01T00:00:00.000Z” 74
“2019-07-01T00:00:00.000Z” 74
“2019-08-01T00:00:00.000Z” 73
“2019-09-01T00:00:00.000Z” 72
“2019-10-01T00:00:00.000Z” 75
“2019-11-01T00:00:00.000Z”  
“2019-12-01T00:00:00.000Z” 76

aggregate step

Perform aggregations on one or several columns. Available aggregation functions are sum, average, count, count distinct, min, max, first, last.

An aggreation step has the following structure:

{
   name: 'aggregate',
   on: ['column1', 'column2'],
   aggregations:  [
    {
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      columns: ['value1', 'value2']
    }
    {
      newcolumns: ['avg_value1'],
      aggfunction: 'avg',
      columns: ['value1']
    }
    // ...
  ]
  keepOriginalGranularity: false, // whether to keep the original granularity, in that
                                  // case computed aggregations will be added in new columns
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Deprecation note:

The column and newcolumn (in the singular) properties of the aggregation parameter are deprecated and are supported only for retrocompatibility purposes. When this step was first created, only 1 column at a time could be aggregated for a given aggregation. Now columns and newcolumns are lists of columns, allowing to apply the same aggregation function to several columns at once.

An old-fashioned step looked like this:

{
   name: 'aggregate',
   on: ['column1', 'column2'],
   aggregations:  [
    {
      newcolumn: 'sum_value1'
      aggfunction: 'sum',
      column: 'value1'
    }
    {
      newcolumn: 'sum_value2',
      aggfunction: 'sum',
      column: 'value2'
    }
    {
      newcolumn: 'avg_value1',
      aggfunction: 'avg',
      column: 'value1'
    }
    // ...
  ]
}

Example 1: keepOriginalGranularity set to false

Input dataset:

Label Group Value1 Value2
Label 1 Group 1 13 10
Label 2 Group 1 7 21
Label 3 Group 1 20 4
Label 4 Group 2 1 17
Label 5 Group 2 9 12
Label 6 Group 2 5 2

Step configuration:

{
  name: 'aggregate',
   on: ['Group'],
   aggregations:  [
    {
      newcolumns: ['Sum-Value1', 'Sum-Value2'],
      aggfunction: 'sum',
      columns: ['Value1', 'Value2']
    }
    {
      newcolumns: ['Avg-Value1'],
      aggfunction: 'avg',
      columns: ['Value1']
    }
  ],
  keepOriginalGranularity: false,
}

Output dataset:

Group Sum-Value1 Sum-Value2 Avg-Value1
Group 1 40 35 13.333333
Group 2 16 31 5.333333

Example 2: keepOriginalGranularity set to true

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'aggregate',
   on: ['Group'],
   aggregations:  [
    {
      newcolumns: ['Total'],
      aggfunction: 'sum',
      columns: ['Value']
    }
  ],
  keepOriginalGranularity: true,
}

Output dataset:

Label Group Value
Label 1 Group 1 40
Label 2 Group 1 40
Label 3 Group 1 40
Label 4 Group 2 16
Label 5 Group 2 16
Label 6 Group 2 16

append step

Appends to the current dataset, one or several datasets resulting from other pipelines. WeaverBird allows you to save pipelines referenced by name in the Vuex store of the application. You can then call them by their unique names in this step.

{
  name: 'append',
  pipelines: ['pipeline1', 'pipeline2'] // the name of the pipelines to append
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7

dataset1 (saved in the application Vuex store):

Label Group Value
Label 3 Group 1 20
Label 4 Group 2 1

dataset2 (saved in the application Vuex store):

Label Group Value
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'append',
  pipelines: ['dataset1', 'dataset2']
}

Output dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

argmax step

Get row(s) matching the maximum value in a given column, by group if groups is specified.

{
  name: 'argmax',
  column: 'value', // column in which to search for max value
  groups: ['group1', 'group2']
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: without groups

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'argmax',
  column: 'Value'
}

Output dataset:

Label Group Value
Label 3 Group 1 20

Example 2: with groups

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'argmax',
  column: 'Value',
  groups: ['Group']
}

Output dataset:

Label Group Value
Label 3 Group 1 20
Label 5 Group 2 10

argmin step

Get row(s) matching the minimum value in a given column, by group if groups is specified.

{
  name: 'argmin',
  column: 'value', // column in which to search for max value
  groups: ['group1', 'group2'] // optional
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: without groups

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'argmin',
  column: 'Value'
}

Output dataset:

Label Group Value
Label 4 Group 2 1

Example 2: with groups

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'argmin',
  column: 'Value',
  groups: ['Groups']
}

Output dataset:

Label Group Value
Label 2 Group 1 7
Label 4 Group 2 1

comparetext step

Compares 2 string columns and returns true if the string values are equal, and false oteherwise. The comparison is case-sensitive (see examples below).

{
  name: 'comparetext',
  newColumnName: 'NEW', // new column to be created, with the comparison result
  strCol1: 'TEXT_1', // first string column to compare
  strCol2: 'TEXT_2', // second string column to compare
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

TEXT_1  TEXT_2
France Fr
France France
France france
France England
France USA

Step configuration:

{
  name: 'split',
  newColumnName: 'RESULT',
  ctrCol1: 'TEXT_1',
  ctrCol2: 'TEXT_2',
}

Output dataset:

TEXT_1  TEXT_2 RESULT
France Fr false
France France true
France france false
France England false
France USA false

concatenate step

This step allows to concatenate several columns using a separator.

{
  name: 'concatenate',
  columns: ['Company', 'Group']
  separator: ' - ' // can be a string of any length
  new_column_name: 'Label' // The new column in which to write the concatenation
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Group Value
Company 1 Group 1 13
Company 2 Group 1 7
Company 3 Group 1 20
Company 4 Group 2 1
Company 5 Group 2 10
Company 6 Group 2 5

Step configuration:

{
  name: 'concatenate',
  columns: ['Company', 'Group']
  separator: ' - '
  new_column_name: 'Label'
}

Output dataset:

Company Group Value Label
Company 1 Group 1 13 Company 1 - Group 1
Company 2 Group 1 7 Company 2 - Group 1
Company 3 Group 1 20 Company 3 - Group 1
Company 4 Group 2 1 Company 4 - Group 2
Company 5 Group 2 10 Company 5 - Group 2
Company 6 Group 2 5 Company 6 - Group 2

convert step

This step allows to convert columns data types.

{
  name: 'convert',
  columns: ['col1', 'col2']
  data_type: 'integer' // The data type to convert the column into. Can be either
                       // 'integer', 'float', 'text', 'date' or 'boolean'
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Pandas (python)
  • Snowflake

In a effort to harmonize as much as possible the conversion behaviors, for some cases, the Sql translator implements casting otherwise than the CAST AS method.

Precisely, when casting float to integer, the default behaviour rounds the result, other languages truncate it. That’s why the use of TRUNCATE was implemented when converting float to int. The same implementation was done when converting strings to int (for date represented as string). As for the conversion of date to int, we handled it by assuming the dataset’s timestamp is in TIMESTAMP_NTZ format.

Example

Input dataset:

Company Value
Company 1 ‘13’
Company 2 ‘7’
Company 3 ‘20’
Company 4 ‘1’
Company 5 ‘10’
Company 6 ‘5’

Step configuration:

{
  name: 'convert',
  columns: ['Value']
  data_type: 'integer'
}

Output dataset:

Company Value
Company 1 13
Company 2 7
Company 3 20
Company 4 1
Company 5 10
Company 6 5

cumsum step

This step allows to compute the cumulated sum of value columns based on a reference column (usually dates) to be sorted by ascending order for the needs of the computation. The computation can be scoped by group if needed.

The toCumSum parameter takes as input a list of 2-elements lists in the form [‘valueColumn’, ‘newColumn’].

{
  name: 'cumsum',
  toCumSum: [['myValues', 'myCumsum']],
  referenceColumn: 'myDates',
  groupby: ['foo', 'bar'],
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Basic usage

Input dataset:

DATE VALUE
2019-01 2
2019-02 5
2019-03 3
2019-04 8
2019-05 9
2019-06 6

Step configuration:

{
  name: 'cumsum',
  toCumSum: [['VALUE', '']]
  referenceColumn: 'DATE',
}

Output dataset:

DATE VALUE VALUE_CUMSUM
2019-01 2 2
2019-02 5 7
2019-03 3 10
2019-04 8 18
2019-05 9 27
2019-06 6 6 33

Example 2: With more advanced options

Input dataset:

COUNTRY DATE VALUE
France 2019-01 2
France 2019-02 5
France 2019-03 3
France 2019-04 8
France 2019-05 9
France 2019-06 6 6
USA 2019-01 10
USA 2019-02 6
USA 2019-03 6
USA 2019-04 4
USA 2019-05 8
USA 2019-06 6 7

Step configuration:

{
  name: 'cumsum',
  toCumSum: [['VALUE', 'MY_CUMSUM']],
  referenceColumn: 'DATE',
  groupby: ['COUNTRY'],
}

Output dataset:

COUNTRY DATE VALUE MY_CUMSUM
France 2019-01 2 2
France 2019-02 5 7
France 2019-03 3 10
France 2019-04 8 18
France 2019-05 9 27
France 2019-06 6 6 33
USA 2019-01 10 10
USA 2019-02 6 16
USA 2019-03 6 22
USA 2019-04 4 26
USA 2019-05 8 34
USA 2019-06 6 7 41

custom step

This step allows to define a custom query that can’t be expressed using the other existing steps.

{
    name: 'custom',
    query: '$group: {_id: ...}'
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6

Example: using Mongo query language

Input dataset:

Company Group Value
Company 1 Group 1 13
Company 2 Group 1 7
Company 3 Group 1 20
Company 4 Group 2 1
Company 5 Group 2 10
Company 6 Group 2 5

Step configuration:

{
  name: 'custom',
  query: '$addFields: { Label: { $concat: [ "$Label", " - ", "$Group" ] ] } }'
}

Output dataset:

Company Group Value Label
Company 1 Group 1 13 Company 1 - Group 1
Company 2 Group 1 7 Company 2 - Group 1
Company 3 Group 1 20 Company 3 - Group 1
Company 4 Group 2 1 Company 4 - Group 2
Company 5 Group 2 10 Company 5 - Group 2
Company 6 Group 2 5 Company 6 - Group 2

dateextract step

Extract date information (eg. day, week, year etc.). The following information can be extracted:

  • year: extract ‘year’ from date,
  • month: extract ‘month’ from date,
  • day: extract ‘day of month’ from date,
  • week': extract ‘week number’ (ranging from 0 to 53) from date,
  • quarter: extract ‘quarter number’ from date (1 for Jan-Feb-Mar)
  • dayOfWeek: extract ‘day of week’ (ranging from 1 for Sunday to 7 for Staurday) from date,
  • dayOfYear: extract ‘day of year’ from date,
  • isoYear: extract ‘year number’ in ISO 8601 format (ranging from 1 to 53) from date.
  • isoWeek: extract ‘week number’ in ISO 8601 format (ranging from 1 to 53) from date.
  • isoDayOfWeek: extract ‘day of week’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date,
  • firstDayOfYear: calendar dat corresponding to the first day (1st of January) of the year ,
  • firstDayOfMonth: calendar dat corresponding to the first day of the month,
  • firstDayOfWeek: calendar dat corresponding to the first day of the week,
  • firstDayOfQuarter: calendar dat corresponding to the first day of the quarter,
  • firstDayOfIsoWeek: calendar dat corresponding to the first day of the week in ISO 8601 format,
  • previousDay: calendar date one day before the target date,
  • firstDayOfPreviousYear: calendar dat corresponding to the first day (1st of January) of the previous year,
  • firstDayOfPreviousMonth: calendar dat corresponding to the first day of the previous month,
  • firstDayOfPreviousWeek: calendar dat corresponding to the first day of the previous week,
  • firstDayOfPreviousQuarter: calendar dat corresponding to the first day of the previous quarter,
  • firstDayOfPreviousISOWeek: calendar dat corresponding to the first day of the previous ISO week,
  • previousYear: extract previous ‘year number’ from date
  • previousMonth: extract previous ‘month number’ from date
  • previousWeek: extract previous ‘week number’ from date
  • previousQuarter: extract previous ‘quarter number’ from date
  • previousISOWeek: extract previous ‘week number’ in ISO 8601 format (ranging from 1 for Monday to 7 for Sunday) from date
  • hour: extract ‘hour’ from date,
  • minutes: extract ‘minutes’ from date,
  • seconds: extract ‘seconds’ from date,
  • milliseconds: extract ‘milliseconds’ from date,

Here’s an example of such a step:

{
  name: 'dateextract',
  column: 'date',
  dateInfo: ['year', 'month', 'day'],
  newColumns: ['date_year', 'date_month', 'date_day'],
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Deprecation note:

The operation and new_column_name parameters are deprecated and are supported for retrocompatibility purposes only.

An old-fashioned step looks like this:

{
  name: 'dateextract',
  column: 'date',
  operation: 'day',
  new_column_name: 'date_day',
}

Example

Input dataset:

Date
2019-10-30T00:00:00.000Z
2019-10-15T00:00:00.000Z
2019-10-01T00:00:00.000Z
2019-09-30T00:00:00.000Z
2019-09-15T00:00:00.000Z
2019-09-01T00:00:00.000Z

Step configuration:

{
  name: 'dateextract',
  column: 'Date',
  dateInfo: ['year', 'month', 'day'],
  newColumns: ['Date_year', 'Date_month', 'Date_day'],
}

Output dataset:

Date Date_year Date_month Date_day
2019-10-30T00:00:00.000Z 2019 10 30
2019-10-15T00:00:00.000Z 2019 10 15
2019-10-01T00:00:00.000Z 2019 10 1
2019-09-30T00:00:00.000Z 2020 10 30
2019-09-15T00:00:00.000Z 2020 10 15
2019-09-01T00:00:00.000Z 2020 10 1

delete step

Delete a column.

{
    name: 'delete',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Group Value Label
Company 1 Group 1 13 Company 1 - Group 1
Company 2 Group 1 7 Company 2 - Group 1
Company 3 Group 1 20 Company 3 - Group 1
Company 4 Group 2 1 Company 4 - Group 2
Company 5 Group 2 10 Company 5 - Group 2
Company 6 Group 2 5 Company 6 - Group 2

Step configuration:

{
  name: 'delete',
  columns: ['Company', 'Group']
}

Output dataset:

Value Label
13 Company 1 - Group 1
7 Company 2 - Group 1
20 Company 3 - Group 1
1 Company 4 - Group 2
10 Company 5 - Group 2
5 Company 6 - Group 2

dissolve step

Geographically dissolve data

{
    name: 'dissolve',
    groups: ['my-column', 'some-other-column'],
    include_nulls: true,
}

This step is supported by the following backends:

  • Pandas (python)

Example without aggregations

Input dataset:

Country City geometry
Country 1 City 1 Polygon
Country 2 City 2 Polygon
Country 2 City 3 Polygon
Country 1 City 4 Polygon
Country 2 City 5 Polygon
Country 1 City 6 Polygon

Step configuration:

{
    name: 'dissolve',
    groups: ['Country'],
    include_nulls: true,
}

Output dataset:

Country geometry
Country 1 MultiPolygon
Country 2 MultiPolygon

Example with aggregations

Input dataset:

Country City geometry Population
Country 1 City 1 Polygon 100_000
Country 2 City 2 Polygon 50_000
Country 2 City 3 Polygon 200_000
Country 1 City 4 Polygon 30_000
Country 2 City 5 Polygon 25_000
Country 1 City 6 Polygon 10_000

Step configuration:

{
    name: 'dissolve',
    groups: ['Country'],
    include_nulls: true,
    aggregations: [
        {
            aggfunction: 'sum',
            columns: ['Population'],
            newcolumns: ['Total population'],
        }
    ]
}

Output dataset:

Country geometry Total population
Country 1 MultiPolygon 140_000
Country 2 MultiPolygon 275_000

domain step

This step is meant to select a specific domain (using MongoDB terminology).

{
    name: 'domain',
    domain: 'my-domain'
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

duplicate step

This step is meant to duplicate a column.

{
    name: 'duplicate',
    column: 'my-column'
    new_column_name: 'my-duplicate'
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Value
Company 1 13
Company 2 0
Company 3 20

Step configuration:

{
  name: 'duplicate',
  column: "Company",
  new_column_name: "Company-copy",
}

Output dataset:

Company Value Company-copy
Company 1 13 Company 1
Company 2 0 Company 2
Company 3 20 Company 3

duration step

Compute the duration (in days, hours, minutes or seconds) between 2 dates in a new column.

{
  name: 'duration',
  newColumnName: 'DURATION', // new column name
  startDateColumn: 'START_DATE', // start date
  endDateColumn: 'END_DATE', // end date
  durationIn: 'days', // duration units (days, hours, minutes or seconds)
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: duration in days

Input dataset:

START_DATE END_DATE
“2020-01-01T00:00:00.000Z” “2020-01-31T00:00:00.000Z”
“2020-01-01T00:00:00.000Z” “2020-12-31T00:00:00.000Z”

Step configuration:

{
  name: 'duration',
  newColumnName: 'DURATION',
  startDateColumn: 'START_DATE',
  endDateColumn: 'END_DATE',
  durationIn: 'days',
}

Output dataset:

START_HOUR END_HOUR DURATION
“2020-01-01T00:00:00.000Z” “2020-01-31T00:00:00.000Z” 30
“2020-01-01T00:00:00.000Z” “2020-12-31T00:00:00.000Z” 365

Example 2: duration in minutes

Input dataset:

START_HOUR END_HOUR
“2020-01-01T14:00:00.000Z” “2020-01-31T15:00:00.000Z”
“2020-01-01T15:00:00.000Z” “2020-12-31T20:00:00.000Z”

Step configuration:

{
  name: 'duration',
  newColumnName: 'DURATION',
  startDateColumn: 'START_HOUR',
  endDateColumn: 'END_HOUR',
  durationIn: 'minutes',
}

Output dataset:

START_HOUR END_HOUR DURATION
“2020-01-01T14:00:00.000Z” “2020-01-31T15:00:00.000Z” 60
“2020-01-01T15:00:00.000Z” “2020-12-31T20:00:00.000Z” 300

evolution step

Use this step if you need to compute the row-by-row evolution of a value column, based on a date column. It will output 2 columns: one for the evolution in absolute value, the other for the evolution in percentage.

You must be careful that the computation is scoped so that there are no dates duplicates (so that any date finds no more than one previous date). That means that you may need to specify “group by” columns to make any date unique inside each group. You should specify those columns in the indexColumns parameter.

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear', // or vsLastMonth, vsLastWeek, vsLastDay
  evolutionFormat: 'abs', // or pct
  // optional, if computation has to be performed by group so that any date finds no more than one previous date
  indexColumns: ['COUNTRY'],
  newColumn: 'MY_EVOL', // optional, <originalColumnName>_EVOL_ABS or <originalColumnName>_EVOL_PCT by default
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Basic configuration - evolution in absolute value

Input dataset:

DATE VALUE
2019-06 79
2019-07 81
2019-08 77
2019-09 75
2019-11 78
2019-12 88

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastMonth',
  evolutionFormat: 'abs',
  indexColumns: [],
}

Output dataset:

DATE VALUE VALUE_EVOL_ABS
2019-06 79  
2019-07 81 2
2019-08 77 -4
2019-09 75 -2
2019-11 78  
2019-12 88 10

Example 2: Basic configuration - evolution in percentage

Input dataset:

DATE VALUE
2019-06 79
2019-07 81
2019-08 77
2019-09 75
2019-11 78
2019-12 88

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastMonth',
  evolutionFormat: 'pct',
  indexColumns: [],
}

Output dataset:

DATE VALUE VALUE_EVOL_PCT
2019-06 79  
2019-07 81 0.02531645569620253
2019-08 77 -0.04938271604938271
2019-09 75 -0.025974025974025976
2019-11 78  
2019-12 88 0.1282051282051282

Example 3: Error on duplicate dates

If ‘COUNTRY’ is not specified as indexColumn, the computation will not be scoped by country. Then there are duplicate dates in the “DATE” columns which is prohibited and will lead to an error.

Input dataset:

DATE COUNTRY VALUE
2014-12 France 79
2015-12 France 81
2016-12 France 77
2017-12 France 75
2014-12 USA 74
2015-12 USA 74
2016-12 USA 73
2017-12 USA 72

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear',
  evolutionFormat: 'abs',
  indexColumns: [],
}

Output dataset:

With the mongo translator, you will get an error at row-level as shown below:

DATE COUNTRY VALUE MY_EVOL
2014-12 France 79  
2015-12 France 81 Error …
2016-12 France 77 Error …
2017-12 France 75 Error …
2014-12 USA 74  
2015-12 USA 74 Error …
2016-12 USA 73 Error …
2017-12 USA 72 Error …

The pandas translator will just return an error, and you will not get any data.

Example 4: Complete configuration with index columns

Input dataset:

DATE COUNTRY VALUE
2014-12 France 79
2015-12 France 81
2016-12 France 77
2017-12 France 75
2019-12 France 78
2020-12 France 88
2014-12 USA 74
2015-12 USA 74
2016-12 USA 73
2017-12 USA 72
2018-11 USA 75
2020-12 USA 76

Step configuration:

{
  name: 'evolution',
  dateCol: 'DATE',
  valueCol: 'VALUE',
  evolutionType: 'vsLastYear',
  evolutionFormat: 'abs',
  indexColumns: ['COUNTRY'],
  newColumn: 'MY_EVOL',
}

Output dataset:

DATE COUNTRY VALUE MY_EVOL
2014-12 France 79  
2015-12 France 81 2
2016-12 France 77 -4
2017-12 France 75 -2
2019-12 France 78  
2020-12 France 88 10
2014-12 USA 74  
2015-12 USA 74 0
2016-12 USA 73 -1
2017-12 USA 72 -1
2018-11 USA 75 3
2020-12 USA 76  

fillna step

Replace null values by a given value in specified columns.

{
    name: 'fillna',
    columns: ["foo", "bar"],
    value: 0
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Deprecation note:

The column (in the singular) parameter is deprecated and is supported only for retrocompatibility purposes.

An old-fashioned step looks like this:

{
    name: 'fillna',
    column: 'foo',
    value: 0
}

Example

Input dataset:

Company Group Value KPI
Company 1 Group 1 13  
Company 2 Group 1   12
Company 3 Group 1 20 40
Company 4 Group 2 1  
Company 5 Group 2   38
Company 6 Group 2 5 4

Step configuration:

{
  name: 'fillna',
  columns: ["Value", "KPI"],
  value: 0
}

Output dataset:

Company Group Value KPI
Company 1 Group 1 13 0
Company 2 Group 1 0 12
Company 3 Group 1 20 40
Company 4 Group 2 1 0
Company 5 Group 2 0 38
Company 6 Group 2 5 4

filter step

Filter out lines that don’t match a filter definition.

{
    name: 'filter',
    condition: {
      column: 'my-column',
      value: 42,
      operator: 'ne'
    }
}

operator is optional, and defaults to eq. Allowed operators are eq, ne, gt, ge, lt, le, in, nin, matches, notmatches isnull or notnull.

value can be an arbitrary value depending on the selected operator (e.g a list when used with the in operator, or null when used with the isnull operator).

matches and notmatches operators are used to test value against a regular expression.

Conditions can be grouped and nested with logical operators and and or.

{
    name: 'filter',
    condition: {
      and: [
        {
          column: 'my-column',
          value: 42,
          operator: 'gte'
        },
        {
          column: 'my-column',
          value: 118,
          operator: 'lte'
        },
        {
          or: [
            {
              column: 'my-other-column',
              value: 'blue',
              operator: 'eq'
            },
            {
              column: 'my-other-column',
              value: 'red',
              operator: 'eq'
            }
          ]
        }
      ]
    }
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Relative dates

Date values can be relative to the moment to the moment when the query is executed. This is expressed by using a RelativeDate object instead of the value, of the form:

{
  quantity: Number // can be positive (future) or negative (past)
  duration: 'year' | 'quarter' | 'month' | 'week' | 'day'
}

*This feature is supported by the following backends:

  • Mongo 5.0

formula step

Add a computation based on a formula. Usually column names do not need to be escaped, unless they include whitespaces, in which case you’ll need to use brackets ‘[]’ (e.g. [myColumn]). Any string escaped with quotes (', ", ''', """) will be considered a string literal.

{
  {
    name: 'formula',
    new_column: 'result', // if the column already exists, overwrites it
    formula: '(Value1 + Value2) / Value3 - Value4 * 2'
  }
}

Supported operators

The following operators are supported by the formula step (note that a value can be a column name or a literal, such as 42 or foo).

  • +: Does an addition of two numeric values. See the concatenate step to append strings
  • -: Does an substraction of two numeric values. See the replace step to remove a part of a string
  • *: Multiplies two numeric values.
  • /: Divides a numeric value by another. Divisions by zero will return null.
  • %: Returns the rest of an integer division. Divisions by zero will return null.

Example 1: Basic usage

Input dataset:

Label Value1 Value2 Value3 Value4
Label 1 10 2 3 1
Label 2 1 13 7 3
Label 3 5 20 5 2

Step configuration:

{
  name: 'formula',
  new_column: 'Result',
  formula: '(Value1 + Value2) / Value3 - Value4 * 2'
}

Output dataset:

Label Value1 Value2 Value3 Value4 Result
Label 1 10 2 3 1 2
Label 2 1 13 7 3 -4
Label 3 5 20 5 2 1

Example 2: Column name with whitespaces

Input dataset:

Label Value1 Value2 Value3 Value 4
Label 1 10 2 3 1
Label 2 1 13 7 3
Label 3 5 20 5 2

Step configuration:

{
  name: 'formula',
  new_column: 'Result',
  formula: '(Value1 + Value2) / Value3 - [Value 4] * 2'
}

Output dataset:

Label Value1 Value2 Value3 Value 4 Result
Label 1 10 2 3 1 2
Label 2 1 13 7 3 -4
Label 3 5 20 5 2 1

hierarchy step

Hierarchy for geographical data.

This step dissolves data for every hierarchy level, and adds a hierarchy level column containing a level (with 0 being the lowest granularity, i.e. the highest level).

{
    name: 'hierarchy',
    hierarchy: ['Country', 'City'],
    include_nulls: false,
}

*This step is supported by the following backends:

  • Pandas (python)

Example

Input dataset:

Country City geometry Population
Country 1 City 1 Polygon 100_000
Country 2 City 2 Polygon 50_000
Country 2 City 3 Polygon 200_000
Country 1 City 4 Polygon 30_000
Country 2 City 5 Polygon 25_000
Country 1 City 6 Polygon 10_000

Step configuration:

    name: 'hierarchy',
    hierarchy: ['Country', 'City'],
    include_nulls: false,

Output dataset:

Country City geometry Population hierarchy_level
Country 1 City 1 Polygon 100_000 2
Country 2 City 2 Polygon 50_000 2
Country 2 City 3 Polygon 200_000 2
Country 1 City 4 Polygon 30_000 2
Country 2 City 5 Polygon 25_000 2
Country 1 City 6 Polygon 10_000 2
Country 1 null MultiPolygon null 1
Country 2 null MultiPolygon null 1
null null MultiPolygon null 0

ifthenelse step

Creates a new column, which values will depend on a condition expressed on existing columns.

The condition is expressed in the if parameter with a condition object, which is the same object expected by the condition parameter of the filter step). Conditions can be grouped and nested with logical operators and and or.

The then parameter only supports a string, that will be interpreted as a formula (cf. formula step). If you want it to be interpreted striclty as a string and not a formula, you must escape the string with quotes (e.g. ‘“this is a text”’).

if...then...else blocks can be nested as the else parameter supports either a string that will be interpreted as a formula (cf. formula step), or a nested if if...then...else object.

{
  name: 'ifthenelse',
  newColumn: '', // the name of the new column to be created
  if: { column: '', value: '', operator: 'eq' }, // a condition, same object as in the filter step
  then: '', // a string that will be interpreted as a formula
  else: '', // a string interpreted as a formula or a nested if...then...else object
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Label number
Label 1 -2
Label 2 2
Label 3 0

Step configuration:

{
    name: 'ifthenelse',
    newColumn: 'result',
    if: { column: 'number', value: 0, operator: 'eq' },
    then: '"zero"'
    else: {
      if: { column: 'rel', value: 0, operator: 'lt' },
      then: 'number * -1',
      else: 'number'
    }
}

Output dataset:

Label number result
Label 1 -2 2
Label 2 5 5
Label 3 0 zero

join step

Joins a dataset to the current dataset, i.e. brings columns from the former into the latter, and matches rows based on columns correspondance. It is similar to a JOIN clause in SQL, or to a VLOOKUP in excel. The joined dataset is the result from the query of the right_pipeline.

The join type can be:

  • ‘left’: will keep every row of the current dataset and fill unmatched rows with null values,

  • ‘inner’: will only keep rows that match rows of the joined dataset.

In the on parameter, you must specify 1 or more column couple(s) that will be compared to determine rows correspondance between the 2 datasets. The first element of a couple is for the current dataset column, and the second for the corresponding column in the right dataset to be joined. If you specify more than 1 couple, the matching rows will be those that find a correspondance between the 2 datasets for every column couple specified (logical ‘AND’).

Weaverbird allows you to save pipelines referenced by name in the Vuex store of the application. You can then call them by their unique names in this step.

{
  name: 'join',
  right_pipeline: 'somePipelineReference',
  type: 'left', // or 'inner'
  on: [
    ['currentDatasetColumn1', 'rightDatasetColumn1'],
    ['currentDatasetColumn2', 'rightDatasetColumn2'],
  ]
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Left join with one column couple as on parameter

Input dataset:

Label Value
Label 1 13
Label 2 7
Label 3 20
Label 4 1
Label 5 1
Label 6 1

rightDataset (saved in the application Vuex store):

Label Group
Label 1 Group 1
Label 2 Group 1
Label 3 Group 2
Label 4 Group 2

Step configuration:

{
  name: 'join',
  right_pipeline: 'rightDataset',
  type: 'left',
  on: [['Label', 'Label']];
}

Output dataset:

Label Value Group
Label 1 13 Group 1
Label 2 7 Group 1
Label 3 20 Group 2
Label 4 1 Group 2
Label 5 1  
Label 6 1  

Example 2: inner join with different column names in the on parameter

Input dataset:

Label Value
Label 1 13
Label 2 7
Label 3 20
Label 4 1
Label 5 1
Label 6 1

rightDataset (saved in the application Vuex store):

LabelRight Group
Label 1 Group 1
Label 2 Group 1
Label 3 Group 2
Label 4 Group 2

Step configuration:

{
  name: 'join',
  right_pipeline: 'rightDataset',
  type: 'inner',
  on: [['Label', 'LabelRight']];
}

Output dataset:

Label Value LabelRight Group
Label 1 13 Label 1 Group 1
Label 2 7 Label 2 Group 1
Label 3 20 Label 3 Group 2
Label 4 1 Label 4 Group 2

fromdate step

Converts a date column into a string column based on a specified format.

{
    name: 'fromdate',
    column: 'myDateColumn'
    format: '%Y-%m-%d' // For further details on supported format specifiers,
                       // see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Date Value
Company 1 2019-10-06T00:00:00.000Z 13
Company 1 2019-10-07T00:00:00.000Z 7
Company 1 2019-10-08T00:00:00.000Z 20
Company 2 2019-10-06T00:00:00.000Z 1
Company 2 2019-10-07T00:00:00.000Z 10
Company 2 2019-10-08T00:00:00.000Z 5

Step configuration:

{
  name: 'fromdate',
  column: 'Date',
  format: '%d/%m/%Y'
}

Output dataset:

Company Date Value
Company 1 06/10/2019 13
Company 1 07/10/2019 7
Company 1 08/10/2019 20
Company 2 06/10/2019 1
Company 2 07/10/2019 10
Company 2 08/10/2019 5

lowercase step

Converts a string column to lowercase.

{
  name: 'lowercase',
  column: 'foo',
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Group Value
LABEL 1 Group 1 13
LABEL 2 Group 1 7
LABEL 3 Group 1 20

Step configuration:

{
  name: 'lowercase',
  column: 'Label',
}

Output dataset:

Label Group Value
label 1 Group 1 13
label 2 Group 1 7
label 3 Group 1 20

movingaverage step

Compute the moving average based on a value column, a reference column to sort (usually a date column) and a moving window (in number of rows i.e. data points). If needed, the computation can be performed by group of rows. The computation result is added in a new column.

{
  name: 'movingaverage',
  valueColumn: 'value',
  columnToSort: 'dates'
  movingWindow: 12,
  groups: ['foo', 'bar'] // optional
  newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Basic usage

Input dataset:

DATE VALUE
2018-01-01 75
2018-01-02 80
2018-01-03 82
2018-01-04 83
2018-01-05 80
2018-01-06 86
2018-01-07 79
2018-01-08 76

Step configuration:

{
  name: 'movingaverage',
  valueColumn: 'VALUE',
  columnToSort: 'DATE'
  movingWindow: 2,
}

Output dataset:

DATE VALUE VALUE_MOVING_AVG
2018-01-01 75 null
2018-01-02 80 77.5
2018-01-03 82 81
2018-01-04 83 82.5
2018-01-05 80 81.5
2018-01-06 86 83
2018-01-07 79 82.5
2018-01-08 76 77.5

Example 2: with groups and custom newColumnName

Input dataset:

COUNTRY DATE VALUE
France 2018-01-01 75
France 2018-01-02 80
France 2018-01-03 82
France 2018-01-04 83
France 2018-01-05 80
France 2018-01-06 86
USA 2018-01-01 69
USA 2018-01-02 73
USA 2018-01-03 73
USA 2018-01-04 75
USA 2018-01-05 70
USA 2018-01-06 76

Step configuration:

{
  name: 'movingaverage',
  valueColumn: 'VALUE',
  columnToSort: 'DATE'
  movingWindow: 2,
  groups: ['COUNTRY']
  newColumName: 'ROLLING_AVERAGE'
}

Output dataset:

COUNTRY DATE VALUE ROLLING_AVERAGE
France 2018-01-01 75 null
France 2018-01-02 80 null
France 2018-01-03 82 79
France 2018-01-04 83 81.7
France 2018-01-05 80 81.7
France 2018-01-06 86 83
USA 2018-01-01 69 null
USA 2018-01-02 73 null
USA 2018-01-03 73 71.7
USA 2018-01-04 75 73.7
USA 2018-01-05 70 72.7
USA 2018-01-06 76 73.7

percentage step

Compute the percentage of total, i.e. for every row the value in column divided by the total as the sum of every values in column. The computation can be performed by group if specified. The result is written in a new column.

{
  name: 'percentage',
  column: 'bar',
  group: ['foo'] // optional
  newColumName: 'myNewColumn' // optional, <originalColumname>_PCT by default
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Group Value
Label 1 Group 1 5
Label 2 Group 1 10
Label 3 Group 1 15
Label 4 Group 2 2
Label 5 Group 2 7
Label 6 Group 2 5

Step configuration:


  name: 'percentage',
  new_column: 'Percentage_of_total',
  column: 'Value',
  group: ['Group']
  newColumn: 'Percentage'
}

Output dataset:

Label Group Value Percentage
Label 1 Group 1 5 0.167
Label 2 Group 1 10 0.333
Label 3 Group 1 15 0.5
Label 4 Group 2 2 0.143
Label 5 Group 2 7 0.5
Label 6 Group 2 5 0.357

pivot step

Pivot rows into columns around a given index (expressed as a combination of column(s)). Values to be used as new column names are found in the column column_to_pivot. Values to populate new columns are found in the column value_column. The function used to aggregate data (when several rows are found by index group) must be among sum, avg, count, min or max.

{
 name: 'pivot',
 index: ['column_1', 'column_2'],
 column_to_pivot: 'column_3',
 value_column: 'column_4',
 agg_function: 'sum',
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Country Value
Label 1 Country1 13
Label 2 Country1 7
Label 3 Country1 20
Label 1 Country2 1
Label 2 Country2 10
Label 3 Country2 5
label 3 Country2 1

Step configuration:

{
 name: 'pivot',
 index: ['Label'],
 column_to_pivot: 'Country',
 value_column: 'Value',
 agg_function: 'sum',
}

Output dataset:

Label Country1 Country2
Label 1 13 1
Label 2 7 10
Label 3 20 6

statistics step

Compute statistics of a column.,

{
    name: 'statistics',
    column: 'Value',
    groupby: [],
    statistics: ['average', 'count'],
    quantiles: [{label: 'median', nth: 1, order: 2}],
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
    name: 'statistics',
    column: 'Value',
    groupby: [],
    statistics: ['average', 'count'],
    quantiles: [{label: 'median', nth: 1, order: 2}],
}

Output dataset:

average count median
9.33333 6 8.5

rank step

This step allows to compute a rank column based on a value column that can be sorted in ascending or descending order. The ranking can be computed by group.

There are 2 ranking methods available, that you will understand easily through those examples:

  • standard: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 5, 5, 7]
  • dense: input = [10, 20, 20, 20, 25, 25, 30] => ranking = [1, 2, 2, 2, 3, 3, 4]

(The dense method is basically the same as the standard method, but rank always increases by 1 at most).

{
  name: 'rank',
  valueCol: 'VALUE', // The value column that will be ordered to determine rank
  order: 'desc', // How to order the value column to determine ranking.
                 // either asc(ending) or desc(ending), 'desc' by default
  method: 'standard', // either 'standard' or 'dense', as explained above
  groupby: ['foo', 'bar'], // specify columns if you want to group the ranking computation
  newColumnName: 'columnA', // specify the new column name to be created (by default '<yourValueColumn>_RANK')
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Basic usage

Input dataset:

COUNTRY VALUE
FRANCE 15
FRANCE 5
FRANCE 10
FRANCE 20
FRANCE 10
FRANCE 15
USA 20
USA 30
USA 20
USA 25
USA 15
USA 20

Step configuration:

{
  name: 'rank',
  valueCol: 'VALUE',
  order: 'desc',
  method: 'standard',
}

Output dataset:

COUNTRY VALUE VALUE_RANK
USA 30 1
USA 25 2
FRANCE 20 3
USA 20 3
USA 20 3
USA 20 3
FRANCE 15 7
FRANCE 15 7
USA 15 7
FRANCE 10 10
FRANCE 10 10
FRANCE 5 12

Example 2: With more options

Input dataset:

COUNTRY VALUE
FRANCE 15
FRANCE 5
FRANCE 10
FRANCE 20
FRANCE 10
FRANCE 15
USA 20
USA 30
USA 20
USA 25
USA 15
USA 20

Step configuration:

{
  name: 'rank',
  valueCol: 'VALUE',
  order: 'asc',
  method: 'dense',
  groupby: ['COUNTRY'],
  newColumnName: 'MY_RANK',
}

Output dataset:

COUNTRY VALUE MY_RANK
FRANCE 5 1
FRANCE 10 2
FRANCE 10 2
FRANCE 15 3
FRANCE 15 3
FRANCE 20 4
USA 15 1
USA 20 2
USA 20 2
USA 20 2
USA 25 3
USA 30 4

rename step

Rename one or several columns. The toRename parameter takes as input a list of 2-elements lists in the form [‘oldColumnName’, ‘newColumnName’].

{
    name: 'rename',
    toRename: [
      ['oldCol1', 'newCol1'] // The first value is the column to be renamed, the second is the new column name
      ['oldCol2', 'newCol2']
    ]
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Deprecation note:

The oldname and newnname parameters are deprecated and are supported only for retrocompatibility purposes. When this step was first created, only 1 column at a time could be renamed, and the step was configured via 2 parameters, oldname and newnname, that are now deprecated and supported only for retrocompatibility purposes. An old-fashioned step with oldname set to ‘old’ and newname set to ‘new’ is equivalent to a new-fashioned step with toRename parameter set to [‘old’, ‘new’].

So an old-fashioned step looks like this:

{
    name: 'rename',
    oldname: 'old-column-name',
    newname: 'new-column-name'
}

Example:

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'rename',
  oldname: 'Label',
  newname: 'Company'
}

Output dataset:

Company Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

replace step

Replace one or several values in a column.

A replace step has the following strucure:

{
   name: 'replace',
   search_column: "column_1",
   to_replace: [
     ['foo', 'bar'], // The first value is the one to be replace, the second is the new value
     [42, 0]
   ]
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

COMPANY COUNTRY
Company 1 Fr
Company 2 UK

Step configuration:

{
   name: 'replace',
   search_column: "COUNTRY",
   to_replace: [
     ['Fr', 'France']
     ['UK', 'United Kingdom']
   ]
}

Output dataset:

COMPANY COUNTRY
Company 1 France
Company 2 United Kingdom

replacetext step

Replace a substring in a column.

A replace-text step has the following structure:

{
   name: 'replacetext',
   search_column: "column_1",
   old_str: 'foo',
   new_str: 'bar'
}

*This step is supported by the following backends:

  • Mongo 5.0 (python)
  • Pandas (python)
  • AWS Athena
  • Google Big Query
  • MySQL
  • PostgreSQL
  • Redshift
  • Snowflake

Example

Input dataset:

COMPANY COUNTRY
Company 1 Fr is boring
Company 2 UK

Step configuration:

{
   name: 'replacetext',
   search_column: "COUNTRY",
   old_str: 'Fr',
   new_str: 'France',
}

Output dataset:

COMPANY COUNTRY
Company 1 France is boring
Company 2 UK

rollup step

Use this step if you need to compute aggregated data at every level of a hierarchy, specified as a series of columns from top to bottom level. The output data structure stacks the data of every level of the hierarchy, specifying for every row the label, level and parent in dedicated columns.

Aggregated rows can be computed with using either sum, average, count, count distinct, min, max, first or last.

{
   name: 'rollup',
   hierarchy: ['continent', 'country', 'city'], // Hierarchical columns, from top to bottom level
   aggregations: [ // The columns to be aggregated, and how
   {
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      columns: ['value1', 'value2']
    }
    {
      newcolumns: ['avg_value1'],
      aggfunction: 'avg',
      columns: ['value1']
    }
   ],
   groupby: ['date'], // optional, if needing to segment the rollup computation by group
   labelCol: 'label', // optional, name of the output label column, 'label' by default
   levelCol: 'level', // optional, name of the output level column, 'level' by default
   parentLabelCol: 'parent', // optional, name of the output parent column, 'parent' by default
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Deprecation note:

The column and newcolumn (in the singular) properties of the aggregation parameter are deprecated and are supported only for retrocompatibility purposes. When this step was first created, only 1 column at a time could be aggregated for a given aggregation. Now columns and newcolumns are lists of columns, allowing to apply the same aggregation function to several columns at once.

An old-fashioned step looked like this:

{
   name: 'rollup',
   hierarchy: ['continent', 'country', 'city'],
   aggregations:  [
    {
      newcolumn: 'sum_value1'
      aggfunction: 'sum',
      column: 'value1'
    }
    {
      newcolumn: 'sum_value2',
      aggfunction: 'sum',
      column: 'value2'
    }
    {
      newcolumn: 'avg_value1',
      aggfunction: 'avg',
      column: 'value1'
    }
    // ...
  ]
}

Example 1 : Basic configuration

Input dataset:

CITY COUNTRY CONTINENT YEAR VALUE
Paris France Europe 2018 10
Bordeaux France Europe 2018 5
Barcelona Spain Europe 2018 8
Madrid Spain Europe 2018 3
Boston USA North America 2018 12
New-York USA North America 2018 21
Montreal Canada North America 2018 10
Ottawa Canada North America 2018 7
Paris France Europe 2019 13
Bordeaux France Europe 2019 8
Barcelona Spain Europe 2019 11
Madrid Spain Europe 2019 6
Boston USA North America 2019 15
New-York USA North America 2019 24
Montreal Canada North America 2019 10
Ottawa Canada North America 2019 13

Step configuration:

{
   name: 'rollup',
   hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
   aggregations: [
    {
      newcolumn: ['VALUE'],
      aggfunction: 'sum',
      column: ['VALUE']
    },
   ],
}

Output dataset:

CITY COUNTRY CONTINENT label level parent VALUE
    Europe Europe CONTINENT   64
    North America North America CONTINENT   112
  France Europe France COUNTRY Europe 36
  Spain Europe Spain COUNTRY Europe 28
  USA North America USA COUNTRY North America 72
  Canada North America Canada COUNTRY North America 40
Paris France Europe Paris CITY France 23
Bordeaux France Europe Bordeaux CITY France 13
Barcelona Spain Europe Barcelona CITY Spain 19
Madrid Spain Europe Madrid CITY Spain 9
Boston USA North America Boston CITY USA 27
New-York USA North America New-York CITY USA 45
Montreal Canada North America Montreal CITY Canada 20
Ottawa Canada North America Ottawa CITY Canada 20

Example 2 : Configuration with optional parameters

Input dataset:

CITY COUNTRY CONTINENT YEAR VALUE COUNT
Paris France Europe 2018 10 1
Bordeaux France Europe 2018 5 1
Barcelona Spain Europe 2018 8 1
Madrid Spain Europe 2018 3 1
Boston USA North America 2018 12 1
New-York USA North America 2018 21 1
Montreal Canada North America 2018 10 1
Ottawa Canada North America 2018 7 1
Paris France Europe 2019 13 1
Bordeaux France Europe 2019 8 1
Barcelona Spain Europe 2019 11 1
Madrid Spain Europe 2019 6 1
Boston USA North America 2019 15 1
New-York USA North America 2019 24 1
Montreal Canada North America 2019 10 1
Ottawa Canada North America 2019 13 1

Step configuration:

{
   name: 'rollup',
   hierarchy: ['CONTINENT', 'COUNTRY', 'CITY'],
   aggregations: [
    {
      newcolumn: ['VALUE-sum', 'COUNT']
      aggfunction: 'sum',
      column: ['VALUE', 'COUNT']
    },
    {
      newcolumn: 'VALUE-avg',
      aggfunction: 'avg',
      column: 'VALUE'
    }
   ],
   groupby: ['YEAR'],
   labelCol: 'MY_LABEL',
   levelCol: 'MY_LEVEL',
   parentLabelCol: 'MY_PARENT',
}

Output dataset:

CITY COUNTRY CONTINENT YEAR MY_LABEL MY_LEVEL MY_PARENT VALUE-sum VALUE-avg COUNT
    North America 2018 Europe CONTINENT   26 6.5 4
    North America 2018 North America CONTINENT   50 12.5 4
  France Europe 2018 France COUNTRY Europe 15 7.5 2
  Spain Europe 2018 Spain COUNTRY Europe 11 5.5 2
  USA North America 2018 USA COUNTRY North America 33 16.5 2
  Canada North America 2018 Canada COUNTRY North America 17 8.5 2
Paris France Europe 2018 Paris CITY France 10 10 1
Bordeaux France Europe 2018 Bordeaux CITY France 5 5 1
Barcelona Spain Europe 2018 Barcelona CITY Spain 8 8 1
Madrid Spain Europe 2018 Madrid CITY Spain 3 3 1
Boston USA North America 2018 Boston CITY USA 12 12 1
New-York USA North America 2018 New-York CITY USA 21 21 1
Montreal Canada North America 2018 Montreal CITY Canada 10 10 1
Ottawa Canada North America 2018 Ottawa CITY Canada 7 7 1
    North America 2019 Europe CONTINENT   38 9.5 4
    North America 2019 North America CONTINENT   62 15.5 4
  France Europe 2019 France COUNTRY Europe 21 10.5 2
  Spain Europe 2019 Spain COUNTRY Europe 17 8.5 2
  USA North America 2019 USA COUNTRY North America 39 19.5 2
  Canada North America 2019 Canada COUNTRY North America 23 11.5 2
Paris France Europe 2019 Paris CITY France 13 13 1
Bordeaux France Europe 2019 Bordeaux CITY France 8 8 1
Barcelona Spain Europe 2019 Barcelona CITY Spain 11 11 1
Madrid Spain Europe 2019 Madrid CITY Spain 6 6 1
Boston USA North America 2019 Boston CITY USA 15 15 1
New-York USA North America 2019 New-York CITY USA 24 24 1
Montreal Canada North America 2019 Montreal CITY Canada 10 10 1
Ottawa Canada North America 2019 Ottawa CITY Canada 13 13 1

select step

Select a column. The default is to keep every columns of the input domain. If the select is used, it will only keep selected columns in the output.

{
    name: 'select',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Group Value Label
Company 1 Group 1 13 Company 1 - Group 1
Company 2 Group 1 7 Company 2 - Group 1
Company 3 Group 1 20 Company 3 - Group 1
Company 4 Group 2 1 Company 4 - Group 2
Company 5 Group 2 10 Company 5 - Group 2
Company 6 Group 2 5 Company 6 - Group 2

Step configuration:

{
  {
    name: 'select',
    columns: ['Value', 'Label']
}
}

Output dataset:

Value Label
13 Company 1 - Group 1
7 Company 2 - Group 1
20 Company 3 - Group 1
1 Company 4 - Group 2
10 Company 5 - Group 2
5 Company 6 - Group 2

sort step

Sort values in one or several columns. Order can be either ‘asc’ or ‘desc’. When sorting on several columns, order of columns specified in columns matters.

{
    name: 'sort',
    columns: [{column: 'foo', order: 'asc'}, {column: 'bar', order: 'desc'}],
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
    name: 'sort',
    columns: [{ column: 'Group', order: 'asc'}, {column: 'Value', order: 'desc' }]
}

Output dataset:

Company Group Value
Label 3 Group 1 20
Label 1 Group 1 13
Label 2 Group 1 7
Label 5 Group 2 10
Label 6 Group 2 5
Label 4 Group 2 1

split step

Split a string column into several columns based on a delimiter.

{
  name: 'split',
  column: 'foo', // the columnn to split
  delimiter: ' - ', // the delimiter can e a strinng of any length
  number_cols_to_keep: 3, // the numer of columns to keep resulting from the
                          // split (starting from first chunk)
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1

Input dataset:

Label Value
Label 1 - Group 1 - France 13
Label 2 - Group 1 - Spain 7
Label 3 - Group 1 - USA 20
Label 4 - Group 2 - France 1
Label 5 - Group 2 - Spain 10
Label 6 - Group 2 - USA 5

Step configuration:

{
  name: 'split',
  column: 'Label',
  delimiter: ' - ',
  number_cols_to_keep: 3,
}

Output dataset:

Label_1 Label_2 Label_3 Value
Label 1 Group 1 Spain 13
Label 2 Group 1 USA 7
Label 3 Group 1 France 20
Label 4 Group 2 USA 1
Label 5 Group 2 France 10
Label 6 Group 2 Spain 5

Example 2: keeping less columns

Input dataset:

Label Value
Label 1 - Group 1 - France 13
Label 2 - Group 1 - Spain 7
Label 3 - Group 1 - USA 20
Label 4 - Group 2 - France 1
Label 5 - Group 2 - Spain 10
Label 6 - Group 2 - USA 5

Step configuration:

{
  name: 'split',
  column: 'Label',
  delimiter: ' - ',
  number_cols_to_keep: 2,
}

Output dataset:

Label_1 Label_2 Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

simplify step

Simplifies geographical data.

When simplifying your data, every point that is closer than a specific distance to the previous one is suppressed. This step can be useful if you have a very precise shape for a country (such as one-meter precision), but want to quickly draw a map chart. In that case, you may want to simplify your data.

After simplification, no points will be closer than tolerance. The unit depends on data’s projection and on its unit, but in general, it’s expressed in meters for CRS projections. For more details, see the GeoPandas documentation.

Step configuration:

{
  name: 'simplify',
  tolerance: 1.0,
}

substring step

Extract a substring in a string column. The substring begins at index start_index (beginning at 1) and stops at end_index. You can specify negative indexes, in such a case the index search will start from the end of the string (with -1 being the last index of the string). Please refer to the examples below for illustration. Neither start_index nor end_index can be equal to 0.

{
  name: 'substring',
  column: 'foo',
  start_index: 1, // 1 = fisrt character
  end_index: -1, // -1 = last character
  newColumName: 'myNewColumn' // optional, <originalColumame>_SUBSTR by default
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: positive start_index and end_index

Input dataset:

Group Value
foo 13
overflow 7
some_text 20
a_word 1
toucan 10
toco 5

Step configuration:

{
  column: 'Label',
  name: 'substring',
  start_index: 1,
  end_index: 4,
}
Label Value Label_PCT
foo 13 foo
overflow 7 over
some_text 20 some
a_word 1 a_wo
toucan 10 touc
toco 5 toco

Example 2: start_index is positive and end_index is negative

Input dataset:

Label Value
foo 13
overflow 7
some_text 20
a_word 1
toucan 10
toco 5

Step configuration:

{
  name: 'substring',
  column: 'Label',
  start_index: 2,
  end_index: -2,
  newColumName: 'short_label',
}

Output dataset:

Label Value short_label
foo 13 o
overflow 7 verflo
some_text 20 ome_tex
a_word 1 _wor
toucan 10 ouca
toco 5 oc

Example 3: start_index and end_index are negative

Input dataset:

Label Value
foo 13
overflow 7
some_text 20
a_word 1
toucan 10
toco 5

Step configuration:

{
  name: 'substring',
  column: 'Label',
  start_index: -3,
  end_index: -1,
}

Output dataset:

Label Value Label_PCT
foo 13 foo
overflow 7 low
some_text 20 ext
a_word 1 ord
toucan 10 can
toco 5 oco

text step

Use this step to add a text column where every value will be equal to the specified text.

{
  {
    name: 'text',
    new_column: 'new', // if the column already exists, overwrites it
    text: 'some text'
  }
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Label Value1
Label 1 10
Label 2 1
Label 3 5

Step configuration:

{
  name: 'text',
  new_column: 'KPI',
  text: 'Sales'
}

Output dataset:

Label Value1 KPI
Label 1 10 Sales
Label 2 1 Sales
Label 3 5 Sales

todate step

Converts a string column into a date column based on a specified format.

{
    name: 'todate',
    column: 'myTextColumn'
    format: '%Y-%m-%d' // Optional. If undefined, the backend tries to infer the date format.
                       // For further details on supported format specifiers,
                       // see https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/#datefromstring-format-specifiers
                       // Note: custom format is not supported in Mongo versions older than 4.0
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example

Input dataset:

Company Date Value
Company 1 06/10/2019 13
Company 1 07/10/2019 7
Company 1 08/10/2019 20
Company 2 06/10/2019 1
Company 2 07/10/2019 10
Company 2 08/10/2019 5

Step configuration:

{
  name: 'todate',
  column: 'Date',
  format: '%d/%m/%Y'
}

Output dataset:

Company Date Value
Company 1 2019-10-06T00:00:00.000Z 13
Company 1 2019-10-07T00:00:00.000Z 7
Company 1 2019-10-08T00:00:00.000Z 20
Company 2 2019-10-06T00:00:00.000Z 1
Company 2 2019-10-07T00:00:00.000Z 10
Company 2 2019-10-08T00:00:00.000Z 5

top step

Return top N rows by group if groups is specified, else over full dataset.

{
  name: 'top',
  groups: ['foo'],
  rank_on: 'bar',
  sort: 'desc', // or 'asc'
  limit: 10
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: top without groups, ascending order

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'top',
  rank_on: 'Value',
  sort: 'asc',
  limit: 3
}

Output dataset:

Label Group Value
Label 4 Group 2 1
Label 6 Group 2 5
Label 2 Group 1 7

Example 2: top with groups, descending order

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 4 Group 2 1
Label 5 Group 2 10
Label 6 Group 2 5

Step configuration:

{
  name: 'top',
  groups: ['Group'],
  rank_on: 'Value',
  sort: 'desc',
  limit: 1
}

Output dataset:

Company Group Value
Label 3 Group 1 20
Label 5 Group 2 10

totals step

Append “total” rows to the dataset for specified dimensions. Computed rows result from an aggregation (either sum, average, count, count distinct, min, max, first or last)

{
  name: 'totals',
  // an array of objects: { totalColumn: <column to add total rows in>, totalRowsLabel: <the label of the added total rows>}
  totalDimensions: [
    { totalColumn: 'foo', totalRowsLabel: 'Total foos' },
    { totalColumn: 'bar', totalRowsLabel: 'Total bars' }
  ],
  aggregations: [ // The columns to be aggregated, and how
    {
      columns: ['value1', 'value2']
      aggfunction: 'sum', // the aggregation function that will be applied to the specified columns
      newcolumns: ['sum_value1', 'sum_value2'], // same length as 'columns'
    },
    {
      columns: ['value1']
      aggfunction: 'avg',
      newcolumns: ['avg_value1'],
    }
   ],
  groups: ['someDimension']// specify columns if the totals computation needs to be scoped by group of rows
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: basic usage

Input dataset:

COUNTRY PRODUCT YEAR VALUE
France product A 2019 5
USA product A 2019 10
France product B 2019 10
USA product B 2019 15
France product A 2020 20
USA product A 2020 20
France product B 2020 30
USA product B 2020 25

Step configuration:

{
  name: 'totals',
  totalDimensions: [{ totalColumn: 'COUNTRY', totalRowsLabel: 'All countries' }],
  aggregations: [
    {
      columns: ['VALUE']
      aggfunction: 'sum',
      newcolumns: ['VALUE'],
    }
   ]
}

Output dataset:

COUNTRY PRODUCT YEAR VALUE
France product A 2019 5
USA product A 2019 10
France product B 2019 10
USA product B 2019 15
France product A 2020 20
USA product A 2020 20
France product B 2020 30
USA product B 2020 25
All countries null null 135

Example 2: With several totals and groups

Input dataset:

COUNTRY PRODUCT YEAR VALUE_1 VALUE_2
France product A 2019 5 50
USA product A 2019 10 100
France product B 2019 10 100
USA product B 2019 15 150
France product A 2020 20 200
USA product A 2020 20 200
France product B 2020 30 300
USA product B 2020 25 250

Step configuration:

{
  name: 'totals',
  totalDimensions: [
    {totalColumn: 'COUNTRY', totalRowsLabel: 'All countries'},
    {totalColumn: 'PRODUCT', totalRowsLabel: 'All products'}
  ],
  aggregations: [
    {
      columns: ['VALUE_1-sum', 'VALUE_2']
      aggfunction: 'sum',
      newcolumns: ['VALUE_1', 'VALUE_2'],
    },
    {
      columns: ['VALUE_1-avg']
      aggfunction: 'avg',
      newcolumns: ['VALUE_1'],
    }
   ],
   groups: ['YEAR']
}

Output dataset:

COUNTRY PRODUCT YEAR VALUE_2 VALUE_1-sum VALUE_1-avg
France product A 2019 50 5 5
USA product A 2019 100 10 10
France product B 2019 100 10 10
USA product B 2019 150 15 15
France product A 2020 200 20 20
USA product A 2020 200 20 20
France product B 2020 300 30 30
USA product B 2020 250 25 25
USA All products 2020 450 45 22.5
France All products 2020 500 50 25
USA All products 2019 250 25 12.5
France All products 2019 150 15 7.5
All countries product B 2020 550 55 27.5
All countries product A 2020 400 40 20
All countries product B 2019 250 25 12.5
All countries product A 2019 150 15 7.5
All countries All products 2020 950 95 23.75
All countries All products 2019 400 40 10

trim step

Trim spaces in a column.

{
    name: 'trim',
    columns: ['my-column', 'some-other-column']
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Pandas (python)

This step is unsupported by the following backends:

  • Mongo 3.6

Example

Input dataset:

Company Group Value Label
’ Company 1 ‘ Group 1 13 Company 1 - Group 1
’ Company 2 ‘ Group 1 7 Company 2 - Group 1

Step configuration:

{
  name: 'trim',
  columns: ['Company']
}

Output dataset:

Company Group Value Label
‘Company 1’ Group 1 13 Company 1 - Group 1
‘Company 2’ Group 1 7 Company 2 - Group 1

unpivot step

Unpivot a list of columns to rows.

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'], // columns to keep fixed around which to unpivot columns
  unpivot: ['NB_CLIENTS', 'REVENUES'], // columns to unpivot
  unpivot_column_name: 'KPI', // name of the new dimension column created after unpivot
  value_column_name: 'VALUE', // name of the new value column created after unpivot
  dropna: true // whether null values have to be kept or the corresponding rows discarded
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: with dropnaparameter to true

Input dataset:

COMPANY COUNTRY NB_CLIENTS REVENUES
Company 1 France 7 10
Company 2 France 2  
Company 1 USA 12 6
Company 2 USA 1 3

Step configuration:

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'],
  unpivot: ['NB_CLIENTS', 'REVENUES'],
  unpivot_column_name: 'KPI',
  value_column_name: 'VALUE',
  dropna: true
}

Output dataset:

COMPANY COUNTRY KPI VALUE
Company 1 France NB_CLIENTS 7
Company 1 France REVENUES 10
Company 2 France NB_CLIENTS 2
Company 1 USA NB_CLIENTS 12
Company 1 USA REVENUES 6
Company 2 USA NB_CLIENTS 1
Company 2 USA REVENUES 3

Example 1: with dropnaparameter to false

Input dataset:

COMPANY COUNTRY NB_CLIENTS REVENUES
Company 1 France 7 10
Company 2 France 2  
Company 1 USA 12 6
Company 2 USA 1 3

Step configuration:

{
  name: 'unpivot',
  keep: ['COMPANY', 'COUNTRY'],
  unpivot: ['NB_CLIENTS', 'REVENUES'],
  unpivot_column_name: 'KPI',
  value_column_name: 'VALUE',
  dropna: false
}

Output dataset:

COMPANY COUNTRY KPI VALUE
Company 1 France NB_CLIENTS 7
Company 1 France REVENUES 10
Company 2 France NB_CLIENTS 2
Company 2 France REVENUES  
Company 1 USA NB_CLIENTS 12
Company 1 USA REVENUES 6
Company 2 USA NB_CLIENTS 1
Company 2 USA REVENUES 3

uppercase step

Converts a string column to uppercase.

{
  name: 'uppercase',
  column: 'foo',
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20

Step configuration:

{
  name: 'uppercase',
  column: 'Label',
}

Output dataset:

Label Group Value
LABEL 1 Group 1 13
LABEL 2 Group 1 7
LABEL 3 Group 1 20

uniquegroups step

Allow to get unique groups of values from one or several columns.

{
  name: 'uniquegroups',
  on: ['foo', 'bar']
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example:

Input dataset:

Label Group Value
Label 1 Group 1 13
Label 2 Group 1 7
Label 3 Group 1 20
Label 1 Group 2 1
Label 2 Group 1 2
Label 3 Group 1 3

Step configuration:

{
  name: 'uniquegroups',
  column: ['Label', 'Group'],
}

Output dataset:

Label Group
Label 1 Group 1
Label 1 Group 2
Label 2 Group 1
Label 3 Group 1

waterfall step

This step allows to generate a data structure useful to build waterfall charts. It breaks down the variation between two values (usually between two dates) accross entities. Entities are found in the labelsColumn, and can optionally be regrouped under common parents found in the parentsColumn for drill-down purposes.

{
  name: 'waterfall',
  valueColumn: 'VALUE', // the column of values
  milestonesColumn: 'DATE', // the column where starting and ending blocks labels
                            // are to be found (usually dates)
  start: '2019', // the starting block label to be found in `milestonesColumn`
  end: '2020', // the starting block label to be found in `milestonesColumn`
  labelsColumn: 'PRODUCT', // the column of labels for the variations breakdown
  groupby: ['COUNTRY'], // specify columns if you want to group the waterfall computation
  sortBy: 'value', // either 'value' (default) or 'label' depending on wether to
                   // sort waterfall blocks by their label or value
  order: 'desc', // either 'asc' or 'desc' (default) depending on wether to sort
                 // by ascending or descending order
}

*This step is supported by the following backends:

  • Mongo 5.0
  • Mongo 4.2
  • Mongo 4.0
  • Mongo 3.6
  • Pandas (python)

Example 1: Basic usage

Input dataset:

city year revenue
Bordeaux 2019 135
Boston 2019 275
New-York 2019 115
Paris 2019 450
Bordeaux 2018 98
Boston 2018 245
New-York 2018 103
Paris 2018 385

Step configuration:

{
  name: 'waterfall',
  valueColumn: 'revenue',
  milestonesColumn: 'year',
  start: '2018',
  end: '2019',
  labelsColumn: 'city',
  sortBy: 'value',
  order: 'desc',
}

Output dataset:

LABEL_waterfall TYPE_waterfall revenue
2018 null 831
Paris parent 65
Bordeaux parent 37
Boston parent 30
New-York parent 12
2019 null 975

Example 2: With more options

Input dataset:

city country product year revenue
Bordeaux France product1 2019 65
Bordeaux France product2 2019 70
Paris France product1 2019 210
Paris France product2 2019 240
Boston USA product1 2019 130
Boston USA product2 2019 145
New-York USA product1 2019 55
New-York USA product2 2019 60
Bordeaux France product1 2018 38
Bordeaux France product2 2018 60
Paris France product1 2018 175
Paris France product2 2018 210
Boston USA product1 2018 95
Boston USA product2 2018 150
New-York USA product1 2018 50
New-York USA product2 2018 53

Step configuration:

{
  name: 'waterfall',
  valueColumn: 'revenue',
  milestonesColumn: 'year',
  start: '2018',
  end: '2019',
  labelsColumn: 'city',
  parentsColumn: 'country',
  groupby: ['product'],
  sortBy: 'label',
  order: 'asc',
}

Output dataset:

LABEL_waterfall GROUP_waterfall TYPE_waterfall product revenue
2018 2018 null product1 358
2018 2018 null product2 473
Bordeaux France child product1 27
Bordeaux France child product2 10
Boston USA child product1 35
Boston USA child product2 -5
France France parent product2 40
France France parent product1 62
New-York USA child product1 5
New-York USA child product2 7
Paris France child product1 35
Paris France child product2 30
USA USA parent product2 2
USA USA parent product1 40
2019 2019 null product2 515
2019 2019 null product1 460