Tutorial

This tutorial teaches how to use a DSL as well as how to extend it with new grammar rules or create an entirely new DSL.

DSL Syntax

There are two kinds of DSL statements:

  1. Evaluation statements, which are translated into executable code.

  2. Definition statements, which allows you to add new grammar rules.

An Evaluation statement may be compiled into executable code by a code generator. A Definition statement on the other hand adds a new rule to grammar. In this section we only focus in the former, the later is described in the section “Adding Internal Grammar Rules”.

Let us look at some examples:

## x = on df | drop duplicates | groupy by df.data apply sum
## x = 5 + 7 % 2
## on df | show

Every statement starts with ‘##’ followed by an optional assignment and a pipeline.

The first pipeline removes all duplicates from the dataframe ‘df’, groups it by the data column and computes the sum. The results are stored in the variable x. In the second statement, a simple expression is evaluated and stored in the variable x. In the last statement the dataframe ‘df’ is printed to the console.

The general syntax of an Evaluation statement is:

## (<name> = )? (<Function> | <Initialization> (<Transformation>)* (<Operation>)?)

Here (…)? means one or none occurence, (…)* means zero or more occurences and ‘|’ indicates that one or the other is possible.

  1. <name> Is simple a variable name.

  2. <Function> Is a function, which is not related to dataframe pipeline.

  3. <Initialization> First element of a pipeline (e.g. ‘on df’, ‘read myfile.csv as csv’).

  4. <Transformation> A function which transforms a dataframe (e.g. ‘drop duplicates’).

  5. <Operation> Computes some operation on a dataframe (e.g. show).

Using the Compiler

We provide a python script called nldsl-compile.py, which serves as a compiler for the DSL. It expects a python file containg NLDSL code as input and generates executable python code for the desired target framework. For example consider the following input file:

#! python3

if __name__ == "__main__":
        ## df = load from 'examples/src/main/resources/people.json' as json

        ## on df | show
        ## on df | describe

        ## on df | select columns 'name' | show
        ## on df | select columns 'name', 'age' | show

        ## on df | select rows df.age > 21 | show

        ## on df | group by 'age' apply count | show

If we want to generate PySpark code from this input file, we would run:

python nldsl-compile.py input_file.py output_file.py -t spark -s "DSL Experiment" -o -a

And the resulting code would be:

#! python3
from pyspark.sql import SparkSession

if __name__ == "__main__":
        spark = SparkSession.builder.appName('DSL Experiment').getOrCreate()

        df = spark.read.format('json').load('examples/src/main/resources/people.json')

        df.show()
        df.describe()

        df.select('name').show()
        df.select(['name', 'age']).show()

        df.filter(df.age > 21).show()

        df.groupBy('age').count().show()

        spark.stop()

Similary if we want to generate pandas code from it, we would execute:

python nldsl-compile.py dsl_experiment.py dsl_pandas_results.py -t pandas -a -n pd

For a description of the available parameters type:

python nldsl-compile.py -h

Using a Code Generator

A code generator translates DSL statements into executable code for a certain target. To create a code generator for spark simply write:

from nldsl import SparkCodeGenerator

code_gen = SparkCodeGenerator()

There are two kinds of arguments which may be provided to a code generator. The first kind are arguments which modify the behavior of the code generator. The remaining arguments form the so called environment for the grammar rules. This environment is provided to any grammar rule. For example we could have create the SparkCodeGenerator like so:

code_gen = SparkCodeGenerator(spark_name="my_spark", start_session_named="my spark app")

Here the argument ‘spark_name’ describes the name of the variable which shall hold the spark session and will be become part of environment. ‘start_session_name’ on the other hand tells the code generator to create the code necessary to start a session under the given name.

The code generators only purpose is to generate code from DSL statements:

code = code_gen("## x = on dataset | select rows df.label == 'C' and df.data > 9000")
# code == ["x = dataset.filter(df.label == 'C' & df.data > 9000"]

Notice that the code generator returns a list, because in general more then one statement will be parsed at once. If the input contains multiple statement the parser expects them to be separetd by newlines:

stmts = ["unique_df = on inital_df | drop duplicates",
                 "clean_df = on unique_df | select rows df.data != None",
                 "on clean_df | show"]

code = code_gen("\n".join(stmts))

We may also parse an entire file of DSL code:

code = code_gen("filename", is_file=True)

Adding Internal Grammar Rules

A Definition statement adds a new rule (so called internal dynamic grammar rules) to the grammer, it does so by combining existing Initializations, Transformations and Operations. For example:

#$ my custom pipeline $column = drop duplicates | select rows $column != 1

Unlike an Evaluation statment a Definition statment start with ‘#?’. On the left-hand side of the equal sign the syntax of the new grammar rule is specified and the right-hand side defines what the pipeline does.

In this case we defined a new pipeline called ‘my custom pipeline’, which takes the name of a single column as argument, removes all duplicates from a dataframe and selects all of the remaining rows where the specified column has a value unequal to 1.

We may use this pipeline in the same way as an other grammar rule:

## x = on df | my custom pipeline labels

There are four types of arguments which a new rule may have

  1. keywords - Denoted as plain words.

  2. variables - Denoted as words prefixed with a ‘$’ sign (e.g. ‘$my_variable’).

  3. expression - Denoted as word prefixed with a ‘!’ sign (e.g. ‘!my_expr’).

  4. lists - Denoted as $[(key | $var)*] (e.g. $[$old to $new]).

Every variable which occurs to the left of the equal sign must also occure on the right.

Adding External Grammar Rules

An external grammar rule is implemented as a python function with the following prototype:

my_grammar_rule(code :str, args :List[str], env :Dict[str:Any]) -> str

For example:

def read_csv(code, args, env):
        # args should be ["filename"]
        return code + env["pandas_name"] + args[0]

Meaning ‘my_grammar_rule’ is a function that takes the following parameters:

  • code - A string containing the already generated code.

  • args (optional) - A list of arguments.

  • env (optional) - A dictionary with environment variables.

And returns a string containing executable code. Only the ‘code’ parameter is mandatory, meaning the following prototypes are also valid:

my_grammar_rule(code :str, args :List[str]) -> str
my_grammar_rule(code :str, env :Dict[str:Any]) -> str
my_grammar_rule(code :str) -> str

It is recommended that you stick to the parameter names ‘code’, ‘args and ‘env’. This is only enforced however if the grammar rule function takes only two arguments, otherwise the framework is unable to determine whether or not the second argument is the ‘env’ dictionary or the args ‘list’.

A new grammar rule may be added to a code generator like so:

# Add the grammar rule the CodeGenerator class as a 'static member'.
CodeGenerator.register_function(my_grammar_rule, "my grammar rule name")

# Only add the rule to a specific instance of the code generator.
code_gen = CodeGenerator()
code_gen["my grammar rule name"] = my_grammar_rule

While it is possible to use any python function with the correct prototype as a grammar rule, we provide a dedicated decorator for this purpose, which unlocks several usefull features:

@grammar(doc :str, expr :ExpressionRule)
def my_grammar_rule(code :str, args :List[str], env :Dict[str:Any]) -> str:
        """...
        Grammar:
                <name> (<variable> | <variable list> | <expression> | <keyword>)*
        ...
        """
        # args is now a Dict[str:x] where x is either List[Tuple[str]] or str.
        ...

The following is provided by the grammar decorator:

  • ‘args’ is converted into a dictonary, which maps from variable names to there values.

  • Automated parsing of boolean/comparison/arithmetic expressions may be used.

  • Enables recommendations on how to continue in case of errors, when using the DSL.

  • Allows for automatic name inference when adding the rule to code generator.

Consider the following example:

@grammar(expr=PandasExpressionRule)
def select_rows(code, args):
        """...
        Grammar:
                select rows !condition
        """
        return code + "[{}]".format(args["condition"])

result = select_rows("df", ["select", "rows", "df.col1", "<", "42", "and", "(", "df.col2", "!=",
                                                    "7", "or", "df.col2", "in", "[", "1", "2", "3", "]", ")"])

# result == "df[df.col1 < 42 & (df.col2 != 7 | df.col2.isin([1, 2, 3])]"

As one can imagine implementing the parsing of arbitrary expression can become somewhat labore intensive, however the grammar decorator makes this task trival. There are three aspects to this implementation:

  1. The PandasExpressionRule provided to the grammar decorator that specifies how the expression will be translated into code. If no expression rule is provided the default is used, which parses the expression into a typical python expression.

  2. The special Grammar section in the functions docstring. This section describes the syntax of the DSL rule implemented by this function. Providing this section in the docstring is required when using the grammar decorator.

  3. The implementation which simply request the values of the DSL arguments from the args dictionary.

The grammar is specified in the similar to the left hand side of a internal dynamic grammar rule, that is one starts with the name of the rule (in the above case ‘select rows’) and after that any number of keywords (e.g. ‘my_keyword’), variables (‘$my_variable’), expressions (‘!my_expression’) or lists of variables (e.g. ‘$my_varist[$my_variable, … , my_keyword]’). The one exception is that variable list may have a name. Inside a variable list may be any number of variables, keywords or expressions in any order, but variable lists must not be nested.

When writting the docstring of an external dynamic grammar rule one should document the corresponding DSL sentence and not the python function itself. The full example of the ‘select rows’ function looks like this:

@grammar(expr=PandasExpressionRule)
def select_rows(code, args):
        """Select the rows of a DataFrame based on some condition.

        The condition can be composed out of boolean, comparison and arithmetic
        expression. The operator precedence is equivalent to python and it is possible
        to use brackets to modify it.

        Grammar:
                select rows !condition

        Type:
                Transformation

        Examples:
                1. ## x = on df | select rows df.col1 > (14.2 + z) and df.col2 == 'A'
                2. ## x = on df | select rows df.col1 != 0 and not df.col2 in [3, 5, 7]
                3. ## x = on df | select rows df.col3 % df.col1 != 2 or df.col1 <= 12

        Args:
                condition (expression): A boolean expression used as a row filter.
        """
        return code + "[{}]".format(args["condition"])

For the most part this looks a like typicall google-style python docstring with the two special sections ‘Grammar’ and ‘Type’. The ‘Type’ section may only contain on of four words ‘Function’, ‘Intitializer’, ‘Transformation’ or ‘Operation’. This type of a grammar rule indicates how it may be used relative to a pipeline. Any pipeline starts with an Initialization, possible followed by several Transformations. An Operation must always be the last step of pipeline. Regular Function may not be used as part of a pipeline.

Besides the already introduced grammar syntax the ‘Grammar’ section may contain additonal lines, specifing a finite set of possible values for a certain variable:

"""...
Grammar:
        group by $column apply $agg
        agg := { min, max, count, sum, avg }
...
""""

The grammar decorator also takes another argument called ‘doc’. This argument alows the user to submit the docstring of the DSL statement directly to the decorator instead of writting it inside the function body. On the one hand this always the user to explicitly document the function implementing the grammar rule, but more importantly it allows the reuse of the docstring for several function which implement the same DSL statement for different targets.

As mentioned previously if the grammar rule is implemented with the grammar decorator one may let the code generator automatically infer the name of the function:

CodeGenerator.register_function(my_grammar_rule)

code_gen = CodeGenerator()
code_gen["__infer__"] = my_grammar_rule

Implementing a Code Generator

Adding new grammar rules to the CodeGenerator class itself is rarely a good idea. Hence the simplest yet useful code generator one may implement is the following:

class MyCodeGenerator(CodeGenerator):
        pass

# my_grammar_rule_1 ... my_grammar_rule_n are external dynamic grammar rules.
SimpleCodeGenerator.register_function(my_grammr_rule_1)
# ...
SimpleCodeGenerator.register_function(my_grammr_rule_n)

All it does is holding grammar rules, which are supposed to be shared between all instances of the MyCodeGenerator class.

Assume we want to make certain variables available to the grammar rules we added, e.g. the name under which a certain python module is imported. This is easely done as follows:

class MyCodeGenerator(CodeGenerator):
        def __init__(self, my_var_1, ... , my_var_n=default_value):
                super().__init__(env_1=my_var_1, ... , env_n=my_var_n)

This will add the variables my_var_1, … , my_var_n to the environment dictionary under the corresponding name env_1, … , env_n. Every grammar rule which requests this dictionary by specifing the parameter ‘env’ in its prototype will receive it and may query it for the values provided on constructing an instance of MyCodeGenerator.

Last but not least we provide a hook to modify the code generated by the code generator right before it is returned from the __call__ method:

class MyCodeGenerator(CodeGenerator):
        def __init__(self, my_var_1, ... , my_var_n=default_1):
                super().__init__(env_1=my_var_1, ... , env_n=my_var_n)

        def postprocessing(self, code_list):
                # do something with the list of generated code ...
                return modified_code_list

Implementing an Expression Rule

Writting a custom expression rule is done by deriving from the ExpressionRule class and modifing the operator_map and operator_type dictionaries in the constructor:

class MyExpressionRule(ExpressionRule):
        def __init__(self, expr_name, next_keyword):
                super().__init__(expr_name, next_keyword)
                self.operator_map["and"] = " & "
                self.operator_map["+"] = " plus "
                self.operator_map["in"] = ".isin"

                self.operator_type["in"] = OperatorType.UNARY_FUNCTION

You need to stick to the API of the ExpressionRule base class as shown above. The operator_map as well as the operator_type dictionary are indexed by a string containing the DSL operator (the DSL operators are equal to the python ones).

When assigning a new value to an operator via the operator_map one must explicitly add the desired amount of spaces surrounding it.

There are three possible values for an operator type:

  1. OPERATOR (default) - Uses the infix notation (e.g. ‘x isin y’)

  2. BINARY_FUNCTION - Treat the operator as a binary function (e.g. ‘isin(x, y)’)

  3. UNARY_FUNCTION - Treat the operator as an unary function (e.g. ‘x isin(y)’)

As show in the example the later is mostly usefull in case the operator is implemented as a member function ot the left-hand operand.