Getting started

Installation

Run the following in your command line:

pip install -U bite-parser

Implementing your first parser

from bite import Combine, CharacterSet, Forward, Literal

value = Forward()
product = value + ((Literal(b'*') | Literal(b'/')) + value)[0, ...]
sum = product + ((Literal(b'+') | Literal(b'-')) + product)[0, ...]
expr = sum
value.assign(
    Combine(CharacterSet(b'0123456789')[1, ...], name="number")
    | (Literal(b'(') + expr + Literal(b')'))
)

This defines a grammar to accept simple mathematical expressions consisting of the four basic operations. Let us discuss the most basic elements first. Each Literal defines a byte or sequence of bytes that needs to be matched exactly. The | (or) operator combines parsers such that the first matching alternative determines the parse result. Order can matter here (but not in this example)! The + (addition) operator chains parsers together, i.e. first the left-hand side parser needs to match the input, followed by a match of the right-hand parser. The bracket notation declares the minimum and maximum number of repeats. Zero or more matches are denoted by [0, ...], one or more matches with [1, ...].

Further, the CharacterSet matches a single byte from the given set. Because a number might have multiple digits, [1, ...] is used to allow for more than one digit. However, we do not want to parse each digit individually, but combine all of the digits into a single token. This is done with Combine.

Finally, the Forward allows to do a forward declaration of value without specifying what constitutes a value until value.assign is called. This allows for the definition of recursive rules. Be careful though, a left-recursive rule will give an infinite recursion! You can add names such as “number” to declarations in your grammar. This can come in handy when evaluating the parse result.

There many more classes and operators that you can use to construct your grammar. Check the reference documentation. Also note, that for each operator an equivalent class exists that can be used interchangeably.

Using the parser

import asyncio
from bite import parse_bytes

result = asyncio.run(parse_bytes(expr, b'123+45*(67+89)'))

Here the parse_bytes function is used to parse the bytes b'123+45*(67+89)' with the expr grammar (or parser). The result is the concrete parse tree which contains all the relevant parsing information. The children of each node will be stored in the parse_tree attribute. For example, we can take a look at the first child of result which corresponds to 123 from the parsed expression:

print(result.parse_tree[0])

ParsedMatchFirst(name="(number) | ((b'(') + ((forward) + ((((b'*') | (b'/')) + (forward))[0, None]) + ((((b'+') | (b'-')) + ((forward) + ((((b'*') | (b'/')) + (forward))[0, None])))[0, None])) + (b')'))", parse_tree=ParsedLeaf(name='number', parse_tree=b'123', start_loc=0, end_loc=3), choice_index=0)

The ParsedMatchFirst corresponds to the | operator of the rule assigned to value. It has also a somewhat unwieldy, auto-generated name besides some other attributes. On each node in the parse tree, you can query start_loc and end_loc to find out to what extent in the input the parsed node corresponds. The end_loc is exclusive.

print(
    "Extent from", result.parse_tree[0].start_loc,
    "to", result.parse_tree[0].end_loc
)

Extent from 0 to 3

The leaf nodes of the parse tree will have the bytes itself as parse_tree attribute:

print("The leaf:", result.parse_tree[0].parse_tree)
print(
    "The children of the leaf are the bytes itself:",
    result.parse_tree[0].parse_tree.parse_tree
)

The leaf: ParsedLeaf(name='number', parse_tree=b'123', start_loc=0, end_loc=3)
The children of the leaf are the bytes itself: b'123'

While the parse tree gives you all the information, it is often more than you need and sometimes cumbersome to work with. Often you only want a transformed representation that you can get with the values attribute on a node.

print(result.values)

(b'123', b'+', b'45', b'*', b'(', b'67', b'+', b'89', b')')

By default this gives a flat tuple of all the parsed tokens. In the next section, we will introduce some more structure to retain the operator precedence.

Introducing structure in the parse result

A simple way to introduce more structure into the parse values is the Group transform. It will put the values of its sub-parsers into a tuple. In addition, Suppress allows to remove parsed tokens completely from the values. In this example, we use this to insert grouping according to the precedence.

import asyncio
from bite import Combine, CharacterSet, Forward, Group, Literal, Suppress
from bite import parse_bytes

value = Forward()
product = Group(value + ((Literal(b'*') | Literal(b'/')) + value)[0, ...])
sum = product + ((Literal(b'+') | Literal(b'-')) + product)[0, ...]
expr = sum
value.assign(
    Combine(CharacterSet(b'0123456789')[1, ...], name="number")
    | Group(Suppress(Literal(b'(')) + expr + Suppress(Literal(b')')))
)

result = asyncio.run(parse_bytes(expr, b'123+45*(67+89)'))

print(result.values)

((b'123',), b'+', (b'45', b'*', ((b'67',), b'+', (b'89',))))

It is also possible to define custom transforms and use this, for example, to evaluate the whole expression.

import asyncio
from bite import Combine, CharacterSet, Forward, Literal, Suppress, TransformValues
from bite import parse_bytes

def eval_product(values):
    acc = values[0]
    for i in range(1, len(values), 2):
        if values[i] == b'*':
            acc *= values[i + 1]
        elif values[i] == b'/':
            acc /= values[i + 1]
        else:
            raise ValueError()
    return (acc,)

def eval_sum(values):
    acc = values[0]
    for i in range(1, len(values), 2):
        if values[i] == b'+':
            acc += values[i + 1]
        elif values[i] == b'-':
            acc -= values[i + 1]
        else:
            raise ValueError()
    return (acc,)

value = Forward()
product = TransformValues(
    value + ((Literal(b'*') | Literal(b'/')) + value)[0, ...],
    eval_product
)
sum = TransformValues(
    product + ((Literal(b'+') | Literal(b'-')) + product)[0, ...],
    eval_sum
)
expr = sum
value.assign(
    TransformValues(
        Combine(CharacterSet(b'0123456789')[1, ...], name="number"),
        lambda values: (int(v) for v in values)
    )
    | Suppress(Literal(b'(')) + expr + Suppress(Literal(b')'))
)

result = asyncio.run(parse_bytes(expr, b'123+45*(67+89)'))

print(result.values)

(7143,)

Parsing asynchronous streams

So far we only parsed complete byte arrays. However, bite-parser is asynchronous and can emit parsed results as they come in. For this the parse_incremental method is used. It returns an asynchronous iterator that returns each successive complete parse tree.

The following example implements a simple server that parses and evaluates each incoming line with the grammar defined in the previous section.

from bite import Opt

request = expr + Opt(Literal(b'\r')) + Literal(b'\n')

async def handle_connection(reader, writer):
    try:
        # The following line is where the magic happens
        async for result in parse_incremental(request, reader):
            writer.write(b'Result: ')
            writer.write(str(result.values[0]).encode('ascii'))
            writer.write(b'\n')
            await writer.drain()
    finally:
        writer.write_eof()
        await writer.drain()
        writer.close()

async def main():
    server = await asyncio.start_server(handle_connection, 'localhost', 4000)

    addrs = ', '.join(str(sock.getsockname()) for sock in server.sockets)
    print(f'Serving on {addrs}')

    async with server:
        await server.serve_forever()

if __name__ == '__main__':
    asyncio.run(main())

After starting the server, you can test the server with netcat (for example):

nc localhost 4000

1+1
Result: 2
21+4*(5+6)-4/2
Result: 63.0