Skip to content

Building a tagged index (or replace elasticsearch)

Introduction

Elasticsearch is a commonly used software for comparing data across domains. It comes at a hefty price: It is resource hungry and you have to leave the relational sql world for TCP which introduces round-trips on a by magnitudes slower lane.

In short: it is only useful for very big shared high performance setups. Not for yours, most probably. Here I explain how to do it much simpler and with way less resources in a relational db (this saves you hardware costs and shorts your electricity bill).

Setup

First we need a generic table which maps to all other tables. We have ContentType. Done.

Second we need a Tag. Here are many flavors possible:

  • Tags with seperate key, value fields.
  • Tags with merged key, value fields.
  • Tags with unique key, values; seperate or merged. Note: some dbs have 255 char limit.

Depending if mysql and others shall be supported

Secondly we need tags, that are text fields with a key value syntax. We can use TextFields for this. In my projects I use a syntax: key=value. Stupidly simple but you have to check that you only seperate on the first = which is a bit hard in some programming languages (use regex, e.g. /^([^=]+)=(.*)/ for seperating in js).

Third (optionally): It would be nice to detect collisions among data of different tables --> collision_key. For building a hash for a collision key we can leverage an hash method adapted from the rdf guys.

First merge the keys with values with a seperator like = (or just use the tags) into an array. Sort the array. The entries are now hashed (each entry) and afterwards a hash is build from all the hashes as if they would be a long bytestring.

import edgy

database = edgy.Database("sqlite:///db.sqlite")
models = edgy.Registry(database=database, with_content_type=True)


class ContentTypeTag(edgy.Model):
    # this prevents the normally set ContentTypeField and replaces it with a common ForeignKey
    content_type = edgy.fields.ForeignKey("ContentType", related_name="tags")
    tag = edgy.fields.TextField()

    class Meta:
        registry = models


class Person(edgy.Model):
    first_name = edgy.fields.CharField(max_length=100)
    last_name = edgy.fields.CharField(max_length=100)

    class Meta:
        registry = models
        unique_together = [("first_name", "last_name")]


class Organisation(edgy.Model):
    name = edgy.fields.CharField(max_length=100, unique=True)

    class Meta:
        registry = models


class Company(edgy.Model):
    name = edgy.fields.CharField(max_length=100, unique=True)

    class Meta:
        registry = models


async def main():
    async with database:
        await models.create_all()
        person = await Person.query.create(first_name="John", last_name="Doe")
        await person.content_type.tags.add({"tag": "name=John Doe"})
        await person.content_type.tags.add({"tag": "type=natural_person"})
        org = await Organisation.query.create(name="Edgy org")
        await org.content_type.tags.add({"tag": "name=Edgy org"})
        await org.content_type.tags.add({"tag": "type=organisation"})
        comp = await Company.query.create(name="Edgy inc")
        await comp.content_type.tags.add({"tag": "name=Edgy inc"})
        await comp.content_type.tags.add({"tag": "type=organisation"})
        # now we can query via content_type
        assert await models.content_type.query.filter(tags__tag="type=organisation").count() == 2


edgy.run_sync(main())

Note

It is crucial that each entry is mangled (either by hash or an other mangling method) because otherwise malicious users could inject = in the value data and provoke

collisions.

Note

The seperator is up to you. I just = because I used this in the secretgraph project, but all chars are elligable. More logic you can lookup there.

Alternative implementations

If you don't like the shared field for key value operations you may want seperate fields for both. Also it would be possible (for postgres and other more powerful dbs) to make the tag field unique.

Operations

Searching for a key:

use registry.content_type.query.filter(tags__tag__startswith='key=')

Searching for a key and a value starting with:

use registry.content_type.query.filter(tags__tag__startswith='key=value_start')

References

secretgraph