This article is part of a series on rewriting 500 Lines or Less, with the goal of rewriting the original 500 Lines or Less project: Dagoba: an in-memory graph database.
Background
Dagoba was designed by the authors to show how to implement a Graph Database on their own from scratch. The name seems to come from a band the author likes, and also because the prefix DAG also happens to be an acronym for Directed Acyclic Graph. The name is also used in this paper.
A graph is a common data structure that describes information as a number of independent nodes (vertex, called node in this paper to be more symmetrical with the edges below) and edges that relate the nodes together. The familiar chain tables and various tree structures can be seen as graphs that conform to specific rules. Graphs are important core data structures in path selection, recommendation algorithms, and neural networks.
Since graphs are so versatile, an important question is how to store them. If you store graphs in a traditional relational database, the natural approach is to create a table for each of the nodes and edges and associate them with a foreign key. In this case, to find all of someone's children, you could write a query like the following:
select nodes.* from nodes, edges on nodes.id=edges.in where nodes.name='Jack' and edges.type='child'
Luckily, it's not too complicated. But what if you want to find grandchildren? Then I'm afraid we'll have to use a subquery or a special construct like CTE (Common Table Expression). Thinking further, what about great-grandchildren? What about grandchildren?
Then we realize that SQL, as a query language, is only designed for structures like two-dimensional data tables, and using it to query graphs is very clumsy and will soon become extremely complex and difficult to scale. For graphs, we want a more natural and intuitive query syntax, something like this:
node('someone').children.children
Graph Database was created to efficiently store and query data structures such as graphs. Because it is very different from traditional relational databases, it belongs to a new type of database, which is a branch of NoSql (other branches include document database, column database, etc.). The main representatives of graph databases include Neo4J and so on. The Dagoba introduced in this paper is a simple graph database with the core functions of graph database, mainly for teaching and demonstration.
Rewrite
The original code is written in JavaScript and makes extensive use of prototype, a language specific construct, when defining the calling interface. For users of other major languages, the use of prototypes may seem a bit awkward and unnatural. Given that most of the other database examples in this series are implemented in Python, this article follows the tradition of rewriting the original code in Python. Again, in keeping with previous practice, we have iterated through the components of the program in order to give the reader a better understanding of how the program is built up.
The original article in the 500lines series of Github repositories contains only the implementation code, not the tests. According to the code comments, the test program is located in another repository of the author, but it seems to be slightly different from the implementation of the 500lines version. The code implemented in this article refers to the original author's test content, but skips the example of Norse mythology ---- I admit that I am really not familiar with the kinship between these gods, and I believe that most readers with Chinese background may not understand it, although the author likes this Although the author likes this example, I think it is better not to add confusion. Therefore, in this article, I only refer to the original example of family relatives when writing the test cases, and drop the part related to mythology, although it will be less interesting, I believe it will be enough for entry-level code.
Source code
The implementation of this article is located in the dagoba directory of the code base. To execute each completed step directly, the reader can find the appropriate code location in the root directory at main.py, uncomment it, and run it, according to the rules agreed upon for this series.
All steps in this program require only Python3, and the tests use the built-in unittest, which requires no additional third-party libraries. In principle, it should work with Python 3.6 and above, but I have only tested it completely with Python 3.8.3. If you find compatibility issues, please feel free to write to me or send me an issue.
Implementation
The program implemented in this article starts with the simplest case and expands through each step to a complete program. These steps include:
-
Create Data Model
-
Managing data primary keys
-
Realize active query
-
Add bi-directional association support
-
Experimental delay query
-
Improve access efficiency
-
Customized query methods
The next step describes each step in turn.
Step 0: Create Data Model
Recall that a graph database is a collection of points (node) and edges (edge). Now one of the major decisions we have to make is how to model the nodes/edges. For an edge, one must specify its association, that is, from which node it points to which node. Most of the time edges are directional - parent-child relationships can be messy without specifying the direction! In addition, there are usually multiple association types in a graph at the same time, for example, family products usually contain father and son, brother, sister, and couple, while social products usually have followers, references, followers, members, interests, and so on. As for the node content, it is based on specific needs and often varies greatly among various applications, which cannot be predetermined by the database itself.
Considering the scalability and generality issues, we can save the data as a dictionary (dict), so that we can easily add any data that users need. Some data is reserved for internal database management, and to make the distinction clear, it can be agreed that special fields starting with an underscore are maintained internally by the database, similar to private members, and that users should not modify them themselves. This is a convention generally followed by the Python community.
In addition, nodes and edges have a cross-reference relationship. We currently know that edges refer to nodes at both ends, and as we'll see later, nodes also refer to edges for efficiency. It is intuitive to use pointer access if their relationship is maintained only in memory, but databases must take into account serialization to disk, at which point point pointers are no longer good. For this reason, it is better to maintain a primary key (_id) for each node, as is generally required for databases, and use the primary key to describe the relationships between them.
Our first step is to model the database. For testing purposes, we use a simplest database model that contains only two nodes and one edge, as follows:

Simple Database
According to the principles of TDD, first write tests:
class DbModelTest(TestCase):
nodes = [
{'_id': 1, 'name': 'foo'},
{'_id': 2, 'name': 'bar'},
]
edges = [
{'_from': 1, '_to': 2},
]
def setUp(self):
self.db = Dagoba(self.nodes, self.edges)
As in the original article, we named the database management interface Dagoba. currently, the simplest test that can be thought of is to confirm that nodes and edges have been added to the database:
def test_nodes(self):
nodes = list(self.db.nodes())
self.assertEqual(2, len(nodes))
self.assert_item(nodes, _id=1, name='foo')
self.assert_item(nodes, _id=2, name='bar')
def test_edges(self):
edges = list(self.db.edges())
self.assertEqual(1, len(edges))
self.assert_item(edges, _from=1, _to=2)
assert_item is a helper method to check if the dictionary contains the expected fields. I'm sure we can all figure out how to implement it, so I won't list it here. Readers can refer to the full source code on Github.
Now, the test is failing. Implementing the database in the simplest way
:
class Dagoba:
def __init__(self, nodes=None, edges=None):
self._nodes = []
self._edges = []
for node in (nodes or []):
self.add_node(node)
for edge in (edges or []):
self.add_edge(edge)
def add_node(self, node):
self._nodes.append(node.copy())
def add_edge(self, edge):
self._edges.append(edge.copy())
def nodes(self):
return (x.copy() for x in self._nodes)
def edges(self):
return (x.copy() for x in self._edges)
It is important to note that the program uses a copy of the copied data instead of using the original data directly, whether adding nodes or querying. Why is this necessary? Because dictionaries are mutable and users can modify their contents at any time. If the database is unaware that the data has changed, it is prone to untraceable consistency problems and, in the worst case, complete confusion. Copying data can avoid these problems at the cost of more memory and processing time. For databases, there are usually far more queries than modifications, so this is an acceptable cost.
The test should now pass properly. To make it even better, we can test some more edge cases to see if the database can handle abnormal data correctly, such as:
-
Adding duplicate (same primary key) records
-
Add edges that point to invalid nodes
-
Get the node whose primary key does not exist
For example, if a user tries to add a duplicate primary key, we expect a ValueError exception to be thrown. So write the test as follows:
def test_nodes_with_duplicate_id(self):
nodes = [
{'_id': 1, 'name': 'foo'},
{'_id': 1, 'name': 'bar'},
]
with self.assertRaises(ValueError):
Dagoba(nodes, self.edges)
In order to meet the above test, the code needs to be modified slightly. In particular, finding the primary key by id is a common operation, and the method of traversal is too inefficient, so it is better to be able to access it directly by primary key. So add another dictionary to the database:
class Dagoba:
def __init__(self, nodes=None, edges=None):
...
self._nodes_by_id = {}
def add_node(self, node):
pk = node.get('_id', None)
if pk in self._nodes_by_id:
raise ValueError(f'Node with _id={pk} already exists.')
node = node.copy()
self._nodes.append(node)
self._nodes_by_id[pk] = node
For the complete code, please refer to the Github repository.
Step 1: Manage primary keys
In the previous step, we explicitly specified the primary key for the node when we initialized the database. According to the general principle of database design, the primary key should preferably be a surrogate key with no business meaning, and the user should not care what its specific value is, so it is usually more reasonable to let the database manage the primary key. Of course, in some scenarios ---- such as importing external data ---- explicitly specifying the primary key is still useful. To support both of these requirements, we agree that the field _id represents the primary key of the node, and if the user specifies the field, the values set by the user are used (it is, of course, the user's responsibility to ensure that they are not duplicated); otherwise, the database automatically assigns a primary key to it.
If the primary key is generated by the database, there is no way to predict in advance what its value will be, and the edge (edge) must specify the node to which it refers, so it must be added only after the primary key has been generated. For this reason, the initialization of the database is slightly more complicated in the case of dynamically generated primary keys. Still, let's write a test first:
class PrimaryKeyTest(TestCase):
def test_nodes(self):
db = Dagoba()
pk1 = db.add_node({'name': 'foo'})
pk2 = db.add_node({'name': 'bar'})
db.add_edge({'_from': pk1, '_to': pk2})
self.assertIsNotNone(db.node(pk1))
self.assertIsNotNone(db.node(pk2))
self.assertTrue(pk1 != pk2)
To support this, we add an internal field _next_id to the database to generate the primary key and have the add_node() method return the newly generated primary key:
class Dagoba:
def __init__(self, nodes=None, edges=None):
...
self._next_id = 1
def add_node(self, node):
...
pk = node.get('_id', None)
if pk in self._nodes_by_id:
raise ValueError(f'Node with _id={pk} already exists.')
if not pk:
pk = self._next_id
node['_id'] = pk
self._next_id += 1
...
return pk
Next, check again if the edge is accessible properly:
def test_edge(self):
edge = self.get_item(self.db.edges(), _from=self.pk1, _to=self.pk2)
self.assertIsNotNone(edge)
Run the test and everything works fine. This step was done easily, but two tests (DbModelTest and PrimaryKeyTest) showed some duplicate code, such as get_item(). We can extract this common code. Since get_item() internally calls methods such as TestCase.assertXXX(), it seemed appropriate to use inheritance, but deriving a base class from TestCase tends to cause some potential problems, so I turned to another trick Mixin:
class DbModelTest(TestCase, fixtures.TestMixin):
...
Step 2: Realize active query
After implementing the database model, the next step is to consider how to query it.
There are several issues to consider when designing a query. For graph access, it almost always starts with a node (or a class of nodes that meet the conditions) and jumps from the edge adjacent to it to other nodes, and so on. So chain calls are a natural style for queries. For example, to find out how many cats Tom's grandson has, you can use a query like this:
node('Tom').children().children().pet('cat')
It is conceivable that each of the above methods should return the set of nodes that match the conditions. This implementation is intuitive, though there is a potential problem: many times the user only needs a small fraction of the results, and it would be extremely wasteful if it always gave us a huge collection regardless of the cost. For example, the following query:
node('Tom').children().children().take(1)
In order to avoid unnecessary waste, we need another mechanism, which is often called "lazy query" or "deferred query". The basic idea is that when we call a query method, it just logs the query conditions and doesn't immediately return the results until some method is explicitly called to actually query the database. If you're familiar with popular Python ORMs like SqlAlchemy or Django ORM, you'll know that they're almost always lazy queries, where you have to call methods like list(result) or result[0:10] to get a concrete query result.
The method that triggers the query is defined in Dagoba as run(). That is, the following query executes until run when it actually looks for data:
node('Tom').children().children().take(1).run()
In contrast to Lazy Query, the method that returns results directly is generally called Eager Query. The inner search logic of active query and lazy query is basically the same, the difference only lies in the different triggering mechanism. Since active queries are simpler to implement and easier to troubleshoot, let's start with active queries.
Let's start with the test. The simple database data used in the previous test is too small to meet the query requirements, so let's first create a more complex data model in this step:
Family Database
One of the complexities of this relationship is the reverse association: if A is the brother of B, then B is the brother/sister of A. In order to look up their relationship with each other, both forward and reverse associations need to exist, so the number of edges that need to be defined when initializing the database will be large. Of course, there is also the problem of reverse association between father and son. To make the problem slightly simpler, we currently only need to look up down (children and grandchildren), which can slightly reduce the number of associations.
Therefore, we define the data model as follows. To reduce duplication of work, we define the backward association through the _backward field, which needs to be maintained as two edges inside the database for query convenience:
class EagerQueryTest(TestCase):
nodes = [
{'_id': 1, 'name': 'Fred'},
...
{'_id': 6, 'name': 'Lucy'},
]
edges = [
{'_from': 1, '_to': 2, '_type': 'son'},
...
{'_from': 3, '_to': 6, '_type': 'sister', '_backward': 'brother'},
...
]
Then, test the simplest query, such as finding all of someone's grandchildren:
def test_grandkids(self):
nodes = self.q.node(1).outcome().outcome().run()
self.assert_nodes(nodes, [3, 4, 5, 6])
Here outcome/income() denotes the set of nodes from which a node is started, or the set of nodes to which it is reached, respectively. In the original author's code, the above method is called out/in(). This is certainly more concise, but unfortunately in is a keyword in Python and cannot be used as a function name. I also considered adding an underscore like out_().in_(), but that looked a little weird too, so the tradeoff was to go with a slightly more verbose name.
Now we can start defining the query interface. As mentioned earlier, we plan to implement two separate queries, including an active query (Eager Query) and a delayed query (Lazy Query). Their intrinsic query logic is similar, and it may seem possible to use inheritance. However, following the YAGNI principle, we will not do this for now, but will just define two new classes and keep extending them as we satisfy the tests. Later we will see that it actually makes more sense to put the common logic into the database itself than to inherit.
class Dagoba:
def query(self, eager=False):
return EagerQuery(self) if eager else LazyQuery(self)
class EagerQuery:
def __init__(self, db):
self._db = db
class LazyQuery:
def __init__(self, db):
self._db = db
The next step is to implement a method to access the node. Since EagerQuery calls query methods that return results immediately, we record the results in the _result internal field. Although the node() method only returns a single result, given that almost all other query methods return collections, for the sake of consistency, let it return collections as well, which avoids branching to support both collections and single results and makes the code more concise and less error-prone. In addition, if the query object does not exist, we only return the empty collection and do not consider it an error.
class EagerQuery:
def __init__(self, db):
self._db = db
self._result = None
def node(self, pk: int):
try:
self._result = [self._db.node(pk)]
except KeyError:
self._result = []
return self
The method for querying input/output nodes is implemented like this:
def outcome(self, type_=None):
result = []
for node in self._result:
pk = Dagoba.pk(node)
result.extend(self._db.outcome(pk, type_))
self._result = result
return self
The core logic of the lookup node is defined in the database itself:
def to_node(self, edge):
return self.node(edge['_to'])
def outcome(self, pk: int, type_=None):
return [self.to_node(x) for x in self.edges()
if Dagoba.is_edge(x, '_from', pk, type_)]
The above uses some of the auxiliary query methods defined internally. Using similar logic to define income() again, their implementation is simple and the reader can directly refer to the source code, so we won't go over it here.
At the end of this step, we implement another optimization. When the query method is called multiple times, the result may return duplicate data, which in many cases is unnecessary. Just as relational databases usually support unique/distinct, we want Dagoba to be able to filter duplicates.
Suppose we want to look up the grandfather of all of someone's children, and obviously no matter how many children there are, they should have the same grandfather. So write a test as follows:
def test_grandfater_unique(self):
nodes = self.q.node(3).outcome().income('son').income('son').unique().run()
self.assertEqual(1, len(nodes))
self.assertEqual(1, Dagoba.pk(nodes[0]))
Now let's implement unique(). We just need to remove the duplicate data according to the primary key:
def unique(self):
d = {}
for node in self._result:
pk = Dagoba.pk(node)
d.setdefault(pk, node)
self._result = list(d.values())
return self
Step 3: Add Bidirectional Association Support
In the previous step, the initialization database specified bidirectional associations, but did not test them. Since we haven't written the code to support them, now add a test that it should fail
:
def test_toms_sisters_brother(self):
nodes =self.q.node(3).outcome('sister').outcome('brother').run()
self.assert_nodes(nodes, [3, 4, 5])
Run the test and it does fail. Let's see how we want to support it. Recall that when finding a node from an edge, the following method is used:
def outcome(self, pk: int, type_=None):
return [self.to_node(x) for x in self.edges()
if Dagoba.is_edge(x, '_from', pk, type_)]
There is also a potential problem here: calling self.edges() means traversing all edges, which is a huge waste when the database has a lot of content. To improve performance, we can record the edges associated with a node in the node itself, so that to find an edge you just have to look at the node itself. Define the set of in/out edges at initialization time:
def add_node(self, node):
...
node['_out'] = []
node['_in'] = []
self._nodes.append(node)
self._nodes_by_id[pk] = node
return pk
When adding edges, we have to simultaneously update their corresponding relationships to the nodes at the same time, in addition to maintaining the reverse association. This involves a partial copy of the dictionary content, starting with writing a helper method:
def copy_dict(src: dict, *excludes) -> dict:
return {k: v for k, v in src.items() if k not in excludes}
Then, the implementation of the add side is modified as follows:
def add_edge(self, edge):
from_id = edge.get('_from', None)
to_id = edge.get('_to', None)
try:
from_node = self.node(from_id)
to_node = self.node(to_id)
forward = copy_dict(edge, '_backward')
self._edges.append(forward)
from_node['_out'].append(forward)
to_node['_in'].append(forward)
if '_backward' in edge.keys():
backward = copy_dict(edge, '_backward')
backward['_type'] = edge['_backward']
backward['_from'] = edge['_to']
backward['_to'] = edge['_from']
self._edges.append(backward)
from_node['_in'].append(backward)
to_node['_out'].append(backward)
except KeyError:
raise ValueError(f'Invalid edge: node(_id={from_id}/{to_id}) not exists.')
The code here adds both forward and backward associations. Some of you may notice that the code is slightly repetitive, yes, but the repetition only occurs inside the function, so in line with the principle of "refactoring in three", we will not extract the code for now.
Once implemented, the previous tests will pass properly.
Step 4: Implementing Deferred Queries
In this step, let's implement Lazy Query.
The requirement of a Lazy Query is that the query method is not executed immediately when it is called, but the entire query is executed and the result is returned only when a specific method, such as run(), is called.
The implementation of a deferred query is a bit more complex than an active query. In order to implement a deferred query, the implementation of the query method does not return the result directly, but records the actions to be performed and the parameters to be passed in, and then executes the previously recorded contents sequentially when run() is called. If you look at the author's implementation, you will see that he uses a data structure to record the actions and parameters to be executed, and a part of the logic to assign the actions to be executed for each structure. This is certainly possible, but the data processing and assignment part of the implementation would be more complex and error-prone. The implementation in this article takes a different approach: using Python's internal function mechanism, we transform a sequence of queries into a set of functions, each of which takes the result of the previous function as input, and the output of the last function is the result of the entire query. Since inner functions are also closures, and although each query takes different arguments, they can all be "captured" by the closure and become internal variables, these inner functions can take a uniform form, eliminating the need to design additional data structures for each query, and thus simplifying the execution process to a large extent.
The LazyQueryTest and EagerQueryTest test cases are almost identical (yes, the two queries differ only in their internal implementation mechanism, their call interfaces are almost identical). So we can copy the EagerQueryTest tests into LazyQueryTest as is. Of course, it is not a good idea to copy and paste the long and fixed initialization part, so we can extract it as a common function shared by both tests. The reader can refer to the step04_lazy_query/tests/test_lazy_query.py section of the code.
The program refers to the serial execution of the query function as a pipeline and uses a variable to record it:
class LazyQuery:
def __init__(self, db):
self._db = db
self._pipeline = []
Then each calling interface is implemented in turn. The implementation of each interface is similar: the real query logic is executed with an internal function, which is then added to the pipeline call chain. For example, the implementation of node() is similar to the following:
def node(self, pk: int):
def func(arg):
try:
result = self._db.node(pk)
return [result]
except KeyError:
return []
self._pipeline.append(func)
return self
The other interfaces are implemented similarly. Finally, the run() function is responsible for executing all queries and returning the final result;
def run(self):
input_, output_ = None, None
for step in self._pipeline:
output_ = step(input_)
input_ = output_
return list(output_)
After completing the above implementation execute the tests to ensure that our implementation is correct.
Step 5: Test access efficiency
As we said earlier, the biggest advantage of deferred queries over active queries is that for many queries they can be accessed on demand without returning the full result at each step, thus improving performance and saving query time. For example, for the following query.
q.node(1).outcome('son').outcome('son').take(1)
The above query is meant to find just one node from the grandchildren that matches the condition. For this query, the active query will iterate through all nodes when outcome('son') is called, even if only the first result is needed in the last step. A deferred query, on the other hand, for efficiency, should stop as soon as it finds a result that matches the condition.
We have not implemented the take() method yet. As usual, first add a test:
def test_take(self):
nodes = self.q.node(1).outcome('son').outcome('son').take(1).run()
pk = Dagoba.pk(nodes[0])
self.assertIn(pk, [3, 4, 5])
The take() implementation of the active query is relatively simple, we just return the first n rows from the result:
def take(self, count: int):
self._result = self._result[:count]
return self
The implementation of a deferred query is a bit more complicated. To avoid unnecessary lookups, the return result should not be a complete list, but an iterable object that returns on demand, and we use the built-in function next() to return the first n results in order:
def take(self, count: int):
def func(arg):
return [next(arg) for i in range(count)]
self._pipeline.append(func)
return self
Write and run tests to make sure they are correct.
From the external interface, active and deferred queries are almost identical, so it is difficult to confirm that the latter is necessarily more efficient than the former with a pure data test, and it is not reliable to test with access times. To test the efficiency, we introduce a concept of the number of node accesses. If the delayed query is more efficient, then it should access the node less often than the active query.
For this purpose, the following test is written:
def test_node_visits(self):
self.db.reset_visits()
eager_query = self.db.query(eager=True)
eager_query.node(1).outcome('son').outcome('son').take(1).run()
eager_visits = self.db.node_visits()
self.db.reset_visits()
lazy_query = self.db.query(eager=False)
lazy_query.node(1).outcome('son').outcome('son').take(1).run()
lazy_visits = self.db.node_visits()
self.assertTrue(lazy_visits < eager_visits)
We add a member to the Dagoba class to record the total number of node visits and two helper methods to get and reset the number of visits respectively:
class Dagoba:
def __init__(self, nodes=None, edges=None):
...
self._node_visits = 0
def node_visits(self) -> int:
return self._node_visits
def reset_visits(self):
self._node_visits = 0
Then, the count is added when the nodes are accessed. Note that we are concerned with query efficiency and need to visit nodes during initialization, but there is no need to focus on the number of visits at this point. Therefore we conditionally increase the access count:
def node(self, pk: int, visit=False):
if visit:
self._node_visits += 1
return self._nodes_by_id[pk]
Then browse the code to find the modification point. Increase the count mainly when finding nodes from edges, so modify the section as follows:
def from_node(self, edge):
return self.node(edge['_from'], visit=True)
def to_node(self, edge):
return self.node(edge['_to'], visit=True)
There are also income/outcome methods, which are simple to modify, so I won't list them here.
After implementation, we run the test again. The test passes, showing that the delayed query is indeed more efficient than the active query.
Step 6: Customized query method
In the above step, we implemented the storage structure of the graph database and the query method. This last step focuses on issues related to interfaces and extensibility.
Unlike the structure of a relational database, which is fixed, the form of a graph can be varied and the query mechanism must be flexible enough. In principle, all queries start from a node and search in a specific direction, so the three methods node/income/outcome can be used to combine almost any query you need. However, for complex queries, the code written can sometimes be trivial and lengthy, and there are often more concise names for specific domains, e.g., mother's brother can be abbreviated to uncle. For these scenarios, it would undoubtedly be more user-friendly to allow users to extend themselves according to their professional requirements, similar to DSL (Domain Specific Language), thus simplifying the query and making it easier to read.
If the reader goes to the original author's implementation, he will find that he uses a special syntax addAlias to define the query he wants, and then makes a query to determine what to execute when calling the method, whose interface and internal implementation are quite complex. And I wish there was a simpler way to do this. Fortunately, Python is a highly dynamic language that allows new members to be added to classes at runtime, so doing this may be easier than expected.
To verify this, write tests as follows:
def test_custom_pipeline_grandkids(self):
def grandkids(self_):
self_.outcome().outcome()
return self_
LazyQuery.grandkids = grandkids
nodes = self.q.node(1).grandkids().run()
self.assert_nodes(nodes, [3, 4, 5, 6])
The test passes without any changes to Dagoba's implementation! In fact, all we have to do is dynamically add a custom member function. As required by the Python object mechanism, the first member of the member function should be a parameter named self, but here it is already inside UnitTest, and an underscore has been added to the parameter of the new function to distinguish it from the self of the test class itself. In addition, the function should return the object it belongs to, which is required for chain calls. We see that the flexibility of dynamic languages makes it very easy to add new syntax.
Summary
At this point, a primitive graph database is formed.
Compared to the original article, this article still lacks some content, such as how to serialize the database to disk. But as I believe the reader has seen, the internal structure of our database is basically a simple native data structure (list + dictionary), so serialization should be fairly simple whether using pickle or JSON methods. Interested readers can do them on their own.
Our graph database implementation stores pointers (or references) to edges inside the nodes in order to improve query performance. The advantage of this is that access from one node to an adjacent node is constant time, no matter how large the database is, so data access is very efficient. A potential problem, however, is that if the database is so large that it can no longer be placed in memory in its entirety, or if distributed access is to be achieved for security reasons, etc., then pointers cannot be used and other mechanisms must be considered to solve the problem. Distributed databases are a tricky problem no matter what data model is used, and we have not covered it in this paper. Interested readers may also consider some of the other articles in the 500lines series on distributed and clustered algorithms.
The implementation of this article is similar to the other databases in the series in that it uses Python as the implementation language, while the original author uses JavaScript, which should be related to the author's background. I believe that Python's object mechanism should be easier to read and understand for most developers than JavaScript's prototype-based syntax. Of course, the original author's version is actually better implemented and more flexible than this one. For a more elegant implementation, we might consider using Python metaprogramming, which would be closer to the author's implementation, but would also make the program much more complex. If you are interested, you may want to read the original author's version against it.