Disk capacity planning for Neo4J
A Neo4J database is composed of (or better, stores) the following discreet data items on disk
- properties string values are stored separately in 128 byte chunks
- Lucene indexes over properties
- other stuff: house-keeping files, logs,…
I will ignore the size of all other stuff and just focus on how the data itself affects disk consumption.
It is important to note than in a cluster setup all nodes contain a complete copy of the data.
Despite Neo’s great documentation, there is no central reference for the overheads and disk sizes. This is understandable to a degree as this is something internal and not a published contract. However, it is still a little bit annoying as one has to go in various locations to find bits and pieces
The following table gives an overview of the various items
Let’s give an example, to illustrate how the numbers above come into play.
Let’s assume that we have a Neo4J database with the following 3 node instances (JSON representation) and 2 relationships with no properties attached to them.
So for this tiny dataset we have
- 3 node instances
- 2 relationship instances
- 24 property “instances” (or better individual property values)
- …16 of which are string values
- 8 index entries
So, the Neo4J database folder storing our model, will be around 3Kb on disk (provided we ignore logs and other housekeeping files)
A calculation template
Let’s consider the following example model to store in Neo4J.
To make things simple, we can choose to implement relational inheritance (in this case
Contact) by simply collapsing and adding additional labels. So an employee node will have 2 labels:
We should know (or at least guesstimate) the following
- how many and what type of properties each class will have
- how many instances of each class (or better, node type) we will have
- how many other nodes each one is connected to on average
Once we have this information, we can create an Excel similar to the following
The right table captures how many nodes of each type we have along with the properties per node.
The left table calculates the number of relationships in total based on the averages. So, for example, each
Country has on average 50
Employees and each
Client has 0.5
Contacts (because we know we have a lot of empty/unused
These figures are then summed to calculate the final DB footprint.
You can download the Excel file from this link and use it as a basis for your own calculations. The file contains a VB macro, which recalculates the totals, when you update any cell.
In case you need to add/remove lines to the tables because your model is bigger/smaller, you will need to edit this Macro.
There might be a more elegant way to create this Excel, but I never said I was an Excel guru ;-) You comments and suggestions are welcome.
I hope you find it useful.
Originally published at https://sgerogia.github.io on July 7, 2015.