Disk capacity planning for Neo4J

Neo4J is a multi-featured graph database, able to store billions of items. This brings up an interesting question: how much space will it take on disk?

Background

A Neo4J database is composed of (or better, stores) the following discreet data items on disk

  • nodes
  • relationships
  • properties string values are stored separately in 128 byte chunks
  • Lucene indexes over properties
  • other stuff: house-keeping files, logs,…

I will ignore the size of all other stuff and just focus on how the data itself affects disk consumption.

It is important to note than in a cluster setup all nodes contain a complete copy of the data.

Overheads

Despite Neo’s great documentation, there is no central reference for the overheads and disk sizes. This is understandable to a degree as this is something internal and not a published contract. However, it is still a little bit annoying as one has to go in various locations to find bits and pieces

The following table gives an overview of the various items

Example calculation

Let’s give an example, to illustrate how the numbers above come into play.

Let’s assume that we have a Neo4J database with the following 3 node instances (JSON representation) and 2 relationships with no properties attached to them.

So for this tiny dataset we have

  • 3 node instances
  • 2 relationship instances
  • 24 property “instances” (or better individual property values)
  • …16 of which are string values
  • 8 index entries

So, the Neo4J database folder storing our model, will be around 3Kb on disk (provided we ignore logs and other housekeeping files)

A calculation template

Let’s consider the following example model to store in Neo4J.

To make things simple, we can choose to implement relational inheritance (in this case User with Employee and Contact) by simply collapsing and adding additional labels. So an employee node will have 2 labels: User and Employee.

We should know (or at least guesstimate) the following

  • how many and what type of properties each class will have
  • how many instances of each class (or better, node type) we will have
  • how many other nodes each one is connected to on average

Once we have this information, we can create an Excel similar to the following

The right table captures how many nodes of each type we have along with the properties per node.

The left table calculates the number of relationships in total based on the averages. So, for example, each Country has on average 50 Employees and each Client has 0.5 Contacts (because we know we have a lot of empty/unused Client entries).

These figures are then summed to calculate the final DB footprint.

You can download the Excel file from this link and use it as a basis for your own calculations. The file contains a VB macro, which recalculates the totals, when you update any cell.

In case you need to add/remove lines to the tables because your model is bigger/smaller, you will need to edit this Macro.

There might be a more elegant way to create this Excel, but I never said I was an Excel guru ;-) You comments and suggestions are welcome.

I hope you find it useful.

Originally published at https://sgerogia.github.io on July 7, 2015.

--

--

--

Life-long learner, happy father, trying to do some software engineering on the side.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

OCR technology: A Solution for Global Businesses to Automate Identity Verification

Erasing a duplicate Recovery Partition

No-SQL Databases with MongoDB.

No-code apps, will they leave software engineers jobless?

Classloaders and Reflection

Huawei Ability Gallery — Card Ability Account Binding Solution 1

Tmux tutorial and my set-up

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Stelios Gerogiannakis

Stelios Gerogiannakis

Life-long learner, happy father, trying to do some software engineering on the side.

More from Medium

Professional Services: Cutting comms by narrating PowerPoint decks — Benefits, Tips & Tricks |…

BambooHR to Neo4j Integration

Top tools to use with Power Platform

SDD Conference Top Takeaways — Evolutionary Architecture

Venn diagram showing fitness functions as the central ellipse, and each of monitors, unit tests, metrics, chaos engineering and new stuff overlapping the fitness functions ellipse but not each other.