Jun 19, 2023 4:00:00 PM

Importing Relational Data to Neo4j using Apache Hop - Neo4j Output

Import relational data into Neo4j using Apache Hop with our dedicated Neo4j Output guide. Enhance your data integration process step by step.

Introduction

Hi there, this is a how-to post and we’ll walk through how to import relational data into a Neo4j database using Apache Hop.

If you are here I suppose you already know about Neo4j and Apache Hop, let’s say that having a basic understanding of the property graph model is the only requirement to have the task done.

The idea is to export data from a relational database system (we’ll use PostgreSQL in this case) and import the data into a Neo4j database (graph database) using the Neo4j Output transform (Apache Hop plugin).

You can find the code and the files in the public repo how-to-apache-hop.

We can divide the task into small pieces:

First, we need to translate the relational schema to a graph model.
Implement the pipelines to load the data of the nodes.
Implement the pipelines to load the data of the relationships.
Finally, we implement the workflows to orchestrate the execution of the pipelines.

You can also check our post Apache Hop: importing relational data into Neo4j - Neo4j Graph Output, which is similar but in that case we load only the nodes with at least a relationship and we use another Apache Hop plugin: Neo4j Graph Output.

The relational database

We are going to use a sample PostgreSQL database. The dvdrental database represents the business processes of a DVD rental store, including data about the films, actors, and demographic data of the staff.

The dvdrental database model

🗒 To keep the graph smaller, we just choose some of the entities that are most relevant for our graph and would most likely benefit from understanding the relationships in our example.

The dvdrental database model fragment

Step 1: Define the graph model

The relational data model must first be turned into a graph data model. When constructing a graph model from a relational model, there are a few standard guidelines to follow.

Convert relational model to graph model

Based on this, we convert our relational model to a graph model.

Nodes and Labels

Each row on our Actor table becomes a node in our graph with Actor as the label.
Each row on our Film table becomes a node with Film as the label.
Each row on our Category table becomes a node with Category as the label.
Each row on our Language table becomes a node with Language as the label.

Relationships

The join between Actor and Film becomes a relationship named ACTS_IN.
The join between Film and Category becomes a relationship named BELONGS_TO.
The join between Film and Language becomes a relationship named IN.

If we draw our translation on a whiteboard, we have this graph data model.

Graph schema of the dvdrental database

You can conduct extraction and loading with Apache Hop by implementing several pipelines and workflows:

A pipeline per node label:
- Actor
- Film
- Category
- Language
A pipeline per relationship:
- ACTS_IN
- BELONGS_TO
- IN
A workflow to orchestrate the execution of the pipelines to load the nodes.
A workflow to orchestrate the execution of the pipelines to load the relationships.
A workflow to orchestrate the execution of the previous two workflows.

ETL implementation

Step 2: Implement the pipelines to load the nodes data

Film nodes

The nodes data may now be imported into the Neo4j database. Using a Table Input and a Neo4j Output transform, we employ a Hop pipeline to load the films:

Pipeline write_films.hpl

Using a PostgreSQL connection and the Table Input transform, retrieve the data from the film table:

SQL query
```
SELECT *
FROM public.film 
```
Table input transform

The Neo4j Output transform allows you to load the films data into the Neo4j dvdrental database by mapping the table fields as node properties.

Neo4j Output transform - From Node tab
- Transform name: set the name for this transform in the pipeline (write-films).
- Neo4j Connection: select the Neo4j connection to write the graph to (neo4j-connection).
- Batch size: the transform will aggregate 1000 records in a single transaction for better performance.
- Create indexes: select this option generates unique field restrictions for all primary attributes in nodes.
- Use CREATE instead of MERGE: select this option to bypass any lookup and load the data in a faster way.
The label is Film and the properties are the fields extracted in the previous transform.

🗒 You do not have to import all of the table's field columns. Determine which node characteristics are required, which may be omitted, and which may need to be imported into another node type or relationship.

The execution of the pipeline creates 1000 Film nodes in the dvdrental graph database.

Pipeline write_films.hpl

You can view some of the nodes in the database by executing this code:

Cypher
```
MATCH (n:Film) RETURN n LIMIT 10;
```
Film nodes
Similar to the previous one, create a pipeline for the other 3 types of nodes.

Actor nodes

The execution of the pipeline creates 200 Actor nodes in the dvdrental graph database.

Pipeline write_actors.hpl

You can view some of the nodes in the database by executing this code:

Cypher
```
MATCH (n:Actor) RETURN n LIMIT 10;
```
Actor nodes

Category nodes

The execution of the pipeline creates 16 Category nodes in the dvdrental graph database.

Pipeline write_categories.hpl

You can view some of the nodes in the database by executing this code:

Cypher
```
MATCH (n:Category) RETURN n LIMIT 10;
```
Category nodes

Language nodes
The execution of the pipeline creates 16 Language nodes in the dvdrental graph database.

Pipeline write_languages.hpl

You can view the nodes in the database by executing this code:

Cypher
```
MATCH (n:Language) RETURN n;
```
Language nodes

Step 3: Implement the pipelines to load the relationships data

You may now create the relationships between the nodes after you have loaded all of the nodes and indexes.

ACTS_IN relationships

The relationships data may now be imported into the Neo4j database. Using a Table Input and a Neo4j Output transform, we build an Apache Hop pipeline to load the ACTS_IN relationships:

Pipeline to write relationships

Using a PostgreSQL connection and the Table Input transform, retrieve the data from the film_actor table. In this case you only need to get the data from the table film_actor that stores the many-to-many relationship from film and actor.

Table film_actor
SQL query
```
SELECT *
FROM public.film_actor
```
Table input transform

The Neo4j Output transform allows you to load the ACTS_IN data into the Neo4j dvdrental database by mapping the table fields as node properties.

Neo4j Output transform
- Transform name: write-acts-in, the name for this transform in the pipeline.
- Neo4j Connection: neo4j-connection, select the Neo4j connection to write the graph to.
- Batch size: the transform will aggregate 1000 records in a single transaction for better performance.
- Only create relationships: select this option because we are going to create relationships only, in this case.
Next, we have 3 tabs:
- From Node: to specify the node from which the relationship starts (Actor).
- To Node: to specify the relationship's destination node (Film).
- Relationship: to specify the data of the relationship (ACTS_IN).
From Node tab
Neo4j Output transform - From Node tab
- The label is Actor and the property to create the relationship is actor_id.
- The From Property Fields is actor_id extracted from the film_actor table in the previous transform.
- The Property Name we specify is the name of the property we used to create the node.
To Node tab
Neo4j Output transform - To Node tab
- The label is Film and the property to create the relationship is film_id.
- The From Property Fields is actor_id extracted from the film_actor table in the previous transform.
- The Property Name we specify is the name of the property we used to create the node.
Relationship tab

Neo4j Output transform - Relationship tab
- The Relationship value is ACTS_IN.
- In Relationship Properties Fields we specify the field last_update extracted from the film_actor table in the previous transform.
- In Relationship property name we specify the name of the relationship property.
The execution of the pipeline creates 5462 ACTS_IN relationships in the dvdrental graph database.

Pipeline write_acts_in.hpl

You can view some of the relationships in the database by executing this code:

Cypher
```
MATCH (a:Actor)-[]-(f:Film) RETURN a,f LIMIT 10;
```
ACTS_IN relationships

Similar to the previous one, create a pipeline for the other 2 types of relationships.

BELONGS_TO relationships

The execution of the pipeline creates 1000 BELONGS_TO relationships in the dvdrental graph database.

Pipeline write_belongs_to.hpl

You can view the nodes in the database by executing this code:

Cypher

MATCH (f:Film)-[]-(c:Category) RETURN f,c LIMIT 10;

BELONGS_TO relationships

IN relationships

The execution of the pipeline creates 1000 IN relationships in the dvdrental graph database.

Pipeline write_in.hpl

You can view the nodes in the database by executing this code:

Cypher
```
MATCH (f:Film)-[]-(l:Language) RETURN f,l LIMIT 10;
```
IN relationships

But we don’t want to run pipeline by pipeline every time we need to load the data after an update or an error, so we implement a workflows orchestration.

Step 4: Implement the workflows to run the pipelines

Workflow for nodes pipelines

First, we implement a workflow to run all the pipelines for the nodes. Create a new workflow and add a Pipeline action for each node pipeline and configure them as follows:

Pipeline action

Browse the pipeline file and the name of the file will be added to Action Name.

Workflow for relationship pipelines

Then, we implement a workflow to run all the pipelines for the relationships. Create a new workflow and add a Pipeline action for each relationship pipeline and configure them by browsing the pipeline file.

Finally, we create a main workflow to run the previous two workflows at once. In this case we use Workflow actions but the configuration is similar. Just add an action for each workflow to be run and browse the pipeline file; the name of the file will be added to the Action Name.

The dvdrental knowledge graph

If your main workflow runs successfully, you will get a graph database containing your dvdrental data.

Let's explore the graph database you just loaded.

Run the following Cypher code to see the graph schema:

Cypher

CALL db.schema.visualization;

Then if you execute the following code block to collect 15 nodes from the graph, you can observe the distinct relationships:
```
MATCH (n) RETURN n LIMIT 15;
```
```
 
```