Skip to content

Importing Relational Data to Neo4j using Apache Hop - Neo4j Output

Import relational data into Neo4j using Apache Hop with our dedicated Neo4j Output guide. Enhance your data integration process step by step.

Introduction

Hi there, this is a how-to post and we’ll walk through how to import relational data into a Neo4j database using Apache Hop.

Neo4j Logo

If you are here I suppose you already know about Neo4j and Apache Hop, let’s say that having a basic understanding of the property graph model is the only requirement to have the task done.

Apache Hop and Neo4j

The idea is to export data from a relational database system (we’ll use PostgreSQL in this case) and import the data into a Neo4j database (graph database) using the Neo4j Output transform (Apache Hop plugin).

You can find the code and the files in the public repo how-to-apache-hop.

We can divide the task into small pieces:

  • First, we need to translate the relational schema to a graph model.
  • Implement the pipelines to load the data of the nodes.
  • Implement the pipelines to load the data of the relationships.
  • Finally, we implement the workflows to orchestrate the execution of the pipelines.

You can also check our post Apache Hop: importing relational data into Neo4j - Neo4j Graph Output, which is similar but in that case we load only the nodes with at least a relationship and we use another Apache Hop plugin: Neo4j Graph Output.

The relational database

We are going to use a sample PostgreSQL database. The dvdrental database represents the business processes of a DVD rental store, including data about the films, actors, and demographic data of the staff.

The dvdrental database model
The dvdrental database model
 

🗒 To keep the graph smaller, we just choose some of the entities that are most relevant for our graph and would most likely benefit from understanding the relationships in our example.

The dvdrental database model fragment
The dvdrental database model fragment

Step 1: Define the graph model

The relational data model must first be turned into a graph data model. When constructing a graph model from a relational model, there are a few standard guidelines to follow.
 
Convert relational model to graph model
Convert relational model to graph model

Based on this, we convert our relational model to a graph model.

Nodes and Labels

  • Each row on our Actor table becomes a node in our graph with Actor as the label.

  • Each row on our Film table becomes a node with Film as the label.

  • Each row on our Category table becomes a node with Category as the label.

  • Each row on our Language table becomes a node with Language as the label.

Relationships

  • The join between Actor and Film becomes a relationship named ACTS_IN.

  • The join between Film and Category becomes a relationship named BELONGS_TO.

  • The join between Film and Language becomes a relationship named IN.

If we draw our translation on a whiteboard, we have this graph data model.

 
Graph schema of the dvdrental database
Graph schema of the dvdrental database

You can conduct extraction and loading with Apache Hop by implementing several pipelines and workflows:

  • A pipeline per node label:

    • Actor

    • Film

    • Category

    • Language

  • A pipeline per relationship:

    • ACTS_IN

    • BELONGS_TO

    • IN

  • A workflow to orchestrate the execution of the pipelines to load the nodes.

  • A workflow to orchestrate the execution of the pipelines to load the relationships.

  • A workflow to orchestrate the execution of the previous two workflows.

    ETL implementation
    ETL implementation

    Step 2: Implement the pipelines to load the nodes data

    Film nodes

    The nodes data may now be imported into the Neo4j database. Using a Table Input and a Neo4j Output transform, we employ a Hop pipeline to load the films:

    Pipeline write_films.hpl
    Pipeline write_films.hpl

    Using a PostgreSQL connection and the Table Input transform, retrieve the data from the film table:

    SQL query

    SELECT *
    FROM public.film
    Table input transform
    Table input transform

    The Neo4j Output transform allows you to load the films data into the Neo4j dvdrental database by mapping the table fields as node properties.

    Neo4j Output transform - From Node tab
    Neo4j Output transform - From Node tab

    • Transform name: set the name for this transform in the pipeline (write-films).

    • Neo4j Connection: select the Neo4j connection to write the graph to (neo4j-connection).

    • Batch size: the transform will aggregate 1000 records in a single transaction for better performance.

    • Create indexes: select this option generates unique field restrictions for all primary attributes in nodes.

    • Use CREATE instead of MERGE: select this option to bypass any lookup and load the data in a faster way.

    The label is Film and the properties are the fields extracted in the previous transform.

    🗒 You do not have to import all of the table's field columns. Determine which node characteristics are required, which may be omitted, and which may need to be imported into another node type or relationship.

    The execution of the pipeline creates 1000 Film nodes in the dvdrental graph database.

    Pipeline write_films.hpl
    Pipeline write_films.hpl

    You can view some of the nodes in the database by executing this code:

    Cypher

    MATCH (n:Film) RETURN n LIMIT 10;
    Film nodes
    Film nodes

    Similar to the previous one, create a pipeline for the other 3 types of nodes.

    Actor nodes

     

    The execution of the pipeline creates 200 Actor nodes in the dvdrental graph database.

    Pipeline write_actors.hpl
    Pipeline write_actors.hpl

    You can view some of the nodes in the database by executing this code:

    Cypher

    MATCH (n:Actor) RETURN n LIMIT 10;

     

    Actor nodes
    Actor nodes

    Category nodes 

    The execution of the pipeline creates 16 Category nodes in the dvdrental graph database.

    Pipeline write_categories.hpl
    Pipeline write_categories.hpl
     

    You can view some of the nodes in the database by executing this code:

    Cypher

    MATCH (n:Category) RETURN n LIMIT 10;
    Category nodes
    Category nodes
     

    Language nodes 

    The execution of the pipeline creates 16 Language nodes in the dvdrental graph database.

    Pipeline write_languages.hpl
    Pipeline write_languages.hpl

    You can view the nodes in the database by executing this code:

    Cypher

    MATCH (n:Language) RETURN n;
    Language nodes
    Language nodes

    Step 3: Implement the pipelines to load the relationships data

    You may now create the relationships between the nodes after you have loaded all of the nodes and indexes.

    ACTS_IN relationships

    The relationships data may now be imported into the Neo4j database. Using a Table Input and a Neo4j Output transform, we build an Apache Hop pipeline to load the ACTS_IN relationships:

    Pipeline to write relationships
    Pipeline to write relationships

    Using a PostgreSQL connection and the Table Input transform, retrieve the data from the film_actor table. In this case you only need to get the data from the table film_actor that stores the many-to-many relationship from film and actor.

    Table film_actor
    Table film_actor

    SQL query

    SELECT *
    FROM public.film_actor
    Table input transform
    Table input transform

    The Neo4j Output transform allows you to load the ACTS_IN data into the Neo4j dvdrental database by mapping the table fields as node properties.

    Neo4j Output transform
    Neo4j Output transform
    • Transform name: write-acts-in, the name for this transform in the pipeline.

    • Neo4j Connection: neo4j-connection, select the Neo4j connection to write the graph to.

    • Batch size: the transform will aggregate 1000 records in a single transaction for better performance.

    • Only create relationships: select this option because we are going to create relationships only, in this case.

    Next, we have 3 tabs:

    • From Node: to specify the node from which the relationship starts (Actor).

    • To Node: to specify the relationship's destination node (Film).

    • Relationship: to specify the data of the relationship (ACTS_IN).

    From Node tab

    Neo4j Output transform - From Node tab
    Neo4j Output transform - From Node tab
    • The label is Actor and the property to create the relationship is actor_id.

    • The From Property Fields is actor_id extracted from the film_actor table in the previous transform.

    • The Property Name we specify is the name of the property we used to create the node.

    To Node tab

    Neo4j Output transform - To Node tab
    Neo4j Output transform - To Node tab
    • The label is Film and the property to create the relationship is film_id.

    • The From Property Fields is actor_id extracted from the film_actor table in the previous transform.

    • The Property Name we specify is the name of the property we used to create the node.

    Relationship tab

    Neo4j Output transform - Relationship tab
    Neo4j Output transform - Relationship tab
    • The Relationship value is ACTS_IN.

    • In Relationship Properties Fields we specify the field last_update extracted from the film_actor table in the previous transform.

    • In Relationship property name we specify the name of the relationship property.

     

    The execution of the pipeline creates 5462 ACTS_IN relationships in the dvdrental graph database.

    Pipeline write_acts_in.hpl
    Pipeline write_acts_in.hpl

    You can view some of the relationships in the database by executing this code:

    Cypher

    MATCH (a:Actor)-[]-(f:Film) RETURN a,f LIMIT 10;
    ACTS_IN relationships
    ACTS_IN relationships

    Similar to the previous one, create a pipeline for the other 2 types of relationships.

    BELONGS_TO relationships 

    The execution of the pipeline creates 1000 BELONGS_TO relationships in the dvdrental graph database.

    Pipeline write_belongs_to.hpl
    Pipeline write_belongs_to.hpl
    You can view the nodes in the database by executing this code:

    Cypher

    MATCH (f:Film)-[]-(c:Category) RETURN f,c LIMIT 10;

    BELONGS_TO relationships
    BELONGS_TO relationships

    IN relationships 

    The execution of the pipeline creates 1000 IN relationships in the dvdrental graph database.

    Pipeline write_in.hpl
    Pipeline write_in.hpl

    You can view the nodes in the database by executing this code:

    Cypher

    MATCH (f:Film)-[]-(l:Language) RETURN f,l LIMIT 10;
    IN relationships
    IN relationships

    But we don’t want to run pipeline by pipeline every time we need to load the data after an update or an error, so we implement a workflows orchestration.

    Step 4: Implement the workflows to run the pipelines

    Workflow for nodes pipelines

    First, we implement a workflow to run all the pipelines for the nodes. Create a new workflow and add a Pipeline action for each node pipeline and configure them as follows:

    Pipeline action
    Pipeline action

    Browse the pipeline file and the name of the file will be added to Action Name.

    Workflow to run pipelines 

    Workflow for relationship pipelines

    Then, we implement a workflow to run all the pipelines for the relationships. Create a new workflow and add a Pipeline action for each relationship pipeline and configure them by browsing the pipeline file.

    Workflow to run pipelines 

    Finally, we create a main workflow to run the previous two workflows at once. In this case we use Workflow actions but the configuration is similar. Just add an action for each workflow to be run and browse the pipeline file; the name of the file will be added to the Action Name. 

    The dvdrental knowledge graph

    If your main workflow runs successfully, you will get a graph database containing your dvdrental data.

    Let's explore the graph database you just loaded.

    dvdrental knowledge graph

    Run the following Cypher code to see the graph schema:

    Cypher

    CALL db.schema.visualization;

    cypher schema

    Then if you execute the following code block to collect 15 nodes from the graph, you can observe the distinct relationships:

    MATCH (n) RETURN n LIMIT 15;
    MATCH (n) RETURN n LIMIT 15;