Aug 8, 2023 2:00:00 PM

Migrating Data to MongoDB with Apache Hop

Discover how to transfer your data to MongoDB using Apache Hop. Explore a smooth migration process in this insightful post.

Introduction

For those who don't know, Apache Hop is an open-source software project that helps you easily and efficiently prepare and move data between different systems. Think of it like a superhero that can transform messy, unruly data into neat, organized data that can be easily understood and used by other programs.

Now, for those who are familiar with Apache Hop, you already know that it's a versatile tool that can be used to perform a wide range of data processing tasks. With Apache Hop, you can create data pipelines that automate the movement and transformation of data, saving you time and effort while ensuring that your data is accurate and consistent.

The Challenge: Migrating Relational Data to a MongoDB Database

Migrating data to MongoDB can be a challenging task, especially when dealing with large amounts of data from different sources. One of the main challenges is mapping the data from the source to the destination database. This can involve converting data types, restructuring data, and dealing with inconsistencies between the source and destination schemas.

Another challenge is ensuring the data is migrated accurately and without loss. Any errors or omissions during the migration process can result in data inconsistencies, leading to incorrect analysis and decision-making.

Having a streamlined process for data migration is important because it can save time and reduce the risk of errors. It can also ensure data consistency and accuracy, leading to better data analysis and decision-making. A streamlined process can automate many of the migration tasks, reducing the need for manual intervention and minimizing the risk of human error. Additionally, a streamlined process can be repeated easily for future migrations, saving time and effort.

Yes, it is definitely possible to streamline the process of migrating data to MongoDB using Apache Hop.

Understanding MongoDB and Apache Hop

MongoDB and its data model

MongoDB is a popular NoSQL document-oriented database that is designed to store and manage unstructured and semi-structured data. The MongoDB data model is based on a flexible document format called BSON (Binary JSON), which allows for dynamic schema structures and nested data structures.

In MongoDB, a document is a unit of data that consists of a set of key-value pairs. Each document is stored in a collection, which is similar to a table in a traditional relational database system. Unlike tables, MongoDB collections do not enforce a fixed schema, which means that documents within a collection can have different structures and fields.

Apache Hop’s plugins for MongoDB

Apache Hop offers a wide range of plugins that make data integration easy and efficient. Among these plugins are the MongoDB input and output plugins, which are designed to streamline the process of reading and writing data to MongoDB databases.

With the MongoDB input plugin, users can easily extract data from MongoDB collections, while the MongoDB output plugin provides a straightforward way to load data into MongoDB collections.

Setting Up the Environment

Here are the steps to set up the environment for using Apache Hop with MongoDB:

Install Apache Hop:
- Download the latest stable version of Apache Hop from the official website https://hop.apache.org/download.html
- Extract the downloaded file to a directory of your choice.
- Apache Hop does not require installation, simply run the hop-ui script located in the bin directory to start the application.
Install MongoDB:
- Download the latest stable version of MongoDB from the official website Download MongoDB Community Server
- Follow the installation guide provided for your specific operating system to install MongoDB on your machine.
Install a MongoDB client:
- There are many MongoDB clients available to choose from such as Compass, Studio 3T, Robo 3T, and more.
- Choose a MongoDB client of your choice and follow the installation guide provided for your specific operating system to install it on your machine.
Create a MongoDB connection in Apache Hop:
- Open Apache Hop and navigate to the Metadata perspective New -> MongoDB Connection OR select New -> MongoDB Connection.
- In the New MongoDB Connection window, specify a name for the connection and provide the necessary connection details such as the server hostname, port, authentication details, and more depending on your specific MongoDB setup. Use a config file to your environment with the connection’s variables.
- Test the connection by clicking on the Test button to ensure that Apache Hop can connect to your MongoDB instance.
- Once the connection is established, you can use it to read from or write to MongoDB within Apache Hop.

Migrating Data from a Relational Database to MongoDB

The relational database model

We are going to use a sample PostgreSQL database. The dvdrental database represents the business processes of a DVD rental store, including data about the films, actors, and demographic data of the staff.

To keep the graph smaller, we just choose some of the entities that are most relevant to our example.

Map the model from PostgreSQL to MongoDB

To map the dvdrental database from PostgreSQL to a MongoDB model, you can follow these general steps:

Analyze the schema and relationships of the dvdrental database to understand the entities, attributes, and relationships that need to be modeled in MongoDB.
Design a document schema for each entity based on the analysis, considering the data types, cardinality, and relationships between entities.
Use a tool or write a script to migrate the data from PostgreSQL to MongoDB, transforming the data to match the MongoDB document schema.
Load the data into MongoDB.

With Apache Hop you don't need to manually create the MongoDB schema. With the “MongoDB output” transform, the schema is automatically generated from the input fields. This means that you can focus on designing your ETL processes and mapping your data, rather than spending time on setting up the database schema. This automation makes it easier and faster to export your data from Apache Hop to MongoDB, allowing you to get your data analysis and visualization tasks done more quickly and efficiently.

For our example, we’ll use the tables actor, film, and film_actor. The map will be as follows:

film document example

actor document example

film_actor document example

In this example, each table in the dvdrental database is mapped to a collection in MongoDB. The film collection contains all the information about a film.

The actor collection contains information about each actor, and the film_actor collection is a join collection that connects the actors to the films they starred in.

📓 Note that this is just one example of how you can map the dvdrental database to a MongoDB model. The specific structure and data types used may depend on your specific requirements and use case.

But, do I need a collection for the relationship between actor and film?

In MongoDB, there are two main approaches for modeling relationships between documents: embedded documents and references.

In the case of the relationship between actors and films, there are different factors to consider in order to decide which approach to take:

Size of the collections: If the collections are expected to be relatively small, then embedding the relationship data within the actors and films collections could be a good option. However, if the collections are expected to be very large, it might be better to have a separate collection for the relationship data to avoid excessive document size and improve query performance.
Data consistency: If you choose to embed the relationship data within the actors and films collections, you need to make sure that the data is consistent and kept up-to-date. For example, if an actor's name is updated, you need to make sure that the change is reflected in all the films they appeared in. In contrast, if you choose to use references, the relationship data is stored in a separate collection, and changes in the actors and films collections won't affect it.
Query complexity: If you embed the relationship data within the actors and films collections, it can make some queries simpler and faster, as you can retrieve all the information you need in a single query. On the other hand, if you use references, you might need to perform multiple queries and join the data to retrieve the information you need.

Based on these factors, both options have their advantages and disadvantages, and the decision will depend on the specific needs of your application.

Step by Step

Our goals are:

Extract data from the relational database using a “Table input” transform.
Transform the data with needed Transform plugins.

For example, we want to remove all the quotation marks (“) characters from the special_features field {"Deleted Scenes","Behind the Scenes"}. The “Replace in String” transform can be used to achieve this.
Load the transformed data into MongoDB using the “MongoDB output” plugin.

The data may now be imported into the MongoDB database. Using a “Table input”, a “Replace in String”, and a “MongoDB output” transforms, we employ an Apache Hop pipeline to load the films data:

Using a PostgreSQL connection and the “Table input” transform, retrieve the data from the dvdrental database:

SELECT f.film_id, f.title, f.description, f.release_year, f.language_id, f.rental_duration, f.rental_rate, f.length, f.replacement_cost, f.rating, f.last_update AS film_last_update, f.special_features, f.fulltext FROM public.film f;

The “Table input” transform is configured as follows:

Use the “Preview” option to see the data to be exported. The image shows a fragment of the preview:

Next step? Removing the quotation marks from the special_features field.

How? Add and connect a “Replace in String” transform. To configure the “Replace in String” transform, first, select the field that needs to be modified. Then, add the quotation mark character to search for and leave the replace field blank.

We can now proceed with configuring the MongoDB database import. With the “MongoDB output” transform, we can easily input data into the dvdrental database.

Add and connect a “MongoDB output” transform. In the first tab set a unique and descriptive name for your pipeline, and select the MongoDB connection and the collection. Additionally, you can set the batch insert size, select the “Truncate collection” option to remove the current data, etc.

In the second tab, you can easily add the document fields by selecting the "Get fields" option.

Add and configure a new “MongoDB output” transform for the other two collections: film and film_actor. Configure them by choosing the collection name and selecting the collection fields.

The resulting pipeline should look like the following image:

Now the pipeline is ready to be executed. Run the pipeline, check the metrics and logs, and verify that the data was imported into your MongoDB database.

If your pipeline runs successfully, you will get the film collection containing the data extracted from the PostgreSQL database.

You can configure another 2 pipelines: one for actors and one for film_actor data. These pipelines can be incorporated into a single workflow, allowing you to execute all three pipelines simultaneously.

The actor pipeline

The film_actor pipeline

The main-read workflow

The dvdrental MongoDB database

Let's explore the MongoDB dvdrental database you just loaded.

The database contains 200 actor documents, 1000 film documents, and 5462 film_actor documents.

Collection actor

Collection film

Collection film_actor

Remarks

The provided example is a simple example using three tables from the dvdrental database to demonstrate the basic steps involved in migrating data from a relational database to MongoDB. These steps can be utilized for migrating data from any relational database to a MongoDB deployment. However, depending on the complexity or size of the data models, users may require a different implementation using Metadata Injection and/or Pipeline/Workflow execution.
The implementation of migrating data from a relational database to MongoDB depends on the mapping between the relational model and the MongoDB model. This mapping defines how the tables, columns, and relationships in the relational database are translated into collections, documents, and embedded documents in MongoDB.
In the next post, we will cover an example of how to use the “MongoDB input” transform in Apache Hop to extract data from a MongoDB database and load it into a relational database. So stay tuned for that!

Migrating Data from Other Sources to MongoDB

Is your data in another format or source? Don't worry, Apache Hop has got you covered! Besides relational database inputs, Apache Hop includes several transform plugins for different input formats such as Excel, CSV, JSON, XML, and many others. You can also read data from various sources, such as FTP, HTTP, and REST services. To learn more about Apache Hop's input/output plugins, check out the Official Documentation.

Conclusion

Some of the benefits of using Apache Hop for data migration to MongoDB are:

Ease of use: Apache Hop offers an intuitive graphical user interface that simplifies the migration process and eliminates the need for complex coding.
Support for multiple data sources: Apache Hop supports a wide range of relational databases, making it possible to migrate data from various sources to MongoDB.
Robustness: Apache Hop can handle large datasets and complex data transformations.
Reusability: The workflows and pipelines created in Apache Hop can be easily reused for future data migrations or data integration projects.

I highly encourage readers to try Apache Hop for their own data migration needs. With its user-friendly interface, and support for a wide range of data sources and destinations, Apache Hop makes it easy to migrate data to MongoDB and other platforms.

Apache Hop can help streamline your data migration process and ensure the accuracy and integrity of your data. So why not give it a try and see how it can benefit your organization?

dataintegration, etl, datapipelines, nosql, nosqldatabase, mongodb