Aug 22, 2023 2:00:00 PM

Exporting Data from MongoDB using Apache Hop

Export your MongoDB data with Apache Hop. Streamline your data export process and enhance efficiency. Learn more in this informative post.

Introduction

In this post, we will explore how Apache Hop can be utilized to export data from MongoDB databases efficiently. Whether you need to migrate MongoDB data to another system, perform data backups, or integrate MongoDB data with other data sources, Apache Hop offers a streamlined solution. With its wide range of plugins and intuitive graphical user interface, Apache Hop enables data professionals to effortlessly extract MongoDB data and transform it into the desired format.

Why Export Data from MongoDB with Apache Hop?

Simplified Data Extraction: Apache Hop provides a user-friendly interface that allows users to define and execute extraction processes from MongoDB collections without complex coding. The MongoDB input plugin within Apache Hop streamlines the data extraction process, enabling users to retrieve data from specific collections, apply filters, and efficiently handle large datasets.
Flexible Data Transformation: After extracting data from MongoDB, Apache Hop's rich set of transformation plugins offers a multitude of data manipulation options. Users can easily cleanse, filter, aggregate, or reshape the data to meet their specific requirements. Apache Hop's graphical interface empowers users to visually configure the transformations and preview the output, ensuring data accuracy and integrity.
Integration with Various Data Destinations: Apache Hop supports a wide array of output formats and destinations, making it seamless to export MongoDB data to other databases, file formats, or cloud storage systems. Whether you need to export the data to a relational database, generate CSV or Excel files, or load it into cloud services such as Amazon S3 or Google Cloud Storage, Apache Hop provides the necessary plugins to facilitate the data export process.
Scalability and Performance: Apache Hop is designed to handle large datasets efficiently. It leverages parallel execution and optimized data processing techniques, enabling fast and scalable data export from MongoDB. This ensures that even when dealing with vast amounts of data, Apache Hop maintains optimal performance and minimizes processing time.
Automation and Reproducibility: Apache Hop allows users to design reusable data pipelines, automating the export process and ensuring reproducibility. Once a pipeline is created, it can be scheduled to run at specific intervals or triggered by external events. This automation saves time, reduces the risk of errors, and ensures consistent and up-to-date data exports.

In the upcoming sections, we will dive into the practical aspects of exporting data from MongoDB using Apache Hop. We will explore the setup process, examine the MongoDB output plugin, and guide you through the steps required to extract and export data from MongoDB collections. By the end of this guide, you will have the knowledge and tools to export MongoDB data effectively, enabling seamless integration and analysis in your data ecosystem.

So, let's get started on this journey of exporting data from MongoDB using the power of Apache Hop!

The Challenge: Export Data from MongoDB to a Relational Database

Exporting data from MongoDB can be a complex task, especially when dealing with diverse datasets and the need to ensure data integrity. One of the main challenges in this process is efficiently extracting data from MongoDB and transforming it into a format suitable for further analysis or migration to another database.

With Apache Hop, you can easily define the desired data selection criteria, apply necessary transformations, and export the data in a structured format of your choice. Whether you're migrating data to another database, conducting data analysis, or creating data backups, Apache Hop can streamline and automate the process, making it faster and more reliable.

Understanding MongoDB and Apache Hop

MongoDB and its data model

MongoDB is a popular NoSQL document-oriented database that is designed to store and manage unstructured and semi-structured data. The MongoDB data model is based on a flexible document format called BSON (Binary JSON), which allows for dynamic schema structures and nested data structures.

In MongoDB, a document is a unit of data that consists of a set of key-value pairs. Each document is stored in a collection, which is similar to a table in a traditional relational database system. Unlike tables, MongoDB collections do not enforce a fixed schema, which means that documents within a collection can have different structures and fields.

Apache Hop’s plugins for MongoDB

Apache Hop offers a wide range of plugins that make data integration easy and efficient. Among these plugins are the MongoDB input and output plugins, which are designed to streamline the process of reading and writing data to MongoDB databases.

With the MongoDB input plugin, users can easily extract data from MongoDB collections, while the MongoDB output plugin provides a straightforward way to load data into MongoDB collections.

Setting Up the Environment

Here are the steps to set up the environment for using Apache Hop with MongoDB:

Install Apache Hop:
- Download the latest stable version of Apache Hop from the official website https://hop.apache.org/download.html
- Extract the downloaded file to a directory of your choice.
- Apache Hop does not require installation, simply run the hop-ui script located in the bin directory to start the application.
Install MongoDB:
- Download the latest stable version of MongoDB from the official website Download MongoDB Community Server
- Follow the installation guide provided for your specific operating system to install MongoDB on your machine.
Install a MongoDB client:
- There are many MongoDB clients available to choose from such as Compass, Studio 3T, Robo 3T, and more.
- Choose a MongoDB client of your choice and follow the installation guide provided for your specific operating system to install it on your machine.
Create a MongoDB connection in Apache Hop:
- Open Apache Hop and navigate to the Metadata perspective New -> MongoDB Connection OR select New -> MongoDB Connection.
- In the New MongoDB Connection window, specify a name for the connection and provide the necessary connection details such as the server hostname, port, authentication details, and more depending on your specific MongoDB setup. Use a config file to your environment with the connection’s variables.
- Test the connection by clicking on the Test button to ensure that Apache Hop can connect to your MongoDB instance.
- Once the connection is established, you can use it to read from or write to MongoDB within Apache Hop.

Migrating Data from MongoDB to a Relational Database

The MongoDB database

We are going to use a MongoDB database created in the previous post based on the dvdrental database.

The current MongoDB database includes the following collections:

Map the model from MongoDB to PostgreSQL

To map the DVDrental database from MongoDB to PostgreSQL, you can follow these general steps:

Analyze the MongoDB collections and relationships to understand the entities, attributes, and relationships that need to be modeled in PostgreSQL.
Design a relational schema for each entity based on the analysis, considering the data types, cardinality, and relationships between collections.
Use a tool or write a script to migrate the data from MongoDB to PostgreSQL, transforming the data to match the PostgreSQL relational schema.
Load the data into PostgreSQL.

With Apache Hop, you don't need to manually create the PostgreSQL tables. With the “Table output” transform, the schema is automatically generated from the input fields. This means that you can focus on designing your ETL processes and mapping your data, rather than spending time on setting up the database schema. This automation makes it easier and faster to export your data from MongoDB to a relational database, allowing you to get your data analysis and visualization tasks done more quickly and efficiently.

For our example, we’ll use the tables actor, film, and film_actor. The map would be as follows:

In this example, each collection in the dvdrental MongoDB database is mapped to a table in PostgreSQL. The film collection contains all the information about a film.

The actor collection contains information about each actor, and the film_actor collection is a join collection that connects the actors to the films they starred in.

Step by Step

Our goals are:

Extract data from the MongoDB database using a “MongoDB input” transform.
Load the data into PostgreSQL using the “Table output” transform.

The data may now be imported into the PostgreSQL database. Using a “MongoDB input”, and a “Table output” transforms, we employ an Apache Hop pipeline:

Using a MongoDB connection and the “MongoDB input” transform, retrieve the data from the dvdrental MongoDB database. The “MongoDB input” transform is configured as follows:

Input options tab

Select the connection and the collection.

Query tab

Add your query.

Fields tab

Uncheck the “Output single JSON field“ option to enable the Get fields button, so you can use it to get all the fields separated.

Remove the “_id” field because we won’t use this MongoDB-specific id. We’ll use the “film_id” as the primary key.

Use the “Preview” option to see the data to be exported. The image shows a fragment of the preview:

Next step? Load the data into PostgreSQL using the “Table output” transform.

How? Add and connect a “Table output” transform.

Select the connection.
Specify the schema and add the table name. Take into account that in this case, we are using an empty database. We’ll add the tables and columns from Apache Hop.
Use the SQL option to execute the SQL statement to create the film table. You can add the primary key constraint for the film_id column.
Execute the SQL statement and check the table creation:
Click OK and save the configuration.

Additionally, you can set the Commit size, select the “Truncate table” option to remove the current data, etc.

The resulting pipeline should look like the following image:

Now the pipeline is ready to be executed. Run the pipeline, check the metrics and logs, and verify that the data was imported into your PostgreSQL database.

If your pipeline runs successfully, you will get the three tables containing the data extracted from the PostgreSQL database.

📔You can configure another 2 pipelines: one for actors and one for film_actor data. These pipelines can be incorporated into a single workflow, allowing you to execute all three pipelines simultaneously.

The actor pipeline

The film_actor pipeline

The main-read workflow

The dvdrental PostgreSQL database

Let's explore the PostgreSQL database you just loaded.

The database contains 200 actors, 1000 films, and 5462 film_actor rows.

film table

actor table

film_actor table

Remarks

In our previous post, we demonstrated the utilization of the "MongoDB output" transform in Apache Hop to extract data from a PostgreSQL database and insert it into a MongoDB database. In this post, we will leverage the populated collection in MongoDB to extract data and load it into the PostgreSQL database.
The provided example is a simple example using three collections from the dvdrental database to demonstrate the basic steps involved. These steps can be utilised for migrating data from MongoDB to any relational database deployment. However, depending on the complexity or size of the data models, users may require a different implementation using Metadata Injection and/or Pipeline/Workflow execution.
The implementation of migrating data from a MongoDB database to a relational database depends on the mapping between the MongoDB model and the relational model. This mapping defines how the collections, documents, and embedded documents in MongoDB are translated into tables, columns, and relationships in the relational database.

Conclusion

Some of the benefits of using Apache Hop for data migration from MongoDB are:

Ease of use: Apache Hop offers an intuitive graphical user interface that simplifies the migration process and eliminates the need for complex coding.
Support for multiple data sources: Apache Hop supports a wide range of relational databases, making it possible to migrate data from MongoDB to various sources.
Robustness: Apache Hop can handle large datasets and complex data transformations.
Reusability: The workflows and pipelines created in Apache Hop can be easily reused for future data migrations or data integration projects.

I highly encourage readers to try Apache Hop for their own data migration needs. With its user-friendly interface, robust features, and support for a wide range of data sources and destinations, Apache Hop makes it easy to migrate data from MongoDB and other platforms. Whether you're a developer, data analyst, or IT professional, Apache Hop can help streamline your data migration process and ensure the accuracy and integrity of your data. So why not give it a try and see how it can benefit your organization?

dataintegration, etl, opensource, pdi-kettle, datapipelines, environments, nosql, nosqldatabase, mongodb