The Power of Metadata Objects in Apache Hop: A Comprehensive Guide II

Written by Adalennis Buchillón Soris | May 2, 2023 8:00:00 PM

Discover the potential of Metadata Objects in Apache Hop through this post - Part II. Harness the power of metadata for streamlined data integration.

Introduction

The use of metadata objects in data integration processes has become increasingly important in recent years, and Apache Hop is no exception. With metadata objects, users can easily manage and maintain complex data pipelines and processing tasks by defining and standardizing data definitions, connection information, and processing rules across different workflows and pipelines.

In this two-part post, we will delve into the power of metadata objects in Apache Hop, covering the different types of objects available, their benefits, use cases, and step-by-step instructions for creating and managing them using the Hop graphical user interface (Hop GUI). In this second post, we will cover essential metadata objects, including Relational Database Connection, Neo4j Connection, Neo4j Graph Model, MongoDB Connection, and more.

By the end of this series, readers will have a comprehensive understanding of metadata objects in Apache Hop and best practices for using them effectively in data integration workflows.

Check the first post: The Power of Metadata Objects in Apache Hop: A Comprehensive Guide I

How to Create and Manage Metadata Objects in Apache Hop

There are several ways to create and manage metadata objects using the Apache Hop graphical user interface Hop GUI. It depends on the metadata type but in this post, we'll cover two of them.

First way

Open the Apache Hop GUI and select the Metadata perspective.
Select a Metadata object type and hit the New button.
Fill in the details for the metadata object, such as the connection details for a database or the file definition.
Click OK to save the metadata object.

Second way

Open the Apache Hop GUI and hit the Hop->New or the New visible button in the Horizontal menu.
Select the type of metadata object you want to create from the context menu. This will send you to the Metadata perspective and the dialog with the fields to be filled in will be opened.
Fill in the details for the metadata object, such as the connection details for a database or the file definition.
Click OK to save the metadata object.

Once you have created a metadata object, you can use it in your workflows and pipelines by referencing it in the appropriate action/transform.

To manage metadata objects, you can use the Metadata perspective to view, edit, or delete existing objects.

Types of Metadata Objects in Apache Hop

Apache Hop provides a variety of metadata objects that users can create and manage to streamline the data integration process. Until now (Apache Hop 2.4), the types of metadata objects in Apache Hop are:

Pipeline Run Configuration
Execution Information Location
Execution Data Profile
Workflow Run Configuration
Pipeline Log
Workflow Log
Pipeline Probe
Pipeline Unit Test
Data Set
Beam File Definition
Relational Database Connection
Neo4j Connection
Neo4j Graph Model
MongoDB Connection
Cassandra Connection
Splunk Connection
Partition Schema
Hop Server
Web Service
Asynchronous Web Service

Yes, there are many metadata objects to cover, which is good news. There is no need to panic because we will guide you through each of them, provide examples, and clarify the dependencies between some of them. This is the second of two posts where we’ll cover all of the current metadata objects in Apache Hop. Check the first post The Power of Metadata Objects in Apache Hop: A Comprehensive Guide I.

This second post will include the following metadata objects:

Relational Database Connection
Neo4j Connection
Neo4j Graph Model
MongoDB Connection
Cassandra Connection
Splunk Connection
Partition Schema
Hop Server
Web Service
Asynchronous Web Service

Relational Database Connection

The Relational Database Connection metadata object in Apache Hop is used to define connections to relational databases, such as MySQL, Oracle, PostgreSQL, SQL Server, and more. This metadata object provides a way to configure the parameters required to connect to a specific database, such as hostname, port, database name, username, and password.

When creating a new Relational Database Connection in Apache Hop, the user is prompted to enter the necessary information for the database they want to connect to, but it depends on the type of database. You may build a Generic connection or use one of the several database types offered by Apache Hop to make a database connection.

For instance, if you choose PostgreSQL as your connection type, the following fields will appear:

Connection type: Select a connection type to be used.
Installed driver: It specifies the version of the installed driver class. This is informative only.
Username: Set the username you use to connect to the database.
Password: Add the password you use to connect to the database.
Server host name: Set the database server host.
Port number: Add the port to connect to the database.
Database name: Set the database name for the connection.
Manual connection URL: You could also configure a manual connection URL instead, in the format postgres://{user}:{password}@{hostname}:{port}/{database-name}

After the connection has been established, the user can easily reference it in other components, such as the Table Input or Table Output transforms, to read or write data from the specified database.

Using the Relational Database Connections metadata object allows for more efficient and standardized management of database connections in Apache Hop.

By defining a connection once and using it throughout a workflow, the user can avoid duplicating connection information and ensure consistency in how the connection is used across different components.

Additionally, the metadata object allows for easy updating of connection information if needed, without having to manually update each component that uses the connection.

Neo4j Connection

The Neo4j Connection metadata object in Apache Hop allows users to connect to a Neo4j graph database. This metadata object contains the necessary information to establish a connection, such as a host, port, username, and password. In addition, users can specify additional options such as encryption and trust strategy.

Basic tab

Connection name: Specify the name of the metadata object.
Protocol: The protocol by default is neo4j. To connect to an Aura version 4 or 5 database you can use the protocol neo4j+s.
Server or IP address: Set the name of the Neo4j server.
Database name: Specify the name of the Neo4j database to be used.
Database port: Specify the number of the port.
Username: Specify your username to connect to the Neo4j server.
Password: Specify your password to connect to the Neo4j server.

Protocol tab

Version 4 database?: Please enable this option to assist us in generating the most optimized Cypher for you.
Browser port: Information only. It specifies the port on which the Neo4j browser runs.
Use routing, neo4j:// protocol?: To use the bolt+routing protocol, select this option to enable it.
Routing policy: Specify the bolt+routing policy to use.
Use encryption?: Disable this option unless you have generated and configured the appropriate SSL keys.

Advanced tab

Connection Liveliness Check Timeout (ms): Set a balance between connection problems and performance by testing pooled connections that have been idle for too long. The default value is 0.
Maximum Connection Lifetime (ms): Close pooled connections older than this threshold to prevent high connection churn. The default value is 1 hour.
Maximum Pool Size: Limit the maximum amount of connections in the pool. The default value is 100.
Connection Acquisition Time (ms): Set the maximum amount of time for connection acquisition attempts. The default value is 60 seconds.
Connection timeout (ms): This option sets the maximum amount of time the driver will wait for a connection to be established with the database before throwing an error.
Maximum Transaction Retry Time (ms): Specify the maximum time for transactions to retry. The default value is 30 seconds.

Manual URLs tab

Manual URLs: Specify a list of manual connection URLs to work with advanced features.

Once the Neo4j Connection metadata object is created, it can be used in Apache Hop transforms to read or write data to the connected Neo4j database.

The Neo4j Cypher transform, for example, allows users to specify the Neo4j Connection metadata object, as well as the query or statement to execute against the database.

It is also possible to use actions in workflows to check connections to Neo4j, run Neo4j scripts, and update constraints or indexes.

In future posts, we'll cover the entire list of Apache Hop plugins for Neo4j.

Neo4j Graph Model

The Apache Hop metadata object for Neo4j Graph Model provides a way to define and manage metadata for a Neo4j graph database. The Neo4j Graph Model metadata object consists of two main components: nodes and relationships.

The nodes are defined by using labels that describe the nodes. Labels represent the nodes in the graph, and each label can have one or more properties associated with it.

The properties are defined as key-value pairs, where the key is the name of the property and the value is the data type of the property.

Relationships represent the edges between the nodes in the graph, and each relationship can have one or more properties associated with it. Like labels, the properties of relationships are defined as key-value pairs.

You can define a Neo4j Graph Model in Apache Hop as a metadata object. A graph model in Apache Hop allows you to create nodes with their attributes, as well as the connections or relationships between these nodes.

The following is an example of how a graph model is set up.

Model tab

In the tab Model, set the name of the graph model: dvdrental.

Nodes tab

Add the labels: Actor in this case, and the properties: actor_id, last_name, first_name.
Specify the primary key: actor_id.

Configure an entrance for the rest of the nodes: Film, Category, and Language.

Relationships tab

Configure all the relationships by specifying the fields:

Name: ACTS_IN
Label: ACTS_IN
Source: Actor, specify the origin node of the relationship.
Target: Film, specify the target node of the relationship.

Graph tab

You can visually check the model you created.

This metadata object can then be used in transforms or actions within Apache Hop to interact with the Neo4j graph database. For example, the Graph Output transform in Apache Hop allows you to automatically map input fields to a graph model using a Neo4j Graph Model metadata object.

MongoDB Connection

The MongoDB Connection metadata object in Apache Hop is used to define and configure a connection to a MongoDB database. Similar to the other database connections, this metadata object contains various settings for the connection, such as the hostname, port, username, password, and authentication database. Additionally, it allows the user to specify the database and collection that will be queried or written to in the MongoDB database.

Users can also define the read preference and write concern settings for the connection, which determine the behavior of the database in terms of data consistency and availability.

The main fields to be configured:

MongoDB Connection name: The name of the metadata object.
Hostname: The name of the host.
Port: The port number.
Database name: The name of the MongoDB database.

Once the MongoDB Connection metadata object is created, it can be used in Apache Hop transforms to read or write data to the connected MongoDB database. The MongoDB Input and MongoDB Output transforms allow users to specify the MongoDB Connection metadata object, as well as the query or field mapping to execute against the database. Users can also specify input and output field mappings to define how data is read from or written to the database.

It is also possible to use the MongoDB Delete transform to execute delete mappings against the database.

Cassandra Connection

The Apache Hop metadata object Cassandra Connection allows users to connect to a Cassandra database to extract and load data. With this connection, users can define the host, port, keyspace, and other configuration options to establish a connection to a Cassandra database.

When defining a Cassandra connection, users should ensure that they have the necessary credentials and permissions to access the database. It's also important to choose appropriate options for consistency levels, compression, and other performance-related settings to ensure efficient data extraction and loading.

Hostname: Enter the host name(s) for connecting to the Cassandra server.
Port: Specify the port number for the Cassandra server connection.
Username: Enter the authentication details' username for the target keyspace and/or table.
Password: Specify the authentication details' password for the target keyspace and/or table.
Socket Timeout: Set a timeout period for the connection, in milliseconds.
Keyspace: Set the database name (keyspace). Use the Select Keyspace button to choose a keyspace or the Execute CQL button to create one.
Schema hostname: For writes, enter the schema hostname (leave blank if same as hostname).
Schema port: For writes, enter the schema port (leave blank if same as port).
Use compression: Select this option to compress (with GZIP) the text of each BATCH INSERT statement before transmission to the node.

After the Cassandra connection is defined, it can be used in a variety of transforms such as Cassandra Input, Cassandra Output, and Cassandra Query. The Cassandra Input transform can be used to extract data from a Cassandra table, while the Cassandra Output transform can be used to load data into a Cassandra table. The Cassandra Query transform can be used to execute custom queries on the Cassandra database.

Splunk Connection

The metadata object Splunk Connection in Apache Hop is used to define the connection properties for accessing Splunk data. This metadata object allows users to set up a connection to a Splunk instance by specifying the host, port, and credentials required for authentication.

When creating a Splunk connection metadata object, users can specify the connection name, description, and connection properties. The connection properties include the host name, port number, scheme, and authentication credentials. Users can also choose to use a proxy server, and set a timeout value for the connection.

Connection name: Specify a name for the connection. It is usually used to help identify a specific connection when there are multiple connections in use.
Hostname or IP address: Specify the name or IP address of the server to which the connection will be made.
Port: Specify the port number on which the server is listening for connections.
Username: Specify the username for the connection.
Password: Set the password for the connection.

Once the Splunk connection metadata object is defined, users can use it in Apache Hop's pipeline and workflow designs to extract data from Splunk.

Partition Schema

In Apache Hop, a metadata object partition schema represents the structure of a partitioned data set. This schema specifies the keys used for partitioning, the partition type, and any additional partitioning options. It is a crucial component in defining a data set that can be efficiently processed in parallel by distributed systems.

The partition schema in Apache Hop provides several partitioning options, including hash partitioning, range partitioning, and list partitioning. Hash partitioning distributes data evenly across partitions by hashing a key value, while ranging partitioning partition data based on a specified range of key values. List partitioning, on the other hand, partitions data based on a specific list of key values.

Partition schema name: This is a field for specifying the name of the partition schema.
Dynamically create the schema definition?: Select this checkbox to create the schema definition dynamically.
Number of partitions: This field specifies the number of partitions that will be used to store the data.
Partitions: This field specifies the partitions that will be used to store the data. Each partition is identified by a unique identifier and contains a subset of the data.

By utilizing a metadata object partition schema, data processing can be made more efficient and scalable. Additionally, it can help optimize data retrieval and processing by allowing for more targeted querying of large data sets.

You can check the following Apache Hop's integration tests to see some sample use cases:

integration-tests/partitioning/0006-partitioned-when-stream-lookup-should-fail2.hpl
integration-tests/partitioning/0004-copies-repartitioning.hpl
integration-tests/partitioning/0001-static-partitioning.hpl
integration-tests/partitioning/0005-partitioned-stream-lookup.hpl
integration-tests/partitioning/0006-partitioned-when-stream-lookup-should-fail.hpl
integration-tests/partitioning/0003-repartitioning.hpl
integration-tests/partitioning/0005-non-partitioned-stream-lookup.hpl
integration-tests/partitioning/0002-dynamic-partitioning.hpl

Check the Apache Hop Git repository .

Hop Server

In Apache Hop, the Hop Server is a metadata object that allows you to remotely execute pipelines and workflows. It enables you to centralize the management and execution of your data integration processes in a single server that can be accessed by multiple users or applications. The Hop Server metadata object defines the properties required to connect to a Hop Server instance, such as the hostname, port number, username, and password.

Server name: the name to use for this server definition.

Service tab

Hostname or IP address: the hostname or IP address where the Hop Server is running.
Port (empty is port 80): the port number to use for the Hop Server. If empty, it defaults to port 80.
Web app name (optional): the name of the web application to use for the Hop Server. This is an optional field.
Username: the username to use for authentication when accessing the Hop Server.
Password: the password to use for authentication when accessing the Hop Server.
Use https protocol: a boolean flag to indicate whether to use the https protocol for communication with the Hop Server. If set to true, https is used; if set to false, http is used.

Proxy tab

Proxy server hostname: The proxy server.
The proxy server port: Specify the port of the proxy server.
Ignore proxy for hosts: regexp|separated: Allows the user to specify a regular expression or a separated list of hosts that should not use the configured proxy server for communication. For example, if the regular expression is set to "localhost|127.0.0.1" or the separated list is set to "localhost,127.0.0.1", then the Hop Server will bypass the proxy server when communicating with hosts matching either "localhost" or "127.0.0.1".

Once the Hop server is defined, it can be used to facilitate the execution of workflows and pipelines remotely through the use of the Remote Pipeline or Remote Workflow run configurations. To run Hop Server, you can use the script available in your Hop installation directory. On Windows, the script is named "hop-server.bat", while on Mac and Linux, it is "hop-server.sh". If you run the script without any parameters, it will display the usage options for Hop Server. The Hop Server can be utilized in conjunction with the Web Service and Asynchronous Web Service metadata types. Check the Hop Official Documentation for more details.

Relationship between Hop Server and Pipeline and Workflow Run Configuration In Apache Hop, the Hop Server is the component that allows remote execution of pipelines and workflows. In the first post about this topic, we covered the Pipeline Run Configuration and the Workflow Run Configuration metadata objects. They allow users to specify and store runtime configuration settings for pipelines and workflows. One of the possible engine types is Hop remote engine. By specifying this engine the user can select a Hop Server metadata object to be used for the execution.

Web Service

The Web Service metadata object in Apache Hop is used to run pipelines on a Hop Server. It allows a user to run a pipeline as a service.

The user can define the name of the service, the filename on the server where the pipeline is located, the transform from which the service will take the output rows, etc.

Additionally, the user can enable the option to list the executions of the web service pipeline in the server's status and specify the name of the variable that will contain the content of the request body content at runtime.

Web service name: This field specifies the name of the Web Service which will be used in the Web Service URL to access the service.
Enabled: This field is used to enable or disable the Web Service.
Filename on the server: This field specifies the name of the pipeline file which needs to be executed on the Hop server. The file should be available on the server for successful execution.
Pipeline Run Configuration: The Pipeline Run Configuration to be used for the executions.
Output transform: This field specifies the name of the transform from which the Web Service will take the output row(s).
Output field: This field specifies the output field from which the Web Service will take the data, convert it to a string and output it.
Content type: This field specifies the content type that will be reported by the Web Service servlet.
List status on server: This field enables the listing of Web Service pipeline executions in the status of the Hop server.
Request body content variable: This field specifies the name of the variable which will contain the content of the request body at runtime. It is useful when doing a POST request against the Web Service.

To ensure that a Hop Server has access to the metadata that you defined, you need to make sure that the server can access both the pipelines you want to execute and the server metadata. The recommended way to achieve this is by setting the following option in your XML configuration file:

<metadata_folder>/path/to/your/metadata</metadata_folder>

An example:

<hop-server-config> <hop-server> <name>8181</name> <hostname>localhost</hostname> <port>8181</port> </hop-server> <metadata_folder>/home/hop/project/services/metadata</metadata_folder> </hop-server-config>

The based request is as follows but you can specify parameters.

http://<hop-server-url>/hop/webService

Check the Hop Official Documentation for more details.

Relationship between Hop Server and Web Service

In Apache Hop, the Hop Server is the lightweight server used to run workflows and pipelines remotely. The Web Service metadata type is used to define a Web Service that can be accessed remotely through the Hop Server. Once the Web Service is defined, it can be accessed remotely through the Hop Server.

Asynchronous Web Service

This particular type of web service is designed for executing workflows that take a long time to complete. Unlike other web services that provide immediate results after a workflow is called, this service only returns a unique ID that represents the executing workflow. This ID can then be used to check the status of the workflow. Additionally, it is possible to specify additional variables that will be reported back when querying the status of the asynchronous workflow.

Name: The name of the Asynchronous Web Service. This is the name that is passed into the "asyncRun" Web Service URL. Example: "http://localhost:8282/hop/asyncRun/?service=runmainworkflow"
Enabled: A boolean flag that indicates whether the web service is enabled or not.
Filename: The name of the workflow that will be used for this web service. You can choose to open an existing workflow, create a new one, or browse to select an existing workflow.
Status variables: A list of variables that will be reported back when the asynchronous status service is queried. These variables are separated by a comma.
Content variable: The name of the variable that will contain the content body of the service call.

Relationship between Hop Server and Web Service

In Apache Hop, the Hop Server and Async Web Service metadata objects are related in that the Async Web Service variant is used to execute long-running workflows on the Hop Server. When using the Async Web Service metadata object, instead of getting immediate results from a pipeline with a Web Service call, the only thing that is returned after the call is the unique ID of the executing workflow.

This unique ID can then be used to query the status of the workflow, including any additional variables that were specified to be reported back during the querying of the status of the asynchronously running workflow. The Hop Server can be accessed in combination with both the Web Service and Async Web Service metadata types to run workflows and pipelines remotely.

Advantages of Using Metadata Objects in Apache Hop

Using metadata objects in Apache Hop for data integration processes has several benefits:

Reusability: Metadata objects can be reused across multiple workflows and pipelines, reducing the amount of time and effort required to build new data integration processes.
Consistency: By defining metadata objects such as database connections, file formats, and schema definitions, it ensures consistency across workflows and pipelines, reducing the risk of errors and improving data quality.
Manageability: Metadata objects can be managed centrally, making it easier to update and maintain them across multiple workflows and pipelines.
Flexibility: With metadata objects, you can easily switch between different data sources and targets without having to update the entire workflow or pipeline.
Collaboration: Metadata objects can be shared among team members, improving collaboration and reducing the risk of miscommunication or errors.

Metadata objects simplify and standardize data processing across workflows and pipelines by providing a centralized way to manage common data integration elements such as database connections, file formats, and schema definitions. Rather than having to manually configure each of these elements for every workflow or pipeline, metadata objects can be defined once and reused across multiple processes.

This approach ensures consistency across data integration processes, reducing the risk of errors and improving data quality. Additionally, metadata objects can be easily updated and maintained, making it simpler to manage changes to data sources, targets, or processing logic.

By providing a standard way to define and manage metadata objects, Apache Hop streamlines the development and deployment of data integration processes. This approach makes it easier for teams to collaborate on data integration projects and ensures that processing is consistent, repeatable, and reliable across different environments and use cases.

Best Practices for Using Metadata Objects in Apache Hop

Here are some best practices for using metadata objects in Apache Hop:

Use clear and consistent naming conventions for metadata objects to make them easy to identify and use in workflows and pipelines. For example, use names that reflect the purpose of the metadata object and the type of data it represents.
Use variables defined in the environment config file to define the metadata objects. This approach provides a more dynamic and flexible way of managing the metadata objects, as it allows you to easily update the values of the variables without having to modify the metadata objects themselves.
Use metadata inheritance to avoid duplicating information across multiple objects. For example, you might create a metadata object for a database connection that is used in several projects, and then create a parent project to be used as a parent for other projects that inherit the connection details.
Use metadata injection to populate metadata objects dynamically at runtime. This can be especially useful when you need to process data from multiple sources that have different metadata properties.
Use version control to manage changes to your metadata objects over time. This can help you track changes and revert to previous versions if necessary.
Document your metadata objects to make it easier for other users to understand their purpose and use them effectively. This can include information about the data source, data types, and other relevant details.

Following these best practices, you can use metadata objects in Apache Hop to streamline your data integration processes, making them more efficient and easier to manage over time.

Conclusion

This post provides an overview of metadata objects in Apache Hop, an open-source data integration tool. The post explains the importance of metadata objects and their different types, including 10 of the 20 metadata types in Apache Hop. The benefits of using metadata objects include reusability, consistency, manageability, and flexibility. The post also provides best practices for creating and managing metadata objects in Apache Hop.

In Apache Hop, metadata objects are used to define the inputs and outputs of pipelines, the format and structure of data sources and targets, and the configuration of various Hop components. They are stored in a centralized metadata repository, which allows for easy access and management of metadata objects across multiple projects.

Metadata objects also enable the automation of data integration processes by providing a way to programmatically manipulate and configure pipelines and workflows. By defining metadata objects once, they can be reused across multiple pipelines and workflows, saving time and effort in development and maintenance.

View full post