Apr 23, 2023 4:00:00 PM

The Power of Metadata Objects in Apache Hop: A Comprehensive Guide I

Unlock the power of metadata objects in Apache Hop and revolutionize your data integration. Dive into our comprehensive guide now.

Introduction

In Apache Hop, metadata objects are used to store and manage information about data sources, targets, and other components of a data integration process. Metadata objects serve as a central repository for configuration and connection information that can be shared across different pipelines and workflows, making it easier to manage and maintain complex data integration processes.

There are many different types of metadata objects in Apache Hop, including database connections, file formats, file definitions, run configurations, etc. These objects are defined using the graphical user interface (Hop GUI), which allows users to create, edit, and manage metadata objects without needing to write any code.

One of the key benefits of using metadata objects in Apache Hop is that they allow users to define and manage connections to a wide range of data sources and targets, including databases, file systems, and more. This makes it easy to integrate data from different sources and perform complex data pipelines and processing tasks.

Here's an overview of the topics to be covered in the post about metadata objects in Apache Hop:

The importance of metadata objects in Apache Hop for data integration workflows.
The different types of metadata objects available in Apache Hop, the relationships/dependencies between some of them, and their use cases.
The benefits of using metadata objects in Apache Hop for data processing, collaboration, and efficiency.
Step-by-step instructions and screenshots for creating and managing some metadata objects using the Apache Hop graphical user interface.
Best practices for using metadata objects in Apache Hop to optimize data integration workflows.

By the end of the post, readers should have a comprehensive understanding of what metadata objects are, their benefits, how to create and manage them in Apache Hop, and best practices for using them effectively in data integration workflows.

How to Create and Manage Metadata Objects in Apache Hop

There are several ways to create and manage metadata objects using the Apache Hop graphical user interface Hop GUI. It depends on the metadata type but in this post, we'll cover two of them.

First way

Open the Apache Hop GUI and select the Metadata perspective.
Select a Metadata object type and hit the New button.
Fill in the details for the metadata object, such as the connection details for a database or the file definition.
Click OK to save the metadata object.

Second way

Open the Apache Hop GUI and hit the Hop->New or the New visible button in the Horizontal menu.
Select the type of metadata object you want to create from the context menu. This will send you to the Metadata perspective and the dialog with the fields to be filled in will be opened.
Fill in the details for the metadata object, such as the connection details for a database or the file definition.
Click OK to save the metadata object.

Once you have created a metadata object, you can use it in your workflows and pipelines by referencing it in the appropriate action/transform.

To manage metadata objects, you can use the Metadata perspective to view, edit, or delete existing objects.

Types of Metadata Objects in Apache Hop

Apache Hop provides a variety of metadata objects that users can create and manage to streamline the data integration process. Until now (Apache Hop 2.4), the types of metadata objects in Apache Hop are:

Pipeline Run Configuration
Execution Information Location
Execution Data Profile
Workflow Run Configuration
Pipeline Log
Workflow Log
Pipeline Probe
Pipeline Unit Test
Data Set
Beam File Definition
Relational Database Connection
Neo4j Connection
Neo4j Graph Model
MongoDB Connection
Cassandra Connection
Splunk Connection
Partition Schema
Hop Server
Web Service
Asynchronous Web Service

Yes, there are many metadata objects to cover, which is good news. There is no need to panic because we will guide you through each of them, provide examples, and clarify the dependencies between some of them. This is the first of two posts where we’ll cover all of the current metadata objects in Apache Hop. This first post will include the following metadata objects:

Pipeline Run Configuration
Execution Data Profile
Execution Information Location
Workflow Run Configuration
Pipeline Log
Workflow Log
Pipeline Probe
Pipeline Unit Test
Data Set
Beam File Definition

Pipeline Run Configuration

The Pipeline Run Configuration metadata object in Apache Hop provides a way to specify and store runtime configuration settings for pipelines. These settings include the selection of other metadata objects previously configured or the creation of new objects: Execution Information Location and the Execution Data Profile. The user should also specify the runtime engine to be used.

The Execution Information Location determines where Apache Hop can send execution information to. This information can be accessed from the Execution Information Perspective after the execution.

The Execution Data Profile feature generates data profiles for data that passes through pipelines. Users can choose from a range of data profilers and configure them to control the type and level of detail of the data profiling.
Below are the listed pipeline engines. The configuration details will vary based on the engine that is selected.

Execution Information Location

As we already mentioned, the Execution Information Location metadata object allows users to specify the location for execution information related to a pipeline or workflow. This can include information such as execution status, start and end times, and error messages. By default, Apache Hop stores execution information in memory, but users can choose to store this information in a variety of locations, including a file, a Neo4j, or a remote location.

Data logging delay (ms): specify the delay for the logging.
Data logging interval (ms): specify an interval of time to be used for loggings.
Location type: the location type can be File location, Neo4j location, or Remote location. Depending on the location type you will need to specify different extra fields.

The ability to specify the execution information location is important for several reasons. First, it allows users to store execution information in a more persistent and accessible location, which can be useful for debugging and monitoring purposes. Second, it enables users to easily share execution information across multiple systems or environments. Finally, it provides a way to automate the processing of execution information, such as by triggering alerts or notifications based on specific events.

The execution information can be checked from the Execution Information Perspective. The Execution Information perspective furnishes a summary of the execution details of workflows and pipelines that have been executed before. With this perspective, you can browse the list of executions and navigate between parent and child workflows and/or pipelines. This perspective also offers details on the status of the execution, logging information, pipeline metrics, and data profiles.

Execution Data Profile

As previously mentioned, the Execution Data Profile metadata object in Apache Hop is used to store and manage data profiling results for a specific execution of a pipeline. The profile includes information about the execution, such as the date and time it was run, the user who ran it, and any parameters or variables that were used.

The user can choose between 4 data samplers to use:
- Data profile output rows: Basic data profiling can be performed on the output rows of a transform.
- First output rows: The transform output is sampled by selecting the first rows.
- Last output rows: The transform output is sampled by selecting the last rows.
- Random output rows: Reservoir sampling is performed on the output rows of a transform.

The data profiling results stored in the profile include statistics such as the number of null values, the minimum and maximum values, and the average and standard deviation for each column in the input data. The profile also includes histograms for each column, which provide a visual representation of the distribution of the data.

The Execution Data Profile is useful for gaining insight into the quality of the data being processed and identifying potential issues that need to be addressed. It can be used to monitor data quality over time and to track changes in the data.

Workflow Run Configuration

The Workflow Run Configuration metadata object in Apache Hop allows users to define configuration settings for the execution of a workflow. This metadata object is similar to the Pipeline Run Configuration and provides users with the ability to set and manage runtime parameters and variables for a workflow, in this case, which can help streamline and automate the execution of data integration processes.

By using Workflow Run Configuration, users can centralize the management of runtime settings for their workflows, making it easier to maintain and update data integration processes. This metadata object can be particularly useful in large, complex workflows that require fine-tuned configuration and customization options.

The following dialog allows users to configure a Workflow Run Configuration:

The user can check it as the Workflow Run Configuration to be used by default.
As we already mentioned, the Execution Information Location is another metadata object that determines the destination to which Apache Hop can send execution information. Once the execution is completed, this information can be accessed from the Execution Information Perspective.

Unlike Pipeline Run Configuration, for the execution of workflows, there are only two types of engines available:
- Hop local workflow engine
- Hop remote workflow engine

Partial summary I

Up to this point we have covered the following 4 metadata types:

Pipeline Run Configuration
Execution Information Location
Execution Data Profile
Workflow Run Configuration

Remarks:

The Pipeline Run Configuration and the Workflow Run Configuration metadata objects describe the run conditions for pipelines and workflows, respectively.
The Execution Information Location plays a crucial role in both Pipeline Run Configuration and Workflow Run Configuration. It determines the destination where Apache Hop can send execution information.
On the other hand, Execution Data Profile is solely used in Pipeline Run Configuration. It is responsible for building data profiles while the data flows through the pipelines.

Pipeline Log

The metadata object Pipeline Log in Apache Hop stores information about the execution of a pipeline, including the start time, end time, and duration of the pipeline run. It also records the number of rows processed and the status of each transform within the pipeline.

This metadata object can be used to monitor pipeline performance and identify potential issues, as well as to track pipeline history and audit trail. It can be viewed and queried through the Hop GUI or accessed programmatically through the Hop API.

Logging parent pipelines only: The user can check this option to log only the parent pipeline.
Pipeline executed to capture logging: specify the pipeline to process the logging information for this Pipeline Log. To create a pipeline for this purpose, you can either navigate to the perspective area or click on the "New" button in the New Pipeline Log dialog. You can then choose a folder and name for your pipeline and once created, a new pipeline is automatically generated with a Pipeline Logging transform connected to a Dummy transform (Save logging here). Then you can modify the pipeline by replacing the Dummy for the target you will use for logging.
Execute at the start of the pipeline?: Specify whether the Pipeline Log should be executed at the beginning of the pipeline run.
Execute at the end of the pipeline?: Indicate whether the Pipeline Log should be executed at the end of a pipeline run.
Execute periodically during execution?: Indicate whether the Pipeline Log should be executed at regular intervals during a pipeline run.
Interval in seconds: If the Pipeline Log is executed periodically, you can specify the interval at which it should be executed.

Workflow Log

The metadata object Workflow Log in Apache Hop is similar to the Pipeline Log. It is a record of the activities and operations performed during the execution of a workflow. It stores information about each action executed within the workflow, such as the start and end time of each action, the status of the action, and any error messages or warnings that occurred during execution.

The Workflow Log metadata object can be accessed through the Metadata Explorer in Apache Hop and can be used to troubleshoot and debug issues that may arise during workflow execution. It can also be used to monitor performance and identify potential areas for optimization.

Logging parent workflow only: The user can check this option to log only the parent workflow.
Pipeline executed to capture logging: specify the pipeline to process the logging information for this Workflow Log. To create a pipeline for this purpose, you can either navigate to the perspective area or click on the New button in the New Workflow Log dialog. You can then choose a folder and name for your pipeline and once created, a new pipeline is automatically generated with a Workflow Logging transform connected to a Dummy transform (Save logging here). Then you can modify the pipeline by replacing the Dummy for the target you will use for logging.
Execute at the start of the workflow?: Specify whether the Workflow Log should be executed at the beginning of the workflow run.
Execute at the end of the workflow?: Indicate whether the Workflow Log should be executed at the end of a workflow run.
Execute periodically during execution?: Indicate whether the Workflow Log should be executed at regular intervals during a workflow run.
Interval in seconds: If the Workflow Log is executed periodically, you can specify the interval at which it should be executed.

Pipeline Probe

The Pipeline Probe metadata object in Apache Hop is a mechanism to retrieve metadata from a pipeline without actually executing the pipeline. It allows you to view and validate metadata before the pipeline runs, helping to identify potential issues and errors. The metadata retrieved from the Pipeline Probe can be used in subsequent steps or for debugging purposes.

To use the Pipeline Probe, you must first define it in your pipeline. This is done by adding a Pipeline Probe metadata type to your pipeline, which can be found in the Metadata center of the Hop GUI.

Pipeline executed to capture logging: specify the pipeline to process the data for this Pipeline Probe.
Capture output of the following transforms: list of pipelines and transforms to log information for.

Once the Pipeline Probe is defined, you can configure the metadata that you want to retrieve by selecting the relevant fields and properties from the pipeline's metadata. The Pipeline Probe will then generate a metadata snapshot of the pipeline based on the configuration.

This metadata snapshot can be used in subsequent steps in the pipeline or exported as a file for further analysis or debugging. The Pipeline Probe can also be used to validate the pipeline's metadata, helping to catch potential issues before the pipeline is executed.

Pipeline Unit Test

The Pipeline Unit Test metadata object in Apache Hop is a type of metadata that allows users to define tests for their pipelines. It provides an easy and efficient way to test the functionality and behavior of a pipeline by specifying expected results for certain inputs.

Unit tests in Apache Hop are a collection of input sets, golden data sets, and various tweaks that can be applied to the pipelines before testing. These tests enable developers to not only work test-driven but also perform regression testing to ensure that previously resolved issues remain fixed.

To create a Pipeline Unit Test, users can specify the input data, expected output data, and any conditions or constraints that should be tested.

Type of test: Specify the type of test to be performed.
The pipeline to test: Add the pipeline to be tested.
Base test path (or use HOP_UNIT_TESTS_FOLDER): Use the HOP_UNIT_TESTS_FOLDER or specify a new directory.
The user can check it as the Pipeline Unit Test to be used by default.
The user can specify a list of database connections in the pipeline to test (Original DB) to replace by database connections in this unit test (Replacement DB).
Variables: Specify a list of variable names and values to use for this test.

The Pipeline Unit Test metadata object can be associated with a pipeline, allowing users to easily run the tests for that pipeline. When a pipeline is run, Apache Hop automatically checks if there is an associated Pipeline Unit Test metadata object and runs the defined test cases. If any of the test cases fail, Apache Hop will notify the user, indicating which test cases failed and the reason for the failure.

The input and golden data sets are classified as a distinct type of metadata object, called Data Set. These are utilized as input and output in the pipeline that needs to be tested.

In upcoming posts, we will guide you through the entire process of implementing and running Pipeline Unit Tests in Apache Hop.

Data Set

The metadata object Data Set is a key component in Apache Hop that allows users to define metadata about data sources and targets in a reusable and consistent way. The Metadata object data set stores information such as the field names, types, formats and other metadata that is used throughout the pipeline to ensure that data is properly transformed and integrated.

Using the Metadata object data set, users can create metadata definitions for various data sources and targets. Once created, these metadata definitions can be used as part of the configuration of the Pipeline Unit Tests as input and golden data sets.

Set Folder (or use HOP_DATASETS_FOLDER): Specify the project directory where data sets can be located and stored.
Base file name: Specify the data set default name.
The data set fields and their column names in the file: It is a list of the field names, types, formats, lengths, and precision. Describes the file layout for this data set.

Partial summary II

Some remarks about the relationship between Pipeline Unit Test and Data Set:

Unit tests in Apache Hop are a collection of input sets, golden data sets, and various tweaks that can be applied to the pipelines before testing.
The input and golden data sets are classified as a distinct type of metadata object, called Data Set, which allows users to define metadata about data sources and targets in a reusable and consistent way.
Using the Data Set metadata object, users can create metadata definitions for various data sources and targets, which can be used as part of the configuration of the Pipeline Unit Tests as input and golden data sets.

Beam File Definition

Apache Hop supports the Apache Beam programming model for data processing, which allows users to define data processing pipelines in a flexible and scalable manner. One key feature of Apache Beam in Apache Hop is the Beam File Definition, which provides a way to define the structure and format of input and output files in a pipeline.

The Beam File Definition is a metadata object that describes the structure of files that will be used as input or output in a pipeline. It specifies the file format (such as CSV, JSON, or Avro), the file layout (such as field delimiter and record separator), and the data types of each field. It also allows users to specify additional metadata about the file, such as compression and encryption settings.

Field separator: The separator that is used between fields in the file definition.
Field enclosure: The field enclosure that is used for fields in the file definition.
Field definitions: A list of field names, types, formats, lengths, and precisions. This describes the file layout for the field definition.

To use the Beam File Definition in Apache Hop, users first create a new metadata object and specify the file format and layout. They then define the fields of the file, including their names, data types, and any additional metadata. Once the Beam File Definition is created, it can be used in a pipeline to read or write files of the specified format and structure.

Using the Beam File Definition in Apache Hop provides several benefits. It allows users to define file formats and structures in a centralized location, making it easy to reuse and maintain data processing pipelines. It also enables automatic schema validation and type conversion, which can help prevent errors and improve the reliability of data processing.

Advantages of Using Metadata Objects in Apache Hop

Using metadata objects in Apache Hop for data integration processes has several benefits:

Reusability: Metadata objects can be reused across multiple workflows and pipelines, reducing the amount of time and effort required to build new data integration processes.
Consistency: By defining metadata objects such as database connections, file formats, and schema definitions, it ensures consistency across workflows and pipelines, reducing the risk of errors and improving data quality.
Manageability: Metadata objects can be managed centrally, making it easier to update and maintain them across multiple workflows and pipelines.
Flexibility: With metadata objects, you can easily switch between different data sources and targets without having to update the entire workflow or pipeline.
Collaboration: Metadata objects can be shared among team members, improving collaboration and reducing the risk of miscommunication or errors.

Metadata objects simplify and standardize data processing across workflows and pipelines by providing a centralized way to manage common data integration elements such as database connections, file formats, and schema definitions. Rather than having to manually configure each of these elements for every workflow or pipeline, metadata objects can be defined once and reused across multiple processes.

This approach ensures consistency across data integration processes, reducing the risk of errors and improving data quality. Additionally, metadata objects can be easily updated and maintained, making it simpler to manage changes to data sources, targets, or processing logic.

By providing a standard way to define and manage metadata objects, Apache Hop streamlines the development and deployment of data integration processes. This approach makes it easier for teams to collaborate on data integration projects and ensures that processing is consistent, repeatable, and reliable across different environments and use cases.

Best Practices for Using Metadata Objects in Apache Hop

Here are some best practices for using metadata objects in Apache Hop:

Use clear and consistent naming conventions for metadata objects to make them easy to identify and use in workflows and pipelines. For example, use names that reflect the purpose of the metadata object and the type of data it represents.
Use variables defined in the environment config file to define the metadata objects. This approach provides a more dynamic and flexible way of managing the metadata objects, as it allows you to easily update the values of the variables without having to modify the metadata objects themselves.
Use metadata inheritance to avoid duplicating information across multiple objects. For example, you might create a metadata object for a database connection that is used in several projects, and then create a parent project to be used as a parent for other projects that inherit the connection details.
Use metadata injection to populate metadata objects dynamically at runtime. This can be especially useful when you need to process data from multiple sources that have different metadata properties.
Use version control to manage changes to your metadata objects over time. This can help you track changes and revert to previous versions if necessary.
Document your metadata objects to make it easier for other users to understand their purpose and use them effectively. This can include information about the data source, data types, and other relevant details.

Following these best practices, you can use metadata objects in Apache Hop to streamline your data integration processes, making them more efficient and easier to manage over time.

Conclusion

This post provides an overview of metadata objects in Apache Hop, an open-source data integration tool. The post explains the importance of metadata objects and their different types, including 10 of the 20 metadata types in Apache Hop. The benefits of using metadata objects include reusability, consistency, manageability, and flexibility. The post also provides best practices for creating and managing metadata objects in Apache Hop.

In Apache Hop, metadata objects are used to define the inputs and outputs of pipelines, the format and structure of data sources and targets, and the configuration of various Hop components. They are stored in a centralized metadata repository, which allows for easy access and management of metadata objects across multiple projects.

Metadata objects also enable the automation of data integration processes by providing a way to programmatically manipulate and configure pipelines and workflows. By defining metadata objects once, they can be reused across multiple pipelines and workflows, saving time and effort in development and maintenance.

dataintegration, etl, pdi-kettle, metadata, apachebeam, datapipelines, logging, unittesting, projects, environments