Discover Apache Hop's Workflow Logging feature for efficient data processing. Uncover insights and...
Understanding and Utilizing Pipeline Logging in Apache Hop
Delve into Apache Hop's Pipeline Logging feature for efficient data processing. Uncover insights and best practices in this guide.
Pipeline Log
Apache Hop introduces a clear separation between data and metadata, enabling you to design data processes independently of the data itself. The Apache Hop Metadata serves as a central repository for shared metadata, including database connections, run configurations, servers, datasets, and more. One useful feature is the "Pipeline Log", which facilitates logging the activity of a pipeline with another pipeline.
The "Pipeline Log" metadata object streams logging information from a running pipeline to another pipeline and is created in JSON format. For each metadata object of this type, you can execute a pipeline of your choice, passing the runtime information of all your pipelines to it.
"Pipeline Log" Configuration
To configure and use the "Pipeline Log" metadata, follow these steps:
Step 1: Create a "Pipeline Log" Metadata Object
-
In the horizontal menu click on "New" -> "Pipeline Log".
-
Or go to "Metadata" -> "Pipeline Log." -> "New".
-
Fill in the required fields:
-
Name: Provide a name for the metadata object (pipelines-logging).
-
Enabled: Check this option to activate the logging.
-
Logging parent pipelines only: This option is unchecked in our provided example. It specifies whether the pipeline logging should capture and process logging information only for the parent pipelines (the pipeline being run), or if it should also capture and process logging information for sub-pipelines that may be executed as part of the main parent pipeline.
-
Pipeline executed to capture logging: Select or create the pipeline that processes the logging information for this "Pipeline Log". Specify the directory of the pipeline (${PROJECT_HOME}/hop/logging/pipelines-logging.hpl). We'll detail this in the second step.
-
Execute at the start of the pipeline?: This option remains chosen in our provided example. The option specifies whether this pipeline log should be executed at the start of a pipeline run. If set to true, the logging will begin at the beginning of the pipeline execution.
-
Execute at the end of the pipeline?: We keep this selected in our example. It determines whether this pipeline log should be executed at the end of a pipeline run. If set to true, the logging will occur once the pipeline has completed its execution.
-
Execute periodically during execution?: We leave this unchecked. Indicates whether this pipeline log should be executed at regular intervals during a pipeline run. If set to true, the log will be executed periodically.
-
Interval in seconds: Specifies the interval, in seconds, at which the pipeline log is executed if "Execute periodically during execution" is set to true. The pipeline log will be triggered at this specified interval during the pipeline execution.
-
💡 Tip: By default, pipeline logging applies to all pipelines in the current project. However, if you prefer to limit logging to specific pipelines, you can select them in the table below the configuration options labeled Capture output of the following pipelines. In the screenshot below, only the write-1000-rows.hpl pipeline is selected for logging in the how-to-apache-hop project.
-
Save the configuration.
Step 2: Create a New Pipeline with the "Pipeline Logging" Transform
-
Create a new pipeline from the "New" option in the "Pipeline Log" dialog by choosing a folder and a name.
-
The pipeline is automatically generated with a "Pipeline Logging" transform connected to a "Dummy" transform ("Save logging here"). Now we'll configure another output for this pipeline. You can also create the pipeline from scratch.
-
Configure the "Pipeline Logging" transform:
-
Transform name: Provide a unique name for the transform (piplog).
-
Also log transform details: We keep this option checked.
- Checked: The transform generates both pipeline and transform logging and metrics. In this scenario, the log will have a line for each transform, containing both pipeline logging and metrics information.
- Unchecked: The transform exclusively produces pipeline logging and metrics.
-
Step 3: Add and Configure a "Table Output" Transform
-
Remove the "Dummy" transform.
-
Add a "Table Output" transform to load data into a database table:
-
Click anywhere on the pipeline canvas.
-
Search for 'table output' -> Table Output.
-
-
Configure the "Table Output" transform:
-
Transform name: Provide a unique name for the transform (pipelines-logging).
-
Connection: Select the database connection where the data will be written (dvdrental-connection), which was configured using the logging-connection.json environment file.
-
Target schema: Specify the schema name for the table where data will be written. (logging).
-
Target table: Specify the table's name to which data will be written (piplog).
-
Click on the SQL option to automatically generate the SQL for creating the output table.
-
-
Execute the SQL statements and verify the logging fields in the created table.
-
Save and close the transform.
Step 4: Run a Pipeline and Check the Logs
-
Launch a pipeline by clicking on "Run" -> "Launch."
-
We use a basic pipeline (generate-rows.hpl) that generates a constant and writes 1000 rows to a CSV file.
-
The pipeline execution data will be recorded in the piplog table.
-
Check the data in the piplog table to review the logs.
Conclusion
The configuration of the "Pipeline Log" is straightforward, offering options to execute at the start or end of a pipeline, and even periodically during execution. This level of flexibility allows for tailored logging approaches based on specific project needs. The ability to choose the interval at which the log is executed adds an extra layer of customization.
In conclusion, Apache Hop's "Pipeline Log" is an essential tool for effective logging in data processing pipelines. Its configuration options and seamless integration within Apache Hop's ecosystem make it a valuable asset for data engineers and developers seeking to enhance logging capabilities and maintain robust data processes. The clear separation of data and metadata, combined with powerful tools like the "Pipeline Log," positions Apache Hop as a leading solution for streamlined and efficient data integration and processing.