Skip to main content
Loading data into the Ocient System using the data pipeline functionality is as simple as writing a SQL statement to query data from different supported sources. The loading operation provides real-time transformations and maximizes performance using all available processing resources. With data pipelines, you manage data loading activities using Data Definition Language (DDL) statements similar to the ones used for managing database tables. You can access all information about pipelines by querying the system catalog tables. You can execute operations on pipelines with a command-line statement using standard interfaces like JDBC and ODBC.

Data Pipelines Overview

A data pipeline is the primary way that the Ocient System loads data. Each data pipeline is a database object that defines the end-to-end processing of data for the extraction, transformation, and load into Ocient tables. Data pipelines execute across your Ocient System to coordinate parallel loading tasks across many Loader Nodes and pipelines.

The Pipeline Object

Creating a pipeline is complex as there are many options to consider. Use the PREVIEW PIPELINE SQL statement to preview the pipeline creation and load of data to ensure that the pipeline definition returns the results you expect. After you are satisfied with the results, you can execute the CREATE PIPELINE statement to create the pipeline at scale. When you create a pipeline, you assign it a name and the pipeline exists as an object in the Ocient System connected to this name. Then, you can control the object using your chosen name with SQL statements like:
  • START PIPELINE
  • STOP PIPELINE
  • CREATE OR REPLACE PIPELINE
  • DROP PIPELINE
You can also define your own function by using the CREATE PIPELINE FUNCTION SQL statement. Execute these statements using a SQL connection. A pipeline maintains its own position during a load and enforces deduplication logic to ensure that the Ocient System only loads data once. The lifecycle of a pipeline defines both the deduplication logic and load position.

Pipeline Events

During the life of a pipeline, you can start, stop, modify, and resume the pipeline without duplicating source data or losing the position in a load.

Pipeline Updates

You can modify pipelines using the CREATE OR REPLACE SQL statement to update the transforms in a pipeline, but maintain the current position in the load.

Load Position

If target tables are truncated or if a target table is dropped and then recreated, the pipeline maintains its own position in the load and will continue from its last position. Pipelines also gracefully handle many error cases and system failure modes. The pipeline stores any errors that occur in the system catalog in association with the pipeline.

Parts of a Pipeline

Pipelines can operate in either a BATCH or CONTINUOUS mode based on how you plan to load your data. The DDL for Data Pipelines has three sections:

Data Source

The SOURCE part in a pipeline defines the source of the data to load. This DDL part identifies the type of the source (e.g., S3) as well as the details for the data that is relevant to the source (e.g., S3 bucket and filters, Kafka topic and consumer configuration).

Data Format

The EXTRACT part in a pipeline defines the format of the data that the Ocient System extracts from the data source. This part includes the data format (e.g., delimited, binary, JSON) and the details about the records (e.g., record delimiter, how to treat empty data). In addition, the Ocient System extracts metadata that you can load into tables.

Transformations

The INTO table SELECT ... SQL statement defines the target tables where the data loads and uses a SELECT SQL statement to extract fields from the source data and map them to target columns. The SELECT statement in a pipeline can utilize a subset of SQL functions to convert the extracted data into the required format during loading.

Pipeline Lifecycle

The pipeline lifecycle includes the following states:
  • Created - Pipeline is defined but not started
  • Running - Pipeline is actively loading data
  • Stopped - Pipeline has been manually stopped
  • Completed - Pipeline has finished loading all data
  • Failed - Pipeline encountered an error
You can start and stop data pipelines using DDL statements. The status of the pipeline transitions automatically as the pipeline completes all assigned work or reaches a failed state.

Observe Pipelines

During pipeline operation, all of the information you need to observe progress, pipeline status, key events, success or failure, and errors is available in system catalog tables in the Ocient System. In addition, performance counters are available for many key loading metrics that you can add to observability systems.

Troubleshooting Pipelines

Data pipelines support robust error-handling capabilities crucial for developing new pipelines, identifying issues with ongoing operations, and correcting bad data. Pipelines can include a bad data target where the Ocient System saves bad data for troubleshooting. In addition, the system captures errors that you encounter during loading in system catalog tables such as sys.pipeline_events and sys.pipeline_errors.

Loading Architecture

Pipelines require Loader Nodes to operate. Loader Nodes are nodes that have the streamloader role assignment. The loading system can use all active nodes with this role to process pipelines. When you execute a pipeline, Ocient spreads the work among the available Loader Nodes. The system executes these parallel processes as tasks on each of the Loader Nodes. Then, the system partitions the files or Kafka partitions to load across the tasks, and loading proceeds in parallel. Many pipelines can run in parallel on the Loader Nodes, and these nodes share resources. The scale-out architecture of the Ocient System allows you to add more Loader Nodes as needed to expand the volume and complexity of pipelines that you can operate at the throughput required by your application.