Use Dataflow SQL  |  Google Cloud (2024)

The page explains how to use Dataflow SQL and create Dataflow SQLjobs.

To create a Dataflow SQL job, you must write andrun a Dataflow SQL query.

Use the Dataflow SQL editor

The Dataflow SQL editor is a page in the Google Cloud console where youwrite and run queries for creating Dataflow SQL jobs.

To access the Dataflow SQL editor, follow these steps:

  • In the Google Cloud console, go to the Dataflow SQL Editor page.

    Go to Dataflow SQL editor

You can also access the Dataflow SQL editor from the Dataflowmonitoring interface by followingthese steps:

  1. In the Google Cloud console, go to the Dataflow Jobspage.

    Go to Jobs

  2. In the Dataflow menu, click SQL Workspace.

Write Dataflow SQL queries

Dataflow SQL queries use the Dataflow SQL query syntax.The Dataflow SQL query syntax is similar to BigQuery standard SQL.

You can use the Dataflow SQL streaming extensionsto aggregate data from continuously updating Dataflow sources likePub/Sub.

For example, the following query counts the passengers in aPub/Sub stream of taxi rides every minute:

SELECT DATETIME(tr.window_start) AS starttime, SUM(tr.passenger_count) AS pickup_countFROM TUMBLE ((SELECT * FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`),DESCRIPTOR(event_timestamp), 'INTERVAL 1 MINUTE') AS trWHERE tr.ride_status = "pickup"GROUP BY DATETIME(tr.window_start)

Run Dataflow SQL queries

When you run a Dataflow SQL query, Dataflow turns thequery into an Apache Beam pipelineand runs the pipeline.

You can run a Dataflow SQL query using the Google Cloud console orGoogle Cloud CLI.

Console

To run a Dataflow SQL query, use the Dataflow SQL editor:

  1. Go to the Dataflow SQL Editor page.

    Go to Dataflow SQL editor

  2. Enter the Dataflow SQL query into the query editor.

  3. Click Create job to open a panel of job options.

  4. Optional: For Job name, enter a unique job name.

  5. For Regional endpoint, select a value from the menu.

  6. Optional: Click Show optional parameters, and then enter values for theprovided Dataflow pipeline options.

  7. For Destination, select an Output type, and then entervalues for the provided fields.

  8. Optional: In the SQL query parameters section, add parameters and thenenter values in the provided fields.

  9. Click Create.

gcloud

To run a Dataflow SQL query, use the gcloud dataflow sql querycommand. The following is an example SQL query that creates

gcloud dataflow sql query \ --job-name=JOB_NAME \ --region=REGION \ --bigquery-table=BIGQUERY_TABLE \ --bigquery-dataset=BIGQUERY_DATASET \ --bigquery-project=BIGQUERY_PROJECT \'SQL_QUERY'

Replace the following:

  • JOB_NAME: a name for your Dataflow SQL job
  • REGION: the Dataflow locationfor deploying your Dataflow job
  • BIGQUERY_TABLE: the name of theBigQuery table to which you want to write the output
  • BIGQUERY_DATASET: the BigQuery dataset IDthat contains the output table
  • BIGQUERY_PROJECT: the Google Cloud project ID thatcontains the output BigQuery table
  • SQL_QUERY: your Dataflow SQL query

Set pipeline options

You can set Dataflow pipeline options for Dataflow SQL jobs.Dataflow pipeline options are execution parametersthat configure how and where to run Dataflow SQL queries.

To set Dataflow pipeline options for Dataflow SQL jobs,specify the following parameters when you run a Dataflow SQL query.

Console

Parameter Type Description Default value
Regionalendpoint String The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow location. If not set, defaults to us-central1.
Maxworkers int The maximum number of Compute Engine instances available to your pipeline during execution. If unspecified, Dataflow automatically service determines an appropriate number of workers.
Workerregion String The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow job region. If not set, defaults to the specified Dataflow region.
Workerzone String The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow job region.

If not set, defaults to a zone in the worker region.

If the worker region is not set, defaults to a zone in the specified Dataflow region.

Serviceaccountemail String The email address of the worker service account with which to run the pipeline. The email address must be in the form my-service-account-name@<project-id>.iam.gserviceaccount.com. If not set, Dataflow workers use the Compute Engine service account of the current project as the worker service account.
Machinetype String

The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

If not set, Dataflow automatically chooses the machine type.
Additionalexperiments String The experiments to enable. An experiment can be a value, like enable_streaming_engine, or a key-value pair, such as shuffle_mode=service. The experiments must be in a comma-separated list. If unspecified, no experiments are enabled.
WorkerIPAddressConfiguration String

Specifies whether Dataflow workers use public IP addresses.

If the value is set to Private, Dataflow workers use private IP addresses for all communication. The specified Network or Subnetwork must have Private Google Access enabled.

If the value is set to Private and the Subnetwork option is specified, the Network option is ignored.

If not set, defaults to Public.
Network String The Compute Engine network to which workers are assigned. If not set, defaults to the network default.
Subnetwork String The Compute Engine subnetwork to which workers are assigned. The subnetwork must be in the form regions/region/subnetworks/subnetwork. If not set, Dataflow automatically determines subnetwork.

gcloud

Flag Type Description Default value
‑‑region String The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow location. If not set, throws an error.
‑‑max‑workers int The maximum number of Compute Engine instances available to your pipeline during execution. If unspecified, Dataflow automatically determines an appropriate number of workers.
‑‑num‑workers int The initial number of Compute Engine instances to use when executing your pipeline. This parameter determines how many workers Dataflow starts up when your job begins. If unspecified, Dataflow automatically determines an appropriate number of workers.
‑‑worker‑region String

The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow job region.

You can specify one of ‑‑worker‑region or ‑‑worker‑zone.

If not set, defaults to the specified Dataflow region.
‑‑worker‑zone String

The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow job region.

You can specify one of ‑‑worker‑region or ‑‑worker‑zone.

If not set, defaults to a zone in the specified Dataflow region.
‑‑worker‑machine‑type String

The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

If not set, Dataflow automatically chooses the machine type.
‑‑service‑account‑email String The email address of the worker service account with which to run the pipeline. The email address must be in the form my-service-account-name@<project-id>.iam.gserviceaccount.com. If not set, Dataflow workers use the Compute Engine service account of the current project as the worker service account.
‑‑disable‑public‑ips boolean

Specifies whether Dataflow workers use public IP addresses.

If set, Dataflow workers use private IP addresses for all communication.

If not set, Dataflow workers use public IP addresses.
‑‑network String The Compute Engine network to which workers are assigned. If not set, defaults to the network default.
‑‑subnetwork String The Compute Engine subnetwork to which workers are assigned. The subnetwork must be in the form regions/region/subnetworks/subnetwork. If not set, Dataflow automatically determines subnetwork.
‑‑dataflow‑kms‑key String The customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. The key must be in the same location as the job. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.

For more information, see thegcloud dataflow sql querycommand reference.

Stop Dataflow SQL jobs

To stop a Dataflow SQL job, you must cancel it.Stopping a Dataflow SQL job with the drain option is not supported.

Pricing

Dataflow SQL uses the standard Dataflow pricing; itdoes not have separate pricing. You are billed for the resources consumed by theDataflow jobs that you create based on your SQL statements. Thecharges for these resource are the standard Dataflow charges forvCPU, memory, Persistent Disk, Streaming Engine, and Dataflow Shuffle.

A Dataflow SQL job might consume additional resources such asPub/Sub and BigQuery, each billed at their ownpricing.

For more information about Dataflow pricing, seeDataflow pricing.

What's next

  • Walk through the Joining streaming data with Dataflow SQL tutorial.
  • Explore the Google Cloud CLI for Dataflow SQL.
Use Dataflow SQL  |  Google Cloud (2024)
Top Articles
Latest Posts
Article information

Author: Horacio Brakus JD

Last Updated:

Views: 5401

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.