Working with BigQuery in Apache Beam Java SDK

Apache Beam is a popular open-source unified programming model for defining and executing data processing pipelines. It provides a simple and expressive way to build batch and streaming data processing applications.

In this blog post, we will explore how to work with BigQuery, a fully-managed data warehouse solution by Google Cloud, using the Apache Beam Java SDK. We will cover some common operations such as reading data from BigQuery, transforming the data, and writing it back to BigQuery.

Setting up the Environment

Before we begin, make sure you have the following tools installed:

Reading Data from BigQuery

To read data from BigQuery, we need to define a BigQueryIO source transform. Here’s an example of reading data from a BigQuery table:

Pipeline p = Pipeline.create();

PCollection<TableRow> tableRows = p.apply(BigQueryIO.readTableRows()
                    .from("<project-id>:<dataset-id>.<table-id>"));

Replace <project-id>, <dataset-id>, and <table-id> with the appropriate values of your BigQuery table.

Transforming Data

Once we have the data in the pipeline, we can apply various transformations to manipulate it. For example, we can filter rows based on certain conditions, transform the data schema, or aggregate the data.

Here’s an example of filtering out rows based on a condition:

PCollection<TableRow> filteredRows = tableRows
                    .apply(Filter.by(row -> row.getInt("age") > 18));

In this example, we filter out rows where the “age” column value is less than or equal to 18.

Writing Data to BigQuery

After transforming the data, we can write it back to BigQuery. The BigQueryIO.writeTableRows() function allows us to specify the BigQuery table where we want to write the data.

filteredRows.apply(BigQueryIO.writeTableRows()
              .to("<project-id>:<dataset-id>.<table-id>")
              .withSchema(schema)
              .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
              .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

Replace <project-id>, <dataset-id>, and <table-id> with the appropriate values of the destination BigQuery table.

In this example, we use WRITE_TRUNCATE to overwrite the existing table data. You can also use other write and create dispositions based on your requirements.

Running the Pipeline

To run the pipeline, you need to execute the following command:

mvn compile exec:java -Dexec.mainClass=com.example.MyPipeline -Pdataflow-runner -Dexec.args="--project=<project-id> --tempLocation=gs://<temp-bucket>"

Replace <project-id> with your Google Cloud project ID, and <temp-bucket> with a Google Cloud Storage bucket where temporary files will be stored during pipeline execution.

Conclusion

In this blog post, we explored how to work with BigQuery in the Apache Beam Java SDK. We covered reading data from BigQuery, transforming the data, and writing it back to BigQuery.

Working with BigQuery in Apache Beam Java SDK provides a powerful and scalable solution for processing and analyzing big data. With the expressive programming model of Beam, you can easily build complex data pipelines with ease.

#BigQuery #ApacheBeam