Working with side inputs in Apache Beam Java SDK

25 Sep 2023

Apache Beam is a powerful and flexible framework for building data processing pipelines. One of the key features of Apache Beam is the ability to work with side inputs. Side inputs provide additional data that can be used by a pipeline during its execution.

What are Side Inputs?

Side inputs are additional data sources that can be used in parallel with the primary data source. They are called “side inputs” because they are not part of the main data processing path, but rather provide supplementary information that can enhance the processing logic.

Side inputs are especially useful when you need to perform calculations or lookups based on some external data, such as a dictionary or a lookup table. By leveraging side inputs, you can avoid expensive operations like reading the data from disk or making network calls repeatedly.

How to Use Side Inputs in Apache Beam Java SDK

Using side inputs in Apache Beam Java SDK is straightforward. Here’s an example of how to work with side inputs:

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.View;

import org.apache.beam.sdk.values.PCollectionView;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.KV;

import java.util.List;
import java.util.Arrays;

public class SideInputsExample {

  public static void main(String[] args) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create();

    // Read the main input data as a PCollection
    PCollection<String> mainInput = pipeline.apply(TextIO.read().from("input.txt"));

    // Create a side input collection
    List<String> sideInputData = Arrays.asList("apple", "banana", "cherry");
    PCollectionView<List<String>> sideInput = mainInput
      .apply(View.asList());

    // Perform processing using side input
    PCollection<KV<String, Integer>> processedData = mainInput.apply(ParDo.of(new DoFn<String, KV<String, Integer>>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
            // Access the side input data
            List<String> sideData = c.sideInput(sideInput);
            
            // Perform calculations or lookups using the side input
            String inputData = c.element();
            int length = inputData.length();
            
            // Emit the result
            c.output(KV.of(inputData, length + sideData.size()));
        }

    }).withSideInputs(sideInput));

    // Write the processed data output
    processedData.apply(TextIO.write().to("output.txt"));

    // Run the pipeline
    pipeline.run();
  }
}

In this example, we create a side input sideInput by converting the mainInput PCollection into a List. We then use this side input in the ParDo transformation to access the side input data and perform calculations or lookups.

Conclusion

Working with side inputs in Apache Beam Java SDK allows you to leverage additional data sources to enhance your data processing pipelines. By using side inputs, you can avoid costly operations and improve the overall efficiency of your data processing tasks.