Skip to content Skip to sidebar Skip to footer

Custom Apache Beam Python Version In Dataflow

I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. A version that is not available in the public repositories (as of this writ

Solution 1:

I will answer myself as I got the answer of this question at one Apache Beam's JIRA I have been helping with.

If you want to use a custom Apache Beam Python version in Google Cloud Dataflow (that is, run your pipeline with the --runner DataflowRunner, you must use the option --sdk_location <apache_beam_v1.2.3.tar.gz> when you run your pipeline; where <apache_beam_v1.2.3.tar.gz> is the location of the corresponding packaged version that you want to use.

For example, as of this writing, if you have checked out the HEAD version of the Apache Beam's git repository, you have to first package the repository by navigating to the Python SDK with cd beam/sdks/python and then run python setup.py sdist (a compressed tar file will be created in the distsubdirectory).

Thereafter you can run your pipeline like this:

python your_pipeline.py [...your_options...] --sdk_location beam/sdks/python/dist/apache-beam-2.2.0.dev0.tar.gz

Google Cloud Dataflow will use the supplied SDK.


Post a Comment for "Custom Apache Beam Python Version In Dataflow"