Numpy And Static Linking
Solution 1:
There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.
First of all Debian packages come with multiple dependencies including
libgfortran
,libblas
,liblapack
andlibquadmath
. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.
Arguably this makes NumPy a really bad candidate for being shipped with pyFile
method.
If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip
is installed as well) with something like this:
try:
import numpy as np
expect ImportError:
import pip
pip.main(["install", "--user", "numpy"])
import numpy as np
You'll find other variants of this method in How to install and import Python modules at runtime?
Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).
See also: shipping python modules in pyspark to other nodes?
Post a Comment for "Numpy And Static Linking"