跳至主要内容

Enabling Native Acceleration for MLlib

The undefined symbol issue

Got onto the ship of machine learning. And soon I hit the wall of `undefined symbol issue’ on lab cluster.

Hi, it just the simple `MinMaxScaler’ example code published on spark MLlib(Machine Learning) guide!!

Everything goes well until the last line `scaledData.show()`, boom. Spark-shell died with the following message on the console:

/usr/java/jdk1.7.0_67-cloudera/bin/java: symbol lookup error: /tmp/jniloader82069440205403545netlib-native_system-linux-x86_64.so: undefined symbol: cblas_daxpy

NO log. NO history server record. Nothing could be used for debug as first glance.

Solution (Wrap up)

My solution is based on CDH 5.11.0 (parcel) plus cloudera GPLExtra (parcel) plus Intel MKL library (parcel). The steps to enable MKL native acceleration for cloudera spark should be as simple as:

  1. Install `netlib-java` by integrate GPLExtra parcel as described in the Enable Native Accerleration For MLlib

  2. Install MKL library parcel by follow the Download Intel Math Kernel Library (Intel MKL) for Cloudera

  3. Enable MKL library for `netlib-java` by perform some dirty hack on every cluster node:

    alternatives –install /usr/lib64/libblas.so.3 libblas.so.3 /opt/cloudera/parcels/mkl/linux/mkl/lib/intel64/libmkl_rt.so 2000 alternatives –install /usr/lib64/liblapack.so.3 liblapack.so.3 /opt/cloudera/parcels/mkl/linux/mkl/lib/intel64/libmkl_rt.so 2000

Verification NOTES

Do not run local mode, as cloudera parcels (GPLExtra and Intel MKL) is enabled by evaluate parcel environment shell scripts. This script seems not to be run in local mode.

Final setup summary

  • OS: RHEL 7
  • Cluster: CDH 5.11.0
    • GPLExtra Parcel
    • MKL Parcel

Trouble Shooting Log and Background information

The Solution was developed by many times of cluster reconfiguration and verification.

The information to setup the Native Acceleration for Cloudera spark is from various sources, includes:

  • Some are learned from cloudera CDH user guide
  • Some are learned from netlib-java github readme.md
  • Some are learned from cloudera community forumn
  • Some are learned from stackoverflow comunity

The undefined symbol error

As described in the beginning, the clue of the issue was the error message. Google for the error lead to a github issue of netlib-java

The solution is to make use of LD_PRELOAD to override OS default `libblas.so’ with openblas library.

To my understanding, this will need script hacking on all cluster node. With our cloudera based cluster, the GPLExtras are setup by parcel. I can’t draw a clear conclusion how much effort will need to be done by next deployment.

netlib-java

To ease the adoption, I decide to find a openblas parcel for cloudera. During the search I discover the press release Cloudera And Intel Speed Up Machine Learning Workloads With Apache Spark, Intel Math Kernel Library Integration showing that Intel MKL is a great option.

Contine with Intel MKL, I find the great thread in cloudera community forum. In which openblas and MKL were mentioned, together with the netlib-java (the jni adopter for Native Liner Alegbra Algorithm libraries).

The thread also mentions that accroding to netlib-java readme.md the library is able to work with any BLAS implementation.

One can easily change a BLAS implementation with another by replace the share library while keep the original library filename. netlib-java jni loader will seemlessly find the new library with same filename.

The recommended way to switch library on linux was to make use of debian based update-alternatives scripts.

At the end of the netlib-java readme, the author also provide a great list of performance benchmark of various BLAS librarys. Which shows that Intel MKL gives the best performance on intel based CPUs.

The Intel MKL parcel

As described in previous section, there’s a press release showing Cloudera and Intel already work out a solution to enable MKL based acceleration on Cloudera spark.

At the end of the press release, there’s great news that. Cloudera and Intel will provide a MKL parcel.

A parcel to cloudera usually means that the packaged feature will be ready to use once the parcel was activated in the cluster. But, this is not true for the MKL parcel. The parcel does not work as expected out of the box, the undefined symbol issue remains.

It turns out that Intel did not provide any alternatives.json in their MKL parcel (el7 flavor). So even after parcel activated, the OS bundled libblas.so is not changed. As a result the native BLAS library netlib-java found was the old system bundled one, still.

RHEL/Fedora BLAS library packaging mass

But, why Intel did not provide the libblas alternatives for el7 parcel?

Read the following issue records:

Per my understanding, this kind of thing is a opinion issue. It will remain no-fix for a long time.

So as a RHEl/Fedora user, I am on my own. I will go create the alternatives for myself.

The openblas journal

Accroding to netlib-java, the openblas library is the second best option. I’d also gave it a try.

Within the EPEL repo, openblas comes with all kinds of different flavors.

There’s no much description for these different packaging, one have to choose the library by name.

After created alternatives for the openmp x64 flavor. The cluster goes very well, until I got following coredump while running Multilayer perceptron classifier example code:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 28.0 failed 4 times, most recent failure: Lost task 1.3 in stage 28.0 (TID 113, datanode2, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed: container_1535469577308_0009_01_000005 on host: datanode2. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1535469577308_0009_01_000005
Exit code: 134
Exception message: /bin/bash: line 1: 27674 Aborted                 (core dumped) LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/../../../CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/../../../GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/hadoop/lib/native:/opt/cloudera/parcels/mkl-2018.3.222/linux/tbb/lib/intel64_lin/gcc4.7:/opt/cloudera/parcels/mkl-2018.3.222/linux/compiler/lib/intel64_lin:/opt/cloudera/parcels/mkl-2018.3.222/linux/mkl/lib/intel64_lin:/opt/cloudera/parcels/GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/impala/lib:/opt/cloudera/parcels/GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/lib/native /usr/java/jdk1.7.0_67-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms16384m -Xmx16384m -Djava.io.tmpdir=/data/4/yarn/nm/usercache/ericsson/appcache/application_1535469577308_0009/container_1535469577308_0009_01_000005/tmp '-Dspark.authenticate.enableSaslEncryption=false' '-Dspark.authenticate=false' '-Dspark.driver.port=42674' '-Dspark.shuffle.service.port=7337' -Dspark.yarn.app.container.log.dir=/data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.163.170.11:42674 --executor-id 4 --hostname datanode2 --cores 5 --app-id application_1535469577308_0009 --user-class-path file:/data/4/yarn/nm/usercache/ericsson/appcache/application_1535469577308_0009/container_1535469577308_0009_01_000005/__app__.jar > /data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005/stdout 2> /data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 27674 Aborted                 (core dumped) LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/../../../CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/../../../GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/hadoop/lib/native:/opt/cloudera/parcels/mkl-2018.3.222/linux/tbb/lib/intel64_lin/gcc4.7:/opt/cloudera/parcels/mkl-2018.3.222/linux/compiler/lib/intel64_lin:/opt/cloudera/parcels/mkl-2018.3.222/linux/mkl/lib/intel64_lin:/opt/cloudera/parcels/GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/impala/lib:/opt/cloudera/parcels/GPLEXTRAS-5.11.0-1.cdh5.11.0.p0.30/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hadoop/lib/native /usr/java/jdk1.7.0_67-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms16384m -Xmx16384m -Djava.io.tmpdir=/data/4/yarn/nm/usercache/ericsson/appcache/application_1535469577308_0009/container_1535469577308_0009_01_000005/tmp '-Dspark.authenticate.enableSaslEncryption=false' '-Dspark.authenticate=false' '-Dspark.driver.port=42674' '-Dspark.shuffle.service.port=7337' -Dspark.yarn.app.container.log.dir=/data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.163.170.11:42674 --executor-id 4 --hostname datanode2 --cores 5 --app-id application_1535469577308_0009 --user-class-path file:/data/4/yarn/nm/usercache/ericsson/appcache/application_1535469577308_0009/container_1535469577308_0009_01_000005/__app__.jar > /data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005/stdout 2> /data/5/yarn/container-logs/application_1535469577308_0009/container_1535469577308_0009_01_000005/stderr
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
    at org.apache.hadoop.util.Shell.run(Shell.java:504)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 134
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1644)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1862)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1982)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
    at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:218)
    at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
    at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
    at breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
    at breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
    at breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
    at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
    at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
    at org.apache.spark.ml.ann.FeedForwardTrainer.train(Layer.scala:878)
    at org.apache.spark.ml.classification.MultilayerPerceptronClassifier.train(MultilayerPerceptronClassifier.scala:170)
    at org.apache.spark.ml.classification.MultilayerPerceptronClassifier.train(MultilayerPerceptronClassifier.scala:110)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:48)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:54)
    at $iwC$$iwC$$iwC.<init>(<console>:56)
    at $iwC$$iwC.<init>(<console>:58)
    at $iwC.<init>(<console>:60)
    at <init>(<console>:62)
    at .<init>(<console>:66)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800)
    at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1000)
    at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1205)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1172)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1165)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:498)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
    at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

References

 

评论

此博客中的热门博文

Use MobaDiff with git difftool

Recently there's an activity in IT that forces the deletion of all unauthorized softwares from all work machines. Unfortunately, kdiff3 is one in the list. As it is generally okay to use vimdiff as an alternative for kdiff3, A gui tool is better suited for desktop workflows. Known that MobaXterm is shipping a gui diff tool named MobaDiff. But it only appears in the windows right click context menu. Find the real application name takes me some time to search in the windows registry. "MobaRTE.exe", which is the one invoked by HKCR\*\shell\MobaDiff. And it was invoked with "-contextdiff" switch to show MobaDiff UI, while when the switch is "-contextedit" it shows MobaTextEditor. Too bad that the "-contextdiff" switch do not support pre-image post-image as other diff tool did, which effectively made it unable to be used as a command line diff utility. Also MobaTech did not mention anything in their document of this Mob...

Winget: From Quirky Underdog to Stable Champion

Remember winget, the Windows Package Manager that started as a playful experiment? Well, prepare to be surprised – it's grown into a powerful and highly stable tool for managing your software, including in environments with network restrictions . Gone are the days of unreliable installs and limited functionality. The developers have diligently transformed winget into a reliable contender in the package manager arena. Updates arrive regularly, bringing stability, enhanced features, and wider app support . Here's why you should give winget another look: Unified experience: Manage all your apps from a single command line , ditching the scattered hunt for individual installers and downloads. Security focus: Winget verifies package integrity and signatures, ensuring you get authentic and secure software . Efficiency: Say goodbye to manual downloads and updates. Winget automates the process, saving you time and effort. Customization: Configure installation options and choose s...

Eglot and before/after-save-hook and use-package

In Emacs, when you try to automate some actions during every save action, you will surely get to the before-save-hook and the after-save-hook. Simply adding something like gofmt-before-save to before-save-hook will save you tons of time to do the go-fmt. And then, I meet eglot, and gopls will also save me tons of time doing googling and api documentation navigation. But eglot-ensure is not very friendly to the good old ways of how after-save-hooks were designed to work. It makes the before/after-save-hook a buffer local variable and it does not inherit the variable's global value. So, to make before/after-save-hook work again, experts start to adding hooks to major mode specific hooks like this: emacs.md - Go (opensource.google) """ ;; Optional: install eglot-format-buffer as a save hook. ;; The depth of -10 places this before eglot's willSave notification, ;; so that that notification reports the actual contents that will be saved. (defu...