跳至主要内容

博文

目前显示的是标签为“spark”的博文

Spark No suitable driver found for jdbc

 The famous driver not found issue I was request to draw a branch of data out of big data platform to MySQL recently. After some carefully consideration, I decide to code the logic in spark with the DataFrameWriter.write.jdbc() interface. Everything goes well until the integration test, in which I have to install my spark app for a real run. java.sql.SQLException: No suitable driver found for jdbc:mysql://bla:bla/bla Background1: I do follow the spark jdbc guide Actually, I did the simulation by follow the DATABRICKS: Connecting to SQL Databases using JDBC on my own zeppline notebook. And everything went well. Background2: I do follow the MySQL jdbc guide I did follow the official MySQL Connector/j: 6.1 Connecting to MySQL Using the JDBC DriverManager Interface coding sample. // The newInstance() call is a work around for some // broken Java implementations Class.forName("com.mysql.jdbc.Driver").newInstance(); Background3: I do attach the jar dependency I did a...

Exchange data between zeppelin pyspark and spark session

Problem: dataframe are not shared? As of version Zepplin(0.7.0), Spark dataframe are not shared between %pyspark (python) and %spark (scala) session. Solution: exchange by using temporary table do the following #%pyspark somedf.registerTempTable("somedftable") and then rebuild the DataFrame in scala session //%scala val somedf = sqlContext.table("somedftable") z.show(somedf.limit(20))  

scala case class and No TypeTag available

No TypeTag available compiling… Following code works whell in zeppelin section val textRdd = sc.textFile("hdfs://nameservice1/user/myname/bank/bank.csv") case class TextLine(lineText: String) val modelDates = textRdd.map( s => TextLine(s.trim)).toDF() modelDates.sort(col("lineText").desc).as[String].first() But if I make a function from it, err!! def maxValueIn(hdfsPath: String) = { val textRdd = sc.textFile(hdfsPath) case class TextLine(lineText: String) val modelDates = textRdd.map( s => TextLine(s.trim)).toDF() modelDates.sort(col("lineText").desc).as[String].first() } Solution: Move the case class out of def!!! As described in the stack overflow question , move the case class out of the method, the code finally compiles. import org.apache.spark.SparkContext import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.functions.col trait HdfsTextFile { val spark: SparkContext val sqlContext: HiveContext impo...

Cloudera 5.11.x Spark action on oozie failed to access hive table

Spark workflow fails for not able to access hive table This is an odd issue. With the same spark program, it will fail to access hive table if scheduled by Ozzie, but runs well if run manually by using spark-submit. Solution Add hive-site.xml to the spark workflow to make sure hive context is correctly initialized. Put the hive-site.xml into hdfs On one of the cluster node, find the hive configuration XML ‘hive-site.xml’ from /etc/hive/conf. Copy it to somewhere on hdfs. Add the hive-site.xml as one of the “FILES” of spark workflow In Hue workflow editor, click the plus sign on “FILES” to add a new “FILE” element. Write the corresponding hdfs path of just copied ‘hive-site.xml’.  

Enabling Native Acceleration for MLlib

The undefined symbol issue Got onto the ship of machine learning. And soon I hit the wall of `undefined symbol issue’ on lab cluster. Hi, it just the simple `MinMaxScaler’ example code published on spark MLlib(Machine Learning) guide!! Everything goes well until the last line `scaledData.show()`, boom. Spark-shell died with the following message on the console: /usr/java/jdk1.7.0_67-cloudera/bin/java: symbol lookup error: /tmp/jniloader82069440205403545netlib-native_system-linux-x86_64.so: undefined symbol: cblas_daxpy NO log. NO history server record. Nothing could be used for debug as first glance. Solution (Wrap up) My solution is based on CDH 5.11.0 (parcel) plus cloudera GPLExtra (parcel) plus Intel MKL library (parcel). The steps to enable MKL native acceleration for cloudera spark should be as simple as: Install `netlib-java` by integrate GPLExtra parcel as described in the Enable Native Accerleration For MLlib Install MKL library parcel by follow the Download Intel Ma...