Accessing Spark RDD from a web browser through a lean server - java
We processed our data with Spark 1.2.1
Java and stored it in Hive tables. We want to access this data as an RDD from a web browser.
I read the documentation and I figured out the steps to complete this task.
I can't seem to find a way to interact with Spark SQL RDDs
through a lean server. The examples I found have the belw line in the code and I don't find that class in the Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
On github, I've seen scala examples using import org.apache.spark.sql.hive.thriftserver, but I don't see this in the Java API docs. Not sure if I'm missing something.
Has anyone had any luck with accessing Spark SQL RDD from a browser through savings? Can you post the code snippet. We are using Java.
source to share
I have most of this work. Allows you to analyze every part of it: (Links at the bottom of the message)
HiveThriftServer2.startWithContext
defined in Scala. I've never been able to access it from Java or Python using Py4j, and I'm not a JVM expert, but I switched to Scala. It might have something to do with annotation @DeveloperApi
. This is how I imported it to Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For those reading this and not using Hive, the Spark SQL context will not work and you need the hive context. However, the HiveContext constructor needs a Java spark context, not Scala.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the lean server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we need to convert them to Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then we need to turn them into SQL Spark tables. You do this by storing them in Hive or by making the RDD available as a temporary table.
Preservation in the hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note. Temporary tables are isolated for the SQL session. The Lean Server spike catcher is multisession by default in version 1.6 (one session per connection). Hence, in order for clients to access the temporary tables that you registered, you need to set the option spark.sql.hive.thriftServer.singleSession
totrue
You can verify this by querying the tables in beeline
the thrift server command line utility. It comes with Spark.
Finally, you need a way to access the lean hive server from a browser. Thanks to its amazing developers, it has an HTTP mode, so if you want to build a web application, you can use the lean protocol via AJAX requests from the browser. A simpler strategy might be to build an IPython notebook and use it pyhive
to connect to a lean server.
Data link: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
SingleSession select request: https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/% 3Cc2bd1313f7ca4e618ec89badbd8f9f31@git.apache.org% 3E
HTTP mode and beeline howto: https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive: https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition: https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/sapache/sparkthriver 73
source to share
Thrift is a JDBC / ODBC server.
You can connect to it via JDBC / ODBC connections and access the content via HiveDriver.
You may not be getting the RDD back because it is HiveContext
not available.
What you are referring to is an experimental feature not available for Java.
As a workaround, you can re-analyze the results and create your own structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);
source to share