Accessing Spark RDD from a web browser through a lean server - java

We processed our data with Spark 1.2.1

Java and stored it in Hive tables. We want to access this data as an RDD from a web browser.

I read the documentation and I figured out the steps to complete this task.

I can't seem to find a way to interact with Spark SQL RDDs

through a lean server. The examples I found have the belw line in the code and I don't find that class in the Spark 1.2.1 java API docs.

HiveThriftServer2.startWithContext

On github, I've seen scala examples using import org.apache.spark.sql.hive.thriftserver, but I don't see this in the Java API docs. Not sure if I'm missing something.

Has anyone had any luck with accessing Spark SQL RDD from a browser through savings? Can you post the code snippet. We are using Java.

+3


source to share


2 answers


I have most of this work. Allows you to analyze every part of it: (Links at the bottom of the message)

HiveThriftServer2.startWithContext

defined in Scala. I've never been able to access it from Java or Python using Py4j, and I'm not a JVM expert, but I switched to Scala. It might have something to do with annotation @DeveloperApi

. This is how I imported it to Scala in Spark 1.6.1:

import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

      

For those reading this and not using Hive, the Spark SQL context will not work and you need the hive context. However, the HiveContext constructor needs a Java spark context, not Scala.

import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext

var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))

      

Now start the lean server

HiveThriftServer2.startWithContext(hiveContext)

// Yay

      

Next, we need to make our RDDs available as SQL tables. First, we need to convert them to Spark SQL DataFrames:

val someDF = hiveContext.createDataFrame(someRDD)

      

Then we need to turn them into SQL Spark tables. You do this by storing them in Hive or by making the RDD available as a temporary table.

Preservation in the hive:



// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")

// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")

      

Or use a temporary table:

// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")

      

Note. Temporary tables are isolated for the SQL session. The Lean Server spike catcher is multisession by default in version 1.6 (one session per connection). Hence, in order for clients to access the temporary tables that you registered, you need to set the option spark.sql.hive.thriftServer.singleSession

totrue

You can verify this by querying the tables in beeline

the thrift server command line utility. It comes with Spark.

Finally, you need a way to access the lean hive server from a browser. Thanks to its amazing developers, it has an HTTP mode, so if you want to build a web application, you can use the lean protocol via AJAX requests from the browser. A simpler strategy might be to build an IPython notebook and use it pyhive

to connect to a lean server.

Data link: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html

SingleSession select request: https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/% 3Cc2bd1313f7ca4e618ec89badbd8f9f31@git.apache.org% 3E

HTTP mode and beeline howto: https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine

Pyhive: https://github.com/dropbox/PyHive

HiveThriftServer2 startWithContext definition: https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/sapache/sparkthriver 73

+1


source


Thrift is a JDBC / ODBC server.
You can connect to it via JDBC / ODBC connections and access the content via HiveDriver.
You may not be getting the RDD back because it is HiveContext

not available.
What you are referring to is an experimental feature not available for Java.

As a workaround, you can re-analyze the results and create your own structures for your client.
For example:



private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";

Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

      

0


source







All Articles