How can I save the graph and run the page ranking like analytics on it hbase?

Sorry if this question seems a little tricky, but I think it's all related, so I wanted to try and get the answer in one shot. Basically I have a multi-level graph * that has various datasets that are only connected to the next dataset (so set1 has vertices with edges to set2, etc., but set1 has nothing connecting to set3 or nothing but set2 This may not be necessary). Typically, you can think of my data as one massive family tree (each set I add about a billion nodes) that I keep loading new generations with every new set (families create new families and no edges go back) ...

I have Hbase / hadoop system running and I know how to use java to add columns and values, but I don't know how to do it:

  • add data to hbase in a graph type format (with its hbase, I want to load it in such a way that I can add a ton of data and it will scale ... compared to other databases that limit graphs to the size of the system). I know how to add data but do not understand how to do it on a scalable graph.
  • Once the chart is loaded, I want to know how to apply some analytics to it. Pagerank is popular, so I thought I'd say it, but pretty much anything based on graph processing.

I guess the simplified way to ask the question is, how can I get the graph in hbase, and once it's there, how do I parse it? Have a tutorial? There is a lot of information on the internet about hbase (I read the hbase book), but I couldn't find anything specific for plots. I found giraph , but I don't think it can connect to hbase (yet). Seeing how hasoop / hbase are versions of mapreduce / bigtables, I suspect there is a way to handle the plots. I just have nothing to find.

* A layered graph is a directed graph with a level for different sets of vertices, for example: http://en.wikipedia.org/wiki/Layered_graph_drawing

+3


source to share


2 answers


I think this question on SO might help:

fooobar.com/questions/1107457 / ...



This part of my answer to this question might be helpful.

Using HBase / Accumulo as input to giraph was recently introduced (7 March 2012) as a new Giraph feature request: HBase / Accumulo input and output formats (GIRAPH-153)

0


source


We use giraph in this way, it only stores the minimum data at each vertex, and then runs the graph algorithm with giraph, then we collect the data rich result with the pig, for the rank algo page, each vertex only needs to store the vertex id, rank. therefore, it can scale to nearly the billionth level.



0


source







All Articles