Query a complex data structure in memory
First post on questions site, but I have a tricky issue that I've been looking at for days.
Background At work we are introducing a new billing system. However, we want to embrace the unprecedented pace of actually auditing the new billing system versus the old one, which will be significantly more reliable on an ongoing basis. The reason is that the new billing system is much more flexible for our new billing plans, so marketing is really helping us implement this new billing system.
We had an IT team that developed a ridiculous sum of money report that runs daily at 8am for yesterday's data, compares the records to get byte count inconsistencies, and generates a report. This is not very helpful for us as for one it starts the next day, and secondly, if it shows poor results, we have no indication as to why we had the problem the day before.
So, we want to create our own system that connects to any possible data source (first, only new and old User Data Records (UDR) systems) and compares the results in real time.
Just some notes on the scale, each billing system produces approximately 6 million records per day for a total file size of about 1 gigabyte.
My suggested setup Basically buy multiple servers, we have a budget for multiple 8-core / 32GB operating systems, so I'd like to do all the processing and storage in in-memory data structures. We can buy a larger server if needed, but after a couple of days I see no reason to keep the data in memory anymore (written to persistent storage) and the aggregation statistics stored in the database.
Each entry essentially contains the platform post id, correlation id, username, login time, duration, bytes, bytes, and a few other fields.
I was thinking about using a rather complex data structure for processing. Each record will be split into a user object, and the record object belongs to either platform a or platform b. At the top level there will be a binary search tree (self balancing) on ββthe username. The next step will be like a list of skips by date, so we will have the next matched_record, next day, next hour, next month, next year, etc. Finally, we will have our matched record object, essentially just a holder, which refers to the udr_record object from system a and the udr record object from system b.
I would run a series of internal analytics as data is added to see the new billing system choked out, started to have large discrepancies from the old system, and sent an alarm to our operations center for research. I have no problem with this part.
Problem The problem I am facing comes down to statistical statistics, but I want to see if I can come up with some kind of query language where the user can enter a query, like the top contributor to this alarm, and see which records have contributed to mismatch, as well as delve into and investigate. I originally wanted to use a filter-like syntax in wireshark with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepancy > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepancy DESC LIMIT 10
Another option would be to use DLINQ, but I have been getting games from C # for several years now, so not 100% up to speed on .net 3.5. Also I'm not sure if he can handle the data structure he was planning to use. The real question is, can I get any feedback on how to approach getting the query string from the user, parse it, and apply it to the data structure (which has a few more attributes described above), and get the result list back. I can handle the rest on my own.
I'm totally ready to hard-code many possible queries, and just have them more like reports that run with some parameters, but if there is a good clean way to do this type of query syntax I think it would be very cool to add.
Actually, for the specified query type, dynamic LINQ is fine. Otherwise, you will have to write almost all the same - a parser and a mechanism for matching attributes. Unfortunately this is not an exact hit as you have to separate things like OrderBy and the dates have to be parameterized, but here's a working example:
class Udr { // formatted for space
public int BytesIn { get; set; }
public UdrAnalysis Analysis { get; set; }
public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
public int Discrepency { get; set; }
public int DiscrepencyPercent { get; set; }
}
static class Program {
static void Main() {
Udr[] data = new [] {
new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
new Udr { BytesIn = 500, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
};
DateTime when = DateTime.Parse("2008-11-10 22:00:44");
var query = data.AsQueryable().Where(
@"bytesin > 1000 && (analysis.discrepency > 100000
|| analysis.discrepencypercent > 100)
&& starteddate > @0",when)
.OrderBy("analysis.discrepency DESC")
.Take(10);
foreach(var item in query) {
Console.WriteLine(item.BytesIn);
}
}
}
You can of course take a dynamic LINQ sample and tweak the parser to do more of what you need ...
source to share
Whether you are using DLINQ or not, I suspect you will want to use LINQ somewhere in the solution, because it provides so many bits of what you want.
How much protection do you need from your users and how important are they? If it's just for a few very technical internal staff members (for example, who are already developers), you can just let them write a C # expression and then use CSharpCodeProvider to compile the code, then apply it to your data.
Obviously this requires your users to be able to write C #, or at least enough to express the request, and it requires you to trust them not to server garbage. (You can load the code into a separate AppDomain, give it low privileges, and tear down the AppDomain after a timeout, but this kind of difficulty is achieved - and you really don't want huge amounts of data to cross the AppDomain boundary.)
source to share
As for LINQ in general - again, a good fit due to your sizing issues:
Just some notes on the scale, each billing system produces approximately 6 million records / day in a total file size of about 1 gig.
LINQ can be used entirely with streaming solutions. For example, your "source" might be a file reader. Then "Where" will iterate over the data, checking individual lines, without having to buffer the entire object in memory:
static IEnumerable<Foo> ReadFoos(string path) {
return from line in ReadLines(path)
let parts = line.Split('|')
select new Foo { Name = parts[0],
Size = int.Parse(parts[1]) };
}
static IEnumerable<string> ReadLines(string path) {
using (var reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
Now this is lazy loading ... we only read one line at a time. You need to use AsQueryable()
to use with dynamic LINQ, but it remains lazy.
If you need to perform multiple aggregates on the same data, then Push LINQ is fine ; this works especially well if you need group data , since it doesn't buffer everything.
Finally, if you want binary storage, serializers such as protobuf-net can be used to create streaming solutions . At the moment it works best with the Push Push approach for LINK, but I expect to be able to invert it for regular IEnumerable<T>
if needed.
source to share