Using Hadoop to Store Stock Market Data
I am happy to learn about Hadoop and the various projects around it, and I currently have two different strategies that I am thinking of for building a system to store a large collection of market tick data. I am just getting started with Hadoop / HDSF and HBase, but hoping someone can help me fuel the system seed that I won't need later using these technologies. Below is a diagram of my system and requirements with some examples of using queries and data and finally my current thinking on the best approach from the little documentation I read. This is an open question and I am happy to ask you to comment and accept the very best, feel free to comment on all or all of the points below. - Duncan Krebs
System Requirements . Be able to use the data warehouse for historical back-testing of systems, historical data compilation, and future data mining. After saving, the data will always be read-only, you need fast access to the data, but not necessarily when retesting.
Static schema - very simple, I want to grab 3 types of posts from the feed:
- Timestamp including date, day, time
- Quote including character, timestamp, request, askSize, bid, bidSize, volume .... (about 40 columns of data)
- Trade including symbol, timestamp, price, size, exchange .... (About 20 columns of data).
Inserting data using cases - either from real market data or search via broker API
Data query services . Below is how I would like to logically query my data.
-
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
-
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
-
Get me the number of trades for these 50 symbols for each day over the last 90 days.
Holy grail . Can MapReduce be used to use cases like below?
-
Generate metadata from raw market data through distributed agents. For example, write a task that calculates the average trading volume over a 1-minute interval for all stocks and all sessions stored in the database. Create a job to have an agent for each stock / session where I will tell you what stock and session it should calculate for this value. (can MapReduce do this?)
-
On the agent classpath, can I add my own utility code so that the above example can publish its value to a central repo or Messaging server? Can I deploy the agent as an OSGI package?
-
Create different types of agents for different types of indicators and assessments that are performed every morning before pre-market trading?
High Frequency Trading
I am also wondering if anyone can share their experience of using Hadoop in the context of high frequency trading systems. Just delving into this technology, my initial gist is that Hadoop can be great for storing and processing large amounts of historical tick data, if anyone is using this for real-time trading, I'd be interested to know more! - Duncan Krebs
source to share
Based on my understanding of your requirements, Hadoop would be a really good solution for storing your data and running your queries on it using Hive.
Storage: . You can store data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
The clock folder can store data for that hour of the day.
One of the advantages of using such a structure is that you can create external tables in Hive on top of this data with your partitions for years, months, days and hours. Something like that:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Getting started with the queries part, once you have the data stored in the above format, you can easily run simple queries.
Get me all quotes, trades, timestamps for GOOG for 22/22/2012
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all trades for GOOG, FB before 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregate queries once a day and use the inference to determine metrics prior to pre-market trading. Since Hive internally performs mapreduce, these requests will not be very fast.
For faster results, you can use some of the in-memory projects such as Impala or Spark. I myself have used Impala to run queries on my hive tables, and I have seen a significant improvement in the execution time of my queries (around 40x). Also you don't need to make any changes to the data structure.
Data Insert Use Cases . You can use tools like Flume or Kafka to insert live data into Hadoop (and therefore into hive tables). Flume scales linearly and can also assist in handling events on the fly during transmission.
All in all, a combination of several big data transfer technologies can provide a really decent solution to the problem you have proposed, and this solution will scale to huge amounts of data.
source to share