Embracing Manual Data Collection

Image for post
Image for post

Collecting Option Order Flow

One of the factors driving me towards data science has always been my love for collecting data. My idea of fun on a Friday night includes web scraping, wrangling and data engineering. Although a lot of data is floating around out on the internet, understanding how to collect data is important because you might not always be able to find the data you want in the form you need. For example, a few years ago I tried my hand at day trading. Although I had limited success in the endeavor, I found that I was more interested in analyzing the data than chasing profits. I got involved with a group of guys who watch options order flow from a curated trade-alert feed and thought it would be interesting to see if I could analyze the order flow and find a potential edge hidden within. The website that hosted the feed didn’t have very many options insofar as searching and analyzing, so I decided to start collecting it myself.

Challenge 1: Copying the Data

Since the website didn’t have a public API from which I could pull the data, I had to figure out a way to copy it. The website was designed in a way that made scraping it difficult, and the page did not allow native highlighting so I couldn’t simply copy and paste all the text. I started digging for a solution and came across the chrome extension Allow Copy. Allow Copy will re-enable select, copy, paste and right click in any web page using any copy protection. Once the extension was installed, I was able to select their data feed, copy it and paste it.

Image for post
Image for post
Chrome extension Allow Copy

Challenge 2: Historical Data

The feed only displayed so many results per day, but it did have a Load More button that allowed a user to load previous day’s results, 50 at a time. As the list grew, the website performance would tank. I could only go back 30 days before the website crashed. Not only did it make the website unstable, but it was also a lot of clicking and dragging to get the maximum number of days to load into the list. I decided to automate the task using an Auto-Hotkey macro creator called Pulover’s Macro Creator. I simply recorded myself clicking the Load More button, and then I looped the playback until I loaded as many days as I could.

Image for post
Image for post
The Load more button

Challenge 3: Cleaning the Data

I automated loading the data and was able to select it all and copy it, but the data needed to be cleaned to make analyzing it simpler. I needed to add a date and adjust the size column. For example, the data included the amount of money spent in the trade in a column called Size, but displayed it as 400K or 1.2M. I decided to paste the data into a google spreadsheet and use a formula to transform the size of the bet from a string into a numeric value I called Spent:

=IFERROR(LEFT(G2,LEN(G2)-1)*CHOOSE(MATCH(RIGHT(G2,1),{"K","M"},0),1000,1000000),G2)
Image for post
Image for post

Challenge 4: Storing the Data

After I cleaned the data, I wanted to store it on my local machine instead of in the google sheet. Since I use a lot of SQL in my day job and am quite familiar with it, I decided to load the data into an SQL database. I prefer using Microsoft SQL Server and Management Studio since that is what I’ve used the most:

  1. Download the google sheet as a CSV file.
Image for post
Image for post
Downloading a Google Sheet as CSV
  1. In SQL Server Management Studio (SSMS), right click the database in which you want to load the data.
  2. Select Tasks > Import Data. An Import/Export Wizard displays.
Image for post
Image for post
  1. Select Data source: Flat File Source
  2. Browse for the downloaded CSV
  3. Click Next
  4. Verify the columns look as expected
  5. Choose a Destination: SQL Server Native Client 11.0
  6. Verify the Server name and database are correct
  7. Click Next
  8. Rename the Destination Table if desired
  9. Click Next
  10. Click Finish
  11. Click Finish. You’ll see a the data load.

The data will have been imported into the selected database!

Once the data is in the SQL database, it is simple to perform additional data engineering and analysis. If you’re not familiar with SQL, there are a ton of online resources to help you learn. I started with w3schools!

Written by

Software Product Analyst in Data Science. pythondashboards.com Top writer in Business www.linkedin.com/in/erickleppen01/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store