Undoubtedly, “big data” is the in-thing and is growing at a tremendous rate. With data acquisition companies in the market, it is not all that difficult or expensive to collect data from multiple sources on the web. We strongly believe that every company which has anything to remotely do with the Internet (selling their products, being reviewed by customers, so on and so forth), would immensely benefit by structuring data scattered around the web. However, how prepared are you to embrace data?
Here is an attempt at demystifying the issue, based on the hundreds of conversations we have day-in and day-out with people trying to foray into big data.
Objective
It is really a no-brainer! Without having a clear reason as to why you would want to gather data, it makes no sense to embark upon the journey. However, this crucial detail often goes missing and projects get started without a predefined result/to-be state. Of course, it may sound great to get all that data, but why? Knowing which sites to crawl, what fields to extract and how frequently to crawl do not always mean that your objective is well defined.
Data Acquisition Plan
Simply put – what kind of data are you looking to acquire, from which sites and at what frequency. If you are unable to pin-point to, say, 10 – 50 websites or less which would provide you with sufficient data to start with, you might want to think deeper! This is not to say that we do not crawl more than 50 sites for a Client – we do. Also, we are only referring to site-specific crawls and not mass-scale crawls . On a lighter note, just to give you a perspective – one of the companies that crawls data from the entire WWW, was valued at $527 billion, the last time I checked!
Tools & Processes
Acquiring data is one thing and utilizing it is another! It wouldn’t make much of a sense to use a spreadsheet application (like Excel for example) to analyse millions of records. However, if it is just a few thousand records from a couple of sites, even a tool like MS Excel would suffice. Either ways, the process and tools for consuming data should be in a reasonably good shape prior to getting started with large-scale data acquisition.
Time Commitment
Let us get real – no project has ever delivered immediate results. We, at PromptCloud, ask for a minimum commitment of three months because we’ve seen that it is the bare minimum time required to get everything in order. In most cases, three months is too short a time period to even gauge the success/failure of a project. Data – based projects are generally long-term in nature. If you are just starting out with an intention to “test it out” before committing anything to the project, please go back to point #1 above!
Plan to Scale-up
This is where the long-term vision of the project comes into play. It is always prudent to start small and scale up gradually. This gives you time and immense amount of learning to do it the right way. Scaling up could be in the form of adding new data sources (websites), data points (fields), categories or even increasing the frequency of crawls.
If you think to have answers to the above, it seems like to perfect time to reach out to us!
Stay tuned for our next article to find out the real reason behind Microsoft’s acquisition of LinkedIn.