Days of unstructured raw data are blurry as most data-driven companies turn their eyes on to structured data which can be easily imported into their internal systems, read by their machines, and analyzed by their teams to keep cutting the edges. But while you have stepped on to the task of data acquisition via crawling or various such methods, how do you define the structure that’s most appropriate for your requirements?
Most of the times we find our clients unable to take a decision on which output format to select post crawl and extraction. Here’s our bit to help weigh your options.
1. XML –
By far, the most robust structure that you can opt for. Does not break with unwanted special characters in the data and can be as nested as you like. However, if your internal systems are not designed for large-scale, it might end up being a bloat with huge file sizes to handle.
Summary- trade-off the bloat for robustness of XML and its flexibility of schema.
Follows an illustration of an XML record-
2. CSV-
Almost all off-the-shelf databases have CSV import functionality designed. Working with CSV files is quite a cake walk and are well suited for low volumes where you could simply select the files, specify columns and be done. But not all tools are able to delimit the text accurately without messing up the imports. Summary- Use CSVs when you are dealing with few records and are 100% sure that weird characters (inevitable when dealing with web data) will not break your imports.
Below is a sample CSV record.
3. JSON-
JSON is quite similar to XML; but is less verbose. However, there’s no CDATA equivalent for JSON and hence it’s difficult to represent a text as-is.
Summary- JSON is the preferred data format when working with Python libraries.
Use it for anything as an XML alternative if not dealing with CDATA. A sample JSON follows.
4. XLS-
This could be the most conservative of all as excel files work on a limited number of platforms. So if you are just playing around with tiny volumes of data and are keen on importing it into your Google drive, then XLS will do the job well. Same issues as with CSVs remain here.. XLS files start cribbing with some characters which XML/JSON can rather take care of.
Summary- Use XLS only if you are going to be loading a few dozen rows of data directly into MS Office or Google spreadsheets. XML is the clear winner when evaluating independently of your internal systems, but the decision gets tough when your system presents technical constraints. Nevertheless, you can always change your data format using a lot of tools in the market or building your own parser;
XML -> CSV, CSV -> XML, XML -> XLS, XML -> JSON and the like.
Note- All the above points are valid only with the current state of things and might change as technologies evolve.