Site-Specific vs. Generic Data Extraction Methods Comparison

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

Janet Williams

September 19, 2013
Blog

Table of Contents show

Scraping data is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data

Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the scraping data.

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention
low on effort and time
can work on any scale

Cons-

Data quality compromised
inaccurate and incomplete datasets
lesser details suited only for high-level analyses

Suited for gathering– blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring

2. Low/Mid scale crawls; detailed datasets

If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

High data qualityPros-

Better data coverage on the site

Cons-

High on effort and time
Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention
Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses– Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard scraping data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments

– See more at: https://blog.promptcloud.com/2013/10/site-specific-extraction-vs-generic-extraction-after-scraping.html#sthash.635QdAM9.dpuf

Janet Williams

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data

2. Low/Mid scale crawls; detailed datasets

Conclusion

Recent post

Why Salary Data is the Key to

How Healthcare Analytics is Revolutionizing Patient Care

Walmart Product Availability: How Data Helps Retailers

Google SERP Scraping 101: Automating Search Data

What is Alternative Data in Finance? Exploring

Legality of Hiding Your IP Address: What

More from Blog

Are you looking for a custom data extraction service?

Solutions

Use cases

Resources

Newsletter

Site specific extraction vs generic extraction after scraping

Janet Williams

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data

2. Low/Mid scale crawls; detailed datasets

Conclusion

Recent post

Why Salary Data is the Key to

How Healthcare Analytics is Revolutionizing Patient Care

Walmart Product Availability: How Data Helps Retailers

Google SERP Scraping 101: Automating Search Data

What is Alternative Data in Finance? Exploring

Legality of Hiding Your IP Address: What

More from Blog

Are you looking for a custom data extraction service?