Sending your message. Please wait...

Thanks for sending your message! We'll get back to you shortly.

There was a problem sending your message. Please try again.

Please complete all the fields in the form before sending.

PromptCloud | Data Scraping vs. Data Crawling
15428
single,single-post,postid-15428,single-format-standard,ajax_fade,page_not_loaded,,,wpb-js-composer js-comp-ver-4.1.2,vc_responsive
 

Data Scraping vs. Data Crawling

Data Scraping vs. Data Crawling

One of our favorite quotes has been- ‘If a problem changes by an order, it becomes a totally different problem’ and in this lies the answer to- what’s the difference between scraping and crawling?

 

Crawling usually refers to dealing with large data-sets where you develop your own crawlers (or bots) which crawl to the deepest of the web pages. Data scraping on the other hand refers to retrieving information from any source (not necessarily the web). It’s more often the case that irrespective of the approaches involved, we refer to extracting data from the web as scraping (or harvesting) and that’s a serious misconception.

Below are some differences in our opinion- both evident and subtle

  1. Scraping data does not necessarily involve the web. Data scraping could refer to extracting information from a local machine, a database, or even if it is from the internet, a mere “Save as” link on the page is also a subset of the data scraping universe. Crawling on the other hand differs immensely in scale as well as in range. Firstly, crawling = web crawling which means on the web, we can only “crawl” data. Programs that perform this incredible job are called crawl agents or bots or spiders (please leave the other spider in spiderman’s world). Some web spiders are algorithmically designed to reach the maximum depth of a page and crawl them iteratively (did we ever say scrape?).
  2. Web is an open world and the quintessential practising platform of our right to freedom. Thus a lot of content gets created and then duplicated. For instance, the same blog might be posted on different pages and our spiders don’t understand that. Hence, data de-duplication (affectionately dedup) is an integral part of data crawling. This is done to achieve two things- keep our clients happy by not flooding their machines with the same data more than once, and saving our own servers some space. However, dedup is not necessarily a part of data scraping.
  3. One of the most challenging things in the web crawling space is to deal with coordination of successive crawls. Our spiders have to be polite with the servers that they hit so that they don’t piss them off and this creates an interesting situation to handle. Over a period of time, our intelligent spiders have to get more intelligent (and not crazy!) and learn to know when and how much to hit a server in order to crawl data on its web pages while complying with its politeness policies.
  4. Finally, different crawl agents are used to crawl different websites and hence you need to ensure they don’t conflict with each other in the process. This situation never arises when you intend to just scrape data.

On a concluding note, scraping represents a very superficial node of crawling which we call extraction and that again requires few algorithms and some automation in place.

P.S. This post does not intend to offend anyone who uses the terms ‘scraping’ and ‘crawling’ interchangeably, but purely wishes to create awareness for those interested in the Big Data domain. And sorry! We couldn’t help being biased towards the word “crawl” because that’s what feeds us :).

  • http://www.blogger.com/profile/10599950380674632630 Amey Desai

    It would be interesting to know you’re crawling and scraping approaches also. Whether you have a distributed crawler architecture, adaptive crawlers etc. Another thing I would like to read on you’re part is how you follow robots.txt and the term ‘politeness’ associated with crawling. In a place saturated with web development, it would be really cool if folks can roll out posts on the technical aspects of web crawling.

  • http://www.blogger.com/profile/03234817416494390229 Arpan

    Amey,

    Thanks for your comments. We’ll gradually get to the technical aspects of our infrastructure and technology in our future posts.

  • Anonymous

    What throughput does your platform support ? I have about 200k sites to be crawled on a daily basis. Will your system be able to support that ?

    • http://www.blogger.com/profile/16024290523117958654 Mohit Sharma

      To know more about our services and discuss your requirements, kindly drop a mail to sales@promptcloud.com.
      Thanks.

  • http://www.blogger.com/profile/05106355807295515042 Hanumesh Palla

    What about “SCRAPY” … an opensource for web crawling and scraping. If there is anything regarding it

  • Anonymous

    Question I have:

    Are coding skills transferable between creating a search engine AND web scraping a website?

    By web scraping I mean softwares functions such as those provided by Outwit Hub Pro or Helium Scraper or
    NeedleBase (extinct.)

    I have been told web scraping a website requires the following coding skills:
    Python , Regular Expressions (Regex) , XPath

    In other words, are the coding skills learned in web scraping transferable to creating a private search engine to index a particular website online in whole to keep up to date with all site changes (such as new product promotions)?

    By the way, the website I am keeping tabs on has a new web page for each new product promotion.

    There is no centralized page where I can view a list of latest product promotions.

    Please enlighten.

    Thanks a million.

    • Anonymous

      That’s an interesting question. Have you found something in the meanwhile searching for answers? A resourceful site for you might be http://stackoverflow.com
      The question I also find interesting is: to what extend is data scraping able to retrieve the program language?

Ready to discuss your requirements?

REQUEST QUOTE