Trulia is a website that set shop in 2005, initially with real-estate listings in California. An interactive map with commuter and transit data shows the driving or commute times of the property from any given location in the United States, thus making Trulia an ideal choice to scrape real estate data.
The best features of Trulia, make it interested in scraping real estate data.
- What Locals Say: What Locals Say, is a recent Trulia feature, that allows home buyers, sellers, and renters to get views of locals of an area, of the neighborhood. Information for this is gathered using polls, surveys, and independent reviews.
- Trulia Neighborhoods: Trulia recently launched Trulia Neighborhoods. It is a unique feature that helps people get more information about a property listing from its website. Original photography, description, and facts about the area, along with even drone footage can be seen in this feature.
- Local Legal Protections: Local Legal Protections is a service that provides information on local nondiscrimination laws that apply to a house, employment, as well as public accommodations. This data is provided beside property listings to make it easier for a more diverse crowd to find adequate accommodation in a comfortable environment.
How to get started with scraping Trulia?
As for the installation and getting started, you can get those from a similar article, where we discussed how to crawl data from a leading travel portal. Once you have installed python and other dependencies along with the code editor Atom, come back to this article, to read on.
Where is the code to scrape data?
In case you are tired of the text, let’s go right to the code. Although the code is given below, you can also download it from the link, and get down to business. You can run it using the python command itself as you might have seen in the other scraping tutorials. Once you download the program, just go to the location in the command prompt and run the command:
python trulia_extractor.py
It will prompt you to enter the link for a Trulia property listing. Once the Extraction is complete, a confirmation message is shown and you can go on to check your folder for the JSON file and the HTML file created.
[code language=”python”]
H:Python_Algorithmic_ProblemsScraping_assignmentsTrulia-Data-Extraction>python trulia_extractor.py
Enter Trulia Property Listing Url- https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215
———-Extraction of data is complete. Check json file.———-
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen
# For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# Input from user
url = input(‘Enter Trulia Property Listing Url- ‘)
# Making the website believe that you are accessing it using a mozilla browser
req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()
# Creating a BeautifulSoup object of the html page for easy extraction of data.
soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
product_json = {}
# This code block will get you a one liner description of the listed property
for meta in soup.findAll(‘meta’, attrs={‘name’: ‘description’}):
try:
product_json[‘description’] = meta[‘content’]
break
except:
pass
# This code block will get you the link of the listed property
for link in soup.findAll(‘link’, attrs={‘rel’: ‘canonical’}):
try:
product_json[‘link’] = link[‘href’]
break
except:
pass
# This code block will get you the price and the currency of the listed property
for scripts in soup.findAll(‘script’,
attrs={‘type’: ‘application/ld+json’}):
details_json = ast.literal_eval(scripts.text.strip())
product_json[‘price’] = {}
product_json[‘price’][‘amount’] = details_json[‘offers’][‘price’]
product_json[‘price’][‘currency’] = details_json[‘offers’
][‘priceCurrency’]
# This code block will get you the detailed description of the the listed property
for paragraph in soup.findAll(‘p’, attrs={‘id’: ‘propertyDescription’}):
product_json[‘broad-description’] = paragraph.text.strip()
product_json[‘overview’] = []
# This code block will get you the important points regarding the listed property
for divs in soup.findAll(‘div’,
attrs={‘data-auto-test-id’: ‘home-details-overview’
}):
for divs_second in divs.findAll(‘div’):
for uls in divs_second.findAll(‘ul’):
for lis in uls.findAll(‘li’, text=True, recursive=False):
product_json[‘overview’].append(lis.text.strip())
# Creates a json file with all the information that you extracted
with open(‘house_details.json’, ‘w’) as outfile:
json.dump(product_json, outfile, indent=4)
# Creates an html file in your local with the html content of the page you parsed.
with open(‘output_file.html’, ‘wb’) as file:
file.write(html)
print (‘———-Extraction of data is complete. Check json file.———-‘)
[/code]
If you enter the HTML mentioned in the example, you will get this JSON saved in your folder-
[code language=”php”]
{
“description”: “327 101st St #1A, Brooklyn, NY is a 3 bed, 3 bath, 1302 sq ft home in foreclosure. Sign in to Trulia to receive all foreclosure information.”,
“link”: “https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215”,
“price”: {
“amount”: “510000”,
“currency”: “USD”
},
“broad-description”: “Very Large Duplex Unit with 1st floor featuring a Finished Recreational Room, an Entertainment Room and a Half Bathroom. Second Level Features 2 Bedrooms, 2 Full Bathrooms, a Living Room/Dining Room and an Outdoor Space. There is Verrazano Bridge views.n View our Foreclosure Guides”,
“overview”: [
“Condo”,
“3 Beds”,
“3 Baths”,
“Built in 2006”,
“5 days on Trulia”,
“1,302 sqft”,
“$392/sqft”,
“143 views”
]
}
[/code]
Data scraping code explained
In case you want to understand the code, you can look into the comment statements, and for understanding the working of the different modules, you need to google a bit. But the most important part here is using Bs4 or BeautifulSoup. BeautifulSoup came into being when a group of developers realized that a lot of HTML code on the internet wasn’t “well-formed” but was functional. What this resulted in is that it did its work as expected, with some minor rarely occurring errors, but when someone tried to parse the same HTML file, he would meet roadblocks- that is he would be getting errors that the HTML wasn’t well-formed. If he tried to convert the HTML into a tree or any other data structure, he would still get the same error. Now he has to sit and clean HTML written by some developer living in some other part of the world. This delays his real objective. Thus, to make things easier for coders, the team developed a parser that would absorb and HTML file passed and create a BeautifulSoup object with nodes and attributes that you can traverse very easily, almost like you traverse a tree.
For example when I write the code-
[code language=”php”]
for paragraph in soup.findAll(‘p’, attrs={‘id’: ‘propertyDescription’}):
product_json[‘broad-description’] = paragraph.text.strip()
[/code]
I am trying to extract the text within a <p> tag that has id = propertyDescription inside it. Simple isn’t it? Well, you need to check out their website to understand more of it and try out self-exploration to extract more data from the HTML file that is also created on running the program. Here is the link for the HTML generated on running the code with the link provided above.
So what data did we get from Trulia?
So what were we able to extract using this simple code? If you look at the JSON properly, you can see that we have extracted quite a bit.
First, we got the description, which is sort of a header that you can use for the listing, then the link in case you need it for any reason, and it is followed up by the price broken into the amount as well currency. The broad description consists of an owner’s description that paints a picture in a person’s head as to how the house is. The overview contains a number of key aspects. Why are they not in a key: value format?
Well, that is because no two houses may have the same aspects or things to boast of. That is why this heading consists of a list of important features that a prospective buyer might take interest in. It may include various points such as the number of beds, and bathrooms when the house was built, since when it is listed in Trulia, the total area, price per square foot, and the number of people who have viewed the listing to date, and more.
So you understand that these things can change, and probably if you run the program on a listing one day, you might not get the same JSON like the one you got the day before from the same listing.
Using this code in business
This Trulia scraping setup can be used in your business in several ways. You can create a CSV of listing links, and get the code to run on individual rows of the CSV, using an automation script. What would be better is that you could build a system, that would grab all listings of a location when the location is fed into it, and then run this code to grab all data of each listing. This can easily be achieved using the expertise of web scraping service providers such as PromptCloud.
Data is money in today’s data-driven economy and making the most of all these freely available data on the internet can prove very profitable in any avenue of business that you decide to venture into. I would be signing off on that note and leaving it for you to ponder over the fact.
Need help with extracting web data?
Get clean and ready-to-use data from websites for business applications through our web scraping services.
Disclaimer: The code provided in this tutorial is only for learning purposes. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.