Python Zillow Scraper Code

A walkthrough of my Zillow Scraper using Beautiful Soup.

What is Beautiful Soup?

Beautiful Soup is a Python library that retrieves data out of HTML, XML, and other markup languages. If you ever had a Myspace, Tumblr, or WordPress you have probably seen or know a thing or two about HTML. It’s a markup language that helps create the structure of a webpage. When scraping information from Zillow we don’t necessarily need all the markup annotations and information so Beautiful Soup helps us extract data without all the excess information.

I created a class labeled ZillowScraper and within the class are 5 simple functions. The first function initializes the attributes for the class.

def __init__ (self, url, params, headers):
self.url = url
self.params = params
self.headers = headers

This shows us that these inputs are needed to successfully run the ZillowScraper. Once the arguments for the scraper are initialized a function is needed to pull the information.

def pull(self, url, params):
response = request.get(url, headers = self.headers, params=params)
return response

This function pulls the information using the url and params given in the arguments into a response variable. Self is a method that is used to represent the instance inside a class. Then I had the function return the response so the following function would be able to analyze it.

def analyze(self, response):

content = BeautifulSoup(response,'lxml')
#use Beautiful Soup to parse webpage.
resident_cards = content.find('ul',{'class': 'photo-cards photo-cards_wow photo-cards_short'})
for card in resident_cards.contents:
script = card.find('script', {'type':'application/ld+json'})

if script:
json_script = json.loads(script.contents[0])
self.results.append({
'price': card.find('div',{'class':'list-card-price'}).text,
'type': card.find('div',{'class':'list-card-type'}).text,
'postalcode': json_script['address']['postalCode'],
'city': json_script['address']['addressLocality'],
'url': json_script['url'],
'floorsize': json_script['floorSize']['value'],
'bedrooms': card.find_all('li')[0].text,
'bathrooms': card.find_all('li')[1].text,
'seller': card.find('div',{'class':'list-card-truncate'})
})

The analyze function just needs the response that was obtained from the pull function. Then Beautiful Soup was used to extract the text within the content. resident_cards refers to each individual property card listed on Zillow. I then created a for loop to extract data from each card. Thankfully the position of each information is in the same spot for most properties so I was able to used .find for specific information.

After the information was extracted from the text using Beautiful Soup I created a open_csv function shown below:

def open_csv(self):

with open('sandiego_zillow.csv','w') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
writer.writeheader()

for row in self.results:
writer.writerow(row)

This saves the extracted data into a csv named ‘sandiego_zillow.csv’. DictWriter takes the csv_file and maps the dictionaries into output rows and the field names for the header. The for loop is then created to iterate the format through every row in the results.

Lastly, a run function was created to put it all together.

def run(self,n):

for page in range(1,n):
res = self.pull(self.url, self.params)
self.analyze(res.text)
time.sleep(2)

self.open_csv()

In this run function, the input of the number of pages is needed for the current location being scraped. The for loop runs through every property on every page from 1 to the number of pages available. First, the information is pulled, then analyzed, and formatted into text. The .sleep() is used to delay the scrape to give it time. Then the results are saved into the csv.

If you would like information on how to find the arguments needed to use this scraper on Zillow you can check out my blog post here:

https://susanna-jihae-han.medium.com/scraping-zillow-using-beautiful-soup-e32e19bbd986

Once the ZillowScraper is used you can create a data frame using the extracted data using pandas.read_csv.

df = pd.read_csv('sandiego_zillow.csv')