Intoruduction to Various Web Scraping, Data Cleaning, and Visualization

Mora Argatha
3 min readDec 11, 2021

by Ardi Imawan, S.Kom., M.Sc

API is a way of pulling data from a url. An example is the Google Map Place API.

Data can be obtained in various ways, depending on the provider that provides the example, the BPS data is downloaded into excel. Another way that does not need scraping, but is not downloaded, is some websites whose data is provided legally, for example social media services or Google.

Google besides selling ads, now sells many things like services and data. We are given a free google map but we pay for the API. API is a way to pull data through a url.

Some websites are required to pay, such as the google translate API and also Google Map, so you don’t need to know how to translate, just call the API from google translate.

We can search accordingly like searching on google map but by using the API, so he will send the url to google according to our keywords.

We need to know the anatomy of the data we have. From the data, we can see which ones need to be tidied up or justified, such as empty rows or columns, incorrect formatting for the review date, etc.

The trick is to open the file (cvc) , use a library called pandas, which is usually used by Sicentis for research, the delimiter is the separator, such as a semicolon or comma, then type run.

If there is no error, it means there is no problem. The purpose of the script is to remove unimportant text. We don’t need to memorize the script, just understand it.

If there are several empty rows, they are deleted by dropping empty columns, which are garbage or noise data. Before deleting, see and make sure which data is empty. The subset determines which column we want to see.

If we only fill in the review, if there is one line with an empty review, it will be dropped. If it means that the data is clear, if it has been deleted, then what needs to be done is that the input of the review results must still be in the form of a line format into the format that we want. So we are casting, which is changing the format of the same column.

We need an output file for visualization. Examples are free and web-based, namely Googke data studio. The trick is from the blank report, we will choose the data source from the google shee file, or scv, etc.

If we make a report, a table will appear. But we can visualize others, such as with time series, so we display the data with the date dimensions we want to retrieve.

My SQL database usually takes at least 1 hour, why? Because if it’s in the database, it could be that a lot of data is taken because Google will store it temporarily in Google Data Studio, so the resources needed are a lot. So we can manage, but there is a limit.

--

--