How to scrape news articles with specific words

Nafis Ahmad
2 min readOct 6, 2021

--

Web-scraping using python is very simple to do if you just follow along

Web Scraping is used for data. Access to relevant data and having methods to analyze it (and performing intelligent actions based on analysis) can make a huge difference in the success and growth of most businesses in the modern world.

We will be using the website Inshorts: https://inshorts.com/en/read/national to scrape our news articles with the help of BeautifulSoup.

Just follow the steps from here on and you're good to go:

  1. Import the required packages. (Make sure to pip install these first, if you do not already have them)
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import re

2. Let us write the code for scraping the first page:

d=[]r = requests.get(“https://inshorts.com/en/read/national")
soup = BeautifulSoup(r.content, ‘html.parser’)
min_news_id = soup.findAll(“script”,{“type”:”text/javascript”})[2].textmin_news_id = min_news_id[25:35]

3. Code for scraping more pages:

(write the number of pages to scrape in range.)

for i in tqdm(range(11000)):

We will now use JavaScript to load more data from https://inshorts.com/en/ajax/more_news using POST requests with parameter ‘news_offset’ which informs server what page it has to send to client.

we can make POST requests with this parameter to get new data in JSON format.

try:
params = {‘news_offset’: min_news_id}
req = requests.post(“https://inshorts.com/en/ajax/more_news",data=params)

In JSON you have HTML in json_data[‘html’] and json_data[‘min_news_id’] for next page.

json_data = req.json()
min_news_id = json_data[‘min_news_id’]
soup = BeautifulSoup(json_data[‘html’], ‘html.parser’)
for data in soup.select(‘div.news-card.z-depth-1’):
if data.find(text=re.compile(“vaccine”)):
d.append({
‘headline’: data.find(itemprop=”headline”).getText(), ‘article’: data.find(itemprop=”articleBody”).getText(), ‘date’: data.find(clas=”date”).getText()
})
except Exception as e:
print (e)

4. Now the final step is storing the data into a .csv file, enter the file name you want to save within the input prompt and your data-set gets saved.

You can toggle indexing the articles with True/False.

df = pd.DataFrame(d)def name():
a = input("File Name: ")
return a
b = name()
df.to_csv(b + ".csv", index = False)

Again, this is simple web scraping, but it covers a large percentage of use cases, especially as you are becoming more familiar with the technique.

Please reach out with any questions: nafis.ahmad0087@gmail.com

:)

--

--

Nafis Ahmad

Exploring Data Science | Full Stack Development | Android Development | Kaggle Contributor | Content Writer at Medium | Computer Science Engineer | Learner