Desbloque esta y miles de clases más
Premium de Código Facilito

Módulo 1 | 8 clases

Scrapper

Clase 1

1.- Introducción

Clase 2

2.- Obtener página

Clase 3

3.- Obtener títulos

Clase 4

4.- Expresiones regulares

Clase 5

5.- BeautifulSoup

Clase 6

6.- Threads

Clase 7

7.- Obtener artículo

Clase 8

8.- Integrar MongoDB

2 comentario(s)

Julio Garcia

30 Abril 20

from bs4 import BeautifulSoup

import requests

#import threading

GOOGLE_NEWS = 'https://news.google.com.mx/'

CUSTOM_TARGET = 'EKhAIACoHCAowob_vCjCR'

def get_beautiful_soup(href):

re = requests.get(href)

if re.status_code == 200:

return BeautifulSoup(re.text, 'html.parser')

def scrapping_site():

soup = get_beautiful_soup(GOOGLE_NEWS)

if soup is not None:

articles = soup.find_all('h3', {'class':'ipQwMb ekueJc gEATFF RD0gLb'})

#find_all regresa un objeto iterable

for article in articles:

#title = article.find('a', {'class':'DY5T1d'}).getText()

href = article.find('a').get('href')

href_complete = 'news.google.com' + href[1::]

if CUSTOM_TARGET in href_complete:

soup = get_beautiful_soup(href_complete)

if soup is not None:

container = soup.find('div', {'class':'field field-name-body field-type-text-with-summary field-label-hidden'})

paragraphs = container.find_all('p')

for paragraph in paragraphs:

print(paragraph)

if __name__ == '__main__':

scrapping_site()

Ver respuestas (1)

@diegopatodo1

30 Diciembre 19

Una duda,
Porque usamos este trozo de código:

final_article =''
            for paragraph in paragraphs:
                final_article ='{} {}'.format(final_article, paragraph)

            print(final_article)

¿¿En vez de añadirle .getText() al final de la linea "paragraphs = container.find('strong')" quedando??:

paragraphs = container.find('strong').getText()

Tenemos el mismo resultado, ¿Es por algún motivo?
Gracias

Clase 7

Obtener artículo

7/8

Crea un scraper web con Python