任务是计算最频繁的单词, 从而从动态来源中提取数据。
首先, 借助以下方法创建网络抓取工具要求模块和美丽的汤模块, 它将从网页中提取数据并将其存储在列表中。可能会有一些不需要的单词或符号(例如特殊符号, 空格), 可以对其进行过滤以简化计数并获得所需的结果。在对每个单词计数之后, 我们还可以对大多数(例如10或20个)常见单词进行计数。
使用的模块和库函数:
requests:将允许你发送HTTP/1.1请求以及更多请求。
beautifulsoup4:用于从HTML和XML文件中提取数据。
operator:导出一组与内部运算符相对应的有效函数。
collections:实现高性能的容器数据类型。
以下是上述想法的实现:
# Python3 program for a word frequency
# counter after crawling a web-page
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
'''Function defining the web-crawler/core
spider, which will fetch information from
a given website, and push the contents to
the second function clean_wordlist()'''
def start(url):
# empty list to store the contents of
# the website fetched from our web-crawler
wordlist = []
source_code = requests.get(url).text
# BeautifulSoup object which will
# ping the requested url for data
soup = BeautifulSoup(source_code, 'html.parser' )
# Text in given web-page is stored under
# the <div> tags with class <entry-content>
for each_text in soup.findAll( 'div' , { 'class' : 'entry-content' }):
content = each_text.text
# use split() to break the sentence into
# words and convert them into lowercase
words = content.lower().split()
for each_word in words:
wordlist.append(each_word)
clean_wordlist(wordlist)
# Function removes any unwanted symbols
def clean_wordlist(wordlist):
clean_list = []
for word in wordlist:
symbols = '!@#$%^&*()_-+={[}]|\;:"<>?/., '
for i in range ( 0 , len (symbols)):
word = word.replace(symbols[i], '')
if len (word)> 0 :
clean_list.append(word)
create_dictionary(clean_list)
# Creates a dictionary conatining each word's
# count and top_20 ocuuring words
def create_dictionary(clean_list):
word_count = {}
for word in clean_list:
if word in word_count:
word_count[word] + = 1
else :
word_count[word] = 1
''' To get the count of each word in
the crawled page -->
# operator.itemgetter() takes one
# parameter either 1(denotes keys)
# or 0 (denotes corresponding values)
for key, value in sorted(word_count.items(), key = operator.itemgetter(1)):
print ("% s : % s " % (key, value))
<-- '''
c = Counter(word_count)
# returns the most occurring elements
top = c.most_common( 10 )
print (top)
# Driver code
if __name__ = = '__main__' :
start( "https://www.srcmini.org/programming-language-choose/" )
[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5), ('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]
首先, 你的面试准备可通过以下方式增强你的数据结构概念:Python DS课程。
评论前必须登录!
注册