Metadata scraping using the New York Times API
Sep 02, 2023 pm 10:13 PM簡介
上周,我寫了一篇關(guān)于抓取網(wǎng)頁以收集元數(shù)據(jù)的介紹,并提到不可能抓取《紐約時(shí)報(bào)》網(wǎng)站。 《紐約時(shí)報(bào)》付費(fèi)墻會阻止您收集基本元數(shù)據(jù)的嘗試。但有一種方法可以使用紐約時(shí)報(bào) API 來解決這個(gè)問題。
最近我開始在 Yii 平臺上構(gòu)建一個(gè)社區(qū)網(wǎng)站,我將在以后的教程中發(fā)布該網(wǎng)站。我希望能夠輕松添加與網(wǎng)站內(nèi)容相關(guān)的鏈接。雖然人們可以輕松地將 URL 粘貼到表單中,但提供標(biāo)題和來源信息卻非常耗時(shí)。
因此,在今天的教程中,我將擴(kuò)展我最近編寫的抓取代碼,以在添加《紐約時(shí)報(bào)》鏈接時(shí)利用《紐約時(shí)報(bào)》API 來收集頭條新聞。
請記住,我參與了下面的評論主題,所以請告訴我您的想法!您還可以通過 Twitter @lookahead_io 與我聯(lián)系。
開始使用
注冊 API 密鑰
首先,讓我們注冊并請求 API 密鑰:
提交表單后,您將通過電子郵件收到密鑰:
探索紐約時(shí)報(bào) API
The Times 提供以下類別的 API:
- 存檔
- 文章搜索
- 書籍
- 社區(qū)
- 地理
- 最受歡迎
- 電影評論
- 語義
- 泰晤士報(bào)
- 時(shí)代標(biāo)簽
- 頭條新聞
很多。并且,在“圖庫”頁面中,您可以單擊任何主題來查看各個(gè) API 類別文檔:
《紐約時(shí)報(bào)》使用 LucyBot 為其 API 文檔提供支持,并且有一個(gè)有用的常見問題解答:
他們甚至向您展示如何快速獲取 API 使用限制(您需要插入密鑰):
curl --head https://api.nytimes.com/svc/books/v3/lists/overview.json?api-key=<your-api-key> 2>/dev/null | grep -i "X-RateLimit" X-RateLimit-Limit-day: 1000 X-RateLimit-Limit-second: 5 X-RateLimit-Remaining-day: 180 X-RateLimit-Remaining-second: 5
我最初很難理解該文檔 - 它是基于參數(shù)的規(guī)范,而不是編程指南。不過,我在紐約時(shí)報(bào) API GitHub 頁面上發(fā)布了一些問題,這些問題很快就得到了有用的解答。
使用文章搜索
在今天的節(jié)目中,我將重點(diǎn)介紹如何使用《紐約時(shí)報(bào)》文章搜索。基本上,我們將擴(kuò)展上一個(gè)教程中的創(chuàng)建鏈接表單:
當(dāng)用戶點(diǎn)擊查找時(shí),我們將向 鏈接::grab($url)
。這是 jQuery:
$(document).on("click", '[id=lookup]', function(event) { $.ajax({ url: $('#url_prefix').val()+'/link/grab', data: {url: $('#url').val()}, success: function(data) { $('#title').val(data); return true; } }); });
這是控制器和模型方法:
// Controller call via AJAX Lookup request public static function actionGrab($url) { Yii::$app->response->format = Response::FORMAT_JSON; return Link::grab($url); } ... // Link::grab() method public static function grab($url) { //clean up url for hostname $source_url = parse_url($url); $source_url = $source_url['host']; $source_url=str_ireplace('www.','',$source_url); $source_url = trim($source_url,' \\'); // use the NYT API when hostname == nytimes.com if ($source_url=='nytimes.com') { ...
接下來,讓我們使用 API 密鑰發(fā)出文章搜索請求:
$nytKey=Yii::$app->params['nytapi']; $curl_dest = 'http://api.nytimes.com /svc/search/v2/articlesearch.json?fl=headline&fq=web_url:%22'. $url.'%22&api-key='.$nytKey; $curl = curl_init(); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_URL,$curl_dest); $result = json_decode(curl_exec($curl)); $title = $result->response->docs[0]->headline->main; } else { // not NYT, use the standard metatag scraper from last episode ... } } return $title; }
它的工作原理非常簡單 - 這是生成的標(biāo)題(順便說一句,氣候變化正在殺死北極熊,我們應(yīng)該關(guān)心):
如果您想了解 API 請求的更多詳細(xì)信息,只需向 ?fl 添加其他參數(shù)即可=headline
?請求例如 關(guān)鍵字
和 lead_paragraph
:
Yii::$app->response->format = Response::FORMAT_JSON; $nytKey=Yii::$app->params['nytapi']; $curl_dest = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?'. 'fl=headline,keywords,lead_paragraph&fq=web_url:%22'.$url.'%22&api-key='.$nytKey; $curl = curl_init(); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_URL,$curl_dest); $result = json_decode(curl_exec($curl)); var_dump($result);
結(jié)果如下:
也許我會在接下來的劇集中編寫一個(gè) PHP 庫來更好地解析 NYT API,但此代碼打破了關(guān)鍵字和引導(dǎo)段落:
Yii::$app->response->format = Response::FORMAT_JSON; $nytKey=Yii::$app->params['nytapi']; $curl_dest = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?'. 'fl=headline,keywords,lead_paragraph&fq=web_url:%22'.$url.'%22&api-key='.$nytKey; $curl = curl_init(); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_URL,$curl_dest); $result = json_decode(curl_exec($curl)); echo $result->response->docs[0]->headline->main.'<br />'.'<br />'; echo $result->response->docs[0]->lead_paragraph.'<br />'.'<br />'; foreach ($result->response->docs[0]->keywords as $k) { echo $k->value.'<br/>'; }
以下是本文顯示的內(nèi)容:
Polar Bears’ Path to Decline Runs Through Alaskan Village The bears that come here are climate refugees, on land because the sea ice they rely on for hunting seals is receding. Polar Bears Greenhouse Gas Emissions Alaska Global Warming Endangered and Extinct Species International Union for Conservation of Nature National Snow and Ice Data Center Polar Bears International United States Geological Survey
希望這能開始擴(kuò)展您對如何使用這些 API 的想象力。現(xiàn)在可能實(shí)現(xiàn)的事情非常令人興奮。
結(jié)束中
紐約時(shí)報(bào) API 非常有用,我很高興看到他們向開發(fā)者社區(qū)提供它。通過 GitHub 獲得如此快速的 API 支持也令人耳目一新——我只是沒想到會這樣。請記住,它適用于非商業(yè)項(xiàng)目。如果您有一些賺錢的想法,請給他們留言,看看他們是否愿意與您合作。出版商渴望新的收入來源。
I hope you find these web scraping snippets helpful and implement them into your projects. If you want to watch today's show, you can try some web scraping on my website Active Together .
Please share any thoughts and feedback in the comments. You can also always contact me directly on Twitter @lookahead_io. Be sure to check out my instructor page and other series: Building Your Startup with PHP and Programming with Yii2.
Related Links
- New York Times API Library
- The New York Times Public API Specification on GitHub
- How to crawl metadata in web pages (Envato Tuts)
- How to use Node.js and jQuery to crawl web pages (Envato Tuts)
- Build your first Web Scraper in Ruby (Envato Tuts)
The above is the detailed content of Metadata scraping using the New York Times API. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Introduction Last week, I wrote an introduction about scraping web pages to collect metadata, and mentioned that it was impossible to scrape the New York Times website. The New York Times paywall blocks your attempts to collect basic metadata. But there is a way to solve this problem using New York Times API. Recently I started building a community website on the Yii platform, which I will publish in a future tutorial. I want to be able to easily add links that are relevant to the content on my site. While people can easily paste URLs into forms, providing title and source information is time-consuming. So in today's tutorial I'm going to extend the scraping code I recently wrote to leverage the New York Times API to collect headlines when adding a New York Times link. Remember, I'm involved

How to crawl and process data by calling API interface in PHP project? 1. Introduction In PHP projects, we often need to crawl data from other websites and process these data. Many websites provide API interfaces, and we can obtain data by calling these interfaces. This article will introduce how to use PHP to call the API interface to crawl and process data. 2. Obtain the URL and parameters of the API interface. Before starting, we need to obtain the URL of the target API interface and the required parameters.

We can access the metadata of audio files using Mutagen and the eyeD3 module in Python. For video metadata we can use movies and the OpenCV library in Python. Metadata is data that provides information about other data, such as audio and video data. Metadata for audio and video files includes file format, file resolution, file size, duration, bitrate, etc. By accessing this metadata, we can manage media more efficiently and analyze the metadata to obtain some useful information. In this article, we will take a look at some of the libraries or modules provided by Python for accessing metadata of audio and video files. Access audio metadata Some libraries for accessing audio file metadata are - using mutagenesis

Microsoft has announced the end of support date for Power BI Desktop on Windows 8.1. Recently, the tech giant’s premier data analytics platform also introduced TypeScript support and other new features. Today, a new Tabular Model Definition Language (TMDL) for Power BI was launched and is now available in public preview. TMDL is required due to the highly complex BIM files extracted from the huge semantic data model created using Power BI. Traditionally containing model metadata in Tabular Model Scripting Language (TMSL), this file is considered difficult to process further. Additionally, with multiple developers working on

Summary of Vue development experience: Tips for optimizing SEO and search engine crawling. With the rapid development of the Internet, website SEO (SearchEngineOptimization, search engine optimization) has become more and more important. For websites developed using Vue, optimizing for SEO and search engine crawling is crucial. This article will summarize some Vue development experience and share some tips for optimizing SEO and search engine crawling. Using prerendering technology Vue

With the development of the Internet, people increasingly rely on the Internet to obtain information. For book lovers, Douban Books has become an indispensable platform. In addition, Douban Books also provides a wealth of book ratings and reviews, allowing readers to understand a book more comprehensively. However, manually obtaining this information is tantamount to finding a needle in a haystack. At this time, we can use the Scrapy tool to crawl data. Scrapy is an open source web crawler framework based on Python, which can help us efficiently

A key feature of Pandas is the ability to handle metadata that can provide additional information about the data present in a DataFrame or Series. Pandas is a powerful and widely used library in Python for data manipulation and analysis. In this article, we will explore how to add metadata to a DataFrame or Series in Python using Pandas. What is metadata in Pandas? Metadata is information about the data in a DataFrame or Series. It can include the data type about the column, the unit of measurement, or any other important and relevant information to provide context about the data provided. You can use Pandas to

How to use the PHPGoutte class library for web crawling and data extraction? Overview: In the daily development process, we often need to obtain various data from the Internet, such as movie rankings, weather forecasts, etc. Web crawling is one of the common methods to obtain this data. In PHP development, we can use the Goutte class library to implement web crawling and data extraction functions. This article will introduce how to use the PHPGoutte class library to crawl web pages and extract data, and attach code examples. What is Gout
