Google parsing – theory and practice. What data can you get?

Youtube

This parser is as simple as underpants costing twenty rubles. And this is not only about its capabilities (by the way, they are modest, there is no proxy support, no anti-captcha), but also about the interface too.

But just in case, I’ll tell you what and where to click so that it’s good :)

1 - Requests to the PS, line by line. Enter Russian characters as they are, the program will do the urlencode itself. Cry right click mouse will open a menu with a couple of goodies.

2 - Click to add site:TLD to each request, where the list of these same TLDs is in the zones.txt file.

Why is this necessary? Everything is very simple, compare the request “google parser” with the request “google parser site:ru”
In the first case search results will contain all found sites, and in the second only sites in the ru zone.
This is useful if you need to get more than 1000 results. Ideally for everyone domain zone you can get 1000 links.
For example, upon request" google parser"We only received 1000 links.
And if you click “site:TLD”, we can get up to 11,000 links:

3 - The file in which the found links will be saved. If specified file exists, it will simply be appended and not overwritten.

4 - The file in which the found ones will be saved domains. If the specified file exists, it will simply be appended rather than overwritten.

5 - Delay interval between requests. It’s better not to rush things and, having set something between 20-30, go make yourself some tea, a sausage sandwich and read the news while the program is running :)

6 - Drop-down list for managing parsing - start, stop, pause and continue. The contents of the list changes depending on the task being performed to show only the available tasks.

It's no secret to any of you that to promote websites you need links, preferably a lot and for free. Where can I get them? There are sites that receive content thanks to users. For example, directories of websites, articles and companies. A database is a collection of addresses of such sites.
Regardless of what database you collect, you can find sites for relevant queries in search engines. This process is called parsing the results. Usually parsed Google and there are three reasons for this:
1.Good search quality
2.High response speed
3.Availability of operator inurl:
This operator has the following form inurl: “contents of the url of the searched pages”. Using this operator, you can search for specific website engines. IN Yandex there are no analogues to this operator.

For example, to find most Made-Cat catalogs you need to use a search engine Google line enter a request: inurl:"ext/rules" or inurl:"add/1".

However, there are a few things you need to know when using this operator. First- for Google, most special characters are the same as space. This is bad because some engines will parse with a huge amount garbage. For example, in the search results for the query inurl: “xxx/yyy” you can find both pages containing “xxx?yyy” and pages containing “xxx.yyy”.
Second- for many queries, the search engine, when using this operator, does not show all the results just in order to limit doorway searchers.
Sometimes I replace a request with the inurl operator with a request in the form -intext:"XXX" -intitle:"XXX" "XXX". In other words, we're telling Google to look for XXX, but not in the text or title, and other than that there's only a URL. True, such a replacement is not equivalent: if the desired XXX is in the title or text and at the same time in the URL, then such a page will not be shown.

When parsing there are usually two tasks:
1. Parse as many URLs as possible.
2. While capturing as little garbage as possible - pages that we don’t need.

To solve the first problem, the following method is used. For example, the query “XXX” returns only 1000 sites, but there are, say, half a million of them on the Internet. To increase search results, add “useless” clarifications to the main query(s):
"XXX" company
"XXX" company
"XXX" find
"XXX" site
"XXX" page
"XXX" home
To clarify, we take commonly used words that can be found on any website. Although it is more useful to divide sites into non-overlapping categories: only English, only Russian, only Ukrainian. Or add a search by domain zone inurl:“.com”, inurl:“.net”... Let’s take, for example, the request “catalog”. There are 209,000,000 pages on the Internet with this word, but we are given no more than 1000. Using six queries
1.Directory inurl: ".com"
2.Directory inurl: “.net”
3. Catalog inurl: “.biz”
4. Catalog inurl: “.ru”
5.Directory inurl: “.info”
6.Directory inurl: ".org"
We will receive not 1000, but 6000 catalogs. With some resourcefulness, you can get tens of thousands of URLs. But most will be trash.

Sometimes the problems with garbage are quite significant, so before parsing you have to manually check the quality of the results for each request so that the machine does not capture many unnecessary sites, and then you do not have to worry about checking them. Finding “useful” clarifications helps.
For example, when requesting inurl:"add/1" you can see a lot of garbage, this needs clarification inurl:"add/1" "Your site URL". You can go further and filter out gray directories "inurl:"add/1" "URL of your site" -"URL where the link is""

Manually collecting parsing results is time-consuming, boring and unproductive. Therefore, there are corresponding programs - parsers that remember the output of queries and save them. Most parsers are either paid on their own or included with other paid applications.

Using a free desktop parser

The program does not require installation and therefore you can use it immediately after downloading. The program only works with Google and has a Spartan interface, but, as they say, “don’t look a gift horse in the mouth.”

1. Query entry field. Here you need to enter a list of queries to Google, for example, inurl: “xxx” (note that the operator and query are written without a space).
2. Input/output field for URL requests to Google. This field will show which URLs Google parses when making queries. If you wish, you can enter here a list of Google urls that need to be parsed. For example: “http://www.google.com.ua/search?hl=ru&q=XXX&btnG=%D0%9F%D0%BE%D0%B8%D1%81%D0%BA+%D0%B2+Google&meta ="
3. The result output field is the URL of the sites that were found.
4.Task completion percentage
5.Filter for parsing only Russian-language sites
6.Delay in thousandths of a second. From 0 to 60,000. The delay is needed so that Google does not understand that the program is parsing it and does not block your access to resources.
7.The “Let’s go” button starts parsing.
8.Shows the page that is being parsed in this moment. There is no particular benefit, rather for entertainment.

Additionally, above the query input field (1) there is a “convert” button, which converts queries inurl:“XXX” to -intext:"XXX" -intitle:"XXX" "XXX"

How to use the program? Enter a list of queries in the left input field (1), wait and copy the result from the right input field (3). Then clear out duplicate domains, for example, using http://bajron.od.ua/?p=67. The results are stored in the format of a list of URLs of found sites.
The program eliminates most of the routine work and parses much faster than a human.

Number of results per request
Links, anchors and snippets from search results
List of Related keywords
Determines whether he has counted Google request typo or not
Parses links, anchors and snippets from advertising blocks. It should be noted that the $link variable will contain links that look like To get links that we see when issued under anchors, you need to use the $visiblelink variable. This only applies to the ad block.

Possibilities (top)

Support for all Google search operators (site:, inurl:, etc.)
Parses the maximum number of results given by Google - 10 pages of 100 elements in the results
Can automatically parse more than 1000 results per request - substitutes additional characters (Parse all results option)
Ability to search for related keywords
Supports selection of search country, city, region, domain, language of results
Supports issuing time specification
Ability to scrape news and blogs
You can specify whether to parse the results if Google reports that nothing was found for the specified query and offers results for a similar query
Supports disabling Google's filter to hide similar results (filter=)
Language selection option Google interface, the output of results in the parser and in the browser, with identical settings, is as identical as possible

Use Cases (top)

Collection of link databases - for A-Poster, XRumer, AllSubmitter, etc.
Assessing competition for keywords
Search for backlinks (mentions) of sites
Checking site indexing
Search for vulnerable sites
Any other options that involve Google parsing in one form or another

Examples (top)

Requests (top)

Requests must include search phrases, just as if they were entered directly into the Google search form, for example:

windows Moscow
site:http://lenta.ru
inurl:guestbook
Click to expand...

results (top)

As a result, a list of links for queries is displayed:

http://lenta.ru/
http://vesti.lenta.ru/
http://old.lenta.ru/
http://1991.lenta.ru/
http://vip.lenta.ru/
http://pda.lenta.ru/
http://3m.lenta.ru/
http://lm-3.lenta.ru/
http://aquarium.lenta.ru/magazine/
http://real.lenta.ru/
http://megafon.lenta.ru/
http://okna-rassvet.ru/
http://www.montblanc.ru/
http://www.probkiizokna.ru/
http://www.panorama-group.ru/
http://www.oknadoz.ru/
http://www.okna-darom.ru/
http://www.oknarosta.ru/
...
Click to expand...

Possible settings (top)

Parameter	Default value	Description
Links per page	100	Number of links in search results per page
Pages count	5	Number of pages to parse
Google domain	www.google.com	Google domain for parsing, all domains supported
Results language	Any language	Selecting the result language (parameter lr=)
Search from country	Global	Selecting the country from which the search is carried out (geo-dependent search, gl= parameter)
Location (city)	-	Search by city, region. You can specify cities in the form novosibirsk, russia; full list locations can be found at the link. You also need to set the correct Google domain
Hide omitted results	☑	Determines whether to hide omitted results (filter= parameter)
Serp time	All time	Sickle time(time-dependent search, parameter tbs=)
Serp type	Main	Defines parsing with home page, by news or blogs
Parse not found	☑	Determines whether to parse the results if Google reported that nothing was found for the specified query and offered results for another query
Use AntiGate	☐	Determines whether to use antigate to bypass captchas
AntiGate preset	default	Util:AntiGate parser preset. You must first configure the Util::AntiGate parser - specify your access key and other parameters, and then select the created preset here
Use digit captcha	☐	Enables forced use of digital captcha
Use sessions	☐	Saves good sessions, which allows you to parse even faster, getting smaller number errors.
Interface language	English	Ability to select the Google interface language for maximum identical results in the parser and in the browser

Sometimes you need to get data from a site in a compressed form so that you don’t have to move from page to page, collecting information bit by bit. This video tutorial will help you collect all the necessary data into a table that is easy to view.

Information gathering tasks are not as rare as they seem. Sometimes we need to get some statistics on sections/topics/headings on a website to which we don’t have access to the admin panel, or collect data about companies in a catalog where all this data is scattered across pages, and if you do it manually, more than one person will leave day. To automate this task, you can use scripts in Google Sheets.

To work with a table, you need to use the SpreadsheetApp class, which allows you to handle the table quite conveniently. To work with the active sheet, we use the getActiveSheet() function in this interface.
All operations within a sheet are possible only within ranges. Even if you need to work with a specific cell, you first need to set a range for it. You can set the range in the selected sheet using the getRange("A1:B2") method. As an argument to this method, you can pass a string containing a range of cells in the familiar format Excel view. Within a given range, cell selection is performed using the getCell(row, col) method, where row and col are the row and column numbers of the cell within the range, respectively.

To get a web page, we use the UrlFetchApp class, which has the fetch(url) method we need. To be able to work with the result of this method, you need to convert it to text using the getContentText() function. After executing this function, it will be stored in a variable source pages that you can work with the way you want. In principle, no one forbids building a tree from the resulting DOM page and working with it, but, it seems to me, working with a string in small tasks, where you can isolate the necessary data in the text, it will be easier and faster.

Script code for Google Spreadsheet

JavaScript

function getConten())( for(var j = 1; j< 19; j++){ getPageContent(1 + 10*(j-1), "https://сайт/cms/page/" + j + "/"); } } function getPageContent(startRow, url) { var sheet = SpreadsheetApp.getActiveSheet(); var range = sheet.getRange("A1:B181"); var cell = range.getCell(startRow,1); var response = UrlFetchApp.fetch(url); var textResp = response.getContentText(); var start, end, name; var i; for(i = 0; i < 10; i++){ start = textResp.indexOf("

", end) + 15; start = textResp.indexOf(">", start) + 1; end = textResp.indexOf("", start); name = textResp.substring(start, end); cell.setValue(name); cell = cell.offset(1,0); ) )

function getConten() (
for (var j = 1 ; j< 19 ; j ++ ) {
getPageContent (1 + 10 * (j - 1) , "https://site/cms/page/" + j + "/" ) ;
function getPageContent(startRow, url) (
var sheet = SpreadsheetApp . getActiveSheet();
var range = sheet . getRange("A1:B181");
var cell = range . getCell(startRow, 1);
var response = UrlFetchApp . fetch(url);
var textResp = response . getContentText();
var start, end, name;
var i ;
for (i = 0 ; i< 10 ; i ++ ) {
start = textResp . indexOf("
", end) + 15;

Everyone has encountered a situation where they need to collect and systematize a large number of information. For standard tasks There are ready-made services for website SEO optimization, for example, Netpeak Checker - for comparing the performance of competing sites or Netpeak Spider - for parsing internal information on the site. But what if the problem is non-trivial and there are no ready-made solutions? There are two ways: to do everything manually and for a long time, or to put the routine process into a matrix, automate it and get results many times faster. This is the case we will talk about.

What is website parsing and why is it needed?

Kimono— powerful and quick to set up scraper with intuitive clear interface. Allows you to parse data from other sites and update it later. Free.

You can get to know each other better and get a short manual on how to use it (in Russian) or on moz.com (in English). Let's try to parse something good using Kimono. For example, let’s supplement the table we created with cities with a list of resorts in the country Cities 2. How this can be implemented using Kimono Labs. We will need:

application for Google Chrome— Kimono;
Google Docs spreadsheet.

1. We find a site with the information we need - that is, a list of countries and their resorts. Open the page where you need to get the data.

2. Click on the Kimono icon in the right top corner Chrome.

3. Select those parts of the page from which we need to parse data. If you need to highlight new type data on the same page, click on “+” to the right of “ property 1" - this is how we indicate to Kimono that this data should be placed in a new column.

4. Clicking on braces <>and selecting " CSV", you can see how the selected data will be located in the table.

5. When all data is checked:

click " Done" (in the upper right corner);
log in to Kimono to link the API to your account;
enter the name of the future API;
click " Create API».

6. When the API is created, go to the Google spreadsheet where we want to load the selected data. Select " Connect to Kimono" and click on the name of our API - " Resorts" The list of countries and links to pages with resort cities is uploaded to a separate sheet.

7. Go back to the site, take Ireland as an example, and again select the cities through Kimono that need to be parsed. We create an API, call it “ Resorts in countries».

9. In " Crawl Strategy» select « URLs from source API" A field appears with a drop-down list of all APIs. Select the API we created earlier " Resorts" and from it is automatically loaded list of URLs for parsing. Click blue button « Start Crawl"(start crawling) and monitor the parsing status. Kimono crawls the pages, parses the data according to a previously specified template and adds it to the table - that is, it does everything the same as for Ireland, but for all other countries that were entered automatically and without our participation.

10. When the table is formed, we synchronize Kimono Labs with the Google table - exactly the same as we did in the sixth point. As a result, a second sheet with data appears in the table.

Suppose we want the table to display all resort cities in the country of the destination city. We process the data on Kimono sheets using formulas for Google Spreadsheets, and display in a line a list of cities where you can still relax in Australia, except for Sydney.

For example, it can be done like this. Label a data array (list of cities) using logical functions and returning the value of the cell to TRUE or FALSE. Using the example below, we have identified cities that are located specifically in Australia:

TRUE = city is in Australia;
FALSE = city is in another country.

Using the TRUE labels, we determine the beginning and end of the processed range, and display the cities corresponding to this range in a line.

By analogy, we can derive resort towns for other countries.

We have specifically given here a fairly simple and step by step example- the formula can be complicated, for example, to make it so that it is enough to enter the country in column C, and all other calculations and display of cities in the line occur automatically.

Automation results

As mentioned at the beginning, we regularly need to create 20 tables of the same type. This is a routine process that consumes 40-50 minutes per table, and 16 hours of time for every 20 pieces. Agree, 2 working days for identical signs is an unreasonable waste of time. After automation, one table takes 5-10 minutes, and 20 - about 2 hours. The table has 17 cells, parsing is done from 5 sources. The table is filled in automatically when only 2 cells with source data are filled.

Setting up and automating parsing took a total of 30 hours of time, that is, the time spent will “pay off” already at the stage of generating the second 20 tables.

Google parsing – theory and practice. What data can you get?

Possibilities (top)

Use Cases (top)

Examples (top)

Requests (top)

results (top)

Possible settings (top)

Script code for Google Spreadsheet

", end) + 15;

What is website parsing and why is it needed?

Automation results

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts