• Aucun résultat trouvé

Web Scraping and Geocoding: Some Stylized Facts About Regions of The Republic of Armenia

N/A
N/A
Protected

Academic year: 2021

Partager "Web Scraping and Geocoding: Some Stylized Facts About Regions of The Republic of Armenia"

Copied!
7
0
0

Texte intégral

(1)

HAL Id: hal-02015746

https://hal.archives-ouvertes.fr/hal-02015746

Preprint submitted on 12 Feb 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Web Scraping and Geocoding: Some Stylized Facts About Regions of The Republic of Armenia

Sergey Avetisyan

To cite this version:

Sergey Avetisyan. Web Scraping and Geocoding: Some Stylized Facts About Regions of The Republic

of Armenia. 2019. �hal-02015746�

(2)

Web Scraping and Geocoding: Some Stylized Facts About Regions of The Republic of Armenia

Sergey Avetisyan

12/02/2019 Technical Article

1

1

Economic Research Department, Central Bank of Armenia, Dilijan Training and Research Centre ,

(email: avetisyan.sergej@gmail.com, sergey.avetisyan@cba.am)

List of Figures

1 Geocoding Location Information/R Studio . . . 3 2 Scraping data from the web/R Studio . . . 3 3 Map of Armenian RepublicSource: author’s own elaboration on data from

the American University of Armenia ( Vector Format Data.) . . . 3 4 Map of Armenian RepublicSource: author’s own elaboration on data from

the American University of Armenia ( Vector Format Data.) . . . 4 5 Population of Cities in The Republic of Armenia Source: author’s own

elaboration. . . 4 6 Scatter Plot. Source: author’s own elaboration on data from Google-Map

and the banks’ web pages. . . 5 Abstract

As the availability of the data at the city level is very limited in Armenian Republic, this technical paper has a rather exploratory character at this stage. This technical paper shows how to collect and visualise data from web pages.

Key words: R Studio, web scraping, geocoding, maping, Republic of Armenia.

1 Notes

Ris an implementation of the Sprogramming language combined with lexical scoping semantics inspired by Scheme. S was created byJohn Chambers1 in 1976, while at Bell Labs. There are some important differences, but much of the code written for S runs unaltered. Rwas created byRoss Ihaka2 andRobert Gentleman3 at the University of

This Paper should not be reported as representing the views of Central Bank of Armenia. The views in this paper are those of the author and should not be interpreted as those of Central Bank of Armenia.

1John McKinley Chambers is the creator of the S programming language, and core member of the R programming language project. He was awarded the 1998 ACM Software System Award for developing S. He donated his prize money (US $ 10,000) to the American Statistical Association to endow an award for novel statistical software.

2George Ross Ihaka obtained his doctorate in 1985 from the University of California, Berkeley, supervised by David R. Brillinger. He received the Royal Society of New Zealand’s Pickering Medal in 2008 for his work onR.

3Robert Clifford Gentleman won the Benjamin Franklin Award in 2008, recognising his work on the R programming language, the Bioconductor project and his commitment to data and methods sharing.

(3)

Auckland, New Zealand, and is currently developed by theRDevelopment Core Team, of which Chambers is a member. Ris named partly after the first names of the first two R authors and partly as a play on the name of S. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000 (Arroyo Abad and Khalifa(2015);Kleiber and Zeileis(2008);Bivand et al.(2008);Rubenstein-Montano et al.(2001)). Geocoding is generally incorporated in commercial Geographic Information Systems (Bichler and Balchak(2007)), where geocoded data can collectively be used for mapping, visualization, and spatial analysis of events. In the past few years, however, the democratization of internet-based mapping services such as Google Maps or Mapquest has facilitated the use of online geocoding services for non-GIS users (Wu et al.(2005)).

2 Web Scraping and Geo Coding

Locational information, for instance in the form of addresses, can be transformed into geographic coordinates through a process known as geocoding. The procedure requires that addresses are standardized. Using a reference dataset (typically a street database), geographic coordinates can be estimated by comparing and interpolating the address to the range of addresses for each segment of the reference dataset. The procedure is sensitive to the completeness of the addresses and on the quality of local and regional road network database. In the past decade or so, the use of geographic information systems has increasingly become popular among economists. One of the reasons for this popularity, geographic information allows economists to observe the previously unobserved.

Web scraping is a method for extracting textual characters from websites, so that they could be analyzed, though not exclusively.

Web scraping is different from data mining, given that the latter involves data analysis and in this context getting data is irrelevant. Also, data mining involves the use of complex statistical algorithms. In this study the assumption is held that web scraping could be a part of data mining. In particular web scraping could be defined as a first step within the data mining process. Data mining itself is considered to be a part of business intelligence.

Geocoding is the process of assigning a location, usually in the form of coordinate values, to an address by comparing the descriptive location elements in the address to those present in the reference material.

Addresses come in many forms, ranging from the common address format of house number followed by the street name and succeeding information to other location descrip- tions such as postal zone or census tract. In essence, an address includes any type of information that distinguishes a place (Swift et al.(2008)).

Kudamatsu’s article surveys the use of geographic information systems (GIS) for the credible identification of causal impacts in recent economics research. It describes how each geo-processing tool in GIS allows economists to use data on geography and weather as sources of exogenous variation for estimating the impact of various ‘treatments’. The diverse range of treatments discussed in this survey includes disease, school competition, land suitability for agriculture, infrastructure, the elasticity of housing supply, mass media, learning from friends, slave trade, the appropriability of crop harvests, and terrain ruggedness.

3 Implication

Cities are places of incremental decision-making involving complex negotiations that produce accumulations of urban assets and path dependency (Bryson et al.(2017)). Cities generate wealth and improve living standards while providing the density, interaction, and networks that make us more creative and productive. They are the key social and economic organizing units of our time, bringing together people, jobs, and all the

(4)

Figure 1: Geocoding Location Information/R Studio

Figure 2: Scraping data from the web/R Studio

Figure 3: Map of Armenian Republic

Source: author’s own elaboration on data from the American University of Armenia( Vector Format Data.)

(5)

Figure 4: Map of Armenian Republic

Source: author’s own elaboration on data from the American University of Armenia( Vector Format Data.)

Figure 5: Population of Cities in The Republic of Armenia Source: author’s own elaboration.

(6)

Figure 6: Scatter Plot.

Source: author’s own elaboration on data from Google-Map and the banks’ web pages.

inputs required for economic growth. Urbanization and economic development are closely intertwined. While urbanization piers does not cause development, sustained economic development does not occur without urbanization.

Armenia is subdivided into eleven administrative divisions. Of these, ten are provinces, known asmarzer or in the singular formmarz (NUTS34) in Armenian.

Yerevan is treated separately and granted special administrative status as the country’s capital. The chief executive in each of 10marzer(Figure 3b: Map of Armenian Republic) is the marzpet, appointed by the government of Armenia. In Yerevan, the chief executive is the mayor.

There are a number of concepts relating to cities and urban development that, although complex and multifaceted, are meaningful and desirable to measure. The CDI cuts across the different clusters identified in the Urban Indicator Frame work as it is based on five sub indices namely, infrastructure, waste, health, education and city product. It is useful as it provides a snap-shot view of how cities are doing with respect to the different indices.

In example for calculation CDI in Armenia5 we collected data from budgets based on Web Scraping (Figure 2). Our goal as Data Analysts is to arrange the insights of our data in such a way that everybody who sees them is able to understand their implications and how to act on them clearly (Figure 1, Figure 2, Figure 5, Figure 6, Figure 7).

4 Summary

The codes and figures was written here builds on three conceptual foundations. The first is lack of regional/city data. The second theoretical reference on inportency of city/regional data. A third conceptual premise is that clear visualization.

Acknowledgement

I would like to thank Vahe Movsisyan his great guidance, valuable discussions and constructive suggestions to improve this paper.

References

L. Arroyo Abad and K. Khalifa. What are stylized facts? Journal of Economic Methodol- ogy, 22(2):143–156, 2015.

4The Nomenclature of Territorial Units for Statistics (NUTS; French: Nomenclature des unit´es territoriales statistiques) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail. The Nomenclature of Territorial Units for Statistics is instrumental in the European Union’s Structural Fund delivery mechanisms and for locating the area where goods and services subject to European public procurement legislation are to be delivered.

5seeAvetisyan(2018),”Who cares about #citydevelopment: Determinants of city development.”

(7)

S. Avetisyan. Who cares about# citydevelopment: Determinants of city development.

Available at SSRN 3284707, 2018.

G. Bichler and S. Balchak. Address matching bias: Ignorance is not bliss. Policing: An International Journal of Police Strategies & Management, 30(1):32–60, 2007.

R. S. Bivand, E. J. Pebesma, V. G´omez-Rubio, and E. J. Pebesma. Applied spatial data analysis with R, volume 747248717. Springer, 2008.

J. R. Bryson, R. A. Mulhall, M. Song, and R. Kenny. Urban assets and the financialisation fix: land tenure, renewal and path dependency in the city of birmingham. Cambridge Journal of Regions, Economy and Society, 10(3):455–469, 2017.

C. Kleiber and A. Zeileis. Applied econometrics with R. Springer Science & Business Media, 2008.

M. Kudamatsu. Gis for credible identification strategies in economics research. CESifo Economic Studies, 64(2):327–338, 2018.

B. Rubenstein-Montano, J. Liebowitz, J. Buchwalter, D. McCaw, B. Newman, K. Rebeck, and T. K. M. M. Team. A systems thinking framework for knowledge management.

Decision support systems, 31(1):5–16, 2001.

J. Swift, D. Goldberg, and J. Wilson. Geocoding best practices: review of eight commonly used geocoding systems. Los Angeles, CA: University of Southern California GIS Research Laboratory, 2008.

J. Wu, T. H. Funk, F. W. Lurmann, and A. M. Winer. Improving spatial accuracy of roadway networks and geocoded addresses. Transactions in GIS, 9(4):585–601, 2005.

Références

Documents relatifs

MapReduce: A programming model (inspired by standard functional programming operators) to facilitate the development and execution of distributed tasks.. Published by Google Labs

 Follow random teleport links with probability 1.0 from dead-ends.  Adjust

While many general tools and algorithms already exist for data profiling, most of them cannot be used for graph datasets, because they assume a relational data structure, a

This problem is commonly known as source selection: a system chooses data sources relevant for a given query (fragment).. Existing approaches for source selec- tion aim solely at

The National Institute of Health, which has a major input into policy development, is keen to redress this imbalance and has proposed that health promotion be included in all

Title: Modeling and new trends in tourism : a contribution to social and economic development / editors, Kostas, Rontos, José António, Filipe and Paris Tsartas (Professor at

Our findings suggest that better access to finance (such as having a credit card or a few credit lines), savings from monthly income and financial literacy (such as knowledge

Attacking the problem in the context of RDF Stream Processing, I defined my research question as follows: Given an information need formulated as a top-k continuous conjunctive