Automated Data Collection With R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

Simon%20Munzert%2C%20Christian%20Rubba%2C%20Peter%20Mei%C3%9Fner%2C%20Dominic%20Nyhuis-Automated%20Data%20Collection%20with%20R_

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 477

DownloadAutomated Data Collection With R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis-Automated
Open PDF In BrowserView PDF
Automated Data
Collection with R
A Practical Guide to
Web Scraping and Text Mining

Simon Munzert|Christian Rubba|Peter Meißner|Dominic Nyhuis

Automated Data Collection with R

Automated Data Collection with R
A Practical Guide to Web Scraping and
Text Mining
Simon Munzert
Department of Politics and Public Administration, University of Konstanz,
Germany

Christian Rubba
Department of Political Science, University of Zurich and National Center of
Competence in Research, Switzerland

Peter Meißner
Department of Politics and Public Administration, University of Konstanz,
Germany

Dominic Nyhuis
Department of Political Science, University of Mannheim, Germany

This edition first published 2015
© 2015 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for
permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and
product names used in this book are trade names, service marks, trademarks or registered trademarks of their
respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the
publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert
assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Munzert, Simon.
Automated data collection with R : a practical guide to web scraping and text mining / Simon Munzert, Christian
Rubba, Peter Meißner, Dominic Nyhuis.
pages cm
Summary: “This book provides a unified framework of web scraping and information extraction from text data
with R for the social sciences”– Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-118-83481-7 (hardback)
1. Data mining. 2. Automatic data collection systems. 3. Social sciences–Research–Data processing.
4. R (Computer program language) I. Title.
QA76.9.D343M865 2014
006.3′ 12–dc23
2014032266
A catalogue record for this book is available from the British Library.
ISBN: 9781118834817
Set in 10/12pt Times by Aptara Inc., New Delhi, India.

1

2015

To my parents, for their unending support. Also, to Stefanie.
—Simon
To my parents, for their love and encouragement.
—Christian
To Kristin, Buddy, and Paul for love, regular walks, and a final deadline.
—Peter
Meiner Familie.
—Dominic

Contents

1

Preface

xv

Introduction
1.1 Case study: World Heritage Sites in Danger
1.2 Some remarks on web data quality
1.3 Technologies for disseminating, extracting, and storing web data
1.3.1 Technologies for disseminating content on the Web
1.3.2 Technologies for information extraction from web documents
1.3.3 Technologies for data storage
1.4 Structure of the book

1
1
7
9
9
11
12
13

Part One
2

A Primer on Web and Data Technologies

HTML
2.1 Browser presentation and source code
2.2 Syntax rules
2.2.1 Tags, elements, and attributes
2.2.2 Tree structure
2.2.3 Comments
2.2.4 Reserved and special characters
2.2.5 Document type definition
2.2.6 Spaces and line breaks
2.3 Tags and attributes
2.3.1 The anchor tag 
2.3.2 The metadata tag 
2.3.3 The external reference tag 
2.3.4 Emphasizing tags , , 
2.3.5 The paragraphs tag 

2.3.6 Heading tags

,

,

, … 2.3.7 Listing content with