Development Manual
User Manual:
Open the PDF directly: View PDF
.
Page Count: 61
| Download | |
| Open PDF In Browser | View PDF |
WebGrab+Plus
(WG++)
V1.1.1
Advanced XMLTV EPG Grabber
A program created by:
Jan van Straaten & Francis de Paemeleere
Website: www.webgrabplus.com
(Document revision 22/01/2016, reflects WebGrab+Plus version 1.1.1/56.13)
What’s in this document:
For everyone new to this program: Read page 5,6,7 and 8 (upto chapter 4.2) and Appendix B
The rest of this document is for everyone willing to develop a SiteIni file or simply wants to know more than the basics.
1
Document author: Jan van Straaten (jan_van_straaten@outlook.com)
Document revision 22/01/2016, reflects WebGrab+Plus version 1.1.1/56.13
Table of Content:
1.Introduction ................................................................................................................... 5
1.1 What it does, features .............................................................................................................................................................. 5
1.2 How to run, files and folders ..................................................................................................................................................... 5
1.3 Xmltv, Single - versus multiple - value xmltv elements ............................................................................................................. 5
2.The grabbing, show update process and update modes: ................................................. 5
2.1 The show update process ......................................................................................................................................................... 5
2.2 The update modes .................................................................................................................................................................... 6
2.3 The Grabbed Site pages: index-page, detail-page and sub-detail-page ................................................................................... 6
2.4 Robots exclusion standard check .............................................................................................................................................. 5
3. Configuration files: ........................................................................................................ 6
3.1 WebGrab++.config.xml ............................................................................................................................................................. 6
3.2 MDB.config.xml ......................................................................................................................................................................... 7
3.3 REX.config.xml........................................................................................................................................................................... 7
4. SiteIni file ...................................................................................................................... 7
4.1 SiteIni file Parts ......................................................................................................................................................................... 7
4.2 The SiteIni file basics ................................................................................................................................................................. 7
4.2.1 scrubstrings ........................................................................................................................................................................ 7
4.2.1.1 The 'separator strings' method ................................................................................................................................... 8
4.2.1.2 The 'regular expression method': ............................................................................................................................... 9
4.2.2 ElementNames : ................................................................................................................................................................. 9
4.2.3 Action specifiers ................................................................................................................................................................. 9
4.2.4 Types .................................................................................................................................................................................. 9
4.2.4.1 type url ........................................................................................................................................................................ 9
4.2.4.2 types single and multi : ............................................................................................................................................. 9
4.2.4.3 type regex : ............................................................................................................................................................... 10
4.2.5 Arguments:....................................................................................................................................................................... 11
4.2.5.1 Argument includeblock and excludeblock : .............................................................................................................. 11
4.2.5.2 Argument separator : ................................................................................................................................................ 11
4.2.5.3 Argument max : ........................................................................................................................................................ 12
4.2.5.4 Arguments include and exclude : ............................................................................................................................. 12
4.2.5.5 Argument debug : ..................................................................................................................................................... 13
4.2.5.6 Dedicated Arguments: .............................................................................................................................................. 13
4.2.6 String matching / wildcards ............................................................................................................................................. 13
4.2.7 TimeZones ........................................................................................................................................................................ 14
4.3 General Site dependent data: ................................................................................................................................................. 14
4.4 Url builder ............................................................................................................................................................................... 16
4.4.1 General URL settings ........................................................................................................................................................ 16
4.4.1.1 HTTP Headers, method GET, POST, POST-BACK and SOAP ....................................................................................... 16
2
4.4.1.2 argument preload ..................................................................................................................................................... 17
4.4.2 url_index .......................................................................................................................................................................... 17
4.4.2.1 urldate format: .......................................................................................................................................................... 17
4.4.2.2 subpage format: ........................................................................................................................................................ 18
4.4.2.3 Full examples of the url_index specification: ............................................................................................................ 18
4.4.3 other url elements ........................................................................................................................................................... 19
4.4.3.1 multiple subdetail pages ........................................................................................................................................... 20
4.4.4 the FTP and File protocol ................................................................................................................................................. 20
4.4.4.1 FTP............................................................................................................................................................................. 20
4.4.4.2 File ............................................................................................................................................................................. 20
4.5 Elements ................................................................................................................................................................................. 20
4.5.1 Non optional elements - elements needed by the program............................................................................................ 20
4.5.2 Elements that are processed in a special way ................................................................................................................. 21
4.5.2.1 Time elements........................................................................................................................................................... 21
4.5.2.1.1 Times from the detail page ................................................................................................................................ 22
4.5.2.2 The others ................................................................................................................................................................. 22
4.5.3 Special elements (see APPENDIX E) ................................................................................................................................. 23
4.5.4 Read_only elements (see APPENDIX C)........................................................................................................................... 24
4.5.5 XMLTV attributes ............................................................................................................................................................. 24
4.6 Operations: ............................................................................................................................................................................. 24
4.6.1 Notes and examples of the effects of modify .................................................................................................................. 27
4.6.1.1 The order of the actions and the argument scope.................................................................................................... 27
4.6.1.2 The use and effect of argument scope ...................................................................................................................... 28
4.6.1.3 Multiple value elements and modify ........................................................................................................................ 28
4.6.1.4 Expression-1 with indices .......................................................................................................................................... 29
4.6.1.5 Expression-1 with 'regular expressions' .................................................................................................................... 30
4.6.2 Conditional arguments ..................................................................................................................................................... 30
4.6.2.1 Pre-Conditional arguments ....................................................................................................................................... 30
4.6.2.2 Post-Conditional arguments ..................................................................................................................................... 31
4.6.3 Loops ................................................................................................................................................................................ 31
4.6.3.1 Conditional loop: ....................................................................................................................................................... 31
4.6.3.2 Each – Loop: .............................................................................................................................................................. 32
4.6.4 The modify commands. .................................................................................................................................................... 32
4.6.4.1 Replace ...................................................................................................................................................................... 32
4.6.4.2 Remove ..................................................................................................................................................................... 32
4.6.4.3 Substring ................................................................................................................................................................... 33
4.6.4.4 Addstart and Addend ................................................................................................................................................ 33
4.6.4.5 Calculate.................................................................................................................................................................... 33
4.6.4.5.1 # Count: .............................................................................................................................................................. 33
4.6.4.5.2 @ Index-of:......................................................................................................................................................... 34
4.6.4.5.3 Date and time calculations................................................................................................................................. 34
4.6.4.5.4 Bitwise Calculations ........................................................................................................................................... 36
4.6.4.6 Cleanup ..................................................................................................................................................................... 36
3
4.6.4.6.1 Cleanup with argument removeduplicates ....................................................................................................... 37
4.6.4.6.2 Cleanup with argument tags. ............................................................................................................................. 37
4.6.4.7 Clear .......................................................................................................................................................................... 38
4.6.4.8 Select ......................................................................................................................................................................... 38
4.6.4.9 Sort ............................................................................................................................................................................ 38
4.6.4.10 Set ........................................................................................................................................................................... 39
4.6.5 Examples of operations .................................................................................................................................................... 39
5. Special Procedures and Tricks ...................................................................................... 40
5.1 Special procedures .................................................................................................................................................................. 40
5.1.1 How to configure a SiteIni file for a site using the POST Http protocol ........................................................................... 40
5.1.2 How to configure a SiteIni file for a site using the POST_BACK Http protocol ................................................................ 40
5.1.3 How to configure a SiteIni file for a site using the SOAP http protocol ........................................................................... 41
5.2 Tricks ....................................................................................................................................................................................... 42
6. MDBIni file................................................................................................................... 43
6.1 Introduction ............................................................................................................................................................................ 43
6.2 MDB Elements......................................................................................................................................................................... 43
6.2.1 Variables in URL element values ...................................................................................................................................... 44
6.3 Differences between MBDIni and SiteIni syntax. .................................................................................................................... 44
6.3.1 Element prefix. ................................................................................................................................................................. 44
6.3.2 Argument urlencode ........................................................................................................................................................ 45
6.4 Series episode details .............................................................................................................................................................. 45
7. How to develop a new SiteIni file................................................................................. 46
7.1 Preparation ............................................................................................................................................................................. 46
7.2 SiteIniIDE ................................................................................................................................................................................. 46
7.3 Development steps ................................................................................................................................................................. 46
APPENDIX A WebGrab+Plus Features............................................................................ 51
APPENDIX B Example config files: .................................................................................. 52
WebGrab++.config.xml file ........................................................................................................................................................... 52
mdb.config.xml file ....................................................................................................................................................................... 54
rex.config.xml file .......................................................................................................................................................................... 57
APPENDIX C Read-only elements .................................................................................. 60
APPENDIX D Site Dependent settings ............................................................................ 60
APPENDIX E
Element names ......................................................................................... 61
4
WebGrab+Plus , an advanced XMLTV EPG Grabber
1.Introduction
Beside this manual, www.webgrabplus.com/documentation provides additional documentation of various topics not
listed here.
1.1 What it does, features
The program grabs EPG data from TV Guide internet sites and
runs in WINDOWS, LINUX and OSX and
can grab from multiple sites in one run, programmable by user trough a SiteIni file
very fast through incremental grabbing (only changed and new shows grabbed)
programmable through editing commands that enable changing, filtering, adding, moving, removing (parts) and
calculating of the xmltv elements.
regular updates, support, documentation, user guides and a vast collection of SiteIni files available on
webgrabplus.com
For a full list of features see APENDIX A
1.2 How to run, files and folders
For WINDOWS, an installation package is provided that creates the default home-folder
C:\ProgramData\ServerCare\WebGrab and fills it with all the necessary files and sub-folders. The program can be run
by a double click on the (also provided) icon - or by running the executable which is located in the (x86)
C:\ProgramFiles.
LINUX users and users that prefer another home-folder must copy all the required files and folders to it manually. To
run the program in this non-standard environment, must be done in command line mode, specifying the path of the
home-folder as a command-line parameter. A simple user guide is provided for this situation. Regular upgrades and
beta versions are also available at the program's website download page http://webgrabplus.com/download
Detailed guide lines for the various types of installation and use of the program are available online in the
documentation pages of www.webgrabplus.com.
1.3 Xmltv, Single - versus multiple - value xmltv elements
For an overview of the xmltv elements supported see APPENDIX E, column: xmltv name.
According to the xmltv specification, some elements can have more than one value in the xmltv file. We distinguish
single value xmltv elements (e.g. description) and multiple value xmltv elements (e.g. category, actor).
WebGrab+Plus treats them differently. (for examples see 4.2.4 Types)
Note that the element 'title' is a single value element but the program supports a second version of the same title
(titleoriginal) with different 'lang=' attributes. (See 4.2.5.6 argument lang)
1.4 Robots exclusion standard check
A quote from http://www.robotstxt.org/orig.html to explain:
quote/
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by
recursively retrieving linked pages. The ‘Robots exclusion standard’ is a common facility the majority of robot authors
offer the WWW community to protect WWW server against unwanted accesses by their robots.
/end quote
Following this definition of WWW Robots, WebGrab+Plus is such a program. Therefore, it obeys the methods and
rules of this standard in that it displays a warning to the user if a site disallows access to pages that the program
wants to grab from.
2.The grabbing, show update process and update modes:
2.1 The show update process
Assuming a previous xmltv listing exist (e.g. of yesterday), the program reads this and stores it as a target for update
and as reference of what shows have to be changed or added. If no xmltv listing exists, the program creates a new
one. Before grabbing show details, the program determines if the existing show in the xmltv listing is still valid or
needs an update. For that it connects to the TV Guide website and grabs the so called index pages (the html pages
5
that contain an overview the scheduled shows per timespan (e.g. day or several days)). It then compares the shows
listed there (channel, start and stop times and title) with shows in the existing xmltv listing. As a result of this
comparison the following situations occur:
same (.), no update. The show in the index page is considered the same as the one in the existing xmltv listing.
changed (c), update. The index show is different from the xmltv show but they have overlapping or equal time
span.
gab (g), insert. The index show fits in a time gab of the xmltv listing.
new (n), add. The index show is new; it will be added to the end (or to the beginning if that is the case) of the
xmltv listing.
repair (r), update. This is a special situation that occurs if errors or overlapping shows are detected in the xmltv
listing. The program will try to solve this by remove and update.
When the program runs, these resulting situations for each show are printed in the command window like this (the iiii
indicates 4 days of index pages downloaded):
iiii...............g............ccc........c.c.......g....r.....nnnnnnnnnnnnnnnnnnnnnnn
The comparison of the show title in the index page (index_title) and the one in the xmltv file is rather complicated
and tricky. This is due to the fact that the index_title frequently differs from the one in the show detail page to a
certain extend. Differences can be due to abbreviation of long titles, different use of punctuation characters and
combination of title with other elements in the index_title (like category and subtitle). The program deals with all
those differences through a weighted comparison. The result of this comparison is a 'title match factor', which ,
roughly, is the biggest percentage of 'matching' words between the two titles in any of the elements of the
index_title. If this title match factor is less than the value for it in the SiteIni file (see 4.3) the show is considered not same - and a show update is started.
For that it will grab the show details from the show detail html page(s) (see 2.3) of the TV Guide website if provided
by it.
2.2 The update modes
The program supports a variety of update modes. The preferred and most efficient is :
'incremental' (i) Works as described above for all shows in the index page. In this mode the download time is
minimized to the minimum.
Other update modes are:
'light' (l) which is incremental but forces a re-grab of all shows for 'today',
'smart' (s) is the same with a forced re-grab for today and tomorrow,
'full' (f) not incremental, forces a full re-grab of all days requested.
Index-only mode:
Besides, and independent from, the modes mentioned above is a special grabbing mode :
'index-only' that is automatically selected by the program if no elements need to be scrubbed from the show
detail page. (see also 4.5) This mode is 'superfast' but seldom useful because most sites provide very little show
data on the index page. But if you are satisfied with just start and stop times and a title it's there. Occasionally
there is a site with richer data on the index page (like tvguide.co.uk). Some sites list only details on the index
page or provide only more detailed information for some shows on detail pages. The program automatically
recognizes these cases.
2.3 The Grabbed Site pages: index-page, detail-page and sub-detail-page
As explained in 2.1 , the update process, the program starts with grabbing the index-page to get an overview of the
shows for the time period for which epg data is requested. Depending on the update decision outcome and of the
availability of them, the program grabs detailed show epg data from the show detail html page. Some sites split the
epg data into sub-detail pages. The program supports additional grabbing from one or more of such sub-detail pages.
(see 4.5 and appendix E)
3. Configuration files:
3.1 WebGrab++.config.xml
This file supplies all TV Guide website independent settings for WebGrab+Plus. Among them are :
filename The path and name of the xmltv output file
update mode , as discussed in 2.2
6
timespan ,the number of days to grab
and, most important, a
list of channels to grab. Each channel for which epg data in the xmltv listing is requested needs to be added to
this channel list. The channel data in this list consists of the update mode (see 2.2) , the site to get the data from
(see 4) the site-id (the channel id of the site, see 4.4.2), the xmltv_id (the id by which xmltv recognises the
channel) and the channel display name.
Besides these, several other settings, like mode , postprocess , proxy , user-agent , logging , credentials , retry , skip
A typical WebGrab++.config.xml file is listed in APPENDIX B. It also provides the explanation of all the settings. The
file is self-explanatory. For detailed configuration instructions see
http://www.webgrabplus.com/documentation/configuration
3.2 MDB.config.xml
The MDB postprocessor of WebGrab+Plus, which is available from Version 1.1.0 onwards, automatically adds movie
and serie details from online 'MDB' sites (e.g. IMDb.com) to the xmltv file created by the basic WebGrab+Plus EPG
frontend grabber. It has its own configuration file which resides in the subfolder \mdb of program’s home-folder. This
mdb.config.xml file also serves as the mdb configuration user guide. An example of it is also listed in APPENDIX B .
For detailed configuration instructions see http://www.webgrabplus.com/documentation/configuration-mdb
3.3 REX.config.xml
The purpose of this postprocessor is to re-arrange and edit the xmltv file created by the grabber section of
WebGrab+Plus. This can be useful or necessary if the EPG viewer of the PVR/Media-Centre used, or the xmltv
importer it uses, does not support all the xmltv elements in the xmltv file created by WG++.
It can:
Move the content of xmltv elements to other xmltv elements
Merge the content of several xmltv elements
Add comments/prefix/postfix text
Remove or create xmltv elements
E.g.: If the PVR doesn't support import of credit elements (actors, directors etc.) it can add the content of them to the
description and remove the original credit elements which are useless. Or, it can move the episode data to the
beginning or end of the subtitle element- Etc. ..
It has its own configuration file which resides in the subfolder \rex of program's home-folder. This rex.config.xml file
also serves as the rex configuration user guide. An example of it is also listed in APPENDIX B .
4. SiteIni file
For each TV Guide website that is entered in the channel list of the config file (see above) a SiteIni file is required to
supply WebGrab+Plus with site dependent settings. The name of this file is directly related to the value of the site
attribute in the channel list through the addition of .ini to this value. (e.g. channel list site attribute : tvgids.nl .. SiteIni
file name : tvgids.nl.ini)
4.1 SiteIni file Parts
The data in this file consists of the following parts:
A top header section that contains meta data like the site, the required WG++ version, revision number, date and
author and eventual remarks.
General Site dependent data (see 4.3)
Data that WebGrab+Plus needs to compose the url's to download pages (see 4.4)
Data that WebGrab+Plus needs to scrub xmltv elements from the downloaded pages (see 4.5)
Optional data that allows post modification of the scrubbed xmltv elements (see 4.6)
A channel file creation part (see 4.5.3)
4.2 The SiteIni file basics
4.2.1 scrubstrings
A scrubstring is just one line in the SiteIni that specifies an action for a SiteIni element. The general format is
Elementname.action {type(arguments)|datastrings required for the action}
Most of the settings in this file relate to how WebGrab+Plus extracts (“scrubs”) xmltv elements from the TV Guide
website html pages. The program supports two methods for that: The ‘separator strings method’ (described in
7
4.2.1.1), by means of element separator strings pointing to the start and end of the element to be scrubbed and the
‘regular expression method’ (described in 4.2.1.2), by which the element to be scrubbed is extracted by means of a
‘regular expression’.
Both methods can be used together mixed in one SiteIni file and both cover more or less the same functionality. The
‘separator strings method’ is the easiest to understand and is recommended if not familiar with ‘regular expressions’.
The ‘regular expression method’ can be considered as the ‘expert’ method and is extremely powerful and compact.
4.2.1.1 The 'separator strings' method
For that it uses (up to) 4 strings that should point to the beginning and the end of the element to scrub:
the element start es and the element end' ee string.
They represent the unique strings (e.g. html tags or parts of it) between which required the element is always
located on the html page. In most cases such unique es and ee are unavailable because somewhere else in the
html page the same strings exist enclosing other data. In that case we need to separate the right es and ee pairs
from the unwanted pairs.
For that we use the block separators:
block start bs and block end be .
These should enclose a html region (block) in which es and ee enclose our wanted element and nothing else.
Consider the following sample html:
Basilisk: Serpent King
RTL 7
Amerikaanse actiefilm. Een team van archeologen ontwaakt een mythische slang die vernieling zaait. De enige
manier om het wezen te stoppen is door een magische scepter te vinden.
- Genre:
- speelfilm
- Genre:
- sequel
- Subgenre:
- avontuur
- Duur:
- 90 min
- Regie:
- Louie Myman
- Met:
- Jeremy London, Wendy Carter, Griff Furst, Cleavant Derricks, Daniel Ponsky, Bashar
Rahal
To scrub the title - Basilisk: Serpent King- we need es= and ee=
. In fact if it is sure that tag is
uniquely used to enclose the title we wouldn't need more than that. However even if that is the case on this (part of)
html page, simple html tags like are seldom unique and thus it is more secure to use the block separators bs=
and be = and ee= . Very likely this
es is unique for the description, so we wouldn't need block separators.
Strings like bs, es, ee and be will be called separatorstrings in the remainder of this document.
The syntax in which the SiteIni file expects them is :
{type(optional arguments)|bs|optional es|ee|optional be}
or:
{type(optional arguments)|bs|optional es|optional ee|be}
To complete a SiteIni scrubstring we need to add the xmltv element name and an action specifier :
ElementName.ActionSpecifier {type(optional arguments)|separatorstrings}
The scrubstrings for two scrubstrings from above for description and title respectively:
description.scrub {single|||||
|
Amerikaanse actiefilm.
Een team van archeologen ontwaakt een mythische slang die vernieling zaait. De enige manier om het
wezen te stoppen is door een magische scepter te vinden.
Geproduceerd in 1998
In such a case we use type multi to instruct WebGrab+Plus to scrub all the elements within the block with the
specified element separators, like here es = and ee =
9
To illustrate the scrub results from this html with type single:
description.scrub {single|||
|Amerikaanse actiefilm.
While the same with type multi :
description.scrub {multi|||
|Amerikaanse actiefilm. Een team van archeologen ontwaakt een mythische slang die
vernieling zaait. De enige manier om het wezen te stoppen is door een magische scepter te vinden. Geproduceerd in
1998
Notice that WebGrab+Plus adds the three description paragraphs together. This is due to the fact that the element
description is a single value xmltv element. (see 1 and 4.2.2)
To illustrate what happens with a multiple value xmltv elements, consider the category.
In the html the genre and subgenre are the obvious choice for that. Xmltv doesn't specify a subgenre element, so we
take them all together as category
- Genre:
- speelfilm
- Genre:
- sequel
- Subgenre:
- avontuur
- Duur:
- 90 min
There are two genre entries in the html, with the same element separators, so we use type multi to grab them both.
category.scrub {multi||Genre: | |}
The result will be the following xmltv listing for category:
speelfilm
sequel
Because category is a multiple value xmltv element the two are not joined to one xmltv element but listed as separate
category elements.
To add the third category element , the Subgenre in the html, we use another feature of the SiteIni specification : For
most SiteIni elements it is allowed to use more than just one scrubstring for the same xmltv element! (see APPENDIX
E column -multiple scrub- which)
So we add:
category.scrub {single||Subgenre: | |}
The final result:
speelfilm
sequel
avontuur
4.2.4.3 type regex :
Used to specify the ‘regex’ method of data extraction. See also 4.2.1.2 for some background information. This method
doesn’t need the type single and multi distinction as is explained below.
The syntax:
Element.scrub {regex(optional argument)||regular expression||}
regex : the action specifier for this method
argument : for this method the only arguments supported are debug and pattern (see arguments 4.2.5.5)
regular expression: The regular expression that matches the desired element content.
The place of the regular expression in the scrubstring is the same as the 'element start - es' in the separator
string method (see 4.2.1.1, syntax) or, simply put, two || in front and two || after it.
The easiest way to get started with this method (after mastering the ‘separator string’ method) is to use a direct
substitution of the separator strings be, es, ee and be used there.
Remember the syntax for the separator string method (see 4.2.4.2 type single and multi explanation)For type single:
Element.scrub {single (arguments)|bs|es|ee|be}
A direct ‘regex’ substitute for it will be:
Element.scrub {regex (arguments)||bs(?:.*)es(.*?)ee(?:.*?)be||}
Examples: The ‘separator string method’ title solution of the previous chapter:
title.scrub {single|}
description.scrub {regex||(.*?)||}
For type multi the substitution is as follows:
Element.scrub {multi(arguments)|bs|es|ee|be}
Will look like this in regex:
Element.scrub {regex(arguments)||bs(?:.*)(?:es(.*?)ee(?:.*?))*be||}
Example the category of chapter 4.2.4.2
category.scrub {multi||Genre: | |}
category.scrub {regex||(?:.*?)(?:Genre: (.*?) (?:.*?))*||}
4.2.5 Arguments:
Arguments can be either/and includeblock, excludeblock, separator, max, include, exclude, debug
and dedicated arguments lang, force, pattern, sort, timespan, preload, alloc, target.
!! All these arguments are irrelevant for type regex, with the exception of debug and pattern !
4.2.5.1 Argument includeblock and excludeblock :
If it is only possible to find blocks that, apart from the required information, contain unwanted information with the
same element separators es and ee , these arguments can be used to select the correct blocks. The syntax:
includeblock=bn1,bn2, .. ,bnn/tn -or- "string-1""string-2" .. "string-n"
excludeblock=bn1,bn2, .. ,bnn/tn -or- "string-1""string-2" .. "string-n"
bn , the block number to include or exclude, starting with 1
tn , the number of blocks for which the block numbers bn repeat
"string" , include or exclude only the blocks that contain the "string". When more than one "string" is entered, the
block selection is done by an 'or' function of the strings. The use of wildcards [x] and [?] is allowed (see 4.2.6)
Example : includeblock="abc""def" , the blocks included contain the string "abc" or "def" .
When more than one "string" is entered separated by the char & , the block selection is done by an 'and' function.
Example : includeblock="abc"&"def" , the blocks included contain the string "abc" and "def".
All characters are allowed.
The characters " ' { and ) need to be preceded by \ . So the string ("O'Neil {superhero}") must be entered as
"\(\"O\'Neil \{superhero}\"\)"
4.2.5.2 Argument separator :
As example take a look at the actors :
Regie: Louie Myman
Met: Jeremy London, Wendy Carter, Griff Furst, Cleavant Derricks, Daniel Ponsky, Bashar
Rahal
If we use : actor.scrub {single|Met: || |} the xmltv listing of actor will be
Jeremy London, Wendy Carter, Griff Furst, Cleavant Derricks, Daniel Ponsky, Bashar Rahal
That is clearly not what we want. To separate them we use the separator argument. It specifies which string or
strings separates the elements. Its syntax is:
separator="string-1" "string-2" .. "string-n"
Between the separator strings a space is allowed but not required.
All characters are allowed with the exception of | (vertical line). This is no limitation of this function because the
program will automatically replace all | characters in the html page into the character combination !?!?!, this to
avoid problems with the special function of this character.
The characters " ' { and ) need to be preceeded by \ So the string ("O'Neil") must be entered as
separator="\(\"O\'Neil\")"
The scrubstring for actor then becomes:
actor.scrub {single(separator=", ")|Met: || |}
and the resulting xmltv listing:
Jeremy London
Wendy Carter
Griff Furst
11
Cleavant Derricks
Daniel Ponsky
Bashar Rahal
Suppose the html line with the actors looked like this:
Met: Jeremy London, Wendy Carter, Griff Furst, Cleavant Derricks, Daniel Ponsky and Bashar
Rahal
(The last two actors separated by the word - and - ) We then can use separator=", " " and " for the same result.
4.2.5.3 Argument max :
To limit the number of elements (either added together in the case of single value xmltv elements or listed separately
in the case of multiple value xmltv elements) we can use the argument max. Its syntax:
max=n
in which n=positive integer
actor.scrub {single(separator=", " max=3)|Met: || |} will result in:
Jeremy London
Wendy Carter
Griff Furst
4.2.5.4 Arguments include and exclude :
These allow further control over which of the scrubbed elements will be passed to the final result. It is important to
realise that both include and exclude can be used together in one scrubstring. The program will execute these in the
order in which they occur in this specification. See for an example of the effect of this in 5.
Its syntax:
include=n -or- first -or- firstn -or- last -or- lastn -or- "string"
exclude=n -or- first -or- firstn -or- last -or- lastn -or- "string"
n the element number to include or exclude, starting with 1
first or firstn (like first2) , the first or the first n elements to include or exclude
last or lastn (like last2) , the last or the last n elements to include or exclude
"string" , like "met o.m.", include or exclude only elements containing the "string". The use of wildcards [x] and
[?] (see 4.2.6) is supported.
All characters are allowed with the exception of | (vertical line). This is no limitation of this function because the
program will automatically replace all | characters in the html page into the character combination !?!?! , this to
avoid problems with the special function of this character.
The characters " ' { and ) need to be preceded by \ So the string ("O'Neil") must be entered as "\(\"O\'Neil\")"
As with the argument separator (see 4.2.5.2) a list of strings is allowed like:
include="string-1" "string-2" .. "string-n"
The effect of these arguments differs depending on whether it is entered in - combination and after the argument
separator — (case A) or not (case B).
Case A (after the argument separator):
In this case it allows to make a selection of the elements we want after they are separated.
As example we use the following html for a title and sub-title combination that occurs frequently:
Motociclismo: Cto. del Mundo
Here, the title Motociclismo, is separated from the sub-title Cto. del Mundo with a : character.
So we can use the arguments separator=": " to separate them , we then use include=first for the title and
exclude=first for the sub-title, like this:
title.scrub {single(separator=": " include=first)|||
|}
subtitle.scrub {single(separator=": " exclude=first)|||
|}
The xmltv result :
Motociclismo
Cto. del Mundo
Case B (not after the argument separator ):
The program will evaluate all the scrubbed elements (single or multi) on the conditions specified by the include
12
and/or exclude values.
As example we use the description again:
Amerikaanse actiefilm.
Een team van archeologen ontwaakt een mythische slang die vernieling zaait. De enige manier om het
wezen te stoppen is door een magische scepter te vinden.
Geproduceerd in 1998
Remember the original scrubstring:
description.scrub {multi|||
|Amerikaanse actiefilm. Een team van archeologen ontwaakt een mythische slang die
vernieling zaait. De enige manier om het wezen te stoppen is door een magische scepter te vinden.
Geproduceerd in 1998
But, the last element - Geproduceerd in 1998 - actually belongs to another xmltv element - date - which is meant
to contain the date of production. So in fact it shouldn't be part of the description. We can use the following to
exclude it from the description:
description.scrub {multi(exclude="Geproduceerd")|||
|||
|||
|||
|||
|page not found
} * output for subsequent pages: 1 2 3 .. subpage.format {number|section_|1|page not found} *output: section_1 section_2 section_3 .. subpage.format {letter|p|a|page not found} * output: pa pb pc ... subpage.format {list|04:00|12:00|20:00} * output: 04:00 12:00 20:00 subpage.format {list(format=D2 step=6 count=4)|0} * output: 00 06 12 18 4.4.2.3 Full examples of the url_index specification: Suppose WebGrab++.config.xml channel entry:| Genre | Film |
|---|---|
| Acteur | Kathy Bates, Jennifer Jason Leigh, Judy Parfitt, Christopher Plummer |
| Regisseur | Taylor Hackford |
||
|| || | } That leaves use with a description that contains all the actors which isn't perfect. However they can be removed with the following: description.modify{remove(null)|Met: '{single|
| }
mdb_temp_1.modify {select|'mdb_subtitle' ~} * select the one and only with the episode title
-Now , get the mdb_episode_id ,
mdb_episode_id.modify {substring(type=regex)|'mdb_temp_1' "(\d\{7\})/\">"}
With this mdb_episode_id grab the episode details with url_mdb’s like these for IMDb.com
url_mdb_p2.modify {addstart|http://www.imdb.com/title/tt'mdb_episode_id'} * the episode detail page
45
url_mdb_p3.modify {addstart|http://www.imdb.com/title/tt'mdb_episode_id'/synopsis?ref_=tt_stry_pl} * the full
synopsis
url_mdb_p4.modify {addstart|http://www.imdb.com/title/tt'mdb_episode_id'/fullcredits?ref_=tt_ql_1} *full cast
and crew
url_mdb_p5.modify {addstart|http://www.imdb.com/title/tt'mdb_episode_id'/plotsummary?ref_=tt_ql_5} *plot
summary
url_mdb_p6.modify {addstart|http://www.imdb.com/title/tt'mdb_episode_id'/reviews?ref_=tt_ql_7} *user reviews
7. How to develop a new SiteIni file
7.1 Preparation
Familiarize yourself with the basics of the SiteIni as described in chapter 4. of this document.
To develop a new SiteIni file it is necessary to collect information about the url, the structure of the html pages of
the EPG of the site for which the SiteIni file is to be created. For that use your internet browser and familiarize
yourself with the 'developer' tools that your browser provides. E.g.:
o In Microsoft IE or Edge this tool is supplied as standard. It can be activated with F12
o For FireFox an add-on 'Firebug' must be installed
o Chrome: https://developers.google.com/web/tools/chrome-devtools/
o Other tools: Fiddler Web Debugger, …
o More information about these developer tools can be found in http://devtoolsecrets.com/
Development environment: Install and familiarize yourself with SiteIniIDE (see 7.2)
It helps to look at a few example SiteIni files as provided @ http://www.webgrabplus.com/epg-channels
7.2 SiteIniIDE
This tool provides a dedicated development environment for SiteIni's. It can be obtained @
http://webgrabplus.com/sites/default/files/downloads/Misc/SiteIniIDE_V0.12.zip
The basics are described in a readme.txt document.
7.3 Development steps
1.
2.
3.
4.
start SiteIniIDE WG++IDE.exe. It uses a third party text editor NotePad++ which is equipped with the SiteIniIDE
language pack as described in http://www.webgrabplus.com/download/utility/notepad-syntax-highlighting that
provides coloured highlighting for a SiteIni listing.
Enter Alt+N to load a template for a new SiteIni. It will prompt for the name. It is essential to use a name that
reflects the basic url of the tvguide site for which you want to make the SiteIni. E.g. tvguide.com
This command will create a folder in the debug area of SiteIniIDE and will also create a SiteIni with the name
you have chosen with the .ini file extention (tvguide.com.ini) and filled the basic structure and template
scrubstrings and a WebGrab++.config.xml file.
Open the SiteIni file. A lot of the scrubstrings in it are disabled (* at the beginning of the line), but some are
already enabled to start with. E.g.
site {url=your_site_name|timezone=UTC+00:00|maxdays=6|cultureinfo=en-GB|charset=UTF8|titlematchfactor=90|nopageoverlaps}
urldate.format {daycounter|0}
url_index{url|http://www.your_site_name}
url_index.headers {customheader=Accept-Encoding=gzip,deflate} * to speed up the downloading of the index
pages
index_showsplit.scrub {multi(debug)||||}
In your internet browser, go to the webpage that displays the tvguide of a certain channel for 'today'. Enable the
developer tool as mentioned in 7.1 and make notes of:
URL : locate the url for the page that contains the actual tvguide data (that is not always the same as the url
that shows in the address bar of the browser!). For this url find out :
The Webrequest method GET, POST, POST-BACK or SOAP (consult 4.4.1.1)
For POST, POST-BACK and SOAP follow the special procedures described in 5.1
The Webrequest headers (see 4.4.1.1)
How the date is specified? Generally, especially in Webrequest method GET, it is part of the url. It can be just
a simple number or any other date string. Read 4.4.2.1 urldate format. It can also be part of the postdata
header in case of a POST or an POST-BACK Webrequest method.
46
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Similar to date find out where and with what string the requested channel is specified.
The URL just located is the URL of the index_page (the page that list all the shows for a certain timespan, just
one day in most cases, and for a certain channel) (But there are all kind of other index_page structures like
multi page ('subpage'), multi channel, multi day, time fragmented etc)
Compose and enter url_index. (4.4.2 and an example in 4.4.2.3)
Add the debug argument (4.2.5.5)
Enter urldate.format. The date format as found above in step 4. (Read 4.4.2.1)
(if the index_page is split into several pages, use subpage.format) (Read 4.4.2.2)
Add all the url_index.headers .
Enter a few simple values in the line that starts with site (see 4.3 General Site dependant data):
timezone: Enter the timezone in which the data on the index_page is given. This is mostly the timezone of the
country. But sometimes the index_page is given in the UTC timezone. See 4.2.7
cultureinfo: The culture info string for the country and the language
maxdays: Figure out for how many days the site provides tvguide data.
charset: The charset in which the index_page is written. The value is often found near the top of the
index_page. If unclear, start with utf-8. If the result of the first run looks garbled, try other values.
Leave index_showsplit as given in 3.: index_showsplit.scrub {multi(debug)||||}
This will simply copy the complete index_page into the element index_showsplit and because of the argument
debug also in the logfile WebGrab++.log.txt
Now open the config file: WebGrab++.config.xml
APPENDIX B contains an example config file including an explanation of all the values.
Locate the sample channel entry :
Source Exif Data: File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 61 Language : en-GB Tagged PDF : Yes Author : Jan van Straaten Creator : Microsoft® Word 2016 Create Date : 2016:01:25 13:10:08+00:00 Modify Date : 2016:01:25 13:10:08+00:00 Producer : Microsoft® Word 2016EXIF Metadata provided by EXIF.tools |