1 files changed, 244 insertions, 30 deletions
diff --git a/network/youtube-dl/youtube-dl.1 b/network/youtube-dl/youtube-dl.1
index 321792e8840d..087f4164dc29 100644
--- a/network/youtube-dl/youtube-dl.1
+++ b/network/youtube-dl/youtube-dl.1
@@ -133,8 +133,8 @@ Make all connections via IPv6 (experimental)
 .RS
 .RE
 .TP
-.B \-\-cn\-verification\-proxy \f[I]URL\f[]
-Use this proxy to verify the IP address for some Chinese sites.
+.B \-\-geo\-verification\-proxy \f[I]URL\f[]
+Use this proxy to verify the IP address for some geo\-restricted sites.
 The default proxy specified by \-\-proxy (or none, if the options is not
 present) is used for the actual downloading.
 (experimental)
@@ -859,6 +859,8 @@ On Linux and OS X, the system wide configuration file is located at
 On Windows, the user wide configuration file locations are
 \f[C]%APPDATA%\\youtube\-dl\\config.txt\f[] or
 \f[C]C:\\Users\\<user\ name>\\youtube\-dl.conf\f[].
+Note that by default configuration file may not exist so you may need to
+create it yourself.
 .PP
 For example, with the following configuration file youtube\-dl will
 always extract the audio, not copy the mtime, use a proxy and save all
@@ -1757,11 +1759,26 @@ legally, you can follow this quick list (assuming your service is called
 .IP " 1." 4
 Fork this repository (https://github.com/rg3/youtube-dl/fork)
 .IP " 2." 4
-Check out the source code with
-\f[C]git\ clone\ git\@github.com:YOUR_GITHUB_USERNAME/youtube\-dl.git\f[]
+Check out the source code with:
+.RS 4
+.IP
+.nf
+\f[C]
+git\ clone\ git\@github.com:YOUR_GITHUB_USERNAME/youtube\-dl.git
+\f[]
+.fi
+.RE
 .IP " 3." 4
 Start a new git branch with
-\f[C]cd\ youtube\-dl;\ git\ checkout\ \-b\ yourextractor\f[]
+.RS 4
+.IP
+.nf
+\f[C]
+cd\ youtube\-dl
+git\ checkout\ \-b\ yourextractor
+\f[]
+.fi
+.RE
 .IP " 4." 4
 Start with this simple template and save it to
 \f[C]youtube_dl/extractor/yourextractor.py\f[]:
@@ -1831,32 +1848,12 @@ extractor should and may
 return (https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/common.py#L74-L252).
 Add tests and code for as many as you want.
 .IP " 8." 4
-Keep in mind that the only mandatory fields in info dict for successful
-extraction process are \f[C]id\f[], \f[C]title\f[] and either
-\f[C]url\f[] or \f[C]formats\f[], i.e.
-these are the critical data the extraction does not make any sense
-without.
-This means that any
-field (https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/common.py#L148-L252)
-apart from aforementioned mandatory ones should be treated \f[B]as
-optional\f[] and extraction should be \f[B]tolerate\f[] to situations
-when sources for these fields can potentially be unavailable (even if
-they always available at the moment) and \f[B]future\-proof\f[] in order
-not to break the extraction of general purpose mandatory fields.
-For example, if you have some intermediate dict \f[C]meta\f[] that is a
-source of metadata and it has a key \f[C]summary\f[] that you want to
-extract and put into resulting info dict as \f[C]description\f[], you
-should be ready that this key may be missing from the \f[C]meta\f[]
-dict, i.e.
-you should extract it as \f[C]meta.get(\[aq]summary\[aq])\f[] and not
-\f[C]meta[\[aq]summary\[aq]]\f[].
-Similarly, you should pass \f[C]fatal=False\f[] when extracting data
-from a webpage with \f[C]_search_regex/_html_search_regex\f[].
-.IP " 9." 4
-Check the code with flake8 (https://pypi.python.org/pypi/flake8).
+Make sure your code follows youtube\-dl coding
+conventions (#youtube-dl-coding-conventions) and check the code with
+flake8 (https://pypi.python.org/pypi/flake8).
 Also make sure your code works under all Python (http://www.python.org/)
 versions claimed supported by youtube\-dl, namely 2.6, 2.7, and 3.2+.
-.IP "10." 4
+.IP " 9." 4
 When the tests pass, add (http://git-scm.com/docs/git-add) the new files
 and commit (http://git-scm.com/docs/git-commit) them and
 push (http://git-scm.com/docs/git-push) the result, like this:
@@ -1871,12 +1868,229 @@ $\ git\ push\ origin\ yourextractor
 \f[]
 .fi
 .RE
-.IP "11." 4
+.IP "10." 4
 Finally, create a pull
 request (https://help.github.com/articles/creating-a-pull-request).
 We\[aq]ll then review and merge it.
 .PP
 In any case, thank you very much for your contributions!
+.SS youtube\-dl coding conventions
+.PP
+This section introduces a guide lines for writing idiomatic, robust and
+future\-proof extractor code.
+.PP
+Extractors are very fragile by nature since they depend on the layout of
+the source data provided by 3rd party media hoster out of your control
+and this layout tend to change.
+As an extractor implementer your task is not only to write code that
+will extract media links and metadata correctly but also to minimize
+code dependency on source\[aq]s layout changes and even to make the code
+foresee potential future changes and be ready for that.
+This is important because it will allow extractor not to break on minor
+layout changes thus keeping old youtube\-dl versions working.
+Even though this breakage issue is easily fixed by emitting a new
+version of youtube\-dl with fix incorporated all the previous version
+become broken in all repositories and distros\[aq] packages that may not
+be so prompt in fetching the update from us.
+Needless to say some may never receive an update at all that is possible
+for non rolling release distros.
+.SS Mandatory and optional metafields
+.PP
+For extraction to work youtube\-dl relies on metadata your extractor
+extracts and provides to youtube\-dl expressed by information
+dictionary (https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/common.py#L75-L257)
+or simply \f[I]info dict\f[].
+Only the following meta fields in \f[I]info dict\f[] are considered
+mandatory for successful extraction process by youtube\-dl:
+.IP \[bu] 2
+\f[C]id\f[] (media identifier)
+.IP \[bu] 2
+\f[C]title\f[] (media title)
+.IP \[bu] 2
+\f[C]url\f[] (media download URL) or \f[C]formats\f[]
+.PP
+In fact only the last option is technically mandatory (i.e.
+if you can\[aq]t figure out the download location of the media the
+extraction does not make any sense).
+But by convention youtube\-dl also treats \f[C]id\f[] and \f[C]title\f[]
+to be mandatory.
+Thus aforementioned metafields are the critical data the extraction does
+not make any sense without and if any of them fail to be extracted then
+extractor is considered completely broken.
+.PP
+Any
+field (https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/common.py#L149-L257)
+apart from the aforementioned ones are considered \f[B]optional\f[].
+That means that extraction should be \f[B]tolerate\f[] to situations
+when sources for these fields can potentially be unavailable (even if
+they are always available at the moment) and \f[B]future\-proof\f[] in
+order not to break the extraction of general purpose mandatory fields.
+.SS Example
+.PP
+Say you have some source dictionary \f[C]meta\f[] that you\[aq]ve
+fetched as JSON with HTTP request and it has a key \f[C]summary\f[]:
+.IP
+.nf
+\f[C]
+meta\ =\ self._download_json(url,\ video_id)
+\f[]
+.fi
+.PP
+Assume at this point \f[C]meta\f[]\[aq]s layout is:
+.IP
+.nf
+\f[C]
+{
+\ \ \ \ ...
+\ \ \ \ "summary":\ "some\ fancy\ summary\ text",
+\ \ \ \ ...
+}
+\f[]
+.fi
+.PP
+Assume you want to extract \f[C]summary\f[] and put into resulting info
+dict as \f[C]description\f[].
+Since \f[C]description\f[] is optional metafield you should be ready
+that this key may be missing from the \f[C]meta\f[] dict, so that you
+should extract it like:
+.IP
+.nf
+\f[C]
+description\ =\ meta.get(\[aq]summary\[aq])\ \ #\ correct
+\f[]
+.fi
+.PP
+and not like:
+.IP
+.nf
+\f[C]
+description\ =\ meta[\[aq]summary\[aq]]\ \ #\ incorrect
+\f[]
+.fi
+.PP
+The latter will break extraction process with \f[C]KeyError\f[] if
+\f[C]summary\f[] disappears from \f[C]meta\f[] at some time later but
+with former approach extraction will just go ahead with
+\f[C]description\f[] set to \f[C]None\f[] that is perfectly fine
+(remember \f[C]None\f[] is equivalent for absence of data).
+.PP
+Similarly, you should pass \f[C]fatal=False\f[] when extracting optional
+data from a webpage with \f[C]_search_regex\f[],
+\f[C]_html_search_regex\f[] or similar methods, for instance:
+.IP
+.nf
+\f[C]
+description\ =\ self._search_regex(
+\ \ \ \ r\[aq]<span[^>]+id="title"[^>]*>([^<]+)<\[aq],
+\ \ \ \ webpage,\ \[aq]description\[aq],\ fatal=False)
+\f[]
+.fi
+.PP
+With \f[C]fatal\f[] set to \f[C]False\f[] if \f[C]_search_regex\f[]
+fails to extract \f[C]description\f[] it will emit a warning and
+continue extraction.
+.PP
+You can also pass \f[C]default=<some\ fallback\ value>\f[], for example:
+.IP
+.nf
+\f[C]
+description\ =\ self._search_regex(
+\ \ \ \ r\[aq]<span[^>]+id="title"[^>]*>([^<]+)<\[aq],
+\ \ \ \ webpage,\ \[aq]description\[aq],\ default=None)
+\f[]
+.fi
+.PP
+On failure this code will silently continue the extraction with
+\f[C]description\f[] set to \f[C]None\f[].
+That is useful for metafields that are known to may or may not be
+present.
+.SS Provide fallbacks
+.PP
+When extracting metadata try to provide several scenarios for that.
+For example if \f[C]title\f[] is present in several places/sources try
+extracting from at least some of them.
+This would make it more future\-proof in case some of the sources became
+unavailable.
+.SS Example
+.PP
+Say \f[C]meta\f[] from previous example has a \f[C]title\f[] and you are
+about to extract it.
+Since \f[C]title\f[] is mandatory meta field you should end up with
+something like:
+.IP
+.nf
+\f[C]
+title\ =\ meta[\[aq]title\[aq]]
+\f[]
+.fi
+.PP
+If \f[C]title\f[] disappeares from \f[C]meta\f[] in future due to some
+changes on hoster\[aq]s side the extraction would fail since
+\f[C]title\f[] is mandatory.
+That\[aq]s expected.
+.PP
+Assume that you have some another source you can extract \f[C]title\f[]
+from, for example \f[C]og:title\f[] HTML meta of a \f[C]webpage\f[].
+In this case you can provide a fallback scenario:
+.IP
+.nf
+\f[C]
+title\ =\ meta.get(\[aq]title\[aq])\ or\ self._og_search_title(webpage)
+\f[]
+.fi
+.PP
+This code will try to extract from \f[C]meta\f[] first and if it fails
+it will try extracting \f[C]og:title\f[] from a \f[C]webpage\f[].
+.SS Make regular expressions flexible
+.PP
+When using regular expressions try to write them fuzzy and flexible.
+.SS Example
+.PP
+Say you need to extract \f[C]title\f[] from the following HTML code:
+.IP
+.nf
+\f[C]
+<span\ style="position:\ absolute;\ left:\ 910px;\ width:\ 90px;\ float:\ right;\ z\-index:\ 9999;"\ class="title">some\ fancy\ title</span>
+\f[]
+.fi
+.PP
+The code for that task should look similar to:
+.IP
+.nf
+\f[C]
+title\ =\ self._search_regex(
+\ \ \ \ r\[aq]<span[^>]+class="title"[^>]*>([^<]+)\[aq],\ webpage,\ \[aq]title\[aq])
+\f[]
+.fi
+.PP
+Or even better:
+.IP
+.nf
+\f[C]
+title\ =\ self._search_regex(
+\ \ \ \ r\[aq]<span[^>]+class=(["\\\[aq]])title\\1[^>]*>(?P<title>[^<]+)\[aq],
+\ \ \ \ webpage,\ \[aq]title\[aq],\ group=\[aq]title\[aq])
+\f[]
+.fi
+.PP
+Note how you tolerate potential changes in \f[C]style\f[]
+attribute\[aq]s value or switch from using double quotes to single for
+\f[C]class\f[] attribute:
+.PP
+The code definitely should not look like:
+.IP
+.nf
+\f[C]
+title\ =\ self._search_regex(
+\ \ \ \ r\[aq]<span\ style="position:\ absolute;\ left:\ 910px;\ width:\ 90px;\ float:\ right;\ z\-index:\ 9999;"\ class="title">(.*?)</span>\[aq],
+\ \ \ \ webpage,\ \[aq]title\[aq],\ group=\[aq]title\[aq])
+\f[]
+.fi
+.SS Use safe conversion functions
+.PP
+Wrap all extracted numeric data into safe functions from \f[C]utils\f[]:
+\f[C]int_or_none\f[], \f[C]float_or_none\f[].
+Use them for string to number conversions as well.
 .SH EMBEDDING YOUTUBE\-DL
 .PP
 youtube\-dl makes the best effort to be a good command\-line program,