20 Commits

Author SHA1 Message Date
TinyToweringTree
1326a5aa38 [archiveorg] Make metadata extraction more robust 2020-02-20 00:02:32 +01:00
TinyToweringTree
b98d1c0d5a [archiveorg] Use and fix get_element_by_class()
Use get_element_by_class() from utils to get rid of yet another regex.
This function used to return only the content of the element, and not
the element itself, including its tag and attributes. The whole group
of get_element_by_X() functions are a bit of a misnomer, as they all
return the *content* of the element and not the element itself.

All these functions can now return the whole element when setting
their `include_tag` parameter to `True`. By default it is `False` so
no other code will be affected by this change. Tests have been added
to test/test_utils.py accordingly.

This uncovered a bug which prevented elements starting with a hyphen as
their class name from being found. This has been fixed by fixing the
regex used in get_elements_by_class().
2020-02-19 22:42:00 +01:00
TinyToweringTree
e910f498d3 [archiveorg] Use extract_attributes() 2020-02-19 22:04:47 +01:00
TinyToweringTree
8df0c2c7a5 [archiveorg] Fix extraction (closes #21330, closes #23586, closes #23700) 2020-01-24 17:49:45 +01:00
Sergey M․
81dc74966a
[archiveorg] Fix extraction (closes #15770, closes #15772) 2018-03-05 22:30:32 +07:00
Tithen-Firion
c12b4b80f8 [archiveorg] Update test 2017-04-28 03:48:32 +07:00
Yen Chi Hsuan
a4a554a793
[generic] Try parsing JWPlayer embedded videos (closes #12030) 2017-02-16 23:44:03 +08:00
Sergey M․
84bc23b41b
[archiveorg] PEP 8 2016-08-05 23:16:19 +07:00
Remita Amine
d50aca41f8 [archiveorg] improve format extraction(closes #10219) 2016-08-05 16:42:15 +01:00
blissland
d6a1738892 [archive.org] Fix incorrect url condition (closes #5628)
The condition for assigning to json_url is the wrong way round:

currently for url: aaa.com/xxx

we get:

aaa.com/xxx&output=json

instead of the correct value:

aaa.com/xxx?output=json
2015-05-06 15:06:10 +02:00
Sergey M․
e8e28989eb [archiveorg] Add test, simplify and modernize 2014-12-29 02:08:46 +06:00
Johannes Knoedtel
ff7a07d5c4 [archiveorg] most metadata fields are optional
Example: https://archive.org/details/Cops1922
2014-12-28 20:31:25 +01:00
Philipp Hagemeister
42154ad5bc [archiveorg] Use centralized sorting 2014-01-07 10:16:22 +01:00
Philipp Hagemeister
3798eadccd More unicode literals 2014-01-07 10:06:30 +01:00
Philipp Hagemeister
29030c0a4c Merge remote-tracking branch 'dstftw/correct-valid-urls' 2013-12-04 19:56:05 +01:00
dst
c0ade33e16 Correct some extractor _VALID_URL regexes 2013-12-04 20:34:47 +07:00
Jaime Marquínez Ferrándiz
fb7abb31af Remove the compatibility code used before the new format system was implemented 2013-12-03 14:31:20 +01:00
Jaime Marquínez Ferrándiz
471a5ee908 Set the ext field for each format 2013-09-14 14:45:04 +02:00
Philipp Hagemeister
690e872c51 Remove video_result helper method
Calling it was more complex then actually including the type in the video info
2013-07-11 12:12:30 +02:00
Philipp Hagemeister
5fe3a3c3fb [archive.org] Add extractor (Fixes #1003) 2013-07-08 02:05:02 +02:00