Tilman Bayer
2017-11-01 17:40:48 UTC
CCing the data dumps mailing list, which is the recommended venue for
questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_
for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_
for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
Also, important categories like Computer Architechture, Human based
computation, Programming language theory, Software Engineering, and Theory
of Computation, are missing from the subcategories of Areas of Computer
Science.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
Analytics mailing list
https://lists.wikimedia.org/mailman/listinfo/analytics
computation, Programming language theory, Software Engineering, and Theory
of Computation, are missing from the subcategories of Areas of Computer
Science.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
Hi,
When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at: https://twitter.com/TheShu
bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from
the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
_______________________________________________When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at: https://twitter.com/TheShu
bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from
the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
Analytics mailing list
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB