The Child Online Protection Act and Internet Content Filtering
Contents
The Child Online Protection Act and Internet Content Filtering¶
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/QKNnwLL991c" frameborder="0" allowfullscreen></iframe>')
Background¶
Study commissioned by DoJ re Child Online Protection Act of 1998 (COPA).
Apologies: stale data. 2005–2006. Required subpoenas of Google, AOL, MSN, Yahoo!
Attempts to legislate protection of minors: CDA, CIPA, COPA.
I worked primarily on COPA; a little on CIPA.
Team at CRAI led by Paul Mewett collected and categorized the webpages and ran filter tests.
I designed the experiments, drew the random samples, analyzed the data.
News coverage of Google subpoena generated lots of hate mail.
Legal¶
COPA¶
2nd attempt to legislate protection from commercial “harmful-to-minors” content
NOT ABOUT CHILD PORNOGRAPHY
Exemptions for literary, artistic, and educational content, ISPs, search engines.
Requires age screen for commercial porn.
Credit card number deemed adequate proof of age.
Supreme Court¶
Feds have legitimate interest in protecting children.
COPA potentially “chilling” of free speech.
DoJ had to show that COPA is “least restrictive alternative.”
How well do filters work?
Data¶
Data Sources: Search Engines¶
Filters over-block and under-block (make Type I and II errors).
Population of pages matters. What’s relevant?
Internet largely mediated by search engines.
Random sample of 50,000 webpages from Google search index in 2006. (Pages users might find.)
Random sample of 1~million webpages from MSN search index in 2005. (Pages users might find.)
Week of search queries from AOL, MSN and Yahoo! by subpoena, about 1.3 billion (Pages users do find.)
685 most popular queries from Wordtracker 11/12/05–2/20/06. (Pages users find most often.)
Categorizing Pages¶
Team at CRA~International attempted to view and categorize
39,999 random webpages from MSN index
11,000 random the webpages from Google index
first 10 results of each of a stratified random sample of 7,541 queries (total weight 15,461)
first 10 results of the 685 Wordtracker searches
Raw results¶
68,150 webpages of which 63,105 worked.
60,833 Category 1a: no reference to sex and no nudity.
1,382 Category 5f: adult entertainment.
890 in other categories, e.g., show genitalia in an artistic or educational context.
I drew random samples of the Category 1a pages to test filters.
Results¶
Prevalence of Adult Content¶
Sizes of populations and samples. Searches weighted by frequency.¶
result |
Google inx |
MSN inx |
AOL, MSN, Y! srch |
Wordtracker srch |
---|---|---|---|---|
pages in sample |
11,100 |
39,999 |
22,405 |
206 million |
working pages in sample |
10,009 |
36,557 |
21,870 |
195 million |
queries in pop |
1.3 billion |
20.6 million |
||
queries in sample |
2,345 |
20.6 million |
Estimated prevalence of adult pages¶
Source |
Google inx |
MSN inx |
AOL, MSN, Y! srch |
Wordtracker srch |
---|---|---|---|---|
adult webpages |
1.1% |
1.1% |
1.7% |
14.1% |
domestic adult webpages |
44.2% |
56.7% |
88.4% |
87.4% |
searches w adult results |
6.0% |
37.1% |
||
searches w domestic adult results |
5.7% |
37.0% |
Conservative 95% lower confidence limits found by inverting binomial tests.¶
bound |
Google inx |
MSN inx |
AOL, MSN, Y! srch |
---|---|---|---|
adult |
1.0% |
1.0% |
2.5% |
domestic adult |
0.4% |
0.5% |
2.2% |
Filtering¶
Estimated underblocking & overblocking¶
Filter |
Underblocking |
Overblocking |
||
---|---|---|---|---|
MSN |
MSN |
|||
AOL Mature Teen |
8.9% |
8.6% |
22.6% |
23.6% |
MSN Pornography |
16.8% |
18.7% |
19.6% |
10.3% |
MSN Teen |
17.7% |
20.5% |
21.9% |
18.9% |
ContentProtect Default |
38.3% |
45.4% |
2.8% |
3.0% |
ContentProtect Custom |
28.3% |
46.7% |
1.4% |
0.7% |
CyberPatrol Custom |
31.0% |
33.5% |
1.4% |
0.9% |
CyberSitter Default |
12.7% |
16.5% |
3.6% |
4.1% |
CyberSitter Custom |
12.4% |
18.9% |
4.0% |
3.7% |
McAfee Young Teen |
16.1% |
26.0% |
12.4% |
13.2% |
Net Nanny Level 2 |
44.0% |
46.1% |
3.3% |
2.2% |
Norton Default |
60.2% |
54.9% |
1.4% |
0.7% |
Norton Custom |
58.4% |
54.2% |
0.9% |
0.4% |
Verizon |
41.8% |
40.3% |
9.4% |
5.7% |
8e6 |
18.3% |
23.0% |
9.4% |
7.5% |
SafeEyes |
16.2% |
15.2% |
3.3% |
3.2% |
Conservative 95% lower confidence limits¶
Filter |
underblocking |
overblocking |
||
---|---|---|---|---|
MSN |
MSN |
|||
AOL Mature Teen |
5.6% |
6.5% |
18.4% |
21.0% |
MSN Pornography |
12.1% |
15.7% |
15.8% |
8.5% |
MSN Teen |
12.8% |
17.4% |
17.8% |
16.6% |
ContentProtect Default |
31.3% |
41.3% |
1.5% |
2.1% |
ContentProtect Custom |
22.2% |
42.6% |
0.6% |
0.4% |
CyberPatrol Custom |
24.6% |
29.7% |
0.6% |
0.5% |
CyberSitter Default |
8.6% |
13.6% |
2.1% |
3.1% |
CyberSitter Custom |
8.4% |
15.9% |
2.4% |
2.7% |
McAfee Young Teen |
11.4% |
22.5% |
9.3% |
11.3% |
Net Nanny Level 2 |
36.8% |
41.9% |
1.9% |
1.5% |
Norton Default |
52.9% |
50.7% |
0.6% |
0.4% |
Norton Custom |
51.1% |
50.1% |
0.4% |
0.2% |
Verizon |
34.7% |
36.2% |
6.7% |
4.4% |
8e6 |
13.1% |
19.6% |
6.7% |
6.0% |
SafeEyes |
11.4% |
12.3% |
1.9% |
2.3% |
Of adult pages not blocked, estimated percentage that are domestic¶
Filter |
MSN |
|
---|---|---|
AOL Mature Teen |
40.0% |
40.6% |
MSN Pornography |
31.6% |
42.9% |
MSN Teen |
40.0% |
37.7% |
ContentProtect Default |
39.0% |
45.8% |
ContentProtect Custom |
40.6% |
47.1% |
CyberPatrol Custom |
48.6% |
44.0% |
CyberSitter Default |
50.0% |
32.8% |
CyberSitter Custom |
57.1% |
36.2% |
McAfee Young Teen |
44.4% |
37.5% |
Net Nanny Level 2 |
41.7% |
48.1% |
Norton Default |
35.3% |
49.3% |
Norton Custom |
36.4% |
49.7% |
Verizon |
37.0% |
42.4% |
8e6 |
42.1% |
46.8% |
SafeEyes |
35.3% |
40.4% |
Estimated underblocking and overblocking for AOL, MSN, & Yahoo! search results¶
filter |
underblocking reslts |
overblocking reslts |
domestic underb |
underblocking queries |
95% CL |
---|---|---|---|---|---|
AOL Mature Teen |
6.2% |
12.5% |
57.0% |
15.6% |
5.3% |
MSN Pornography |
21.4% |
4.4% |
86.1% |
32.3% |
20.9% |
MSN Teen |
20.8% |
5.8% |
91.9% |
28.1% |
18.8% |
ContentProtect Default |
18.4% |
6.4% |
70.1% |
46.2% |
10.0% |
ContentProtect Custom |
20.4% |
0.0% |
62.1% |
42.2% |
25.4% |
CyberPatrol Custom |
34.6% |
0.4% |
94.9% |
65.6% |
24.4% |
CyberSitter Default |
11.2% |
4.6% |
33.8% |
23.2% |
11.2% |
CyberSitter Custom |
10.0% |
5.3% |
44.1% |
20.1% |
8.1% |
McAfee Young Teen |
14.2% |
20.7% |
80.7% |
30.9% |
10.4% |
Net Nanny Level 2 |
28.1% |
3.7% |
79.4% |
36.6% |
20.8% |
Norton Default |
42.1% |
0.8% |
85.3% |
51.6% |
49.3% |
Norton Custom |
43.4% |
0.0% |
85.6% |
56.1% |
54.3% |
Verizon |
23.1% |
1.3% |
80.9% |
41.6% |
31.4% |
8e6 |
7.3% |
7.5% |
78.0% |
23.4% |
11.7% |
SafeEyes |
13.7% |
1.9% |
87.8% |
29.8% |
14.9% |
Underblocking | estimated overblocking for Wordtracker query results¶
filter |
underblocking reslts |
overblocking reslts |
domestic underblk |
underblocking queries |
---|---|---|---|---|
AOL Mature Teen |
1.3% |
19.6% |
69.2% |
4.3% |
MSN Pornography |
2.7% |
13.3% |
86.1% |
8.2% |
MSN Teen |
2.6% |
13.7% |
83.1% |
8.3% |
ContentProtect Default |
7.5% |
12.4% |
84.1% |
23.1% |
ContentProtect Custom |
8.1% |
7.8% |
84.9% |
25.3% |
CyberPatrol Custom |
3.9% |
9.2% |
86.4% |
10.1% |
CyberSitter |
1.4% |
19.9% |
69.3% |
5.1% |
CyberSitter Custom |
2.9% |
18.2% |
84.0% |
9.4% |
McAfee Young Teen |
2.8% |
32.8% |
70.7% |
9.3% |
Net Nanny Level 2 |
12.6% |
9.5% |
82.9% |
34.4% |
Norton Default |
9.9% |
4.8% |
79.4% |
25.2% |
Norton Custom |
10.2% |
2.9% |
79.4% |
25.9% |
Verizon |
4.4% |
16.1% |
67.9% |
15.0% |
8e6 |
3.4% |
25.1% |
93.0% |
10.3% |
SafeEyes |
2.0% |
16.5% |
96.6% |
6.4% |
Filter Results¶
Most restrictive filter blocked 91% of adult pages; also blocked about 23-24% of the clean webpages in the indexes.
Would block 22–23 clean webpages for each adult page it blocks in Google or MSN search index
Less restrictive filters blocked as little as 40% of the adult pages.
The most restrictive filter blocked about 94% of the adult pages among search results; also blocked about 13% of clean search results.
On average, it would block about 7.6 clean results for every adult result it blocks.
For the most popular queries, the most restrictive filter blocks over 98% of adult results; also blocked ~20% of clean results.
Would block ~1.1 clean results of popular searches for each adult result it blocks.
Location: Foreign Adult Websites with Commercial Ties to the US¶
Data Source |
Percentage |
---|---|
Google index |
90.3% |
MSN index |
89.8% |
AOL, MSN & Y! queries |
88.2% |
Wordtracker queries |
95.9% |
Estimated percentage of nominally free adult foreign webpages that have commercial ties to the United States, based on data provided by CRA International. Estimates for query results take into account query weights.
The other side¶
Filtering studies cited by Plaintiffs’ Expert¶
Reference |
Year |
Sample type |
Quantitative |
Source of pages |
---|---|---|---|---|
eTesting Labs |
2001 |
convenience |
yes |
searches on Google |
eTesting Labs |
2002 |
convenience |
yes |
searches on Google; DMOZ |
NetAlert |
2001 |
quota |
yes |
unknown |
PC Magazine |
2004 |
unknown |
no |
unknown |
Consumer Reports |
2005 |
convenience |
no |
unknown |
Rulespace depo |
2006 |
convenience |
yes |
unknown |
eTesting 1: Google search for “free adult sex.”
eTesting 2: Added DMOZ; took sample of results.
NetAlert: at most 30 webpages.
Plaintiffs’ “Internet Geography” Study¶
Claim: less than half of “free” porn sites are in US, and about 2/3 of adult membership websites are in US
Universe: Adultreviews.net, Adultwebmasters.org, Google Web Directory, Sextracker.com.
Sample of convenience, not census or random sample.
According to his database, the following are porn sites: aol.com, msn.com, yahoo.com, about.com lycos.fr, lycos.co.uk, com.ar, com.au, com.br, co.hu, co.il, co.kr, com.mx, co.nz, com.pl, com.pt, com.tw, com.ua, co.uk, com.ve, co.yu, co.za
Serious bug: claims entire commercial domains of at least 17 countries are porn sites.
The Public¶
Surprising outcry: thought the suit enabled DOJ to get personal info. Of course,
every ISP, search engine, & e-commeerce site had the info…as did the NSA.
the subpoenas specifically said not to include IP addresses or any other information that would identify users http://www.nytimes.com/2006/01/20/technology/google-resists-us-subpoena-of-search-data.html
the queries were not the point: the point was what results the queries retrieved
Google fought the subpoena for search records at the same time it was censoring search results in China
Court ordered Google to give sample from search index, but no queries
well now good for you – instead of teaching parents/caregivers of minors how to block unwanted porn sites you have given this administration an EXCUSE to peruse search engine data bases.
enough erosion of civil liberties
Dorothy Grimes
earthchildren@comcast.net
Heartwood Books Heartwood@cstone.net to stark show details 1/20/06 Dear Professor Stark,
The Google user is an actual person, not just a statistic, and your attempt to expose my personal information (even buried in a large quantity of data) is at best short sighted on your part. It is also annoying. It is absolutely NONE OF YOUR BUSINESS what I search for on Google.
I am aware of the fact that some people (especially the young) seem to place no value on privacy. But this is not the case for everyone. Do you think for a minute that the government will be satisfied with “anonymous” data if it sees “suspicious” patterns? Using statistical methods to identify criminals has enormous potential for misuse. Look at the early use of genetics that produced eugenics. Before you accept your next consulting fee, stop and talk with someone about the ethics of your work.
Even if you do not value your personal privacy in this matter, ask yourself if you would want the public or the government examining all of your communication or internet use. When the government gains the right to watch our private non-criminal lives, this power will not exist only for the current well meaning Bush administration but will be available for the next Bush, Clinton or Nixon as well.
It is absolutely NONE OF YOUR BUSINESS what I search for on Google. It is none of my business whether the baseball cap just looks cute or is hiding thinning hair. Some things are private.
Paul Collinge
Heartwood Books 5 Elliewood Ave. Charlottesville, Va. 22903 434 295 7083