The Child Online Protection Act and Internet Content Filtering

from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/QKNnwLL991c" frameborder="0" allowfullscreen></iframe>')

Background

  • Study commissioned by DoJ re Child Online Protection Act of 1998 (COPA).

  • Apologies: stale data. 2005–2006. Required subpoenas of Google, AOL, MSN, Yahoo!

  • Attempts to legislate protection of minors: CDA, CIPA, COPA.

  • I worked primarily on COPA; a little on CIPA.

  • Team at CRAI led by Paul Mewett collected and categorized the webpages and ran filter tests.

  • I designed the experiments, drew the random samples, analyzed the data.

  • News coverage of Google subpoena generated lots of hate mail.

Data

Data Sources: Search Engines

Filters over-block and under-block (make Type I and II errors).

Population of pages matters. What’s relevant?

Internet largely mediated by search engines.

  • Random sample of 50,000 webpages from Google search index in 2006. (Pages users might find.)

  • Random sample of 1~million webpages from MSN search index in 2005. (Pages users might find.)

  • Week of search queries from AOL, MSN and Yahoo! by subpoena, about 1.3 billion (Pages users do find.)

  • 685 most popular queries from Wordtracker 11/12/05–2/20/06. (Pages users find most often.)

Categorizing Pages

Team at CRA~International attempted to view and categorize

  • 39,999 random webpages from MSN index

  • 11,000 random the webpages from Google index

  • first 10 results of each of a stratified random sample of 7,541 queries (total weight 15,461)

  • first 10 results of the 685 Wordtracker searches

Raw results

  • 68,150 webpages of which 63,105 worked.

  • 60,833 Category 1a: no reference to sex and no nudity.

  • 1,382 Category 5f: adult entertainment.

  • 890 in other categories, e.g., show genitalia in an artistic or educational context.

I drew random samples of the Category 1a pages to test filters.

Results

Prevalence of Adult Content

Sizes of populations and samples. Searches weighted by frequency.

result

Google inx

MSN inx

AOL, MSN, Y! srch

Wordtracker srch

pages in sample

11,100

39,999

22,405

206 million

working pages in sample

10,009

36,557

21,870

195 million

queries in pop

1.3 billion

20.6 million

queries in sample

2,345

20.6 million

Estimated prevalence of adult pages

Source

Google inx

MSN inx

AOL, MSN, Y! srch

Wordtracker srch

adult webpages

1.1%

1.1%

1.7%

14.1%

domestic adult webpages

44.2%

56.7%

88.4%

87.4%

searches w adult results

6.0%

37.1%

searches w domestic adult results

5.7%

37.0%

Conservative 95% lower confidence limits found by inverting binomial tests.

bound

Google inx

MSN inx

AOL, MSN, Y! srch

adult

1.0%

1.0%

2.5%

domestic adult

0.4%

0.5%

2.2%

Filtering

Estimated underblocking & overblocking

Filter

Underblocking

Overblocking

Google

MSN

Google

MSN

AOL Mature Teen

8.9%

8.6%

22.6%

23.6%

MSN Pornography

16.8%

18.7%

19.6%

10.3%

MSN Teen

17.7%

20.5%

21.9%

18.9%

ContentProtect Default

38.3%

45.4%

2.8%

3.0%

ContentProtect Custom

28.3%

46.7%

1.4%

0.7%

CyberPatrol Custom

31.0%

33.5%

1.4%

0.9%

CyberSitter Default

12.7%

16.5%

3.6%

4.1%

CyberSitter Custom

12.4%

18.9%

4.0%

3.7%

McAfee Young Teen

16.1%

26.0%

12.4%

13.2%

Net Nanny Level 2

44.0%

46.1%

3.3%

2.2%

Norton Default

60.2%

54.9%

1.4%

0.7%

Norton Custom

58.4%

54.2%

0.9%

0.4%

Verizon

41.8%

40.3%

9.4%

5.7%

8e6

18.3%

23.0%

9.4%

7.5%

SafeEyes

16.2%

15.2%

3.3%

3.2%

Conservative 95% lower confidence limits

Filter

underblocking

overblocking

Google

MSN

Google

MSN

AOL Mature Teen

5.6%

6.5%

18.4%

21.0%

MSN Pornography

12.1%

15.7%

15.8%

8.5%

MSN Teen

12.8%

17.4%

17.8%

16.6%

ContentProtect Default

31.3%

41.3%

1.5%

2.1%

ContentProtect Custom

22.2%

42.6%

0.6%

0.4%

CyberPatrol Custom

24.6%

29.7%

0.6%

0.5%

CyberSitter Default

8.6%

13.6%

2.1%

3.1%

CyberSitter Custom

8.4%

15.9%

2.4%

2.7%

McAfee Young Teen

11.4%

22.5%

9.3%

11.3%

Net Nanny Level 2

36.8%

41.9%

1.9%

1.5%

Norton Default

52.9%

50.7%

0.6%

0.4%

Norton Custom

51.1%

50.1%

0.4%

0.2%

Verizon

34.7%

36.2%

6.7%

4.4%

8e6

13.1%

19.6%

6.7%

6.0%

SafeEyes

11.4%

12.3%

1.9%

2.3%

Of adult pages not blocked, estimated percentage that are domestic

Filter

Google

MSN

AOL Mature Teen

40.0%

40.6%

MSN Pornography

31.6%

42.9%

MSN Teen

40.0%

37.7%

ContentProtect Default

39.0%

45.8%

ContentProtect Custom

40.6%

47.1%

CyberPatrol Custom

48.6%

44.0%

CyberSitter Default

50.0%

32.8%

CyberSitter Custom

57.1%

36.2%

McAfee Young Teen

44.4%

37.5%

Net Nanny Level 2

41.7%

48.1%

Norton Default

35.3%

49.3%

Norton Custom

36.4%

49.7%

Verizon

37.0%

42.4%

8e6

42.1%

46.8%

SafeEyes

35.3%

40.4%

Estimated underblocking and overblocking for AOL, MSN, & Yahoo! search results

filter

underblocking reslts

overblocking reslts

domestic underb

underblocking queries

95% CL

AOL Mature Teen

6.2%

12.5%

57.0%

15.6%

5.3%

MSN Pornography

21.4%

4.4%

86.1%

32.3%

20.9%

MSN Teen

20.8%

5.8%

91.9%

28.1%

18.8%

ContentProtect Default

18.4%

6.4%

70.1%

46.2%

10.0%

ContentProtect Custom

20.4%

0.0%

62.1%

42.2%

25.4%

CyberPatrol Custom

34.6%

0.4%

94.9%

65.6%

24.4%

CyberSitter Default

11.2%

4.6%

33.8%

23.2%

11.2%

CyberSitter Custom

10.0%

5.3%

44.1%

20.1%

8.1%

McAfee Young Teen

14.2%

20.7%

80.7%

30.9%

10.4%

Net Nanny Level 2

28.1%

3.7%

79.4%

36.6%

20.8%

Norton Default

42.1%

0.8%

85.3%

51.6%

49.3%

Norton Custom

43.4%

0.0%

85.6%

56.1%

54.3%

Verizon

23.1%

1.3%

80.9%

41.6%

31.4%

8e6

7.3%

7.5%

78.0%

23.4%

11.7%

SafeEyes

13.7%

1.9%

87.8%

29.8%

14.9%

Underblocking | estimated overblocking for Wordtracker query results

filter

underblocking reslts

overblocking reslts

domestic underblk

underblocking queries

AOL Mature Teen

1.3%

19.6%

69.2%

4.3%

MSN Pornography

2.7%

13.3%

86.1%

8.2%

MSN Teen

2.6%

13.7%

83.1%

8.3%

ContentProtect Default

7.5%

12.4%

84.1%

23.1%

ContentProtect Custom

8.1%

7.8%

84.9%

25.3%

CyberPatrol Custom

3.9%

9.2%

86.4%

10.1%

CyberSitter

1.4%

19.9%

69.3%

5.1%

CyberSitter Custom

2.9%

18.2%

84.0%

9.4%

McAfee Young Teen

2.8%

32.8%

70.7%

9.3%

Net Nanny Level 2

12.6%

9.5%

82.9%

34.4%

Norton Default

9.9%

4.8%

79.4%

25.2%

Norton Custom

10.2%

2.9%

79.4%

25.9%

Verizon

4.4%

16.1%

67.9%

15.0%

8e6

3.4%

25.1%

93.0%

10.3%

SafeEyes

2.0%

16.5%

96.6%

6.4%

Filter Results

  • Most restrictive filter blocked 91% of adult pages; also blocked about 23-24% of the clean webpages in the indexes.

  • Would block 22–23 clean webpages for each adult page it blocks in Google or MSN search index

  • Less restrictive filters blocked as little as 40% of the adult pages.

  • The most restrictive filter blocked about 94% of the adult pages among search results; also blocked about 13% of clean search results.

  • On average, it would block about 7.6 clean results for every adult result it blocks.

  • For the most popular queries, the most restrictive filter blocks over 98% of adult results; also blocked ~20% of clean results.

  • Would block ~1.1 clean results of popular searches for each adult result it blocks.

Location: Foreign Adult Websites with Commercial Ties to the US

Data Source

Percentage

Google index

90.3%

MSN index

89.8%

AOL, MSN & Y! queries

88.2%

Wordtracker queries

95.9%

Estimated percentage of nominally free adult foreign webpages that have commercial ties to the United States, based on data provided by CRA International. Estimates for query results take into account query weights.

The other side

Filtering studies cited by Plaintiffs’ Expert

Reference

Year

Sample type

Quantitative

Source of pages

eTesting Labs

2001

convenience

yes

searches on Google

eTesting Labs

2002

convenience

yes

searches on Google; DMOZ

NetAlert

2001

quota

yes

unknown

PC Magazine

2004

unknown

no

unknown

Consumer Reports

2005

convenience

no

unknown

Rulespace depo

2006

convenience

yes

unknown

eTesting 1: Google search for “free adult sex.”

eTesting 2: Added DMOZ; took sample of results.

NetAlert: at most 30 webpages.

This isn't science.

Plaintiffs’ “Internet Geography” Study

  • Claim: less than half of “free” porn sites are in US, and about 2/3 of adult membership websites are in US

  • Universe: Adultreviews.net, Adultwebmasters.org, Google Web Directory, Sextracker.com.

  • Sample of convenience, not census or random sample.

  • According to his database, the following are porn sites: aol.com, msn.com, yahoo.com, about.com lycos.fr, lycos.co.uk, com.ar, com.au, com.br, co.hu, co.il, co.kr, com.mx, co.nz, com.pl, com.pt, com.tw, com.ua, co.uk, com.ve, co.yu, co.za

  • Serious bug: claims entire commercial domains of at least 17 countries are porn sites.

This isn't science. Judge took his results at face value nonetheless.

The Public

Surprising outcry: thought the suit enabled DOJ to get personal info. Of course,

well now good for you – instead of teaching parents/caregivers of minors how to block unwanted porn sites you have given this administration an EXCUSE to peruse search engine data bases.

enough erosion of civil liberties

Dorothy Grimes

earthchildren@comcast.net

Heartwood Books Heartwood@cstone.net to stark show details 1/20/06 Dear Professor Stark,

The Google user is an actual person, not just a statistic, and your attempt to expose my personal information (even buried in a large quantity of data) is at best short sighted on your part. It is also annoying. It is absolutely NONE OF YOUR BUSINESS what I search for on Google.

I am aware of the fact that some people (especially the young) seem to place no value on privacy. But this is not the case for everyone. Do you think for a minute that the government will be satisfied with “anonymous” data if it sees “suspicious” patterns? Using statistical methods to identify criminals has enormous potential for misuse. Look at the early use of genetics that produced eugenics. Before you accept your next consulting fee, stop and talk with someone about the ethics of your work.

Even if you do not value your personal privacy in this matter, ask yourself if you would want the public or the government examining all of your communication or internet use. When the government gains the right to watch our private non-criminal lives, this power will not exist only for the current well meaning Bush administration but will be available for the next Bush, Clinton or Nixon as well.

It is absolutely NONE OF YOUR BUSINESS what I search for on Google. It is none of my business whether the baseball cap just looks cute or is hiding thinning hair. Some things are private.

Paul Collinge

Heartwood Books 5 Elliewood Ave. Charlottesville, Va. 22903 434 295 7083