·   · 731 posts
  •  · 0 friends
  •  · 0 followers

How to crawl a quarter billion webpages in 40 hours

How to crawl a quarter billion webpages in 40 hours

there is an example which shows that crawling of 1/4 billion of webpages is possible in 2 days, WHEN:

"More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances."

Instance

vCPU

Arbeitsspeicher (GiB)

Speicher

Netzwerkleistung (Gbit/s)

a1.xlarge

4

8

Nur EBS

Bis zu 10

with 80 vCPU and 160GB RAM and 500 gigabytes of outgoing bandwidth through the HTTP-requests, 1.69 Terabytes of downloaded content and 2800 agents

"According to this presentation by Googler Jeff Dean, as of November 2010 Google was indexing “tens of billions of pages”. "

0 0 0 0 0 0
  • 729
Attachments
Comments (0)
    Info
    Category:
    Created:
    Updated:
    Featured Posts
    the first proof of concept semi-liquid neural network (advanced LLM) run positive (SELI NEORAD for short)
    https://github.com/DuskoPre/AutoCoder/wiki i'm not just testing LLMs but also creating a semi-liquid neural network (advanced LLM) with chromadb: https://github.com/DuskoPre/liquid-neural-network-with-chromadb-cache it seems there is a mayor upgrade for LLMs possible trough classifications (managing-system) of inputs to the LLM. the model is still static but trough the implementations of the classifier it can be made semi-liquid.   #dichipcoin
    Free Website Counter
    Free Website Counter
    SARS-CoV-2 web counter web counter