YaCy
| Developed by | Michael Peter Christen |
|---|---|
| Latest release | 1.3 / December, 2012 |
| OS | Java (Platform independent The GNU Operating System, Windows, Mac OS X, etc)GNU/Linux |
| Type | Free Software Search Engine |
| License | GPL |
| Website | http://www.yacy.net |
YaCy is a free open-source peer to peer-based (distributed) search engine program made in Java which is designed to run on your personal computer. It comes complete with a web server, wiki, bookmark manager, proxy, blog and internal messaging system.
YaCy is a distributed Java-based personal P2P-based Internet search-engine program and also a caching proxy designed to run on your personal desktop computer or on your server.
Contents |
[edit] What is it?
The YaCy software aims at becoming a global distributed Internet search-engine. Your YaCy indexes a small part of the web, your own searches are distributed among other peers, and your YaCy replies when other people search for web-pages it knows about.
This, in theory, will eventually allow you to search the whole indexable web with your own YaCy software even though your personal YaCy-installation only knows about a small portion of it.
But YaCy is much more than that. It is almost a complete all-in-one Internet assistant, because YaCy has many other interesting features, including:
- A built-in web-server. The web-server which shows you the search-page also allows you to publish a homepage, files and so on.
- A proxy, which caches the pages you visit and optionally indexes them so they become part of the global search-index.
- A password-protected file-sharing area, which allows you to share files with your friends.
- A very basic wiki (example), and a very simple blog function.
[edit] The Search Engine
The Search Engine is YaCy's feature. It works like any other search engine, you type in what you want to read about and it gives you a list of results. You can optionally choose how long you want YaCy to search before it presents the results, the longer you ask it to search, the more peers are asked for results.
[edit] The search results
YaCy knows about quite a few pages regarding most keywords, but it does present far fewer search-results than most major search-engines. It also doesn't do a very good job at sorting the pages according to relevance.
Seeing is believing, you can try a few random searches using YachSearch.com and compare them with search engines such as Google.com and MSN.
It's was reported by mainstream press that YaCy covers 205 million Internet-pages January 2007.[1], but the actual total known URLs was closer to 214 million in January 2007. This may sound much, but search engines such as Exalead.com brag that they cover 8 billion pages. It's very hard to say just how big the web really is, so it's hard to know how much YaCy covers. The Internet has about 100.000.000 hostnames according to[2], but one hostname could hold everything from zero to millions of actual pages.
The search results YaCy present are very impressive if you only consider that it is a free open-source GPL-licensed search-engine running on home computers, and that it already has indexed two hundred million pages when there are only close to a hundred active peers in the YaCy-network.
[edit] Blacklists and Censorship
The indexable web is what the crawler software used by major search-engines - and P2P search engines like YaCy - can index and make appear in the search-results.
the visible web is what is actually shown on Internet's front pages[3][4].
There is intentionally a big difference between the visible web and the indexable web. Governments, specially in NATO-countries, can and do instruct search-engines that they must "not embarrass us" - and all major search-engines are more than willing to censor their services.[5].
YaCy is a peer to peer search-engine, so there is no central point who can say "blahblahblah.tld? We don't like that one. We want as few people as possible to be aware that it even exists. We'll just blacklist it and hope nobody notices."
You can add URLs to a blacklist to prevent them from showing up in your peer's search-results. And you can also import blacklists from other peers.
But you can not blacklist or censor sites globally. You can make your peer not know about a website. But you can't make every YaCy peer ignore it. This makes YaCy very censorship-resistant. It is simply not possible for someone to censor away parts of the indexable web from the visible web (except your Internet Service Provider, it can make sure you can't communicate with other YaCy peers - in which case you're pretty much screwed anyway).
YaCy's search-results are censorship-resistant, and that's it's biggest quality.
[edit] Resource demands
YaCy is written in Java. As such its resource usage can be high. System resource monitoring is integrated into YaCy, and the amount of resources that is uses is highly configurable, such that YaCy can be configured to run unobtrusively on low spec servers yet still effectively index websites. YaCy's system requirements are low, and it has been demonstrated successfully indexing and performing web-searches on smartphones and less than 600mhz devices.
[edit] Search Engine Optimization
The first thing that came to mind when reading about YaCy was: It must be real easy to do "Search Engine Optimization" (SEO) using YaCy. And it is.
See, YaCy isn't really that big yet, and there really isn't that many peers yet, so YaCy doesn't know about every single website out there - yet. Traditionally you'd have to wait for big search-engines to finally decide to crawl your site(s), and you'd have to guess what they think of them, and you'd have to hope that they would grab your entire site and not just grab your front page and then wait months before re-visiting. If you run a YaCy peer then you can simply ask your YaCy to spider and index your site, and volla, it does, and volla, it now shows up in the search-results at other YaCy peers. It does appear that some of the peers run YaCy for exactly this reason.
Further, YaCy's search-results has a link to "Info" on the search-results pages, and here you'll find interesting things like "Parsed text" and "Parsed sentences". This can be useful even if you're not running a YaCy peer or even care what it is, since it shows you exactly what part of the text on your
[edit] Privacy
It must be mentioned that the very nature of peer to peer systems in general, and specially search systems, have many disturbing privacy issues who must be considered.
Few people are aware that many major search engines are working closely with intelligence agencies[6] and that they do give out names of journalists so their friendly governments can "help them"[7][8]
This is absolutely not solved by running your own P2P-distributed search-engine software.
First of all, the searches are distributed among a huge number of YaCy peers. Thus; any of these peers can easily monitor search-requests done on other peers. This means that if, for example, a local "intelligence" agency within the NATO alliance want to know which citizens are aware that 9/11 was a NATO-approved US Department of Defense operation[9] and understand that "The truth is, there is no Islamic army or terrorist group called Al Qaida. And any informed intelligence officer knows this. But there is a propaganda campaign to make the public believe in the presence of an identified entity representing the 'devil' only in order to drive the 'TV watcher' to accept a unified international leadership for a war against terrorism."[10] then they should simply run a YaCy peer and look for frequent searches like "911 inside job" and "wtc demolition". This is a serious problem, because local NATO intelligence agencies, like Norwegian PST, covertly torture law-abiding citizens who raise awareness of such issues.
In short: If you're searching using a traditional search-engine then you're telling them and them only (and those they share that information with) what subjects you are interested in. With YaCy, you're telling everybody. Your peer is telling all the other peers that you're interested in keyword foo, and they can give you the links they have who are relevant to keyword foo and that's it, or they can say "Oh, you're interested in foo, that's interesting, I'll make a note of that and add it to your already thick file".
A partial solution is to use publicly available YaCy peers[11] such as YaCySearch to do your YaCy-searching. But this brings up a whole other problem: You're back to square one. The peer-owner can easily look at the interface and see what IP is searching for what keyword, so you're basically back to the "All your keywords belong to us" privacy issues traditional search-engine systems have. If you have three friend who's running YaCy and you randomly use their peers to search and they randomly use yours then it does become slightly harder for a adversary to see who's searching for what keywords.
Know that YaCy only shows the hash of a keyword which is searched for remotely, so the advesary would have to modify YaCy's source-code to look them up and/or what results reports matches it. YaCy logs local searches with the keywords in plain text at the YaCy peer where the search is done.
The best solution to YaCy's privacy issues the same solution that applies to all other web-browsing: Use the Tor Internet privacy system all your traffic, including YaCy searches, and all other searches for that matter, and don't search using your own peer but encurage others to do so.
The censorship-resistant properties YaCy has are great. The privacy properties it has are not so great. You can and should use software like Tor in combination with YaCy to get good censorship-resistant properties and good privacy properties.
Note from YaCy developers: it is not true that a remote search asks every other peer for their results of a search request. The index is distributed into a DHT (distributed hash table) and during a search only some other peers are contacted, not every. The developers are aware of the fact that an evil person can occupy a sensible position inside the DHT and monitor the search requests from outside. We developed a peer-hopping procedure against such supervision methods which will occur sometime this year and will make supervision of remote search imposible. Furthermore, remote searches do never use clear words for request, but only word hashes.
[edit] Corporate Website Search Engine?
YaCy can be setup to be a back-end for a corporate website, but there are other far better alternatives for adding a simple search-function to a website. YaCy isn't designed to be a search-back-end for a single site, and requires to much resources to make it worth having on a server just to add a search-function. Take a look at the DataparkSearch Engine is that's what you need.
[edit] References
- BerliOS Project: YaCy P2P Web-Search
- SVN changes (WebSVN)]
- Crawl requests - Ask for your site to be crawled
- ↑ Heise.de (german): Jimmy Wales' Suchmaschine zum Mitmachen nimmt Gestalt an
- ↑ NetCraft: January 2007 Web Server Survey
- ↑ Alexa Top 100 Websites
- ↑ Winner-Take-All: Google and the Third Age of Computing
- ↑ Google censors itself for China
- ↑ Former Intelligence Agent Says Google In Bed With CIA
- ↑ Yahoo, Chinese police, and a jailed journalist
- ↑ BBC NEWS | World | Asia-Pacific | Yahoo 'helped jail China writer'
- ↑ Scholars for 911 truth
- ↑ Al-Qaeda, described by Pierre-Henry Bunel, a former agent for French military intelligence
- ↑ YaCy Demopeers