YaCy

From LinuxReviews

Jump to: navigation, search
YaCy
MaintainerMichael Peter Christen
Latest release0.52 / May, 2007
OSJava (Platform independent The GNU Operating System, Windows, Mac OS X, etc)GNU/Linux
UseFree Software Search Engine
LicenseGPL
Websitehttp://www.yacy.net/yacy/
{{{screenshot}}}

YaCy is a free open-source peer to peer-based (distributed) search engine program made in Java which is designed to run on your personal computer. It comes complete with a web server, wiki, bookmark manager, proxy, blog and internal messaging system.

YaCy is a distributed Java-based personal P2P-based Internet search-engine program and also a caching proxy designed to run on your personal desktop computer or on your server.

Contents

[edit] What is it?

YaCy's front page, by default available at 127.0.0.1:8080 when the YaCy software is running.
YaCy's front page, by default available at 127.0.0.1:8080 when the YaCy software is running.

The YaCy software aims at becoming a global distributed Internet search-engine. Your YaCy indexes a small part of the web, your own searches are distributed among other peers, and your YaCy replies when other people search for web-pages it knows about.

This, in theory, will eventually allow you to search the whole indexable web with your own YaCy software even though your personal YaCy-installation only knows about a small portion of it.

But YaCy is much more than that. It is almost a complete all-in-one Internet assistant, because YaCy has many other interesting features, including:

  • A built-in web-server. The web-server which shows you the search-page also allows you to publish a homepage, files and so on.
  • A proxy, which caches the pages you visit and optionally indexes them so they become part of the global search-index.
  • A password-protected file-sharing area, which allows you to share files with your friends.
  • A very basic wiki (example), and a very simple blog function.

[edit] The Search Engine

The Search Engine is YaCy's feature. It works like any other search engine, you type in what you want to read about and it gives you a list of results. You can optionally choose how long you want YaCy to search before it presents the results, the longer you ask it to search, the more peers are asked for results.

[edit] The search results

YaCy knows about quite a few pages regarding most keywords, but it does present far fewer search-results than most major search-engines. It also doesn't do a very good job at sorting the pages according to relevance.

Seeing is believing, you can try a few random searches using YachSearch.com and compare them with search engines such as Google.com and MSN.

It's was reported by mainstream press that YaCy covers 205 million Internet-pages January 2007.[1], but the actual total known URLs was closer to 214 million in January 2007. This may sound much, but search engines such as Exalead.com brag that they cover 8 billion pages. It's very hard to say just how big the web really is, so it's hard to know how much YaCy covers. The Internet has about 100.000.000 hostnames according to[2], but one hostname could hold everything from zero to millions of actual pages.

The search results YaCy present are very impressive if you only consider that it is a free open-source GPL-licensed search-engine running on home computers, and that it already has indexed two hundred million pages when there are only close to a hundred active peers in the YaCy-network.

[edit] Blacklists and Censorship

The indexable web is what the crawler software used by major search-engines - and P2P search engines like YaCy - can index and make appear in the search-results.

the visible web is what is actually shown on Internet's front pages[3][4].

There is intentionally a big difference between the visible web and the indexable web. Governments, specially in NATO-countries, can and do instruct search-engines that they must "not embarrass us" - and all major search-engines are more than willing to censor their services.[5].

YaCy is a peer to peer search-engine, so there is no central point who can say "blahblahblah.tld? We don't like that one. We want as few people as possible to be aware that it even exists. We'll just blacklist it and hope nobody notices."

You can add URLs to a blacklist to prevent them from showing up in your peer's search-results. And you can also import blacklists from other peers.

But you can not blacklist or censor sites globally. You can make your peer not know about a website. But you can't make every YaCy peer ignore it. This makes YaCy very censorship-resistant. It is simply not possible for someone to censor away parts of the indexable web from the visible web (except your Internet Service Provider, it can make sure you can't communicate with other YaCy peers - in which case you're pretty much screwed anyway).

YaCy's search-results are censorship-resistant, and that's it's biggest quality.

[edit] Resource demands

YaCy is written in Java. And - like all Java-programs – it does demand huge amounts of resources in order to say "hello world".

YaCy FAQ story regarding it's performance is this:

"You don't need a fast machine to run YaCy. You also don't need a lot of space. You can configure the amount of Megabytes that you want to spend for the cache. YaCy can also run on a vServer."

However, YaCy does require more than a fair share of resources - specially if you ask it to crawl the web using the default settings. The interface has a configuration page called Performance, and this page allows you to set delay values. You can demand that YaCy chill out of a number of seconds between each time it does something like fetch a web-page or analyze a web-page, and you can make YaCy hog way less resources than it does out of the box by cranking those delay-numbers up real high.

YaCy is also very resource-demanding when it's used as a proxy. The idea is supposedly that you can use YaCy as a caching proxy and at the same time allow it to analyze the cached pages in order to increase the number of webpages it's got in it's search-index. But even more disturbing, the proxy is utterly slow. It's takes like half a cup of coffee to load a page when using YaCy as a proxy and this whole part of the software is basically a total outrage. Polipo makes web-browsing seem faster than not using a proxy, Privoxy makes web-browsing slightly slower, but it still loads pages way faster than YaCy - and Privoxy does a huge amount of filtering, rewrites web-pages so advertisements disappear, and so on.

[edit] Heavy load

It was widely reported on the #yacy IRC channel on OFTC that the YaCy-peer http://www.suma-lab.de:8080/ stopped responding when the German publication Heise ran a story which mentioned it[1]. YaCy's performance is horrible, much likely because it's Java, which means that it really doesn't take more than a few users doing simultaneous searches before a YaCy-peer runs into serious trouble.

This isn't that big of a problem if you run your own peer and you're the only one using it (which is a bad idea because of privacy issues)

[edit] Search Engine Optimization

The first thing that came to mind when reading about YaCy was: It must be real easy to do "Search Engine Optimization" (SEO) using YaCy. And it is.

See, YaCy isn't really that big yet, and there really isn't that many peers yet, so YaCy doesn't know about every single website out there - yet. Traditionally you'd have to wait for big search-engines to finally decide to crawl your site(s), and you'd have to guess what they think of them, and you'd have to hope that they would grab your entire site and not just grab your front page and then wait months before re-visiting. If you run a YaCy peer then you can simply ask your YaCy to spider and index your site, and volla, it does, and volla, it now shows up in the search-results at other YaCy peers. It does appear that some of the peers run YaCy for exactly this reason.

Further, YaCy's search-results has a link to "Info" on the search-results pages, and here you'll find interesting things like "Parsed text" and "Parsed sentences". This can be useful even if you're not running a YaCy peer or even care what it is, since it shows you exactly what part of the text on your

[edit] Privacy

It must be mentioned that the very nature of peer to peer systems in general, and specially search systems, have many disturbing privacy issues who must be considered.

Few people are aware that many major search engines are working closely with intelligence agencies[6] and that they do give out names of journalists so their friendly governments can "help them"[7][8]

This is absolutely not solved by running your own P2P-distributed search-engine software.

First of all, the searches are distributed among a huge number of YaCy peers. Thus; any of these peers can easily monitor search-requests done on other peers. This means that if, for example, a local "intelligence" agency within the NATO alliance want to know which citizens are aware that 9/11 was a NATO-approved US Department of Defense operation[9] and understand that "The truth is, there is no Islamic army or terrorist group called Al Qaida. And any informed intelligence officer knows this. But there is a propaganda campaign to make the public believe in the presence of an identified entity representing the 'devil' only in order to drive the 'TV watcher' to accept a unified international leadership for a war against terrorism."[10] then they should simply run a YaCy peer and look for frequent searches like "911 inside job" and "wtc demolition". This is a serious problem, because local NATO intelligence agencies, like Norwegian PST, covertly torture law-abiding citizens who raise awareness of such issues.

YaCy shows you the hashed keywords remote peers are searching for (and the IP of their peer). You don't see the cleartext keywords, and you've have to modify YaCy to figure them out and/or list the sites who are returned for that keyword, which isn't that hard. And you can't be entirely sure if it is the local peer operator or someone using that peer who is actually did the search.
YaCy shows you the hashed keywords remote peers are searching for (and the IP of their peer). You don't see the cleartext keywords, and you've have to modify YaCy to figure them out and/or list the sites who are returned for that keyword, which isn't that hard. And you can't be entirely sure if it is the local peer operator or someone using that peer who is actually did the search.

In short: If you're searching using a traditional search-engine then you're telling them and them only (and those they share that information with) what subjects you are interested in. With YaCy, you're telling everybody. Your peer is telling all the other peers that you're interested in keyword foo, and they can give you the links they have who are relevant to keyword foo and that's it, or they can say "Oh, you're interested in foo, that's interesting, I'll make a note of that and add it to your already thick file".

A partial solution is to use publicly available YaCy peers[11] such as YaCySearch to do your YaCy-searching. But this brings up a whole other problem: You're back to square one. The peer-owner can easily look at the interface and see what IP is searching for what keyword, so you're basically back to the "All your keywords belong to us" privacy issues traditional search-engine systems have. If you have three friend who's running YaCy and you randomly use their peers to search and they randomly use yours then it does become slightly harder for a adversary to see who's searching for what keywords.

Know that YaCy only shows the hash of a keyword which is searched for remotely, so the advesary would have to modify YaCy's source-code to look them up and/or what results reports matches it. YaCy logs local searches with the keywords in plain text at the YaCy peer where the search is done.

The best solution to YaCy's privacy issues the same solution that applies to all other web-browsing: Use the Tor Internet privacy system all your traffic, including YaCy searches, and all other searches for that matter, and don't search using your own peer but encurage others to do so.

The censorship-resistant properties YaCy has are great. The privacy properties it has are not so great. You can and should use software like Tor in combination with YaCy to get good censorship-resistant properties and good privacy properties.

Note from YaCy developers: it is not true that a remote search asks every other peer for their results of a search request. The index is distributed into a DHT (distributed hash table) and during a search only some other peers are contacted, not every. The developers are aware of the fact that an evil person can occupy a sensible position inside the DHT and monitor the search requests from outside. We developed a peer-hopping procedure against such supervision methods which will occur sometime this year and will make supervision of remote search imposible. Furthermore, remote searches do never use clear words for request, but only word hashes.

[edit] Corporate Website Search Engine?

YaCy can be setup to be a back-end for a corporate website, but there are other far better alternatives for adding a simple search-function to a website. YaCy isn't designed to be a search-back-end for a single site, and requires to much resources to make it worth having on a server just to add a search-function. Take a look at the DataparkSearch Engine is that's what you need.

[edit] References

  1. 1.0 1.1 Heise.de (german): Jimmy Wales' Suchmaschine zum Mitmachen nimmt Gestalt an
  2. NetCraft: January 2007 Web Server Survey
  3. Alexa Top 100 Websites
  4. Winner-Take-All: Google and the Third Age of Computing
  5. Google censors itself for China
  6. Former Intelligence Agent Says Google In Bed With CIA
  7. Yahoo, Chinese police, and a jailed journalist
  8. BBC NEWS | World | Asia-Pacific | Yahoo 'helped jail China writer'
  9. Scholars for 911 truth
  10. Al-Qaeda, described by Pierre-Henry Bunel, a former agent for French military intelligence
  11. YaCy Demopeers
 
Personal tools
Privacy policy
linux events
ipv6
Linux Reviews
IPv6

Search:

linux newz | random page | poetry | free blog | adult dating

You are using a insecure IPv4 connection. Click here to enable SSL encryption..
You can also connect secure and anonymously if you are using Tor.